0. Preface
In order to improve the basic service quality of carrier pigeon, the whole process of network packet receiving is sorted out.
In network programming, we have more contact with socket API and epoll model, and less contact with system kernel and network card driver. On the one hand, our system may not need deep tuning, on the other hand, network programming involves complex knowledge such as hardware, driver, kernel, virtual simulation, etc., which makes people flinch. There are a lot of data related to the network card packet collection, but they are scattered. Here, I sort out the process of network card packet collection and share it with you, hoping to help you. In this paper, I quoted some colleagues' charts and excerpts of online data. At the end of the article, I gave some references and sources. Thank these authors for sharing.
1. Overall process
As a whole, the packet receiving of network card is the process of converting the high and low level in the network cable to the FIFO storage of network card and copying it to the main memory of the system (DDR3). It involves the network card controller, CPU, DMA, driver, which belongs to the physical layer and link layer in the OSI model, as shown in the figure below.
2. Key data structure
The code involved in the network data flow in the kernel is relatively complex, as shown in Figure 1 (the original figure is in the attachment). Among them, three data structures play the most important role in the process of receiving packets from the network card, they are: sk_buff, softnet_data, net_device.
Figure 1. Kernel network data flow
Sk_buff
Sk_buff structure is one of the most important data structures in Linux network module. Sk_buff can be passed between different network protocol layers. In order to adapt to different protocols, most of its members are pointers, and there are some Union. Among them, the data pointer and Len will change in different protocol layers. In the packet receiving process, that is, when the data is passed to the upper layer, the lower layer header is no longer needed. Figure 2 shows the changes of pointer and Len when sending packets. (different versions of Linux source code have some differences. The screenshot below is from Linux 2.6.20).
Figure 2. When sk_buff is passed in different protocol layers,
Sample data pointer changes
Softnet_data
The field in the softnet data structure is the processing queue between NIC and network layer. This structure is global, one for each CPU. It transfers data information between NIC interrupt and poll method. Figure 3 illustrates the role of variables in softnet_data.
Net_device
Net device is the receiving function of NaPi callback.
Net device represents a kind of network device, which can be either a physical network card or a virtual network card. In sk_buff, there is a net_device * dev variable, which will change with the flow direction of sk_buff. When the network device driver is initialized, the receiving sk_buff cache queue will be allocated, and the dev pointer will point to the network device receiving the packet. When the original network device receives the message, it will select a suitable virtual network device according to some algorithm, and change the device pointer to the net device structure of the virtual device.
3. Network packet receiving principle
This section mainly refers to the articles on the network, and adds some remarks in the key points. Tencent mainly uses Intel 82576 network card and Intel IGB driver, which are different from the following network card and driver. In fact, the principle is the same, but some function naming and processing details are different, which does not affect understanding.
There are three situations of network driven packet receiving:
No NaPi: every time the MAC receives an Ethernet packet, it will generate a receive interrupt to the CPU, that is, it will receive the packet completely by the interrupt mode
The disadvantage is that when the network traffic is large, the CPU spends most of its time dealing with MAC interrupt.
Netpoll: when the network and I / O subsystem are not fully available, the interrupt from the specified device is simulated, that is, polling for packets.
The disadvantage is poor real-time
NaPi: interrupt + polling: when the MAC receives a packet, it will generate a receive interrupt, but it will shut down immediately.
Do not turn on the receive interrupt until you have received enough netdev? Max? Backlog packets (300 by default), or all packets on the Mac
Modify net.core.netdev_max_backlog through sysctl
Or modify / proc / sys / net / core / netdev_max_backlog through proc
Figure 3. Softnet data and interface layer
Relationship with network layer
The following only writes to the case that the kernel is configured to use NaPi, and only writes to the TSEC driver. Kernel version Linux 2.6.24.
NaPi related data structure
Each network device (MAC layer) has its own net device data structure, which has NaPi struct. Every time a packet is received, the network device driver attaches its NaPi struct to the CPU private variable. In this way, in case of soft interrupt, net RX action will traverse the poll list of CPU private variables, execute the poll hook function of NaPi struct structure hung above, and transfer the data package from the driver to the network protocol stack.
Preparations for kernel startup
Three point one
Initialize the global data structure related to the network, and mount the hook function to handle the network related soft interrupt
start_kernel()
--> rest_init()
--> do_basic_setup()
--> do_initcall
-->net_dev_init
__init net_dev_init(){
/ / each CPU has a CPU private variable, get CPU var (softnet data) / / / get CPU var (softnet data). Poll list is very important. In soft interrupt, you need to traverse its for each CPU (I){
struct softnet_data *queue;
queue = &per_cpu(softnet_data, i);
skb_queue_head_init(&queue->input_pkt_queue);
queue->completion_queue = NULL;
INIT_LIST_HEAD(&queue->poll_list);
queue->backlog.poll = process_backlog;
queue->backlog.weight = weight_p;
}
/ / hang the network sending handler on the soft interrupt
open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
//Hang network receive handler on soft interrupt
open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
}
Softirq
Interrupt processing "lower half" mechanism
Interrupt service programs are usually executed when interrupt requests are closed, so as to avoid the complexity of interrupt control due to nesting. However, interrupt is a random event, it will come at any time. If the time of turning off interrupt is too long, the CPU can not respond to other interrupt requests in time, resulting in the loss of interrupt.
Therefore, the goal of Linux kernel is to process interrupt requests as quickly as possible and delay as much processing as possible. For example, if a data block has reached the network cable, when the interrupt controller receives the interrupt request signal, the Linux kernel simply marks the arrival of data, and then allows the processor to return to its previous running state, and the rest of the processing will be carried out later (for example, move the data into a buffer, and the process receiving the data can find the data in the buffer).
Therefore, the kernel divides the interrupt processing into two parts: the top half and the bottom half. The top half (i.e. interrupt service program) is executed immediately, while the bottom half (i.e. some kernel functions) is reserved for later processing.
2.6 "bottom half" processing mechanism in the kernel:
1) soft interrupt request (softirq) mechanism (pay attention not to be confused with the signal of inter process communication)
2) tasklet mechanism
3) work queue mechanism
We can check the CPU usage of softirq through the top command:
Softirq is actually a mechanism for registering callbacks. PS – elf can see that the registered functions are handled by a special daemons (ksoftirgd), and each CPU is a daemons.
Three point two
Loading drivers for network devices
Note: the network device here refers to the network device of MAC layer, that is, TSEC and PCI network card (bcm5461 is PHY) create net device data structure in the network device driver, and initialize its hook function open (), close (), etc. the entry function of the driver that mounts TSEC is GFAR
//Data structure of TSEC platform driver GFAR driver{
.probe = gfar_probe,
.remove = gfar_remove,
.driver = {
.name = "fsl-gianfar",
}
}
int gfar_probe(struct platform_device *pdev)
{
Dev = alloc_etherdev (sizeof (* priv)); / / create a net device data structure, dev - > open = gfar_enet_open;
dev->hard_start_xmit = gfar_start_xmit;
dev->tx_timeout = gfar_timeout;
dev->watchdog_timeo = TX_TIMEOUT;
#ifdef CONFIG_GFAR_NAPI
Netif? NaPi? Add (DEV, & priv - > NaPi, GFAR? Poll, GFAR? Dev? Weight); / / the poll hook function will be called in the soft interrupt
#endif
#ifdef CONFIG_NET_POLL_CONTROLLER
dev->poll_controller = gfar_netpoll;
#endif
dev->stop = gfar_close;
dev->change_mtu = gfar_change_mtu;
dev->mtu = 1500;
dev->set_multicast_list = gfar_set_multi;
dev->set_mac_address = gfar_set_mac_address;
dev->ethtool_ops = &gfar_ethtool_ops; }
Three point three
Enable network devices
3.3.1 the user calls ifconfig and other programs, and then enters the kernel through IOCTL system call
The ioctl() system call of socket -- > socket ﹣ ioctl() --- dev ﹣ ioctl() / / judge siocsifflags
-- > dev get by name (net, IFR - > IFR name) / / select net device by name
-- > dev_change_flags() / / determine iff_up
-- > dev_open (net_device) / / call the open hook function
For TSEC, the hook function is gfar_enet_open (net_device)
3.3.2 in the open hook function of the network device, allocate and receive BD, and suspend ISR (including Rx, TX, ERR). For TSEC
GFAR ﹣ eNet ﹣ open > -->Allocate memory to Rx ﹐ skbuff pointer array, and initialize to null ﹐ initialize TX BD ﹐ initialize RX BD, allocate the SKB storing Ethernet packets in advance, here is a one-time DMA mapping (Note: define default ﹐ Rx ﹐ buffer ﹐ size 1536 ensures that SKB can store an Ethernet packet)
-- > initialize TX BD -- > initialize RX BD, allocate the SKB storing the Ethernet packet in advance, and use the one-time DMA map here (Note: define default, RX buffer, size 1536 ensure that the SKB can store an Ethernet packet)
Rxbdp = priv - > Rx bd_base; for (I = 0; I < priv - > Rx ring_size; I + +) {struct sk_buff * SKB = null; rxbdp - > status = 0; / / here, the SKB is really allocated, and rxbpd - > bufptr, rxbdpd - > length SKB = gfar_new_skb (DEV, rxbdp)
priv->rx_skbuff[i] = skb;
rxbdp++;
} rxbdp--;
Rxbdp - > status| = rxbd_wrap; / / set the flag wrap to the last BD
Register TSEC related interrupt handler: error, receive, send
Request ﹐ IRQ (priv - > interrupterror, GFAR ﹐ error, 0, "eNet ﹐ error", DEV) request ﹐ IRQ (priv - > interrupttransmit, GFAR ﹐ transmit, 0, "eNet ﹐ TX", DEV) / / after the packet is sent, request ﹐ IRQ (priv - > interruptreceive, GFAR ﹐ receive, 0, "eNet ﹐ RX", DEV) / / after the packet is received
-->gfar_start(net_device)
/ / enable Rx, TX
/ / turn on the DMA register of TSEC
/ / mask breaks events we don't care about
In the end, the data structures such as BD related to TSEC should look like the following
Three point four
Receive Ethernet packet in interrupt
The RX of TSEC has been enabled. The process of network packets entering memory is as follows:
Network cable > RJ45 network port > MDI differential cable
-- > bcm5461 (PHY chip for digital to analog conversion) -- > MII bus
-- > TSEC's DMA engine will automatically check the next available RX BD
-- > DMA network packets to the memory pointed to by RX BD, that is, SKB - > Data
After receiving a complete Ethernet packet, TSEC will trigger an Rx external interrupt according to the event mask.
The CPU saves the scene, and according to the interrupt vector, starts to execute the external interrupt processing function do? IRQ ()
Do? IRQ pseudo code
Upper half processing hard interrupt
Check the interrupt source register and find out that the external interrupt is generated by the network peripheral
Execute RX interrupt handler of network device (different device, different function, but similar process, TSEC is gfar_receive)
1. When the mask drops RX event, the packet will not generate RX interrupt
2. Add NaPi state sched state to NaPi struct.state
3. Attach the NaPi struct structure of the network device to the CPU private variable get CPU var (softnet data). Poll list
4. Trigger the network receive soft interrupt ("65123; raise ﹣ softirq ﹣ net ﹣ Rx ﹣ softirq); - > wakeup ﹣ softirqd())
Soft interrupt processing in the lower half
Execute all soft interrupt handlers in turn, including timer, tasklet, etc
Execute the soft interrupt handler net RX action received by the network
1. Traverse the CPU private variable ﹐ get ﹐ CPU ﹐ var (softnet ﹐ data). Poll ﹐ list
2. Take out the NaPi struct structure hung on the poll list and execute the hook function NaPi struct. Poll()
(different equipment, different hook functions, similar process, TSEC is GFAR \
3. If the poll hook function finishes processing all packets, open RX event mask, and Rx interrupt will be generated if packets come again
4. Call napi_complete (napi_struct * n)
5. Remove the NaPi struct structure from get CPU var (softnet data). Poll list, and remove the NaPi state sched state of NaPi struct.state
3.4.1 reception interrupt processing function of TSEC
gfar_receive{
#ifdef CONFIG_GFAR_NAPI
/ / test and set the NaPi struct.state of the current net device is NaPi state sched
/ / call net RX action in soft interrupt to check the status NaPi struct.state
if (netif_rx_schedule_prep(dev, &priv->napi)) {
tempval = gfar_read(&priv->regs->imask);
Tempval & = imask ﹣ Rx ﹣ disabled; / / mask breaks RX and no longer generates RX interrupt GFAR ﹣ & priv - > regs - > imask, tempval
/ / hang the NaPi struct.poll list of the current net device to the
/ / CPU private variable \\\\\\\\\\\\\\\\\\\\\
/ / so, when calling net_rx_action in soft interrupt, the current net_device will be executed.
/ / NaPi struct. Poll() hook function, i.e. GFAR pur poll()
__netif_rx_schedule(dev, &priv->napi);
}
#else
gfar_clean_rx_ring(dev, priv->rx_ring_size);
#endif
}
3.4.2 net RX action
net_rx_action(){
struct list_head *list = &__get_cpu_var(softnet_data).poll_list;
//Link N multiple NaPi structs to a chain through NaPi struct.poll list
//Through the CPU private variable, we find the chain head and start to traverse the chain
Int budget = netdev_budget; / / this value is net.core.netdev_max_backlog, which can be modified through sysctl
while (!list_empty(list)) {
struct napi_struct *n;
int work, weight;
local_irq_enable();
/ / take a NaPi struct structure from the chain (the one added to the list by the receive interrupt processing function, such as gfar_receive)
n = list_entry(list->next, struct napi_struct, poll_list);
weight = n->weight;
work = 0;
If (test bit (NaPi state sched, & n - > state)) / / check the status flag, which is added to the receive interrupt
/ / when using NaPi, use the network device's own NaPi struct.poll
//For TSEC, GFAR? Pol
work = n->poll(n, weight);
WARN_ON_ONCE(work > weight);
budget -= work;
local_irq_disable();
if (unlikely(work == weight)) {
if (unlikely(napi_disable_pending(n)))
//Operate NaPi struct, delete NaPi state sched from the list
__napi_complete(n);
else
list_move_tail(&n->poll_list, list);
}
netpoll_poll_unlock(have);
}
Out:
local_irq_enable();
}
static int gfar_poll(struct napi_struct *napi, int budget){
struct gfar_private *priv = container_of(napi, struct gfar_private, napi);
Struct net device * dev = priv - > dev; / / network device corresponding to TSEC
int howmany;
/ / according to RX BD of dev, get the SKB and send it to the protocol stack, and return the number of SKB processed, that is, the number of Ethernet packets
howmany = gfar_clean_rx_ring(dev, budget);
//The following judgment is more exquisite
/ / the number of received packets is less than the budget, which means that we have processed all the packets in one soft interrupt, so turn on RX interrupt
/ / if the number of received packets is greater than budget, it means that all packets cannot be processed in a soft interrupt, then RX interrupt will not be opened,
/ / wait for the next soft interrupt and continue processing until all packets are processed (i.e. howmany < budget), and then open RX interrupt
if (howmany < budget) {
netif_rx_complete(dev, napi);
gfar_write(&priv->regs->rstat, RSTAT_CLEAR_RHALT);
/ / turn on RX interrupt, which is turned off in gfar_receive()
gfar_write(&priv->regs->imask, IMASK_DEFAULT);
}
return howmany;
}
gfar_clean_rx_ring(dev, budget){
bdp = priv->cur_rx;
while (!((bdp->status & RXBD_EMPTY) || (--rx_work_limit < 0))) {
rmb();
SKB = priv - > rx_skbuff [priv - > skb_currx]; / / get SKB from rx_skbug []
howmany++;
dev->stats.rx_packets++;
Pkt_length = BDP - > length - 4; / / remove the FCS length of Ethernet packet from the length
gfar_process_frame(dev, skb, pkt_len);
dev->stats.rx_bytes += pkt_len;
dev->last_rx = jiffies;
BDP - > status & = ~ Rx bd_stats; / / clear the status of rxbd
skb = gfar_new_skb(dev, bdp); // Add another skb for the future
priv->rx_skbuff[priv->skb_currx] = skb;
If (BDP - > status & rxbd_wrap) / / update the pointer to BD
BDP = priv - > rx_bd_base; / / BD is marked with warp, indicating that it is the last BD and needs to be "rewound"
Else
bdp++;
priv->skb_currx = (priv->skb_currx + 1) & RX_RING_MOD_MASK(priv->rx_ring_size);
}
priv->cur_rx = bdp; /* Update the current rxbd pointer to be the next one */
return howmany;
}
gfar_process_frame()
-->Receive (SKB) / / call netif [receive] SKB (SKB) to enter the protocol stack
#ifdef CONFIG_GFAR_NAPI
#define RECEIVE(x) netif_receive_skb(x)
#else
#define RECEIVE(x) netif_rx(x)
#endif
Using NaPi in soft interrupt
The main flow of the above net RX action is shown in Figure 4. During the execution of a network soft interrupt, the RX interrupt of the network card itself has been closed, that is, no new receive interrupt will be generated. Local ﹣ IRQ ﹣ enable and local ﹣ IRQ ﹣ disable set whether the CPU receives interrupts. When entering the net RX action, a budget will be initiated, that is, the maximum number of network packets to be processed. If there are multiple network cards (put in the poll list), the budget will be shared. At the same time, each network card also has a weight or quota quota. After a network card processes the packets in the input queue, there are two situations: one receives a lot of packets at a time, which is used by quota After that, hang the poll virtual function of the received packet to the end of the poll list queue, reset the quota value and wait for while polling. In another case, if there are not many packets received and the quota is not used up, it means that the network card is idle, then remove yourself from the poll list and exit the polling. There are two situations in which the whole net RX action exits: the budget is used up or the time has exceeded.
Figure 4. Main execution flow of net ﹣ Rx ﹣ action
Three point five
DMA 8237A
DMA operation is involved in network card packet receiving. The main function of DMA is to transfer data between peripheral devices (such as network card and main memory) without the participation of CPU (i.e. without CPU using special IO instructions to copy data). Here is a brief introduction to the principle of DMA, as shown in Figure 5.
Figure 5. DMA system composition
DMA mode is adopted for network card (DMA controller is generally on the system board, and some network cards are also built-in DMA controller). ISR programs DMA controller through CPU (it is driven by DMA, at this time DMA is equivalent to a common peripheral, programming mainly refers to setting the register of DMA controller). After receiving ISR request, DMA controller sends bus hold request to main CPU, and then sends it to LAN after obtaining CPU response The DMA answers and takes over the bus, and starts the data transmission between the network card buffer and the memory. At this time, the CPU can continue to execute other instructions. When the DMA operation is completed, the DMA releases the control over the bus.
4. Network card multi queue
Network card multi queue is a feature of hardware, and it also needs kernel support. Intel 82576 used by Tencent supports network card multi queue, and the kernel version is greater than 2.6.20. For a single queue network card, only one interrupt signal can be generated, and only one CPU can handle it. This will lead to a high load of one core (CPU 0 by default) in a multi-core system. The network card multi queue maintains multiple receiving and sending queues inside the network card, and generates multiple interrupt signals so that different CPUs can process the packets received by the network card, thus improving the performance, as shown in Figure 6.
Figure 6. Schematic diagram of work receiving process of multi queue network card
Msi-x: a device can generate multiple interrupts, as shown in the figure below, No. 54-61 interrupt eth1-txrx - [0-7], which is actually the interrupt number occupied by eth1 network card.
CPU affinity: each interrupt number is configured with only one CPU for processing, where the values: 01, 02, 04, etc. are hexadecimal, and the corresponding bit is the number of the value code CPU of 1.
5. I / O virtualization SR-IOV
Server virtualization technology is very common in distributed system, which can improve the utilization rate of equipment and operation efficiency. Server virtualization includes processor virtualization, memory virtualization and I / 0 device virtualization. Network related virtualization belongs to I / 0 virtualization. The role of I / 0 device virtualization is that a single I / O device can be shared by multiple virtual machines. For the application program in the client operating system, the process of initiating I / O operation is the same as that of the operating system on the real hardware platform. The difference of the whole I / O process is that the device driver accesses the hardware part. After years of development, the main model of I / 0 virtualization is shown in Table 1, and the early device simulation is shown in Figure 7. It can be seen that the process of network packets from physical network card to virtual machine needs a lot of additional processing, which is very inefficient. SR-IOV supports virtualization directly from the hardware. As shown in Figure 8, the device is divided into one physical function unit (PF) and multiple virtual function units (VF Function), each virtual functional unit can be used as a lightweight I / O device for virtual machines, so that one device can be allocated to multiple virtual machines at the same time, which solves the problem of poor scalability caused by the limit of the number of devices to the virtual system. Each VF has a unique rid (requester identifier) and key resources for sending and receiving data packets, such as sending queue, receiving queue, DMA channel, etc., so each VF has the function of independent sending and receiving data packets. All VFS share main equipment resources, such as data link layer processing and message classification.
Table 1. Comparison of several I / 0 virtualization calculations
Figure 7. Device simulation
Figure 8. Device structure supporting SR-IOV
SR-IOV requires network card support:
Special drivers are needed to support:
In fact, the driver of VF is similar to that of a normal network card. At last, it will execute in netif ﹣ receive ﹣ SKB, and then send the received package to its VLAN virtual sub device for processing.
Using SR-IOV in docker
Activate VF
#echo "options igb max_vfs=7" >>/etc/modprobe.d/igb.conf
#reboot
Set VLAN of VF ා IP link set eth1 VF 0 VLAN 12
Move VF to container network namespace (IP link set eth4 nets $PID)
#ip netns exec $pid ip link set dev eth4 name eth1
#ip netns exec $pid ip link set dev eth1 up
In container: set IP
#IP addr add 10.217.121.107/21 dev eth1 gateway
#ip route add default via 10.217.120.1
6. References
1. Linux kernel source code analysis -- TCP / IP implementation, Volume I
2. Understanding Linux Network Internals
3. Research and implementation of network card virtualization based on SR-IOV Technology
4. 82576 sr-iov driver companion guide
5. Working principle of network card and optimization of high concurrency
6. Introduction to multi queue network card
7. NaPi mechanism analysis of Linux kernel, ChinaUnix
8. Network packet receiving and sending process (1): from driver to protocol stack, CSDN
9. Interrupt processing "bottom half" mechanism, CSDN
10. Details of basic network devices on Linux, 51CTO
11. DMA operating system, yutube
official account
tencentbigdata
You may like:
Analysis of angel open source plan | sort benchmark winning the title