| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | Segmentation Offloads in the Linux Networking Stack | 
|  | 2 |  | 
|  | 3 | Introduction | 
|  | 4 | ============ | 
|  | 5 |  | 
|  | 6 | This document describes a set of techniques in the Linux networking stack | 
|  | 7 | to take advantage of segmentation offload capabilities of various NICs. | 
|  | 8 |  | 
|  | 9 | The following technologies are described: | 
|  | 10 | * TCP Segmentation Offload - TSO | 
|  | 11 | * UDP Fragmentation Offload - UFO | 
|  | 12 | * IPIP, SIT, GRE, and UDP Tunnel Offloads | 
|  | 13 | * Generic Segmentation Offload - GSO | 
|  | 14 | * Generic Receive Offload - GRO | 
|  | 15 | * Partial Generic Segmentation Offload - GSO_PARTIAL | 
|  | 16 | * SCTP accelleration with GSO - GSO_BY_FRAGS | 
|  | 17 |  | 
|  | 18 | TCP Segmentation Offload | 
|  | 19 | ======================== | 
|  | 20 |  | 
|  | 21 | TCP segmentation allows a device to segment a single frame into multiple | 
|  | 22 | frames with a data payload size specified in skb_shinfo()->gso_size. | 
|  | 23 | When TCP segmentation requested the bit for either SKB_GSO_TCPV4 or | 
|  | 24 | SKB_GSO_TCPV6 should be set in skb_shinfo()->gso_type and | 
|  | 25 | skb_shinfo()->gso_size should be set to a non-zero value. | 
|  | 26 |  | 
|  | 27 | TCP segmentation is dependent on support for the use of partial checksum | 
|  | 28 | offload.  For this reason TSO is normally disabled if the Tx checksum | 
|  | 29 | offload for a given device is disabled. | 
|  | 30 |  | 
|  | 31 | In order to support TCP segmentation offload it is necessary to populate | 
|  | 32 | the network and transport header offsets of the skbuff so that the device | 
|  | 33 | drivers will be able determine the offsets of the IP or IPv6 header and the | 
|  | 34 | TCP header.  In addition as CHECKSUM_PARTIAL is required csum_start should | 
|  | 35 | also point to the TCP header of the packet. | 
|  | 36 |  | 
|  | 37 | For IPv4 segmentation we support one of two types in terms of the IP ID. | 
|  | 38 | The default behavior is to increment the IP ID with every segment.  If the | 
|  | 39 | GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP | 
|  | 40 | ID and all segments will use the same IP ID.  If a device has | 
|  | 41 | NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO | 
|  | 42 | and we will either increment the IP ID for all frames, or leave it at a | 
|  | 43 | static value based on driver preference. | 
|  | 44 |  | 
|  | 45 | UDP Fragmentation Offload | 
|  | 46 | ========================= | 
|  | 47 |  | 
|  | 48 | UDP fragmentation offload allows a device to fragment an oversized UDP | 
|  | 49 | datagram into multiple IPv4 fragments.  Many of the requirements for UDP | 
|  | 50 | fragmentation offload are the same as TSO.  However the IPv4 ID for | 
|  | 51 | fragments should not increment as a single IPv4 datagram is fragmented. | 
|  | 52 |  | 
|  | 53 | UFO is deprecated: modern kernels will no longer generate UFO skbs, but can | 
|  | 54 | still receive them from tuntap and similar devices. Offload of UDP-based | 
|  | 55 | tunnel protocols is still supported. | 
|  | 56 |  | 
|  | 57 | IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads | 
|  | 58 | ======================================================== | 
|  | 59 |  | 
|  | 60 | In addition to the offloads described above it is possible for a frame to | 
|  | 61 | contain additional headers such as an outer tunnel.  In order to account | 
|  | 62 | for such instances an additional set of segmentation offload types were | 
|  | 63 | introduced including SKB_GSO_IPXIP4, SKB_GSO_IPXIP6, SKB_GSO_GRE, and | 
|  | 64 | SKB_GSO_UDP_TUNNEL.  These extra segmentation types are used to identify | 
|  | 65 | cases where there are more than just 1 set of headers.  For example in the | 
|  | 66 | case of IPIP and SIT we should have the network and transport headers moved | 
|  | 67 | from the standard list of headers to "inner" header offsets. | 
|  | 68 |  | 
|  | 69 | Currently only two levels of headers are supported.  The convention is to | 
|  | 70 | refer to the tunnel headers as the outer headers, while the encapsulated | 
|  | 71 | data is normally referred to as the inner headers.  Below is the list of | 
|  | 72 | calls to access the given headers: | 
|  | 73 |  | 
|  | 74 | IPIP/SIT Tunnel: | 
|  | 75 | Outer			Inner | 
|  | 76 | MAC		skb_mac_header | 
|  | 77 | Network		skb_network_header	skb_inner_network_header | 
|  | 78 | Transport	skb_transport_header | 
|  | 79 |  | 
|  | 80 | UDP/GRE Tunnel: | 
|  | 81 | Outer			Inner | 
|  | 82 | MAC		skb_mac_header		skb_inner_mac_header | 
|  | 83 | Network		skb_network_header	skb_inner_network_header | 
|  | 84 | Transport	skb_transport_header	skb_inner_transport_header | 
|  | 85 |  | 
|  | 86 | In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and | 
|  | 87 | SKB_GSO_UDP_TUNNEL_CSUM.  These two additional tunnel types reflect the | 
|  | 88 | fact that the outer header also requests to have a non-zero checksum | 
|  | 89 | included in the outer header. | 
|  | 90 |  | 
|  | 91 | Finally there is SKB_GSO_TUNNEL_REMCSUM which indicates that a given tunnel | 
|  | 92 | header has requested a remote checksum offload.  In this case the inner | 
|  | 93 | headers will be left with a partial checksum and only the outer header | 
|  | 94 | checksum will be computed. | 
|  | 95 |  | 
|  | 96 | Generic Segmentation Offload | 
|  | 97 | ============================ | 
|  | 98 |  | 
|  | 99 | Generic segmentation offload is a pure software offload that is meant to | 
|  | 100 | deal with cases where device drivers cannot perform the offloads described | 
|  | 101 | above.  What occurs in GSO is that a given skbuff will have its data broken | 
|  | 102 | out over multiple skbuffs that have been resized to match the MSS provided | 
|  | 103 | via skb_shinfo()->gso_size. | 
|  | 104 |  | 
|  | 105 | Before enabling any hardware segmentation offload a corresponding software | 
|  | 106 | offload is required in GSO.  Otherwise it becomes possible for a frame to | 
|  | 107 | be re-routed between devices and end up being unable to be transmitted. | 
|  | 108 |  | 
|  | 109 | Generic Receive Offload | 
|  | 110 | ======================= | 
|  | 111 |  | 
|  | 112 | Generic receive offload is the complement to GSO.  Ideally any frame | 
|  | 113 | assembled by GRO should be segmented to create an identical sequence of | 
|  | 114 | frames using GSO, and any sequence of frames segmented by GSO should be | 
|  | 115 | able to be reassembled back to the original by GRO.  The only exception to | 
|  | 116 | this is IPv4 ID in the case that the DF bit is set for a given IP header. | 
|  | 117 | If the value of the IPv4 ID is not sequentially incrementing it will be | 
|  | 118 | altered so that it is when a frame assembled via GRO is segmented via GSO. | 
|  | 119 |  | 
|  | 120 | Partial Generic Segmentation Offload | 
|  | 121 | ==================================== | 
|  | 122 |  | 
|  | 123 | Partial generic segmentation offload is a hybrid between TSO and GSO.  What | 
|  | 124 | it effectively does is take advantage of certain traits of TCP and tunnels | 
|  | 125 | so that instead of having to rewrite the packet headers for each segment | 
|  | 126 | only the inner-most transport header and possibly the outer-most network | 
|  | 127 | header need to be updated.  This allows devices that do not support tunnel | 
|  | 128 | offloads or tunnel offloads with checksum to still make use of segmentation. | 
|  | 129 |  | 
|  | 130 | With the partial offload what occurs is that all headers excluding the | 
|  | 131 | inner transport header are updated such that they will contain the correct | 
|  | 132 | values for if the header was simply duplicated.  The one exception to this | 
|  | 133 | is the outer IPv4 ID field.  It is up to the device drivers to guarantee | 
|  | 134 | that the IPv4 ID field is incremented in the case that a given header does | 
|  | 135 | not have the DF bit set. | 
|  | 136 |  | 
|  | 137 | SCTP accelleration with GSO | 
|  | 138 | =========================== | 
|  | 139 |  | 
|  | 140 | SCTP - despite the lack of hardware support - can still take advantage of | 
|  | 141 | GSO to pass one large packet through the network stack, rather than | 
|  | 142 | multiple small packets. | 
|  | 143 |  | 
|  | 144 | This requires a different approach to other offloads, as SCTP packets | 
|  | 145 | cannot be just segmented to (P)MTU. Rather, the chunks must be contained in | 
|  | 146 | IP segments, padding respected. So unlike regular GSO, SCTP can't just | 
|  | 147 | generate a big skb, set gso_size to the fragmentation point and deliver it | 
|  | 148 | to IP layer. | 
|  | 149 |  | 
|  | 150 | Instead, the SCTP protocol layer builds an skb with the segments correctly | 
|  | 151 | padded and stored as chained skbs, and skb_segment() splits based on those. | 
|  | 152 | To signal this, gso_size is set to the special value GSO_BY_FRAGS. | 
|  | 153 |  | 
|  | 154 | Therefore, any code in the core networking stack must be aware of the | 
|  | 155 | possibility that gso_size will be GSO_BY_FRAGS and handle that case | 
|  | 156 | appropriately. | 
|  | 157 |  | 
|  | 158 | There are some helpers to make this easier: | 
|  | 159 |  | 
|  | 160 | - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if | 
|  | 161 | an skb is an SCTP GSO skb. | 
|  | 162 |  | 
|  | 163 | - For size checks, the skb_gso_validate_*_len family of helpers correctly | 
|  | 164 | considers GSO_BY_FRAGS. | 
|  | 165 |  | 
|  | 166 | - For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size | 
|  | 167 | will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs. | 
|  | 168 |  | 
|  | 169 | This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits | 
|  | 170 | set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE. |