| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | Linux kernel driver for Elastic Network Adapter (ENA) family: | 
 | 2 | ============================================================= | 
 | 3 |  | 
 | 4 | Overview: | 
 | 5 | ========= | 
 | 6 | ENA is a networking interface designed to make good use of modern CPU | 
 | 7 | features and system architectures. | 
 | 8 |  | 
 | 9 | The ENA device exposes a lightweight management interface with a | 
 | 10 | minimal set of memory mapped registers and extendable command set | 
 | 11 | through an Admin Queue. | 
 | 12 |  | 
 | 13 | The driver supports a range of ENA devices, is link-speed independent | 
 | 14 | (i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has | 
 | 15 | a negotiated and extendable feature set. | 
 | 16 |  | 
 | 17 | Some ENA devices support SR-IOV. This driver is used for both the | 
 | 18 | SR-IOV Physical Function (PF) and Virtual Function (VF) devices. | 
 | 19 |  | 
 | 20 | ENA devices enable high speed and low overhead network traffic | 
 | 21 | processing by providing multiple Tx/Rx queue pairs (the maximum number | 
 | 22 | is advertised by the device via the Admin Queue), a dedicated MSI-X | 
 | 23 | interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, | 
 | 24 | and CPU cacheline optimized data placement. | 
 | 25 |  | 
 | 26 | The ENA driver supports industry standard TCP/IP offload features such | 
 | 27 | as checksum offload and TCP transmit segmentation offload (TSO). | 
 | 28 | Receive-side scaling (RSS) is supported for multi-core scaling. | 
 | 29 |  | 
 | 30 | The ENA driver and its corresponding devices implement health | 
 | 31 | monitoring mechanisms such as watchdog, enabling the device and driver | 
 | 32 | to recover in a manner transparent to the application, as well as | 
 | 33 | debug logs. | 
 | 34 |  | 
 | 35 | Some of the ENA devices support a working mode called Low-latency | 
 | 36 | Queue (LLQ), which saves several more microseconds. | 
 | 37 |  | 
 | 38 | Supported PCI vendor ID/device IDs: | 
 | 39 | =================================== | 
 | 40 | 1d0f:0ec2 - ENA PF | 
 | 41 | 1d0f:1ec2 - ENA PF with LLQ support | 
 | 42 | 1d0f:ec20 - ENA VF | 
 | 43 | 1d0f:ec21 - ENA VF with LLQ support | 
 | 44 |  | 
 | 45 | ENA Source Code Directory Structure: | 
 | 46 | ==================================== | 
 | 47 | ena_com.[ch]      - Management communication layer. This layer is | 
 | 48 |                     responsible for the handling all the management | 
 | 49 |                     (admin) communication between the device and the | 
 | 50 |                     driver. | 
 | 51 | ena_eth_com.[ch]  - Tx/Rx data path. | 
 | 52 | ena_admin_defs.h  - Definition of ENA management interface. | 
 | 53 | ena_eth_io_defs.h - Definition of ENA data path interface. | 
 | 54 | ena_common_defs.h - Common definitions for ena_com layer. | 
 | 55 | ena_regs_defs.h   - Definition of ENA PCI memory-mapped (MMIO) registers. | 
 | 56 | ena_netdev.[ch]   - Main Linux kernel driver. | 
 | 57 | ena_syfsfs.[ch]   - Sysfs files. | 
 | 58 | ena_ethtool.c     - ethtool callbacks. | 
 | 59 | ena_pci_id_tbl.h  - Supported device IDs. | 
 | 60 |  | 
 | 61 | Management Interface: | 
 | 62 | ===================== | 
 | 63 | ENA management interface is exposed by means of: | 
 | 64 | - PCIe Configuration Space | 
 | 65 | - Device Registers | 
 | 66 | - Admin Queue (AQ) and Admin Completion Queue (ACQ) | 
 | 67 | - Asynchronous Event Notification Queue (AENQ) | 
 | 68 |  | 
 | 69 | ENA device MMIO Registers are accessed only during driver | 
 | 70 | initialization and are not involved in further normal device | 
 | 71 | operation. | 
 | 72 |  | 
 | 73 | AQ is used for submitting management commands, and the | 
 | 74 | results/responses are reported asynchronously through ACQ. | 
 | 75 |  | 
 | 76 | ENA introduces a very small set of management commands with room for | 
 | 77 | vendor-specific extensions. Most of the management operations are | 
 | 78 | framed in a generic Get/Set feature command. | 
 | 79 |  | 
 | 80 | The following admin queue commands are supported: | 
 | 81 | - Create I/O submission queue | 
 | 82 | - Create I/O completion queue | 
 | 83 | - Destroy I/O submission queue | 
 | 84 | - Destroy I/O completion queue | 
 | 85 | - Get feature | 
 | 86 | - Set feature | 
 | 87 | - Configure AENQ | 
 | 88 | - Get statistics | 
 | 89 |  | 
 | 90 | Refer to ena_admin_defs.h for the list of supported Get/Set Feature | 
 | 91 | properties. | 
 | 92 |  | 
 | 93 | The Asynchronous Event Notification Queue (AENQ) is a uni-directional | 
 | 94 | queue used by the ENA device to send to the driver events that cannot | 
 | 95 | be reported using ACQ. AENQ events are subdivided into groups. Each | 
 | 96 | group may have multiple syndromes, as shown below | 
 | 97 |  | 
 | 98 | The events are: | 
 | 99 | 	Group			Syndrome | 
 | 100 | 	Link state change	- X - | 
 | 101 | 	Fatal error		- X - | 
 | 102 | 	Notification		Suspend traffic | 
 | 103 | 	Notification		Resume traffic | 
 | 104 | 	Keep-Alive		- X - | 
 | 105 |  | 
 | 106 | ACQ and AENQ share the same MSI-X vector. | 
 | 107 |  | 
 | 108 | Keep-Alive is a special mechanism that allows monitoring of the | 
 | 109 | device's health. The driver maintains a watchdog (WD) handler which, | 
 | 110 | if fired, logs the current state and statistics then resets and | 
 | 111 | restarts the ENA device and driver. A Keep-Alive event is delivered by | 
 | 112 | the device every second. The driver re-arms the WD upon reception of a | 
 | 113 | Keep-Alive event. A missed Keep-Alive event causes the WD handler to | 
 | 114 | fire. | 
 | 115 |  | 
 | 116 | Data Path Interface: | 
 | 117 | ==================== | 
 | 118 | I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx | 
 | 119 | SQ correspondingly). Each SQ has a completion queue (CQ) associated | 
 | 120 | with it. | 
 | 121 |  | 
 | 122 | The SQs and CQs are implemented as descriptor rings in contiguous | 
 | 123 | physical memory. | 
 | 124 |  | 
 | 125 | The ENA driver supports two Queue Operation modes for Tx SQs: | 
 | 126 | - Regular mode | 
 | 127 |   * In this mode the Tx SQs reside in the host's memory. The ENA | 
 | 128 |     device fetches the ENA Tx descriptors and packet data from host | 
 | 129 |     memory. | 
 | 130 | - Low Latency Queue (LLQ) mode or "push-mode". | 
 | 131 |   * In this mode the driver pushes the transmit descriptors and the | 
 | 132 |     first 128 bytes of the packet directly to the ENA device memory | 
 | 133 |     space. The rest of the packet payload is fetched by the | 
 | 134 |     device. For this operation mode, the driver uses a dedicated PCI | 
 | 135 |     device memory BAR, which is mapped with write-combine capability. | 
 | 136 |  | 
 | 137 | The Rx SQs support only the regular mode. | 
 | 138 |  | 
 | 139 | Note: Not all ENA devices support LLQ, and this feature is negotiated | 
 | 140 |       with the device upon initialization. If the ENA device does not | 
 | 141 |       support LLQ mode, the driver falls back to the regular mode. | 
 | 142 |  | 
 | 143 | The driver supports multi-queue for both Tx and Rx. This has various | 
 | 144 | benefits: | 
 | 145 | - Reduced CPU/thread/process contention on a given Ethernet interface. | 
 | 146 | - Cache miss rate on completion is reduced, particularly for data | 
 | 147 |   cache lines that hold the sk_buff structures. | 
 | 148 | - Increased process-level parallelism when handling received packets. | 
 | 149 | - Increased data cache hit rate, by steering kernel processing of | 
 | 150 |   packets to the CPU, where the application thread consuming the | 
 | 151 |   packet is running. | 
 | 152 | - In hardware interrupt re-direction. | 
 | 153 |  | 
 | 154 | Interrupt Modes: | 
 | 155 | ================ | 
 | 156 | The driver assigns a single MSI-X vector per queue pair (for both Tx | 
 | 157 | and Rx directions). The driver assigns an additional dedicated MSI-X vector | 
 | 158 | for management (for ACQ and AENQ). | 
 | 159 |  | 
 | 160 | Management interrupt registration is performed when the Linux kernel | 
 | 161 | probes the adapter, and it is de-registered when the adapter is | 
 | 162 | removed. I/O queue interrupt registration is performed when the Linux | 
 | 163 | interface of the adapter is opened, and it is de-registered when the | 
 | 164 | interface is closed. | 
 | 165 |  | 
 | 166 | The management interrupt is named: | 
 | 167 |    ena-mgmnt@pci:<PCI domain:bus:slot.function> | 
 | 168 | and for each queue pair, an interrupt is named: | 
 | 169 |    <interface name>-Tx-Rx-<queue index> | 
 | 170 |  | 
 | 171 | The ENA device operates in auto-mask and auto-clear interrupt | 
 | 172 | modes. That is, once MSI-X is delivered to the host, its Cause bit is | 
 | 173 | automatically cleared and the interrupt is masked. The interrupt is | 
 | 174 | unmasked by the driver after NAPI processing is complete. | 
 | 175 |  | 
 | 176 | Interrupt Moderation: | 
 | 177 | ===================== | 
 | 178 | ENA driver and device can operate in conventional or adaptive interrupt | 
 | 179 | moderation mode. | 
 | 180 |  | 
 | 181 | In conventional mode the driver instructs device to postpone interrupt | 
 | 182 | posting according to static interrupt delay value. The interrupt delay | 
 | 183 | value can be configured through ethtool(8). The following ethtool | 
 | 184 | parameters are supported by the driver: tx-usecs, rx-usecs | 
 | 185 |  | 
 | 186 | In adaptive interrupt moderation mode the interrupt delay value is | 
 | 187 | updated by the driver dynamically and adjusted every NAPI cycle | 
 | 188 | according to the traffic nature. | 
 | 189 |  | 
 | 190 | By default ENA driver applies adaptive coalescing on Rx traffic and | 
 | 191 | conventional coalescing on Tx traffic. | 
 | 192 |  | 
 | 193 | Adaptive coalescing can be switched on/off through ethtool(8) | 
 | 194 | adaptive_rx on|off parameter. | 
 | 195 |  | 
 | 196 | The driver chooses interrupt delay value according to the number of | 
 | 197 | bytes and packets received between interrupt unmasking and interrupt | 
 | 198 | posting. The driver uses interrupt delay table that subdivides the | 
 | 199 | range of received bytes/packets into 5 levels and assigns interrupt | 
 | 200 | delay value to each level. | 
 | 201 |  | 
 | 202 | The user can enable/disable adaptive moderation, modify the interrupt | 
 | 203 | delay table and restore its default values through sysfs. | 
 | 204 |  | 
 | 205 | The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK | 
 | 206 | and can be configured by the ETHTOOL_STUNABLE command of the | 
 | 207 | SIOCETHTOOL ioctl. | 
 | 208 |  | 
 | 209 | SKB: | 
 | 210 | The driver-allocated SKB for frames received from Rx handling using | 
 | 211 | NAPI context. The allocation method depends on the size of the packet. | 
 | 212 | If the frame length is larger than rx_copybreak, napi_get_frags() | 
 | 213 | is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer | 
 | 214 | content is copied (by CPU) to the SKB, and the buffer is recycled. | 
 | 215 |  | 
 | 216 | Statistics: | 
 | 217 | =========== | 
 | 218 | The user can obtain ENA device and driver statistics using ethtool. | 
 | 219 | The driver can collect regular or extended statistics (including | 
 | 220 | per-queue stats) from the device. | 
 | 221 |  | 
 | 222 | In addition the driver logs the stats to syslog upon device reset. | 
 | 223 |  | 
 | 224 | MTU: | 
 | 225 | ==== | 
 | 226 | The driver supports an arbitrarily large MTU with a maximum that is | 
 | 227 | negotiated with the device. The driver configures MTU using the | 
 | 228 | SetFeature command (ENA_ADMIN_MTU property). The user can change MTU | 
 | 229 | via ip(8) and similar legacy tools. | 
 | 230 |  | 
 | 231 | Stateless Offloads: | 
 | 232 | =================== | 
 | 233 | The ENA driver supports: | 
 | 234 | - TSO over IPv4/IPv6 | 
 | 235 | - TSO with ECN | 
 | 236 | - IPv4 header checksum offload | 
 | 237 | - TCP/UDP over IPv4/IPv6 checksum offloads | 
 | 238 |  | 
 | 239 | RSS: | 
 | 240 | ==== | 
 | 241 | - The ENA device supports RSS that allows flexible Rx traffic | 
 | 242 |   steering. | 
 | 243 | - Toeplitz and CRC32 hash functions are supported. | 
 | 244 | - Different combinations of L2/L3/L4 fields can be configured as | 
 | 245 |   inputs for hash functions. | 
 | 246 | - The driver configures RSS settings using the AQ SetFeature command | 
 | 247 |   (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and | 
 | 248 |   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). | 
 | 249 | - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash | 
 | 250 |   function delivered in the Rx CQ descriptor is set in the received | 
 | 251 |   SKB. | 
 | 252 | - The user can provide a hash key, hash function, and configure the | 
 | 253 |   indirection table through ethtool(8). | 
 | 254 |  | 
 | 255 | DATA PATH: | 
 | 256 | ========== | 
 | 257 | Tx: | 
 | 258 | --- | 
 | 259 | end_start_xmit() is called by the stack. This function does the following: | 
 | 260 | - Maps data buffers (skb->data and frags). | 
 | 261 | - Populates ena_buf for the push buffer (if the driver and device are | 
 | 262 |   in push mode.) | 
 | 263 | - Prepares ENA bufs for the remaining frags. | 
 | 264 | - Allocates a new request ID from the empty req_id ring. The request | 
 | 265 |   ID is the index of the packet in the Tx info. This is used for | 
 | 266 |   out-of-order TX completions. | 
 | 267 | - Adds the packet to the proper place in the Tx ring. | 
 | 268 | - Calls ena_com_prepare_tx(), an ENA communication layer that converts | 
 | 269 |   the ena_bufs to ENA descriptors (and adds meta ENA descriptors as | 
 | 270 |   needed.) | 
 | 271 |   * This function also copies the ENA descriptors and the push buffer | 
 | 272 |     to the Device memory space (if in push mode.) | 
 | 273 | - Writes doorbell to the ENA device. | 
 | 274 | - When the ENA device finishes sending the packet, a completion | 
 | 275 |   interrupt is raised. | 
 | 276 | - The interrupt handler schedules NAPI. | 
 | 277 | - The ena_clean_tx_irq() function is called. This function handles the | 
 | 278 |   completion descriptors generated by the ENA, with a single | 
 | 279 |   completion descriptor per completed packet. | 
 | 280 |   * req_id is retrieved from the completion descriptor. The tx_info of | 
 | 281 |     the packet is retrieved via the req_id. The data buffers are | 
 | 282 |     unmapped and req_id is returned to the empty req_id ring. | 
 | 283 |   * The function stops when the completion descriptors are completed or | 
 | 284 |     the budget is reached. | 
 | 285 |  | 
 | 286 | Rx: | 
 | 287 | --- | 
 | 288 | - When a packet is received from the ENA device. | 
 | 289 | - The interrupt handler schedules NAPI. | 
 | 290 | - The ena_clean_rx_irq() function is called. This function calls | 
 | 291 |   ena_rx_pkt(), an ENA communication layer function, which returns the | 
 | 292 |   number of descriptors used for a new unhandled packet, and zero if | 
 | 293 |   no new packet is found. | 
 | 294 | - Then it calls the ena_clean_rx_irq() function. | 
 | 295 | - ena_eth_rx_skb() checks packet length: | 
 | 296 |   * If the packet is small (len < rx_copybreak), the driver allocates | 
 | 297 |     a SKB for the new packet, and copies the packet payload into the | 
 | 298 |     SKB data buffer. | 
 | 299 |     - In this way the original data buffer is not passed to the stack | 
 | 300 |       and is reused for future Rx packets. | 
 | 301 |   * Otherwise the function unmaps the Rx buffer, then allocates the | 
 | 302 |     new SKB structure and hooks the Rx buffer to the SKB frags. | 
 | 303 | - The new SKB is updated with the necessary information (protocol, | 
 | 304 |   checksum hw verify result, etc.), and then passed to the network | 
 | 305 |   stack, using the NAPI interface function napi_gro_receive(). |