| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 | Open vSwitch datapath developer documentation | 
|  | 2 | ============================================= | 
|  | 3 |  | 
|  | 4 | The Open vSwitch kernel module allows flexible userspace control over | 
|  | 5 | flow-level packet processing on selected network devices.  It can be | 
|  | 6 | used to implement a plain Ethernet switch, network device bonding, | 
|  | 7 | VLAN processing, network access control, flow-based network control, | 
|  | 8 | and so on. | 
|  | 9 |  | 
|  | 10 | The kernel module implements multiple "datapaths" (analogous to | 
|  | 11 | bridges), each of which can have multiple "vports" (analogous to ports | 
|  | 12 | within a bridge).  Each datapath also has associated with it a "flow | 
|  | 13 | table" that userspace populates with "flows" that map from keys based | 
|  | 14 | on packet headers and metadata to sets of actions.  The most common | 
|  | 15 | action forwards the packet to another vport; other actions are also | 
|  | 16 | implemented. | 
|  | 17 |  | 
|  | 18 | When a packet arrives on a vport, the kernel module processes it by | 
|  | 19 | extracting its flow key and looking it up in the flow table.  If there | 
|  | 20 | is a matching flow, it executes the associated actions.  If there is | 
|  | 21 | no match, it queues the packet to userspace for processing (as part of | 
|  | 22 | its processing, userspace will likely set up a flow to handle further | 
|  | 23 | packets of the same type entirely in-kernel). | 
|  | 24 |  | 
|  | 25 |  | 
|  | 26 | Flow key compatibility | 
|  | 27 | ---------------------- | 
|  | 28 |  | 
|  | 29 | Network protocols evolve over time.  New protocols become important | 
|  | 30 | and existing protocols lose their prominence.  For the Open vSwitch | 
|  | 31 | kernel module to remain relevant, it must be possible for newer | 
|  | 32 | versions to parse additional protocols as part of the flow key.  It | 
|  | 33 | might even be desirable, someday, to drop support for parsing | 
|  | 34 | protocols that have become obsolete.  Therefore, the Netlink interface | 
|  | 35 | to Open vSwitch is designed to allow carefully written userspace | 
|  | 36 | applications to work with any version of the flow key, past or future. | 
|  | 37 |  | 
|  | 38 | To support this forward and backward compatibility, whenever the | 
|  | 39 | kernel module passes a packet to userspace, it also passes along the | 
|  | 40 | flow key that it parsed from the packet.  Userspace then extracts its | 
|  | 41 | own notion of a flow key from the packet and compares it against the | 
|  | 42 | kernel-provided version: | 
|  | 43 |  | 
|  | 44 | - If userspace's notion of the flow key for the packet matches the | 
|  | 45 | kernel's, then nothing special is necessary. | 
|  | 46 |  | 
|  | 47 | - If the kernel's flow key includes more fields than the userspace | 
|  | 48 | version of the flow key, for example if the kernel decoded IPv6 | 
|  | 49 | headers but userspace stopped at the Ethernet type (because it | 
|  | 50 | does not understand IPv6), then again nothing special is | 
|  | 51 | necessary.  Userspace can still set up a flow in the usual way, | 
|  | 52 | as long as it uses the kernel-provided flow key to do it. | 
|  | 53 |  | 
|  | 54 | - If the userspace flow key includes more fields than the | 
|  | 55 | kernel's, for example if userspace decoded an IPv6 header but | 
|  | 56 | the kernel stopped at the Ethernet type, then userspace can | 
|  | 57 | forward the packet manually, without setting up a flow in the | 
|  | 58 | kernel.  This case is bad for performance because every packet | 
|  | 59 | that the kernel considers part of the flow must go to userspace, | 
|  | 60 | but the forwarding behavior is correct.  (If userspace can | 
|  | 61 | determine that the values of the extra fields would not affect | 
|  | 62 | forwarding behavior, then it could set up a flow anyway.) | 
|  | 63 |  | 
|  | 64 | How flow keys evolve over time is important to making this work, so | 
|  | 65 | the following sections go into detail. | 
|  | 66 |  | 
|  | 67 |  | 
|  | 68 | Flow key format | 
|  | 69 | --------------- | 
|  | 70 |  | 
|  | 71 | A flow key is passed over a Netlink socket as a sequence of Netlink | 
|  | 72 | attributes.  Some attributes represent packet metadata, defined as any | 
|  | 73 | information about a packet that cannot be extracted from the packet | 
|  | 74 | itself, e.g. the vport on which the packet was received.  Most | 
|  | 75 | attributes, however, are extracted from headers within the packet, | 
|  | 76 | e.g. source and destination addresses from Ethernet, IP, or TCP | 
|  | 77 | headers. | 
|  | 78 |  | 
|  | 79 | The <linux/openvswitch.h> header file defines the exact format of the | 
|  | 80 | flow key attributes.  For informal explanatory purposes here, we write | 
|  | 81 | them as comma-separated strings, with parentheses indicating arguments | 
|  | 82 | and nesting.  For example, the following could represent a flow key | 
|  | 83 | corresponding to a TCP packet that arrived on vport 1: | 
|  | 84 |  | 
|  | 85 | in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), | 
|  | 86 | eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, | 
|  | 87 | frag=no), tcp(src=49163, dst=80) | 
|  | 88 |  | 
|  | 89 | Often we ellipsize arguments not important to the discussion, e.g.: | 
|  | 90 |  | 
|  | 91 | in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) | 
|  | 92 |  | 
|  | 93 |  | 
|  | 94 | Wildcarded flow key format | 
|  | 95 | -------------------------- | 
|  | 96 |  | 
|  | 97 | A wildcarded flow is described with two sequences of Netlink attributes | 
|  | 98 | passed over the Netlink socket. A flow key, exactly as described above, and an | 
|  | 99 | optional corresponding flow mask. | 
|  | 100 |  | 
|  | 101 | A wildcarded flow can represent a group of exact match flows. Each '1' bit | 
|  | 102 | in the mask specifies a exact match with the corresponding bit in the flow key. | 
|  | 103 | A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit | 
|  | 104 | of a incoming packet. Using wildcarded flow can improve the flow set up rate | 
|  | 105 | by reduce the number of new flows need to be processed by the user space program. | 
|  | 106 |  | 
|  | 107 | Support for the mask Netlink attribute is optional for both the kernel and user | 
|  | 108 | space program. The kernel can ignore the mask attribute, installing an exact | 
|  | 109 | match flow, or reduce the number of don't care bits in the kernel to less than | 
|  | 110 | what was specified by the user space program. In this case, variations in bits | 
|  | 111 | that the kernel does not implement will simply result in additional flow setups. | 
|  | 112 | The kernel module will also work with user space programs that neither support | 
|  | 113 | nor supply flow mask attributes. | 
|  | 114 |  | 
|  | 115 | Since the kernel may ignore or modify wildcard bits, it can be difficult for | 
|  | 116 | the userspace program to know exactly what matches are installed. There are | 
|  | 117 | two possible approaches: reactively install flows as they miss the kernel | 
|  | 118 | flow table (and therefore not attempt to determine wildcard changes at all) | 
|  | 119 | or use the kernel's response messages to determine the installed wildcards. | 
|  | 120 |  | 
|  | 121 | When interacting with userspace, the kernel should maintain the match portion | 
|  | 122 | of the key exactly as originally installed. This will provides a handle to | 
|  | 123 | identify the flow for all future operations. However, when reporting the | 
|  | 124 | mask of an installed flow, the mask should include any restrictions imposed | 
|  | 125 | by the kernel. | 
|  | 126 |  | 
|  | 127 | The behavior when using overlapping wildcarded flows is undefined. It is the | 
|  | 128 | responsibility of the user space program to ensure that any incoming packet | 
|  | 129 | can match at most one flow, wildcarded or not. The current implementation | 
|  | 130 | performs best-effort detection of overlapping wildcarded flows and may reject | 
|  | 131 | some but not all of them. However, this behavior may change in future versions. | 
|  | 132 |  | 
|  | 133 |  | 
|  | 134 | Unique flow identifiers | 
|  | 135 | ----------------------- | 
|  | 136 |  | 
|  | 137 | An alternative to using the original match portion of a key as the handle for | 
|  | 138 | flow identification is a unique flow identifier, or "UFID". UFIDs are optional | 
|  | 139 | for both the kernel and user space program. | 
|  | 140 |  | 
|  | 141 | User space programs that support UFID are expected to provide it during flow | 
|  | 142 | setup in addition to the flow, then refer to the flow using the UFID for all | 
|  | 143 | future operations. The kernel is not required to index flows by the original | 
|  | 144 | flow key if a UFID is specified. | 
|  | 145 |  | 
|  | 146 |  | 
|  | 147 | Basic rule for evolving flow keys | 
|  | 148 | --------------------------------- | 
|  | 149 |  | 
|  | 150 | Some care is needed to really maintain forward and backward | 
|  | 151 | compatibility for applications that follow the rules listed under | 
|  | 152 | "Flow key compatibility" above. | 
|  | 153 |  | 
|  | 154 | The basic rule is obvious: | 
|  | 155 |  | 
|  | 156 | ------------------------------------------------------------------ | 
|  | 157 | New network protocol support must only supplement existing flow | 
|  | 158 | key attributes.  It must not change the meaning of already defined | 
|  | 159 | flow key attributes. | 
|  | 160 | ------------------------------------------------------------------ | 
|  | 161 |  | 
|  | 162 | This rule does have less-obvious consequences so it is worth working | 
|  | 163 | through a few examples.  Suppose, for example, that the kernel module | 
|  | 164 | did not already implement VLAN parsing.  Instead, it just interpreted | 
|  | 165 | the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the | 
|  | 166 | packet.  The flow key for any packet with an 802.1Q header would look | 
|  | 167 | essentially like this, ignoring metadata: | 
|  | 168 |  | 
|  | 169 | eth(...), eth_type(0x8100) | 
|  | 170 |  | 
|  | 171 | Naively, to add VLAN support, it makes sense to add a new "vlan" flow | 
|  | 172 | key attribute to contain the VLAN tag, then continue to decode the | 
|  | 173 | encapsulated headers beyond the VLAN tag using the existing field | 
|  | 174 | definitions.  With this change, a TCP packet in VLAN 10 would have a | 
|  | 175 | flow key much like this: | 
|  | 176 |  | 
|  | 177 | eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) | 
|  | 178 |  | 
|  | 179 | But this change would negatively affect a userspace application that | 
|  | 180 | has not been updated to understand the new "vlan" flow key attribute. | 
|  | 181 | The application could, following the flow compatibility rules above, | 
|  | 182 | ignore the "vlan" attribute that it does not understand and therefore | 
|  | 183 | assume that the flow contained IP packets.  This is a bad assumption | 
|  | 184 | (the flow only contains IP packets if one parses and skips over the | 
|  | 185 | 802.1Q header) and it could cause the application's behavior to change | 
|  | 186 | across kernel versions even though it follows the compatibility rules. | 
|  | 187 |  | 
|  | 188 | The solution is to use a set of nested attributes.  This is, for | 
|  | 189 | example, why 802.1Q support uses nested attributes.  A TCP packet in | 
|  | 190 | VLAN 10 is actually expressed as: | 
|  | 191 |  | 
|  | 192 | eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), | 
|  | 193 | ip(proto=6, ...), tcp(...))) | 
|  | 194 |  | 
|  | 195 | Notice how the "eth_type", "ip", and "tcp" flow key attributes are | 
|  | 196 | nested inside the "encap" attribute.  Thus, an application that does | 
|  | 197 | not understand the "vlan" key will not see either of those attributes | 
|  | 198 | and therefore will not misinterpret them.  (Also, the outer eth_type | 
|  | 199 | is still 0x8100, not changed to 0x0800.) | 
|  | 200 |  | 
|  | 201 | Handling malformed packets | 
|  | 202 | -------------------------- | 
|  | 203 |  | 
|  | 204 | Don't drop packets in the kernel for malformed protocol headers, bad | 
|  | 205 | checksums, etc.  This would prevent userspace from implementing a | 
|  | 206 | simple Ethernet switch that forwards every packet. | 
|  | 207 |  | 
|  | 208 | Instead, in such a case, include an attribute with "empty" content. | 
|  | 209 | It doesn't matter if the empty content could be valid protocol values, | 
|  | 210 | as long as those values are rarely seen in practice, because userspace | 
|  | 211 | can always forward all packets with those values to userspace and | 
|  | 212 | handle them individually. | 
|  | 213 |  | 
|  | 214 | For example, consider a packet that contains an IP header that | 
|  | 215 | indicates protocol 6 for TCP, but which is truncated just after the IP | 
|  | 216 | header, so that the TCP header is missing.  The flow key for this | 
|  | 217 | packet would include a tcp attribute with all-zero src and dst, like | 
|  | 218 | this: | 
|  | 219 |  | 
|  | 220 | eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) | 
|  | 221 |  | 
|  | 222 | As another example, consider a packet with an Ethernet type of 0x8100, | 
|  | 223 | indicating that a VLAN TCI should follow, but which is truncated just | 
|  | 224 | after the Ethernet type.  The flow key for this packet would include | 
|  | 225 | an all-zero-bits vlan and an empty encap attribute, like this: | 
|  | 226 |  | 
|  | 227 | eth(...), eth_type(0x8100), vlan(0), encap() | 
|  | 228 |  | 
|  | 229 | Unlike a TCP packet with source and destination ports 0, an | 
|  | 230 | all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka | 
|  | 231 | VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan | 
|  | 232 | attribute expressly to allow this situation to be distinguished. | 
|  | 233 | Thus, the flow key in this second example unambiguously indicates a | 
|  | 234 | missing or malformed VLAN TCI. | 
|  | 235 |  | 
|  | 236 | Other rules | 
|  | 237 | ----------- | 
|  | 238 |  | 
|  | 239 | The other rules for flow keys are much less subtle: | 
|  | 240 |  | 
|  | 241 | - Duplicate attributes are not allowed at a given nesting level. | 
|  | 242 |  | 
|  | 243 | - Ordering of attributes is not significant. | 
|  | 244 |  | 
|  | 245 | - When the kernel sends a given flow key to userspace, it always | 
|  | 246 | composes it the same way.  This allows userspace to hash and | 
|  | 247 | compare entire flow keys that it may not be able to fully | 
|  | 248 | interpret. |