lh | 9ed821d | 2023-04-07 01:36:19 -0700 | [diff] [blame^] | 1 | |
| 2 | This documented is slightly dated but should give you idea of how things |
| 3 | work. |
| 4 | |
| 5 | What is it? |
| 6 | ----------- |
| 7 | |
| 8 | An extension to the filtering/classification architecture of Linux Traffic |
| 9 | Control. |
| 10 | Up to 2.6.8 the only action that could be "attached" to a filter was policing. |
| 11 | i.e you could say something like: |
| 12 | |
| 13 | ----- |
| 14 | tc filter add dev lo parent ffff: protocol ip prio 10 u32 match ip src \ |
| 15 | 127.0.0.1/32 flowid 1:1 police mtu 4000 rate 1500kbit burst 90k |
| 16 | ----- |
| 17 | |
| 18 | which implies "if a packet is seen on the ingress of the lo device with |
| 19 | a source IP address of 127.0.0.1/32 we give it a classification id of 1:1 and |
| 20 | we execute a policing action which rate limits its bandwidth utilization |
| 21 | to 1.5Mbps". |
| 22 | |
| 23 | The new extensions allow for more than just policing actions to be added. |
| 24 | They are also fully backward compatible. If you have a kernel that doesnt |
| 25 | understand them, then the effect is null i.e if you have a newer tc |
| 26 | but older kernel, the actions are not installed. Likewise if you |
| 27 | have a newer kernel but older tc, obviously the tc will use current |
| 28 | syntax which will work fine. Of course to get the required effect you need |
| 29 | both newer tc and kernel. If you are reading this you have the |
| 30 | right tc ;-> |
| 31 | |
| 32 | A side effect is that we can now get stateless firewalling to work with tc. |
| 33 | Essentially this is now an alternative to iptables. |
| 34 | I wont go into details of my dislike for iptables at times, but |
| 35 | scalability is one of the main issues; however, if you need stateful |
| 36 | classification - use netfilter (for now). |
| 37 | |
| 38 | This stuff works on both ingress and egress qdiscs. |
| 39 | |
| 40 | Features |
| 41 | -------- |
| 42 | |
| 43 | 1) new additional syntax and actions enabled. Note old syntax is still valid. |
| 44 | |
| 45 | Essentially this is still the same syntax as tc with a new construct |
| 46 | "action". The syntax is of the form: |
| 47 | tc filter add <DEVICE> parent 1:0 protocol ip prio 10 <Filter description> |
| 48 | flowid 1:1 action <ACTION description>* |
| 49 | |
| 50 | You can have as many actions as you want (within sensible reasoning). |
| 51 | |
| 52 | In the past the only real action was the policer; i.e you could do something |
| 53 | along the lines of: |
| 54 | tc filter add dev lo parent ffff: protocol ip prio 10 u32 \ |
| 55 | match ip src 127.0.0.1/32 flowid 1:1 \ |
| 56 | police mtu 4000 rate 1500kbit burst 90k |
| 57 | |
| 58 | Although you can still use the same syntax, now you can say: |
| 59 | |
| 60 | tc filter add dev lo parent 1:0 protocol ip prio 10 u32 \ |
| 61 | match ip src 127.0.0.1/32 flowid 1:1 \ |
| 62 | action police mtu 4000 rate 1500kbit burst 90k |
| 63 | |
| 64 | " generic Actions" (gact) at the moment are: |
| 65 | { drop, pass, reclassify, continue} |
| 66 | (If you have others, no listed here give me a reason and we will add them) |
| 67 | +drop says to drop the packet |
| 68 | +pass and ok (are equivalent) says to accept it |
| 69 | +reclassify requests for reclassification of the packet |
| 70 | +continue requests for next lookup to match |
| 71 | |
| 72 | 2)In order to take advantage of some of the targets written by the |
| 73 | iptables people, a classifier can have a packet being massaged by an |
| 74 | iptable target. I have only tested with mangler targets up to now. |
| 75 | (infact anything that is not in the mangling table is disabled right now) |
| 76 | |
| 77 | In terms of hooks: |
| 78 | *ingress is mapped to pre-routing hook |
| 79 | *egress is mapped to post-routing hook |
| 80 | I dont see much value in the other hooks, if you see it and email me good |
| 81 | reasons, the addition is trivial. |
| 82 | |
| 83 | Example syntax for iptables targets usage becomes: |
| 84 | tc filter add ..... u32 <u32 syntax> action ipt -j <iptables target syntax> |
| 85 | |
| 86 | example: |
| 87 | tc filter add dev lo parent ffff: protocol ip prio 8 u32 \ |
| 88 | match ip dst 127.0.0.8/32 flowid 1:12 \ |
| 89 | action ipt -j mark --set-mark 2 |
| 90 | |
| 91 | NOTE: flowid 1:12 is parsed flowid 0x1:0x12. Make sure if you want flowid |
| 92 | decimal 12, then use flowid 1:c. |
| 93 | |
| 94 | 3) A feature i call pipe |
| 95 | The motivation is derived from Unix pipe mechanism but applied to packets. |
| 96 | Essentially take a matching packet and pass it through |
| 97 | action1 | action2 | action3 etc. |
| 98 | You could do something similar to this with the tc policer and the "continue" |
| 99 | operator but this rather restricts it to just the policer and requires |
| 100 | multiple rules (and lookups, hence quiet inefficient); |
| 101 | |
| 102 | as an example -- and please note that this is just an example _not_ The |
| 103 | Word Youve Been Waiting For (yes i have had problems giving examples |
| 104 | which ended becoming dogma in documents and people modifying them a little |
| 105 | to look clever); |
| 106 | |
| 107 | i selected the metering rates to be small so that i can show better how |
| 108 | things work. |
| 109 | |
| 110 | The script below does the following: |
| 111 | - an incoming packet from 10.0.0.21 is first given a firewall mark of 1. |
| 112 | |
| 113 | - It is then metered to make sure it does not exceed its allocated rate of |
| 114 | 1Kbps. If it doesnt exceed rate, this is where we terminate action execution. |
| 115 | |
| 116 | - If it does exceed its rate, its "color" changes to a mark of 2 and it is |
| 117 | then passed through a second meter. |
| 118 | |
| 119 | -The second meter is shared across all flows on that device [i am suprised |
| 120 | that this seems to be not a well know feature of the policer; Bert was telling |
| 121 | me that someone was writing a qdisc just to do sharing across multiple devices; |
| 122 | it must be the summer heat again; weve had someone doing that every year around |
| 123 | summer -- the key to sharing is to use a operator "index" in your policer |
| 124 | rules (example "index 20"). All your rules have to use the same index to |
| 125 | share.] |
| 126 | |
| 127 | -If the second meter is exceeded the color of the flow changes further to 3. |
| 128 | |
| 129 | -We then pass the packet to another meter which is shared across all devices |
| 130 | in the system. If this meter is exceeded we drop the packet. |
| 131 | |
| 132 | Note the mark can be used further up the system to do things like policy |
| 133 | or more interesting things on the egress. |
| 134 | |
| 135 | ------------------ cut here ------------------------------- |
| 136 | # |
| 137 | # Add an ingress qdisc on eth0 |
| 138 | tc qdisc add dev eth0 ingress |
| 139 | # |
| 140 | #if you see an incoming packet from 10.0.0.21 |
| 141 | tc filter add dev eth0 parent ffff: protocol ip prio 1 \ |
| 142 | u32 match ip src 10.0.0.21/32 flowid 1:15 \ |
| 143 | # |
| 144 | # first give it a mark of 1 |
| 145 | action ipt -j mark --set-mark 1 index 2 \ |
| 146 | # |
| 147 | # then pass it through a policer which allows 1kbps; if the flow |
| 148 | # doesnt exceed that rate, this is where we stop, if it exceeds we |
| 149 | # pipe the packet to the next action |
| 150 | action police rate 1kbit burst 9k pipe \ |
| 151 | # |
| 152 | # which marks the packet fwmark as 2 and pipes |
| 153 | action ipt -j mark --set-mark 2 \ |
| 154 | # |
| 155 | # next attempt to borrow b/width from a meter |
| 156 | # used across all flows incoming on eth0("index 30") |
| 157 | # and if that is exceeded we pipe to the next action |
| 158 | action police index 30 mtu 5000 rate 1kbit burst 10k pipe \ |
| 159 | # mark it as fwmark 3 if exceeded |
| 160 | action ipt -j mark --set-mark 3 \ |
| 161 | # and then attempt to borrow from a meter used by all devices in the |
| 162 | # system. Should this be exceeded, drop the packet on the floor. |
| 163 | action police index 20 mtu 5000 rate 1kbit burst 90k drop |
| 164 | --------------------------------- |
| 165 | |
| 166 | Now lets see the actions installed with |
| 167 | "tc filter show parent ffff: dev eth0" |
| 168 | |
| 169 | -------- output ----------- |
| 170 | jroot# tc filter show parent ffff: dev eth0 |
| 171 | filter protocol ip pref 1 u32 |
| 172 | filter protocol ip pref 1 u32 fh 800: ht divisor 1 |
| 173 | filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:15 |
| 174 | |
| 175 | action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 176 | target MARK set 0x1 index 2 |
| 177 | |
| 178 | action order 2: police 1 action pipe rate 1Kbit burst 9Kb mtu 2Kb |
| 179 | |
| 180 | action order 3: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 181 | target MARK set 0x2 index 1 |
| 182 | |
| 183 | action order 4: police 30 action pipe rate 1Kbit burst 10Kb mtu 5000b |
| 184 | |
| 185 | action order 5: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 186 | target MARK set 0x3 index 3 |
| 187 | |
| 188 | action order 6: police 20 action drop rate 1Kbit burst 90Kb mtu 5000b |
| 189 | |
| 190 | match 0a000015/ffffffff at 12 |
| 191 | ------------------------------- |
| 192 | |
| 193 | Note the ordering of the actions is based on the order in which we entered |
| 194 | them. In the future i will add explicit priorities. |
| 195 | |
| 196 | Now lets run a ping -f from 10.0.0.21 to this host; stop the ping after |
| 197 | you see a few lines of dots |
| 198 | |
| 199 | ---- |
| 200 | [root@jzny hadi]# ping -f 10.0.0.22 |
| 201 | PING 10.0.0.22 (10.0.0.22): 56 data bytes |
| 202 | .................................................................................................................................................................................................................................................................................................................................................................................................................................................... |
| 203 | --- 10.0.0.22 ping statistics --- |
| 204 | 2248 packets transmitted, 1811 packets received, 19% packet loss |
| 205 | round-trip min/avg/max = 0.7/9.3/20.1 ms |
| 206 | ----------------------------- |
| 207 | |
| 208 | Now lets take a look at the stats with "tc -s filter show parent ffff: dev eth0" |
| 209 | |
| 210 | -------------- |
| 211 | jroot# tc -s filter show parent ffff: dev eth0 |
| 212 | filter protocol ip pref 1 u32 |
| 213 | filter protocol ip pref 1 u32 fh 800: ht divisor 1 |
| 214 | filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 |
| 215 | 5 |
| 216 | |
| 217 | action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 218 | target MARK set 0x1 index 2 |
| 219 | Sent 188832 bytes 2248 pkts (dropped 0, overlimits 0) |
| 220 | |
| 221 | action order 2: police 1 action pipe rate 1Kbit burst 9Kb mtu 2Kb |
| 222 | Sent 188832 bytes 2248 pkts (dropped 0, overlimits 2122) |
| 223 | |
| 224 | action order 3: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 225 | target MARK set 0x2 index 1 |
| 226 | Sent 178248 bytes 2122 pkts (dropped 0, overlimits 0) |
| 227 | |
| 228 | action order 4: police 30 action pipe rate 1Kbit burst 10Kb mtu 5000b |
| 229 | Sent 178248 bytes 2122 pkts (dropped 0, overlimits 1945) |
| 230 | |
| 231 | action order 5: tablename: mangle hook: NF_IP_PRE_ROUTING |
| 232 | target MARK set 0x3 index 3 |
| 233 | Sent 163380 bytes 1945 pkts (dropped 0, overlimits 0) |
| 234 | |
| 235 | action order 6: police 20 action drop rate 1Kbit burst 90Kb mtu 5000b |
| 236 | Sent 163380 bytes 1945 pkts (dropped 0, overlimits 437) |
| 237 | |
| 238 | match 0a000015/ffffffff at 12 |
| 239 | ------------------------------- |
| 240 | |
| 241 | Neat, eh? |
| 242 | |
| 243 | |
| 244 | Wanna write an action module? |
| 245 | ------------------------------ |
| 246 | Its easy. Either look at the code or send me email. I will document at |
| 247 | some point; will also accept documentation. |
| 248 | |
| 249 | TODO |
| 250 | ---- |
| 251 | |
| 252 | Lotsa goodies/features coming. Requests also being accepted. |
| 253 | At the moment the focus has been on getting the architecture in place. |
| 254 | Expect new things in the spurious time i have to work on this |
| 255 | (particularly around end of year when i have typically get time off |
| 256 | from work). |
| 257 | |