lh | 9ed821d | 2023-04-07 01:36:19 -0700 | [diff] [blame^] | 1 | .TH CBQ 8 "8 December 2001" "iproute2" "Linux" |
| 2 | .SH NAME |
| 3 | CBQ \- Class Based Queueing |
| 4 | .SH SYNOPSIS |
| 5 | .B tc qdisc ... dev |
| 6 | dev |
| 7 | .B ( parent |
| 8 | classid |
| 9 | .B | root) [ handle |
| 10 | major: |
| 11 | .B ] cbq avpkt |
| 12 | bytes |
| 13 | .B bandwidth |
| 14 | rate |
| 15 | .B [ cell |
| 16 | bytes |
| 17 | .B ] [ ewma |
| 18 | log |
| 19 | .B ] [ mpu |
| 20 | bytes |
| 21 | .B ] |
| 22 | |
| 23 | .B tc class ... dev |
| 24 | dev |
| 25 | .B parent |
| 26 | major:[minor] |
| 27 | .B [ classid |
| 28 | major:minor |
| 29 | .B ] cbq allot |
| 30 | bytes |
| 31 | .B [ bandwidth |
| 32 | rate |
| 33 | .B ] [ rate |
| 34 | rate |
| 35 | .B ] prio |
| 36 | priority |
| 37 | .B [ weight |
| 38 | weight |
| 39 | .B ] [ minburst |
| 40 | packets |
| 41 | .B ] [ maxburst |
| 42 | packets |
| 43 | .B ] [ ewma |
| 44 | log |
| 45 | .B ] [ cell |
| 46 | bytes |
| 47 | .B ] avpkt |
| 48 | bytes |
| 49 | .B [ mpu |
| 50 | bytes |
| 51 | .B ] [ bounded isolated ] [ split |
| 52 | handle |
| 53 | .B & defmap |
| 54 | defmap |
| 55 | .B ] [ estimator |
| 56 | interval timeconstant |
| 57 | .B ] |
| 58 | |
| 59 | .SH DESCRIPTION |
| 60 | Class Based Queueing is a classful qdisc that implements a rich |
| 61 | linksharing hierarchy of classes. It contains shaping elements as |
| 62 | well as prioritizing capabilities. Shaping is performed using link |
| 63 | idle time calculations based on the timing of dequeue events and |
| 64 | underlying link bandwidth. |
| 65 | |
| 66 | .SH SHAPING ALGORITHM |
| 67 | Shaping is done using link idle time calculations, and actions taken if |
| 68 | these calculations deviate from set limits. |
| 69 | |
| 70 | When shaping a 10mbit/s connection to 1mbit/s, the link will |
| 71 | be idle 90% of the time. If it isn't, it needs to be throttled so that it |
| 72 | IS idle 90% of the time. |
| 73 | |
| 74 | From the kernel's perspective, this is hard to measure, so CBQ instead |
| 75 | derives the idle time from the number of microseconds (in fact, jiffies) |
| 76 | that elapse between requests from the device driver for more data. Combined |
| 77 | with the knowledge of packet sizes, this is used to approximate how full or |
| 78 | empty the link is. |
| 79 | |
| 80 | This is rather circumspect and doesn't always arrive at proper |
| 81 | results. For example, what is the actual link speed of an interface |
| 82 | that is not really able to transmit the full 100mbit/s of data, |
| 83 | perhaps because of a badly implemented driver? A PCMCIA network card |
| 84 | will also never achieve 100mbit/s because of the way the bus is |
| 85 | designed - again, how do we calculate the idle time? |
| 86 | |
| 87 | The physical link bandwidth may be ill defined in case of not-quite-real |
| 88 | network devices like PPP over Ethernet or PPTP over TCP/IP. The effective |
| 89 | bandwidth in that case is probably determined by the efficiency of pipes |
| 90 | to userspace - which not defined. |
| 91 | |
| 92 | During operations, the effective idletime is measured using an |
| 93 | exponential weighted moving average (EWMA), which considers recent |
| 94 | packets to be exponentially more important than past ones. The Unix |
| 95 | loadaverage is calculated in the same way. |
| 96 | |
| 97 | The calculated idle time is subtracted from the EWMA measured one, |
| 98 | the resulting number is called 'avgidle'. A perfectly loaded link has |
| 99 | an avgidle of zero: packets arrive exactly at the calculated |
| 100 | interval. |
| 101 | |
| 102 | An overloaded link has a negative avgidle and if it gets too negative, |
| 103 | CBQ throttles and is then 'overlimit'. |
| 104 | |
| 105 | Conversely, an idle link might amass a huge avgidle, which would then |
| 106 | allow infinite bandwidths after a few hours of silence. To prevent |
| 107 | this, avgidle is capped at |
| 108 | .B maxidle. |
| 109 | |
| 110 | If overlimit, in theory, the CBQ could throttle itself for exactly the |
| 111 | amount of time that was calculated to pass between packets, and then |
| 112 | pass one packet, and throttle again. Due to timer resolution constraints, |
| 113 | this may not be feasible, see the |
| 114 | .B minburst |
| 115 | parameter below. |
| 116 | |
| 117 | .SH CLASSIFICATION |
| 118 | Within the one CBQ instance many classes may exist. Each of these classes |
| 119 | contains another qdisc, by default |
| 120 | .BR tc-pfifo (8). |
| 121 | |
| 122 | When enqueueing a packet, CBQ starts at the root and uses various methods to |
| 123 | determine which class should receive the data. If a verdict is reached, this |
| 124 | process is repeated for the recipient class which might have further |
| 125 | means of classifying traffic to its children, if any. |
| 126 | |
| 127 | CBQ has the following methods available to classify a packet to any child |
| 128 | classes. |
| 129 | .TP |
| 130 | (i) |
| 131 | .B skb->priority class encoding. |
| 132 | Can be set from userspace by an application with the |
| 133 | .B SO_PRIORITY |
| 134 | setsockopt. |
| 135 | The |
| 136 | .B skb->priority class encoding |
| 137 | only applies if the skb->priority holds a major:minor handle of an existing |
| 138 | class within this qdisc. |
| 139 | .TP |
| 140 | (ii) |
| 141 | tc filters attached to the class. |
| 142 | .TP |
| 143 | (iii) |
| 144 | The defmap of a class, as set with the |
| 145 | .B split & defmap |
| 146 | parameters. The defmap may contain instructions for each possible Linux packet |
| 147 | priority. |
| 148 | |
| 149 | .P |
| 150 | Each class also has a |
| 151 | .B level. |
| 152 | Leaf nodes, attached to the bottom of the class hierarchy, have a level of 0. |
| 153 | .SH CLASSIFICATION ALGORITHM |
| 154 | |
| 155 | Classification is a loop, which terminates when a leaf class is found. At any |
| 156 | point the loop may jump to the fallback algorithm. |
| 157 | |
| 158 | The loop consists of the following steps: |
| 159 | .TP |
| 160 | (i) |
| 161 | If the packet is generated locally and has a valid classid encoded within its |
| 162 | .B skb->priority, |
| 163 | choose it and terminate. |
| 164 | |
| 165 | .TP |
| 166 | (ii) |
| 167 | Consult the tc filters, if any, attached to this child. If these return |
| 168 | a class which is not a leaf class, restart loop from the class returned. |
| 169 | If it is a leaf, choose it and terminate. |
| 170 | .TP |
| 171 | (iii) |
| 172 | If the tc filters did not return a class, but did return a classid, |
| 173 | try to find a class with that id within this qdisc. |
| 174 | Check if the found class is of a lower |
| 175 | .B level |
| 176 | than the current class. If so, and the returned class is not a leaf node, |
| 177 | restart the loop at the found class. If it is a leaf node, terminate. |
| 178 | If we found an upward reference to a higher level, enter the fallback |
| 179 | algorithm. |
| 180 | .TP |
| 181 | (iv) |
| 182 | If the tc filters did not return a class, nor a valid reference to one, |
| 183 | consider the minor number of the reference to be the priority. Retrieve |
| 184 | a class from the defmap of this class for the priority. If this did not |
| 185 | contain a class, consult the defmap of this class for the |
| 186 | .B BEST_EFFORT |
| 187 | class. If this is an upward reference, or no |
| 188 | .B BEST_EFFORT |
| 189 | class was defined, |
| 190 | enter the fallback algorithm. If a valid class was found, and it is not a |
| 191 | leaf node, restart the loop at this class. If it is a leaf, choose it and |
| 192 | terminate. If |
| 193 | neither the priority distilled from the classid, nor the |
| 194 | .B BEST_EFFORT |
| 195 | priority yielded a class, enter the fallback algorithm. |
| 196 | .P |
| 197 | The fallback algorithm resides outside of the loop and is as follows. |
| 198 | .TP |
| 199 | (i) |
| 200 | Consult the defmap of the class at which the jump to fallback occured. If |
| 201 | the defmap contains a class for the |
| 202 | .B |
| 203 | priority |
| 204 | of the class (which is related to the TOS field), choose this class and |
| 205 | terminate. |
| 206 | .TP |
| 207 | (ii) |
| 208 | Consult the map for a class for the |
| 209 | .B BEST_EFFORT |
| 210 | priority. If found, choose it, and terminate. |
| 211 | .TP |
| 212 | (iii) |
| 213 | Choose the class at which break out to the fallback algorithm occurred. Terminate. |
| 214 | .P |
| 215 | The packet is enqueued to the class which was chosen when either algorithm |
| 216 | terminated. It is therefore possible for a packet to be enqueued *not* at a |
| 217 | leaf node, but in the middle of the hierarchy. |
| 218 | |
| 219 | .SH LINK SHARING ALGORITHM |
| 220 | When dequeuing for sending to the network device, CBQ decides which of its |
| 221 | classes will be allowed to send. It does so with a Weighted Round Robin process |
| 222 | in which each class with packets gets a chance to send in turn. The WRR process |
| 223 | starts by asking the highest priority classes (lowest numerically - |
| 224 | highest semantically) for packets, and will continue to do so until they |
| 225 | have no more data to offer, in which case the process repeats for lower |
| 226 | priorities. |
| 227 | |
| 228 | .B CERTAINTY ENDS HERE, ANK PLEASE HELP |
| 229 | |
| 230 | Each class is not allowed to send at length though - they can only dequeue a |
| 231 | configurable amount of data during each round. |
| 232 | |
| 233 | If a class is about to go overlimit, and it is not |
| 234 | .B bounded |
| 235 | it will try to borrow avgidle from siblings that are not |
| 236 | .B isolated. |
| 237 | This process is repeated from the bottom upwards. If a class is unable |
| 238 | to borrow enough avgidle to send a packet, it is throttled and not asked |
| 239 | for a packet for enough time for the avgidle to increase above zero. |
| 240 | |
| 241 | .B I REALLY NEED HELP FIGURING THIS OUT. REST OF DOCUMENT IS PRETTY CERTAIN |
| 242 | .B AGAIN. |
| 243 | |
| 244 | .SH QDISC |
| 245 | The root qdisc of a CBQ class tree has the following parameters: |
| 246 | |
| 247 | .TP |
| 248 | parent major:minor | root |
| 249 | This mandatory parameter determines the place of the CBQ instance, either at the |
| 250 | .B root |
| 251 | of an interface or within an existing class. |
| 252 | .TP |
| 253 | handle major: |
| 254 | Like all other qdiscs, the CBQ can be assigned a handle. Should consist only |
| 255 | of a major number, followed by a colon. Optional. |
| 256 | .TP |
| 257 | avpkt bytes |
| 258 | For calculations, the average packet size must be known. It is silently capped |
| 259 | at a minimum of 2/3 of the interface MTU. Mandatory. |
| 260 | .TP |
| 261 | bandwidth rate |
| 262 | To determine the idle time, CBQ must know the bandwidth of your underlying |
| 263 | physical interface, or parent qdisc. This is a vital parameter, more about it |
| 264 | later. Mandatory. |
| 265 | .TP |
| 266 | cell |
| 267 | The cell size determines he granularity of packet transmission time calculations. Has a sensible default. |
| 268 | .TP |
| 269 | mpu |
| 270 | A zero sized packet may still take time to transmit. This value is the lower |
| 271 | cap for packet transmission time calculations - packets smaller than this value |
| 272 | are still deemed to have this size. Defaults to zero. |
| 273 | .TP |
| 274 | ewma log |
| 275 | When CBQ needs to measure the average idle time, it does so using an |
| 276 | Exponentially Weighted Moving Average which smoothes out measurements into |
| 277 | a moving average. The EWMA LOG determines how much smoothing occurs. Defaults |
| 278 | to 5. Lower values imply greater sensitivity. Must be between 0 and 31. |
| 279 | .P |
| 280 | A CBQ qdisc does not shape out of its own accord. It only needs to know certain |
| 281 | parameters about the underlying link. Actual shaping is done in classes. |
| 282 | |
| 283 | .SH CLASSES |
| 284 | Classes have a host of parameters to configure their operation. |
| 285 | |
| 286 | .TP |
| 287 | parent major:minor |
| 288 | Place of this class within the hierarchy. If attached directly to a qdisc |
| 289 | and not to another class, minor can be omitted. Mandatory. |
| 290 | .TP |
| 291 | classid major:minor |
| 292 | Like qdiscs, classes can be named. The major number must be equal to the |
| 293 | major number of the qdisc to which it belongs. Optional, but needed if this |
| 294 | class is going to have children. |
| 295 | .TP |
| 296 | weight weight |
| 297 | When dequeuing to the interface, classes are tried for traffic in a |
| 298 | round-robin fashion. Classes with a higher configured qdisc will generally |
| 299 | have more traffic to offer during each round, so it makes sense to allow |
| 300 | it to dequeue more traffic. All weights under a class are normalized, so |
| 301 | only the ratios matter. Defaults to the configured rate, unless the priority |
| 302 | of this class is maximal, in which case it is set to 1. |
| 303 | .TP |
| 304 | allot bytes |
| 305 | Allot specifies how many bytes a qdisc can dequeue |
| 306 | during each round of the process. This parameter is weighted using the |
| 307 | renormalized class weight described above. |
| 308 | |
| 309 | .TP |
| 310 | priority priority |
| 311 | In the round-robin process, classes with the lowest priority field are tried |
| 312 | for packets first. Mandatory. |
| 313 | |
| 314 | .TP |
| 315 | rate rate |
| 316 | Maximum rate this class and all its children combined can send at. Mandatory. |
| 317 | |
| 318 | .TP |
| 319 | bandwidth rate |
| 320 | This is different from the bandwidth specified when creating a CBQ disc. Only |
| 321 | used to determine maxidle and offtime, which are only calculated when |
| 322 | specifying maxburst or minburst. Mandatory if specifying maxburst or minburst. |
| 323 | |
| 324 | .TP |
| 325 | maxburst |
| 326 | This number of packets is used to calculate maxidle so that when |
| 327 | avgidle is at maxidle, this number of average packets can be burst |
| 328 | before avgidle drops to 0. Set it higher to be more tolerant of |
| 329 | bursts. You can't set maxidle directly, only via this parameter. |
| 330 | |
| 331 | .TP |
| 332 | minburst |
| 333 | As mentioned before, CBQ needs to throttle in case of |
| 334 | overlimit. The ideal solution is to do so for exactly the calculated |
| 335 | idle time, and pass 1 packet. However, Unix kernels generally have a |
| 336 | hard time scheduling events shorter than 10ms, so it is better to |
| 337 | throttle for a longer period, and then pass minburst packets in one |
| 338 | go, and then sleep minburst times longer. |
| 339 | |
| 340 | The time to wait is called the offtime. Higher values of minburst lead |
| 341 | to more accurate shaping in the long term, but to bigger bursts at |
| 342 | millisecond timescales. |
| 343 | |
| 344 | .TP |
| 345 | minidle |
| 346 | If avgidle is below 0, we are overlimits and need to wait until |
| 347 | avgidle will be big enough to send one packet. To prevent a sudden |
| 348 | burst from shutting down the link for a prolonged period of time, |
| 349 | avgidle is reset to minidle if it gets too low. |
| 350 | |
| 351 | Minidle is specified in negative microseconds, so 10 means that |
| 352 | avgidle is capped at -10us. |
| 353 | |
| 354 | .TP |
| 355 | bounded |
| 356 | Signifies that this class will not borrow bandwidth from its siblings. |
| 357 | .TP |
| 358 | isolated |
| 359 | Means that this class will not borrow bandwidth to its siblings |
| 360 | |
| 361 | .TP |
| 362 | split major:minor & defmap bitmap[/bitmap] |
| 363 | If consulting filters attached to a class did not give a verdict, |
| 364 | CBQ can also classify based on the packet's priority. There are 16 |
| 365 | priorities available, numbered from 0 to 15. |
| 366 | |
| 367 | The defmap specifies which priorities this class wants to receive, |
| 368 | specified as a bitmap. The Least Significant Bit corresponds to priority |
| 369 | zero. The |
| 370 | .B split |
| 371 | parameter tells CBQ at which class the decision must be made, which should |
| 372 | be a (grand)parent of the class you are adding. |
| 373 | |
| 374 | As an example, 'tc class add ... classid 10:1 cbq .. split 10:0 defmap c0' |
| 375 | configures class 10:0 to send packets with priorities 6 and 7 to 10:1. |
| 376 | |
| 377 | The complimentary configuration would then |
| 378 | be: 'tc class add ... classid 10:2 cbq ... split 10:0 defmap 3f' |
| 379 | Which would send all packets 0, 1, 2, 3, 4 and 5 to 10:1. |
| 380 | .TP |
| 381 | estimator interval timeconstant |
| 382 | CBQ can measure how much bandwidth each class is using, which tc filters |
| 383 | can use to classify packets with. In order to determine the bandwidth |
| 384 | it uses a very simple estimator that measures once every |
| 385 | .B interval |
| 386 | microseconds how much traffic has passed. This again is a EWMA, for which |
| 387 | the time constant can be specified, also in microseconds. The |
| 388 | .B time constant |
| 389 | corresponds to the sluggishness of the measurement or, conversely, to the |
| 390 | sensitivity of the average to short bursts. Higher values mean less |
| 391 | sensitivity. |
| 392 | |
| 393 | |
| 394 | |
| 395 | .SH SOURCES |
| 396 | .TP |
| 397 | o |
| 398 | Sally Floyd and Van Jacobson, "Link-sharing and Resource |
| 399 | Management Models for Packet Networks", |
| 400 | IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 |
| 401 | |
| 402 | .TP |
| 403 | o |
| 404 | Sally Floyd, "Notes on CBQ and Guarantee Service", 1995 |
| 405 | |
| 406 | .TP |
| 407 | o |
| 408 | Sally Floyd, "Notes on Class-Based Queueing: Setting |
| 409 | Parameters", 1996 |
| 410 | |
| 411 | .TP |
| 412 | o |
| 413 | Sally Floyd and Michael Speer, "Experimental Results |
| 414 | for Class-Based Queueing", 1998, not published. |
| 415 | |
| 416 | |
| 417 | |
| 418 | .SH SEE ALSO |
| 419 | .BR tc (8) |
| 420 | |
| 421 | .SH AUTHOR |
| 422 | Alexey N. Kuznetsov, <kuznet@ms2.inr.ac.ru>. This manpage maintained by |
| 423 | bert hubert <ahu@ds9a.nl> |
| 424 | |
| 425 | |