lh | 9ed821d | 2023-04-07 01:36:19 -0700 | [diff] [blame^] | 1 | .TH "TC\-HFSC" 7 "31 October 2011" iproute2 Linux |
| 2 | .SH "NAME" |
| 3 | tc-hfcs \- Hierarchical Fair Service Curve |
| 4 | . |
| 5 | .SH "HISTORY & INTRODUCTION" |
| 6 | . |
| 7 | HFSC \- \fBHierarchical Fair Service Curve\fR was first presented at |
| 8 | SIGCOMM'97. Developed as a part of ALTQ (ALTernative Queuing) on NetBSD, found |
| 9 | its way quickly to other BSD systems, and then a few years ago became part of |
| 10 | the linux kernel. Still, it's not the most popular scheduling algorithm \- |
| 11 | especially if compared to HTB \- and it's not well documented from enduser's |
| 12 | perspective. This introduction aims to explain how HFSC works without |
| 13 | going to deep into math side of things (although some if it will be |
| 14 | inevitable). |
| 15 | |
| 16 | In short HFSC aims to: |
| 17 | . |
| 18 | .RS 4 |
| 19 | .IP \fB1)\fR 4 |
| 20 | guarantee precise bandwidth and delay allocation for all leaf classes (realtime |
| 21 | criterion) |
| 22 | .IP \fB2)\fR |
| 23 | allocate excess bandwidth fairly as specified by class hierarchy (linkshare & |
| 24 | upperlimit criterion) |
| 25 | .IP \fB3)\fR |
| 26 | minimize any discrepancy between the service curve and the actual amount of |
| 27 | service provided during linksharing |
| 28 | .RE |
| 29 | .PP |
| 30 | . |
| 31 | The main "selling" point of HFSC is feature \fB(1)\fR, which is achieved by |
| 32 | using nonlinear service curves (more about what it actually is later). This is |
| 33 | particularly useful in VoIP or games, where not only guarantee of consistent |
| 34 | bandwidth is important, but initial delay of a data stream as well. Note that |
| 35 | it matters only for leaf classes (where the actual queues are) \- thus class |
| 36 | hierarchy is ignored in realtime case. |
| 37 | |
| 38 | Feature \fB(2)\fR is well, obvious \- any algorithm featuring class hierarchy |
| 39 | (such as HTB or CBQ) strives to achieve that. HFSC does that well, although |
| 40 | you might end with unusual situations, if you define service curves carelessly |
| 41 | \- see section CORNER CASES for examples. |
| 42 | |
| 43 | Feature \fB(3)\fR is mentioned due to the nature of the problem. There may be |
| 44 | situations where it's either not possible to guarantee service of all curves at |
| 45 | the same time, and/or it's impossible to do so fairly. Both will be explained |
| 46 | later. Note that this is mainly related to interior (aka aggregate) classes, as |
| 47 | the leafs are already handled by \fB(1)\fR. Still \- it's perfectly possible to |
| 48 | create a leaf class w/o realtime service, and in such case \- the caveats will |
| 49 | naturally extend to leaf classes as well. |
| 50 | |
| 51 | .SH ABBREVIATIONS |
| 52 | For the remaining part of the document, we'll use following shortcuts: |
| 53 | .nf |
| 54 | .RS 4 |
| 55 | |
| 56 | RT \- realtime |
| 57 | LS \- linkshare |
| 58 | UL \- upperlimit |
| 59 | SC \- service curve |
| 60 | .fi |
| 61 | . |
| 62 | .SH "BASICS OF HFSC" |
| 63 | . |
| 64 | To understand how HFSC works, we must first introduce a service curve. |
| 65 | Overall, it's a nondecreasing function of some time unit, returning amount of |
| 66 | service (allowed or allocated amount of bandwidth) by some specific point in |
| 67 | time. The purpose of it should be subconsciously obvious \- if a class was |
| 68 | allowed to transfer not less than the amount specified by its service curve \- |
| 69 | then service curve is not violated. |
| 70 | |
| 71 | Still \- we need more elaborate criterion than just the above (although in |
| 72 | most generic case it can be reduced to it). The criterion has to take two |
| 73 | things into account: |
| 74 | . |
| 75 | .RS 4 |
| 76 | .IP \(bu 4 |
| 77 | idling periods |
| 78 | .IP \(bu |
| 79 | ability to "look back", so if during current active period service curve is violated, maybe it |
| 80 | isn't if we count excess bandwidth received during earlier active period(s) |
| 81 | .RE |
| 82 | .PP |
| 83 | Let's define the criterion as follows: |
| 84 | .RS 4 |
| 85 | .nf |
| 86 | .IP "\fB(1)\fR" 4 |
| 87 | For each t1, there must exist t0 in set B, so S(t1\-t0)\~<=\~w(t0,t1) |
| 88 | .fi |
| 89 | .RE |
| 90 | . |
| 91 | .PP |
| 92 | Here 'w' denotes the amount of service received during some time period between t0 |
| 93 | and t1. B is a set of all times, where a session becomes active after idling |
| 94 | period (further denoted as 'becoming backlogged'). For a clearer picture, |
| 95 | imagine two situations: |
| 96 | . |
| 97 | .RS 4 |
| 98 | .IP \fBa)\fR 4 |
| 99 | our session was active during two periods, with a small time gap between them |
| 100 | .IP \fBb)\fR |
| 101 | as in (a), but with a larger gap |
| 102 | .RE |
| 103 | . |
| 104 | .PP |
| 105 | Consider \fB(a)\fR \- if the service received during both periods meets |
| 106 | \fB(1)\fR, then all is good. But what if it doesn't do so during the 2nd |
| 107 | period ? If the amount of service received during the 1st period is bigger |
| 108 | than the service curve, then it might compensate for smaller service during |
| 109 | the 2nd period \fIand\fR the gap \- if the gap is small enough. |
| 110 | |
| 111 | If the gap is larger \fB(b)\fR \- then it's less likely to happen (unless the |
| 112 | excess bandwidth allocated during the 1st part was really large). Still, the |
| 113 | larger the gap \- the less interesting is what happened in the past (e.g. 10 |
| 114 | minutes ago) \- what matters is the current traffic that just started. |
| 115 | |
| 116 | From HFSC's perspective, more interesting is answering the following question: |
| 117 | when should we start transferring packets, so a service curve of a class is not |
| 118 | violated. Or rephrasing it: How much X() amount of service should a session |
| 119 | receive by time t, so the service curve is not violated. Function X() defined |
| 120 | as below is the basic building block of HFSC, used in: eligible, deadline, |
| 121 | virtual\-time and fit\-time curves. Of course, X() is based on equation |
| 122 | \fB(1)\fR and is defined recursively: |
| 123 | |
| 124 | .RS 4 |
| 125 | .IP \(bu 4 |
| 126 | At the 1st backlogged period beginning function X is initialized to generic |
| 127 | service curve assigned to a class |
| 128 | .IP \(bu |
| 129 | At any subsequent backlogged period, X() is: |
| 130 | .nf |
| 131 | \fBmin(X() from previous period ; w(t0)+S(t\-t0) for t>=t0),\fR |
| 132 | .fi |
| 133 | \&... where t0 denotes the beginning of the current backlogged period. |
| 134 | .RE |
| 135 | . |
| 136 | .PP |
| 137 | HFSC uses either linear, or two\-piece linear service curves. In case of |
| 138 | linear or two\-piece linear convex functions (first slope < second slope), |
| 139 | min() in X's definition reduces to the 2nd argument. But in case of two\-piece |
| 140 | concave functions, the 1st argument might quickly become lesser for some |
| 141 | t>=t0. Note, that for some backlogged period, X() is defined only from that |
| 142 | period's beginning. We also define X^(\-1)(w) as smallest t>=t0, for which |
| 143 | X(t)\~=\~w. We have to define it this way, as X() is usually not an injection. |
| 144 | |
| 145 | The above generic X() can be one of the following: |
| 146 | . |
| 147 | .RS 4 |
| 148 | .IP "E()" 4 |
| 149 | In realtime criterion, selects packets eligible for sending. If none are |
| 150 | eligible, HFSC will use linkshare criterion. Eligible time \&'et' is calculated |
| 151 | with reference to packets' heads ( et\~=\~E^(\-1)(w) ). It's based on RT |
| 152 | service curve, \fIbut in case of a convex curve, uses its 2nd slope only.\fR |
| 153 | .IP "D()" |
| 154 | In realtime criterion, selects the most suitable packet from the ones chosen |
| 155 | by E(). Deadline time \&'dt' corresponds to packets' tails |
| 156 | (dt\~=\~D^(\-1)(w+l), where \&'l' is packet's length). Based on RT service |
| 157 | curve. |
| 158 | .IP "V()" |
| 159 | In linkshare criterion, arbitrates which packet to send next. Note that V() is |
| 160 | function of a virtual time \- see \fBLINKSHARE CRITERION\fR section for |
| 161 | details. Virtual time \&'vt' corresponds to packets' heads |
| 162 | (vt\~=\~V^(\-1)(w)). Based on LS service curve. |
| 163 | .IP "F()" |
| 164 | An extension to linkshare criterion, used to limit at which speed linkshare |
| 165 | criterion is allowed to dequeue. Fit\-time 'ft' corresponds to packets' heads |
| 166 | as well (ft\~=\~F^(\-1)(w)). Based on UL service curve. |
| 167 | .RE |
| 168 | |
| 169 | Be sure to make clean distinction between session's RT, LS and UL service |
| 170 | curves and the above "utility" functions. |
| 171 | . |
| 172 | .SH "REALTIME CRITERION" |
| 173 | . |
| 174 | RT criterion \fIignores class hierarchy\fR and guarantees precise bandwidth and |
| 175 | delay allocation. We say that packet is eligible for sending, when current real |
| 176 | time is bigger than eligible time. From all packets eligible, the one most |
| 177 | suited for sending, is the one with the smallest deadline time. Sounds simply, |
| 178 | but consider following example: |
| 179 | |
| 180 | Interface 10mbit, two classes, both with two\-piece linear service curves: |
| 181 | .RS 4 |
| 182 | .IP \(bu 4 |
| 183 | 1st class \- 2mbit for 100ms, then 7mbit (convex \- 1st slope < 2nd slope) |
| 184 | .IP \(bu |
| 185 | 2nd class \- 7mbit for 100ms, then 2mbit (concave \- 1st slope > 2nd slope) |
| 186 | .RE |
| 187 | .PP |
| 188 | Assume for a moment, that we only use D() for both finding eligible packets, |
| 189 | and choosing the most fitting one, thus eligible time would be computed as |
| 190 | D^(\-1)(w) and deadline time would be computed as D^(\-1)(w+l). If the 2nd |
| 191 | class starts sending packets 1 second after the 1st class, it's of course |
| 192 | impossible to guarantee 14mbit, as the interface capability is only 10mbit. |
| 193 | The only workaround in this scenario is to allow the 1st class to send the |
| 194 | packets earlier that would normally be allowed. That's where separate E() comes |
| 195 | to help. Putting all the math aside (see HFSC paper for details), E() for RT |
| 196 | concave service curve is just like D(), but for the RT convex service curve \- |
| 197 | it's constructed using \fIonly\fR RT service curve's 2nd slope (in our example |
| 198 | \- 7mbit). |
| 199 | |
| 200 | The effect of such E() \- packets will be sent earlier, and at the same time |
| 201 | D() \fIwill\fR be updated \- so current deadline time calculated from it will |
| 202 | be bigger. Thus, when the 2nd class starts sending packets later, both the 1st |
| 203 | and the 2nd class will be eligible, but the 2nd session's deadline time will be |
| 204 | smaller and its packets will be sent first. When the 1st class becomes idle at |
| 205 | some later point, the 2nd class will be able to "buffer" up again for later |
| 206 | active period of the 1st class. |
| 207 | |
| 208 | A short remark \- in a situation, where the total amount of bandwidth |
| 209 | available on the interface is bigger than the allocated total realtime parts |
| 210 | (imagine interface 10 mbit, but 1mbit/2mbit and 2mbit/1mbit classes), the sole |
| 211 | speed of the interface could suffice to guarantee the times. |
| 212 | |
| 213 | Important part of RT criterion is that apart from updating its D() and E(), |
| 214 | also V() used by LS criterion is updated. Generally the RT criterion is |
| 215 | secondary to LS one, and used \fIonly\fR if there's a risk of violating precise |
| 216 | realtime requirements. Still, the "participation" in bandwidth distributed by |
| 217 | LS criterion is there, so V() has to be updated along the way. LS criterion can |
| 218 | than properly compensate for non\-ideal fair sharing situation, caused by RT |
| 219 | scheduling. If you use UL service curve its F() will be updated as well (UL |
| 220 | service curve is an extension to LS one \- see \fBUPPERLIMIT CRITERION\fR |
| 221 | section). |
| 222 | |
| 223 | Anyway \- careless specification of LS and RT service curves can lead to |
| 224 | potentially undesired situations (see CORNER CASES for examples). This wasn't |
| 225 | the case in HFSC paper where LS and RT service curves couldn't be specified |
| 226 | separately. |
| 227 | |
| 228 | .SH "LINKSHARING CRITERION" |
| 229 | . |
| 230 | LS criterion's task is to distribute bandwidth according to specified class |
| 231 | hierarchy. Contrary to RT criterion, there're no comparisons between current |
| 232 | real time and virtual time \- the decision is based solely on direct comparison |
| 233 | of virtual times of all active subclasses \- the one with the smallest vt wins |
| 234 | and gets scheduled. One immediate conclusion from this fact is that absolute |
| 235 | values don't matter \- only ratios between them (so for example, two children |
| 236 | classes with simple linear 1mbit service curves will get the same treatment |
| 237 | from LS criterion's perspective, as if they were 5mbit). The other conclusion |
| 238 | is, that in perfectly fluid system with linear curves, all virtual times across |
| 239 | whole class hierarchy would be equal. |
| 240 | |
| 241 | Why is VC defined in term of virtual time (and what is it) ? |
| 242 | |
| 243 | Imagine an example: class A with two children \- A1 and A2, both with let's say |
| 244 | 10mbit SCs. If A2 is idle, A1 receives all the bandwidth of A (and update its |
| 245 | V() in the process). When A2 becomes active, A1's virtual time is already |
| 246 | \fIfar\fR bigger than A2's one. Considering the type of decision made by LS |
| 247 | criterion, A1 would become idle for a lot of time. We can workaround this |
| 248 | situation by adjusting virtual time of the class becoming active \- we do that |
| 249 | by getting such time "up to date". HFSC uses a mean of the smallest and the |
| 250 | biggest virtual time of currently active children fit for sending. As it's not |
| 251 | real time anymore (excluding trivial case of situation where all classes become |
| 252 | active at the same time, and never become idle), it's called virtual time. |
| 253 | |
| 254 | Such approach has its price though. The problem is analogous to what was |
| 255 | presented in previous section and is caused by non\-linearity of service |
| 256 | curves: |
| 257 | .IP 1) 4 |
| 258 | either it's impossible to guarantee service curves and satisfy fairness |
| 259 | during certain time periods: |
| 260 | |
| 261 | .RS 4 |
| 262 | Recall the example from RT section, slightly modified (with 3mbit slopes |
| 263 | instead of 2mbit ones): |
| 264 | |
| 265 | .IP \(bu 4 |
| 266 | 1st class \- 3mbit for 100ms, then 7mbit (convex \- 1st slope < 2nd slope) |
| 267 | .IP \(bu |
| 268 | 2nd class \- 7mbit for 100ms, then 3mbit (concave \- 1st slope > 2nd slope) |
| 269 | |
| 270 | .PP |
| 271 | They sum up nicely to 10mbit \- interface's capacity. But if we wanted to only |
| 272 | use LS for guarantees and fairness \- it simply won't work. In LS context, |
| 273 | only V() is used for making decision which class to schedule. If the 2nd class |
| 274 | becomes active when the 1st one is in its second slope, the fairness will be |
| 275 | preserved \- ratio will be 1:1 (7mbit:7mbit), but LS itself is of course |
| 276 | unable to guarantee the absolute values themselves \- as it would have to go |
| 277 | beyond of what the interface is capable of. |
| 278 | .RE |
| 279 | |
| 280 | .IP 2) 4 |
| 281 | and/or it's impossible to guarantee service curves of all classes at the same |
| 282 | time [fairly or not]: |
| 283 | |
| 284 | .RS 4 |
| 285 | |
| 286 | This is similar to the above case, but a bit more subtle. We will consider two |
| 287 | subtrees, arbitrated by their common (root here) parent: |
| 288 | |
| 289 | .nf |
| 290 | R (root) -\ 10mbit |
| 291 | |
| 292 | A \- 7mbit, then 3mbit |
| 293 | A1 \- 5mbit, then 2mbit |
| 294 | A2 \- 2mbit, then 1mbit |
| 295 | |
| 296 | B \- 3mbit, then 7mbit |
| 297 | .fi |
| 298 | |
| 299 | R arbitrates between left subtree (A) and right (B). Assume that A2 and B are |
| 300 | constantly backlogged, and at some later point A1 becomes backlogged (when all |
| 301 | other classes are in their 2nd linear part). |
| 302 | |
| 303 | What happens now ? B (choice made by R) will \fIalways\fR get 7 mbit as R is |
| 304 | only (obviously) concerned with the ratio between its direct children. Thus A |
| 305 | subtree gets 3mbit, but its children would want (at the point when A1 became |
| 306 | backlogged) 5mbit + 1mbit. That's of course impossible, as they can only get |
| 307 | 3mbit due to interface limitation. |
| 308 | |
| 309 | In the left subtree \- we have the same situation as previously (fair split |
| 310 | between A1 and A2, but violated guarantees), but in the whole tree \- there's |
| 311 | no fairness (B got 7mbit, but A1 and A2 have to fit together in 3mbit) and |
| 312 | there's no guarantees for all classes (only B got what it wanted). Even if we |
| 313 | violated fairness in the A subtree and set A2's service curve to 0, A1 would |
| 314 | still not get the required bandwidth. |
| 315 | .RE |
| 316 | . |
| 317 | .SH "UPPERLIMIT CRITERION" |
| 318 | . |
| 319 | UL criterion is an extensions to LS one, that permits sending packets only |
| 320 | if current real time is bigger than fit\-time ('ft'). So the modified LS |
| 321 | criterion becomes: choose the smallest virtual time from all active children, |
| 322 | such that fit\-time < current real time also holds. Fit\-time is calculated |
| 323 | from F(), which is based on UL service curve. As you can see, it's role is |
| 324 | kinda similar to E() used in RT criterion. Also, for obvious reasons \- you |
| 325 | can't specify UL service curve without LS one. |
| 326 | |
| 327 | Main purpose of UL service curve is to limit HFSC to bandwidth available on the |
| 328 | upstream router (think adsl home modem/router, and linux server as |
| 329 | nat/firewall/etc. with 100mbit+ connection to mentioned modem/router). |
| 330 | Typically, it's used to create a single class directly under root, setting |
| 331 | linear UL service curve to available bandwidth \- and then creating your class |
| 332 | structure from that class downwards. Of course, you're free to add UL service |
| 333 | (linear or not) curve to any class with LS criterion. |
| 334 | |
| 335 | Important part about UL service curve is, that whenever at some point in time |
| 336 | a class doesn't qualify for linksharing due to its fit\-time, the next time it |
| 337 | does qualify, it will update its virtual time to the smallest virtual time of |
| 338 | all active children fit for linksharing. This way, one of the main things LS |
| 339 | criterion tries to achieve \- equality of all virtual times across whole |
| 340 | hierarchy \- is preserved (in perfectly fluid system with only linear curves, |
| 341 | all virtual times would be equal). |
| 342 | |
| 343 | Without that, 'vt' would lag behind other virtual times, and could cause |
| 344 | problems. Consider interface with capacity 10mbit, and following leaf classes |
| 345 | (just in case you're skipping this text quickly \- this example shows behavior |
| 346 | that \f(BIdoesn't happen\fR): |
| 347 | |
| 348 | .nf |
| 349 | A \- ls 5.0mbit |
| 350 | B \- ls 2.5mbit |
| 351 | C \- ls 2.5mbit, ul 2.5mbit |
| 352 | .fi |
| 353 | |
| 354 | If B was idle, while A and C were constantly backlogged, they would normally |
| 355 | (as far as LS criterion is concerned) divide bandwidth in 2:1 ratio. But due |
| 356 | to UL service curve in place, C would get at most 2.5mbit, and A would get the |
| 357 | remaining 7.5mbit. The longer the backlogged period, the more virtual times of |
| 358 | A and C would drift apart. If B became backlogged at some later point in time, |
| 359 | its virtual time would be set to (A's\~vt\~+\~C's\~vt)/2, thus blocking A from |
| 360 | sending any traffic, until B's virtual time catches up with A. |
| 361 | . |
| 362 | .SH "SEPARATE LS / RT SCs" |
| 363 | . |
| 364 | Another difference from original HFSC paper, is that RT and LS SCs can be |
| 365 | specified separately. Moreover \- leaf classes are allowed to have only either |
| 366 | RT SC or LS SC. For interior classes, only LS SCs make sense \- Any RT SC will |
| 367 | be ignored. |
| 368 | . |
| 369 | .SH "CORNER CASES" |
| 370 | . |
| 371 | Separate service curves for LS and RT criteria can lead to certain traps, |
| 372 | that come from "fighting" between ideal linksharing and enforced realtime |
| 373 | guarantees. Those situations didn't exist in original HFSC paper, where |
| 374 | specifying separate LS / RT service curves was not discussed. |
| 375 | |
| 376 | Consider interface with capacity 10mbit, with following leaf classes: |
| 377 | |
| 378 | .nf |
| 379 | A \- ls 5.0mbit, rt 8mbit |
| 380 | B \- ls 2.5mbit |
| 381 | C \- ls 2.5mbit |
| 382 | .fi |
| 383 | |
| 384 | Imagine A and C are constantly backlogged. As B is idle, A and C would divide |
| 385 | bandwidth in 2:1 ratio, considering LS service curve (so in theory \- 6.66 and |
| 386 | 3.33). Alas RT criterion takes priority, so A will get 8mbit and LS will be |
| 387 | able to compensate class C for only 2 mbit \- this will cause discrepancy |
| 388 | between virtual times of A and C. |
| 389 | |
| 390 | Assume this situation lasts for a lot of time with no idle periods, and |
| 391 | suddenly B becomes active. B's virtual time will be updated to |
| 392 | (A's\~vt\~+\~C's\~vt)/2, effectively landing in the middle between A's and C's |
| 393 | virtual time. The effect \- B, having no RT guarantees, will be punished and |
| 394 | will not be allowed to transfer until C's virtual time catches up. |
| 395 | |
| 396 | If the interface had higher capacity \- for example 100mbit, this example |
| 397 | would behave perfectly fine though. |
| 398 | |
| 399 | Let's look a bit closer at the above example \- it "cleverly" invalidates one |
| 400 | of the basic things LS criterion tries to achieve \- equality of all virtual |
| 401 | times across class hierarchy. Leaf classes without RT service curves are |
| 402 | literally left to their own fate (governed by messed up virtual times). |
| 403 | |
| 404 | Also - it doesn't make much sense. Class A will always be guaranteed up to |
| 405 | 8mbit, and this is more than any absolute bandwidth that could happen from its |
| 406 | LS criterion (excluding trivial case of only A being active). If the bandwidth |
| 407 | taken by A is smaller than absolute value from LS criterion, the unused part |
| 408 | will be automatically assigned to other active classes (as A has idling periods |
| 409 | in such case). The only "advantage" is, that even in case of low bandwidth on |
| 410 | average, bursts would be handled at the speed defined by RT criterion. Still, |
| 411 | if extra speed is needed (e.g. due to latency), non linear service curves |
| 412 | should be used in such case. |
| 413 | |
| 414 | In the other words - LS criterion is meaningless in the above example. |
| 415 | |
| 416 | You can quickly "workaround" it by making sure each leaf class has RT service |
| 417 | curve assigned (thus guaranteeing all of them will get some bandwidth), but it |
| 418 | doesn't make it any more valid. |
| 419 | |
| 420 | Keep in mind - if you use nonlinear curves and irregularities explained above |
| 421 | happen \fIonly\fR in the first segment, then there's little wrong with |
| 422 | "overusing" RT curve a bit: |
| 423 | |
| 424 | .nf |
| 425 | A \- ls 5.0mbit, rt 9mbit/30ms, then 1mbit |
| 426 | B \- ls 2.5mbit |
| 427 | C \- ls 2.5mbit |
| 428 | .fi |
| 429 | |
| 430 | Here, the vt of A will "spike" in the initial period, but then A will never get more |
| 431 | than 1mbit, until B & C catch up. Then everything will be back to normal. |
| 432 | . |
| 433 | .SH "LINUX AND TIMER RESOLUTION" |
| 434 | . |
| 435 | In certain situations, the scheduler can throttle itself and setup so |
| 436 | called watchdog to wakeup dequeue function at some time later. In case of HFSC |
| 437 | it happens when for example no packet is eligible for scheduling, and UL |
| 438 | service curve is used to limit the speed at which LS criterion is allowed to |
| 439 | dequeue packets. It's called throttling, and accuracy of it is dependent on |
| 440 | how the kernel is compiled. |
| 441 | |
| 442 | There're 3 important options in modern kernels, as far as timers' resolution |
| 443 | goes: \&'tickless system', \&'high resolution timer support' and \&'timer |
| 444 | frequency'. |
| 445 | |
| 446 | If you have \&'tickless system' enabled, then the timer interrupt will trigger |
| 447 | as slowly as possible, but each time a scheduler throttles itself (or any |
| 448 | other part of the kernel needs better accuracy), the rate will be increased as |
| 449 | needed / possible. The ceiling is either \&'timer frequency' if \&'high |
| 450 | resolution timer support' is not available or not compiled in, or it's |
| 451 | hardware dependent and can go \fIfar\fR beyond the highest \&'timer frequency' |
| 452 | setting available. |
| 453 | |
| 454 | If \&'tickless system' is not enabled, the timer will trigger at a fixed rate |
| 455 | specified by \&'timer frequency' \- regardless if high resolution timers are |
| 456 | or aren't available. |
| 457 | |
| 458 | This is important to keep those settings in mind, as in scenario like: no |
| 459 | tickless, no HR timers, frequency set to 100hz \- throttling accuracy would be |
| 460 | at 10ms. It doesn't automatically mean you would be limited to ~0.8mbit/s |
| 461 | (assuming packets at ~1KB) \- as long as your queues are prepared to cover for |
| 462 | timer inaccuracy. Of course, in case of e.g. locally generated udp traffic \- |
| 463 | appropriate socket size is needed as well. Short example to make it more |
| 464 | understandable (assume hardcore anti\-schedule settings \- HZ=100, no HR |
| 465 | timers, no tickless): |
| 466 | |
| 467 | .nf |
| 468 | tc qdisc add dev eth0 root handle 1:0 hfsc default 1 |
| 469 | tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit |
| 470 | .fi |
| 471 | |
| 472 | Assuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit \- anything |
| 473 | beyond it (e.g. the above example with specified rate over 10x bigger) will |
| 474 | require appropriate queuing and cause bursts every ~10 ms. As you can |
| 475 | imagine, any HFSC's RT guarantees will be seriously invalidated by that. |
| 476 | Aforementioned example is mainly important if you deal with old hardware \- as |
| 477 | it's particularly popular for home server chores. Even then, you can easily |
| 478 | set HZ=1000 and have very accurate scheduling for typical adsl speeds. |
| 479 | |
| 480 | Anything modern (apic or even hpet msi based timers + \&'tickless system') |
| 481 | will provide enough accuracy for superb 1gbit scheduling. For example, on one |
| 482 | of basically cheap dual core AMD boards I have with following settings: |
| 483 | |
| 484 | .nf |
| 485 | tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1 |
| 486 | tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit |
| 487 | .fi |
| 488 | |
| 489 | And simple: |
| 490 | |
| 491 | .nf |
| 492 | nc \-u dst.host.com 54321 </dev/zero |
| 493 | nc \-l \-p 54321 >/dev/null |
| 494 | .fi |
| 495 | |
| 496 | \&...will yield following effects over period of ~10 seconds (taken from |
| 497 | /proc/interrupts): |
| 498 | |
| 499 | .nf |
| 500 | 319: 42124229 0 HPET_MSI\-edge hpet2 (before) |
| 501 | 319: 42436214 0 HPET_MSI\-edge hpet2 (after 10s.) |
| 502 | .fi |
| 503 | |
| 504 | That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious |
| 505 | drawback of it is that cpu load can be rather extensive with servicing that |
| 506 | many timer interrupts. Example with 300mbit RT service curve on 1gbit link is |
| 507 | particularly ugly, as it requires a lot of throttling with minuscule delays. |
| 508 | |
| 509 | Also note that it's just an example showing capability of current hardware. |
| 510 | The above example (essentially 300mbit TBF emulator) is pointless on internal |
| 511 | interface to begin with \- you will pretty much always want regular LS service |
| 512 | curve there, and in such scenario HFSC simply doesn't throttle at all. |
| 513 | |
| 514 | 300mbit RT service curve (selected columns from mpstat \-P ALL 1): |
| 515 | |
| 516 | .nf |
| 517 | 10:56:43 PM CPU %sys %irq %soft %idle |
| 518 | 10:56:44 PM all 20.10 6.53 34.67 37.19 |
| 519 | 10:56:44 PM 0 35.00 0.00 63.00 0.00 |
| 520 | 10:56:44 PM 1 4.95 12.87 6.93 73.27 |
| 521 | .fi |
| 522 | |
| 523 | So, in rare case you need those speeds with only RT service curve, or with UL |
| 524 | service curve \- remember about drawbacks. |
| 525 | . |
| 526 | .SH "CAVEAT: RANDOM ONLINE EXAMPLES" |
| 527 | . |
| 528 | For reasons unknown (though well guessed), many examples you can google love to |
| 529 | overuse UL criterion and stuff it in every node possible. This makes no sense |
| 530 | and works against what HFSC tries to do (and does pretty damn well). Use UL |
| 531 | where it makes sense - on the uppermost node to match upstream router's uplink |
| 532 | capacity. Or - in special cases, such as testing (limit certain subtree to some |
| 533 | speed) or customers that must never get more than certain speed. In the last |
| 534 | case you can usually achieve the same by just using RT criterion without LS+UL |
| 535 | on leaf nodes. |
| 536 | |
| 537 | As for router case - remember it's good to differentiate between "traffic to |
| 538 | router" (remote console, web config, etc.) and "outgoing traffic", so for |
| 539 | example: |
| 540 | |
| 541 | .nf |
| 542 | tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002 |
| 543 | tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50mbit |
| 544 | tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2mbit ul m2 2mbit |
| 545 | .fi |
| 546 | |
| 547 | \&... so "internet" tree under 1:1 and "router itself" as 1:999 |
| 548 | . |
| 549 | .SH "LAYER2 ADAPTATION" |
| 550 | . |
| 551 | Please refer to \fBtc\-stab\fR(8) |
| 552 | . |
| 553 | .SH "SEE ALSO" |
| 554 | . |
| 555 | \fBtc\fR(8), \fBtc\-hfsc\fR(8), \fBtc\-stab\fR(8) |
| 556 | |
| 557 | Please direct bugreports and patches to: <net...@vger.kernel.org> |
| 558 | . |
| 559 | .SH "AUTHOR" |
| 560 | . |
| 561 | Manpage created by Michal Soltys (sol...@ziu.info) |