| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 | TCP protocol | 
|  | 2 | ============ | 
|  | 3 |  | 
|  | 4 | Last updated: 3 June 2017 | 
|  | 5 |  | 
|  | 6 | Contents | 
|  | 7 | ======== | 
|  | 8 |  | 
|  | 9 | - Congestion control | 
|  | 10 | - How the new TCP output machine [nyi] works | 
|  | 11 |  | 
|  | 12 | Congestion control | 
|  | 13 | ================== | 
|  | 14 |  | 
|  | 15 | The following variables are used in the tcp_sock for congestion control: | 
|  | 16 | snd_cwnd		The size of the congestion window | 
|  | 17 | snd_ssthresh		Slow start threshold. We are in slow start if | 
|  | 18 | snd_cwnd is less than this. | 
|  | 19 | snd_cwnd_cnt		A counter used to slow down the rate of increase | 
|  | 20 | once we exceed slow start threshold. | 
|  | 21 | snd_cwnd_clamp		This is the maximum size that snd_cwnd can grow to. | 
|  | 22 | snd_cwnd_stamp		Timestamp for when congestion window last validated. | 
|  | 23 | snd_cwnd_used		Used as a highwater mark for how much of the | 
|  | 24 | congestion window is in use. It is used to adjust | 
|  | 25 | snd_cwnd down when the link is limited by the | 
|  | 26 | application rather than the network. | 
|  | 27 |  | 
|  | 28 | As of 2.6.13, Linux supports pluggable congestion control algorithms. | 
|  | 29 | A congestion control mechanism can be registered through functions in | 
|  | 30 | tcp_cong.c. The functions used by the congestion control mechanism are | 
|  | 31 | registered via passing a tcp_congestion_ops struct to | 
|  | 32 | tcp_register_congestion_control. As a minimum, the congestion control | 
|  | 33 | mechanism must provide a valid name and must implement either ssthresh, | 
|  | 34 | cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. | 
|  | 35 |  | 
|  | 36 | Private data for a congestion control mechanism is stored in tp->ca_priv. | 
|  | 37 | tcp_ca(tp) returns a pointer to this space.  This is preallocated space - it | 
|  | 38 | is important to check the size of your private data will fit this space, or | 
|  | 39 | alternatively, space could be allocated elsewhere and a pointer to it could | 
|  | 40 | be stored here. | 
|  | 41 |  | 
|  | 42 | There are three kinds of congestion control algorithms currently: The | 
|  | 43 | simplest ones are derived from TCP reno (highspeed, scalable) and just | 
|  | 44 | provide an alternative congestion window calculation. More complex | 
|  | 45 | ones like BIC try to look at other events to provide better | 
|  | 46 | heuristics.  There are also round trip time based algorithms like | 
|  | 47 | Vegas and Westwood+. | 
|  | 48 |  | 
|  | 49 | Good TCP congestion control is a complex problem because the algorithm | 
|  | 50 | needs to maintain fairness and performance. Please review current | 
|  | 51 | research and RFC's before developing new modules. | 
|  | 52 |  | 
|  | 53 | The default congestion control mechanism is chosen based on the | 
|  | 54 | DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default | 
|  | 55 | value then you can set it using sysctl net.ipv4.tcp_congestion_control. The | 
|  | 56 | module will be autoloaded if needed and you will get the expected protocol. If | 
|  | 57 | you ask for an unknown congestion method, then the sysctl attempt will fail. | 
|  | 58 |  | 
|  | 59 | If you remove a TCP congestion control module, then you will get the next | 
|  | 60 | available one. Since reno cannot be built as a module, and cannot be | 
|  | 61 | removed, it will always be available. | 
|  | 62 |  | 
|  | 63 | How the new TCP output machine [nyi] works. | 
|  | 64 | =========================================== | 
|  | 65 |  | 
|  | 66 | Data is kept on a single queue. The skb->users flag tells us if the frame is | 
|  | 67 | one that has been queued already. To add a frame we throw it on the end. Ack | 
|  | 68 | walks down the list from the start. | 
|  | 69 |  | 
|  | 70 | We keep a set of control flags | 
|  | 71 |  | 
|  | 72 |  | 
|  | 73 | sk->tcp_pend_event | 
|  | 74 |  | 
|  | 75 | TCP_PEND_ACK			Ack needed | 
|  | 76 | TCP_ACK_NOW			Needed now | 
|  | 77 | TCP_WINDOW			Window update check | 
|  | 78 | TCP_WINZERO			Zero probing | 
|  | 79 |  | 
|  | 80 |  | 
|  | 81 | sk->transmit_queue		The transmission frame begin | 
|  | 82 | sk->transmit_new		First new frame pointer | 
|  | 83 | sk->transmit_end		Where to add frames | 
|  | 84 |  | 
|  | 85 | sk->tcp_last_tx_ack		Last ack seen | 
|  | 86 | sk->tcp_dup_ack			Dup ack count for fast retransmit | 
|  | 87 |  | 
|  | 88 |  | 
|  | 89 | Frames are queued for output by tcp_write. We do our best to send the frames | 
|  | 90 | off immediately if possible, but otherwise queue and compute the body | 
|  | 91 | checksum in the copy. | 
|  | 92 |  | 
|  | 93 | When a write is done we try to clear any pending events and piggy back them. | 
|  | 94 | If the window is full we queue full sized frames. On the first timeout in | 
|  | 95 | zero window we split this. | 
|  | 96 |  | 
|  | 97 | On a timer we walk the retransmit list to send any retransmits, update the | 
|  | 98 | backoff timers etc. A change of route table stamp causes a change of header | 
|  | 99 | and recompute. We add any new tcp level headers and refinish the checksum | 
|  | 100 | before sending. | 
|  | 101 |  |