Blame - src/kernel/linux/v4.19/Documentation/networking/rds.txt - T800

blob: 0235ae69af2a862938e550e61453b4fe81d7b2d3 [file] [log] [blame]

xj	b04a402	2021-11-25 15:01:52 +0800	[diff] [blame]	1
				2	Overview
				3	========
				4
				5	This readme tries to provide some background on the hows and whys of RDS,
				6	and will hopefully help you find your way around the code.
				7
				8	In addition, please see this email about RDS origins:
				9	http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
				10
				11	RDS Architecture
				12	================
				13
				14	RDS provides reliable, ordered datagram delivery by using a single
				15	reliable connection between any two nodes in the cluster. This allows
				16	applications to use a single socket to talk to any other process in the
				17	cluster - so in a cluster with N processes you need N sockets, in contrast
				18	to N*N if you use a connection-oriented socket transport like TCP.
				19
				20	RDS is not Infiniband-specific; it was designed to support different
				21	transports. The current implementation used to support RDS over TCP as well
				22	as IB.
				23
				24	The high-level semantics of RDS from the application's point of view are
				25
				26	* Addressing
				27	RDS uses IPv4 addresses and 16bit port numbers to identify
				28	the end point of a connection. All socket operations that involve
				29	passing addresses between kernel and user space generally
				30	use a struct sockaddr_in.
				31
				32	The fact that IPv4 addresses are used does not mean the underlying
				33	transport has to be IP-based. In fact, RDS over IB uses a
				34	reliable IB connection; the IP address is used exclusively to
				35	locate the remote node's GID (by ARPing for the given IP).
				36
				37	The port space is entirely independent of UDP, TCP or any other
				38	protocol.
				39
				40	* Socket interface
				41	RDS sockets work mostly as you would expect from a BSD
				42	socket. The next section will cover the details. At any rate,
				43	all I/O is performed through the standard BSD socket API.
				44	Some additions like zerocopy support are implemented through
				45	control messages, while other extensions use the getsockopt/
				46	setsockopt calls.
				47
				48	Sockets must be bound before you can send or receive data.
				49	This is needed because binding also selects a transport and
				50	attaches it to the socket. Once bound, the transport assignment
				51	does not change. RDS will tolerate IPs moving around (eg in
				52	a active-active HA scenario), but only as long as the address
				53	doesn't move to a different transport.
				54
				55	* sysctls
				56	RDS supports a number of sysctls in /proc/sys/net/rds
				57
				58
				59	Socket Interface
				60	================
				61
				62	AF_RDS, PF_RDS, SOL_RDS
				63	AF_RDS and PF_RDS are the domain type to be used with socket(2)
				64	to create RDS sockets. SOL_RDS is the socket-level to be used
				65	with setsockopt(2) and getsockopt(2) for RDS specific socket
				66	options.
				67
				68	fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
				69	This creates a new, unbound RDS socket.
				70
				71	setsockopt(SOL_SOCKET): send and receive buffer size
				72	RDS honors the send and receive buffer size socket options.
				73	You are not allowed to queue more than SO_SNDSIZE bytes to
				74	a socket. A message is queued when sendmsg is called, and
				75	it leaves the queue when the remote system acknowledges
				76	its arrival.
				77
				78	The SO_RCVSIZE option controls the maximum receive queue length.
				79	This is a soft limit rather than a hard limit - RDS will
				80	continue to accept and queue incoming messages, even if that
				81	takes the queue length over the limit. However, it will also
				82	mark the port as "congested" and send a congestion update to
				83	the source node. The source node is supposed to throttle any
				84	processes sending to this congested port.
				85
				86	bind(fd, &sockaddr_in, ...)
				87	This binds the socket to a local IP address and port, and a
				88	transport, if one has not already been selected via the
				89	SO_RDS_TRANSPORT socket option
				90
				91	sendmsg(fd, ...)
				92	Sends a message to the indicated recipient. The kernel will
				93	transparently establish the underlying reliable connection
				94	if it isn't up yet.
				95
				96	An attempt to send a message that exceeds SO_SNDSIZE will
				97	return with -EMSGSIZE
				98
				99	An attempt to send a message that would take the total number
				100	of queued bytes over the SO_SNDSIZE threshold will return
				101	EAGAIN.
				102
				103	An attempt to send a message to a destination that is marked
				104	as "congested" will return ENOBUFS.
				105
				106	recvmsg(fd, ...)
				107	Receives a message that was queued to this socket. The sockets
				108	recv queue accounting is adjusted, and if the queue length
				109	drops below SO_SNDSIZE, the port is marked uncongested, and
				110	a congestion update is sent to all peers.
				111
				112	Applications can ask the RDS kernel module to receive
				113	notifications via control messages (for instance, there is a
				114	notification when a congestion update arrived, or when a RDMA
				115	operation completes). These notifications are received through
				116	the msg.msg_control buffer of struct msghdr. The format of the
				117	messages is described in manpages.
				118
				119	poll(fd)
				120	RDS supports the poll interface to allow the application
				121	to implement async I/O.
				122
				123	POLLIN handling is pretty straightforward. When there's an
				124	incoming message queued to the socket, or a pending notification,
				125	we signal POLLIN.
				126
				127	POLLOUT is a little harder. Since you can essentially send
				128	to any destination, RDS will always signal POLLOUT as long as
				129	there's room on the send queue (ie the number of bytes queued
				130	is less than the sendbuf size).
				131
				132	However, the kernel will refuse to accept messages to
				133	a destination marked congested - in this case you will loop
				134	forever if you rely on poll to tell you what to do.
				135	This isn't a trivial problem, but applications can deal with
				136	this - by using congestion notifications, and by checking for
				137	ENOBUFS errors returned by sendmsg.
				138
				139	setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
				140	This allows the application to discard all messages queued to a
				141	specific destination on this particular socket.
				142
				143	This allows the application to cancel outstanding messages if
				144	it detects a timeout. For instance, if it tried to send a message,
				145	and the remote host is unreachable, RDS will keep trying forever.
				146	The application may decide it's not worth it, and cancel the
				147	operation. In this case, it would use RDS_CANCEL_SENT_TO to
				148	nuke any pending messages.
				149
				150	setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
				151	getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
				152	Set or read an integer defining the underlying
				153	encapsulating transport to be used for RDS packets on the
				154	socket. When setting the option, integer argument may be
				155	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
				156	value, RDS_TRANS_NONE will be returned on an unbound socket.
				157	This socket option may only be set exactly once on the socket,
				158	prior to binding it via the bind(2) system call. Attempts to
				159	set SO_RDS_TRANSPORT on a socket for which the transport has
				160	been previously attached explicitly (by SO_RDS_TRANSPORT) or
				161	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
				162	An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
				163	always return EINVAL.
				164
				165	RDMA for RDS
				166	============
				167
				168	see rds-rdma(7) manpage (available in rds-tools)
				169
				170
				171	Congestion Notifications
				172	========================
				173
				174	see rds(7) manpage
				175
				176
				177	RDS Protocol
				178	============
				179
				180	Message header
				181
				182	The message header is a 'struct rds_header' (see rds.h):
				183	Fields:
				184	h_sequence:
				185	per-packet sequence number
				186	h_ack:
				187	piggybacked acknowledgment of last packet received
				188	h_len:
				189	length of data, not including header
				190	h_sport:
				191	source port
				192	h_dport:
				193	destination port
				194	h_flags:
				195	CONG_BITMAP - this is a congestion update bitmap
				196	ACK_REQUIRED - receiver must ack this packet
				197	RETRANSMITTED - packet has previously been sent
				198	h_credit:
				199	indicate to other end of connection that
				200	it has more credits available (i.e. there is
				201	more send room)
				202	h_padding[4]:
				203	unused, for future use
				204	h_csum:
				205	header checksum
				206	h_exthdr:
				207	optional data can be passed here. This is currently used for
				208	passing RDMA-related information.
				209
				210	ACK and retransmit handling
				211
				212	One might think that with reliable IB connections you wouldn't need
				213	to ack messages that have been received. The problem is that IB
				214	hardware generates an ack message before it has DMAed the message
				215	into memory. This creates a potential message loss if the HCA is
				216	disabled for any reason between when it sends the ack and before
				217	the message is DMAed and processed. This is only a potential issue
				218	if another HCA is available for fail-over.
				219
				220	Sending an ack immediately would allow the sender to free the sent
				221	message from their send queue quickly, but could cause excessive
				222	traffic to be used for acks. RDS piggybacks acks on sent data
				223	packets. Ack-only packets are reduced by only allowing one to be
				224	in flight at a time, and by the sender only asking for acks when
				225	its send buffers start to fill up. All retransmissions are also
				226	acked.
				227
				228	Flow Control
				229
				230	RDS's IB transport uses a credit-based mechanism to verify that
				231	there is space in the peer's receive buffers for more data. This
				232	eliminates the need for hardware retries on the connection.
				233
				234	Congestion
				235
				236	Messages waiting in the receive queue on the receiving socket
				237	are accounted against the sockets SO_RCVBUF option value. Only
				238	the payload bytes in the message are accounted for. If the
				239	number of bytes queued equals or exceeds rcvbuf then the socket
				240	is congested. All sends attempted to this socket's address
				241	should return block or return -EWOULDBLOCK.
				242
				243	Applications are expected to be reasonably tuned such that this
				244	situation very rarely occurs. An application encountering this
				245	"back-pressure" is considered a bug.
				246
				247	This is implemented by having each node maintain bitmaps which
				248	indicate which ports on bound addresses are congested. As the
				249	bitmap changes it is sent through all the connections which
				250	terminate in the local address of the bitmap which changed.
				251
				252	The bitmaps are allocated as connections are brought up. This
				253	avoids allocation in the interrupt handling path which queues
				254	sages on sockets. The dense bitmaps let transports send the
				255	entire bitmap on any bitmap change reasonably efficiently. This
				256	is much easier to implement than some finer-grained
				257	communication of per-port congestion. The sender does a very
				258	inexpensive bit test to test if the port it's about to send to
				259	is congested or not.
				260
				261
				262	RDS Transport Layer
				263	==================
				264
				265	As mentioned above, RDS is not IB-specific. Its code is divided
				266	into a general RDS layer and a transport layer.
				267
				268	The general layer handles the socket API, congestion handling,
				269	loopback, stats, usermem pinning, and the connection state machine.
				270
				271	The transport layer handles the details of the transport. The IB
				272	transport, for example, handles all the queue pairs, work requests,
				273	CM event handlers, and other Infiniband details.
				274
				275
				276	RDS Kernel Structures
				277	=====================
				278
				279	struct rds_message
				280	aka possibly "rds_outgoing", the generic RDS layer copies data to
				281	be sent and sets header fields as needed, based on the socket API.
				282	This is then queued for the individual connection and sent by the
				283	connection's transport.
				284	struct rds_incoming
				285	a generic struct referring to incoming data that can be handed from
				286	the transport to the general code and queued by the general code
				287	while the socket is awoken. It is then passed back to the transport
				288	code to handle the actual copy-to-user.
				289	struct rds_socket
				290	per-socket information
				291	struct rds_connection
				292	per-connection information
				293	struct rds_transport
				294	pointers to transport-specific functions
				295	struct rds_statistics
				296	non-transport-specific statistics
				297	struct rds_cong_map
				298	wraps the raw congestion bitmap, contains rbnode, waitq, etc.
				299
				300	Connection management
				301	=====================
				302
				303	Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
				304	ERROR states.
				305
				306	The first time an attempt is made by an RDS socket to send data to
				307	a node, a connection is allocated and connected. That connection is
				308	then maintained forever -- if there are transport errors, the
				309	connection will be dropped and re-established.
				310
				311	Dropping a connection while packets are queued will cause queued or
				312	partially-sent datagrams to be retransmitted when the connection is
				313	re-established.
				314
				315
				316	The send path
				317	=============
				318
				319	rds_sendmsg()
				320	struct rds_message built from incoming data
				321	CMSGs parsed (e.g. RDMA ops)
				322	transport connection alloced and connected if not already
				323	rds_message placed on send queue
				324	send worker awoken
				325	rds_send_worker()
				326	calls rds_send_xmit() until queue is empty
				327	rds_send_xmit()
				328	transmits congestion map if one is pending
				329	may set ACK_REQUIRED
				330	calls transport to send either non-RDMA or RDMA message
				331	(RDMA ops never retransmitted)
				332	rds_ib_xmit()
				333	allocs work requests from send ring
				334	adds any new send credits available to peer (h_credits)
				335	maps the rds_message's sg list
				336	piggybacks ack
				337	populates work requests
				338	post send to connection's queue pair
				339
				340	The recv path
				341	=============
				342
				343	rds_ib_recv_cq_comp_handler()
				344	looks at write completions
				345	unmaps recv buffer from device
				346	no errors, call rds_ib_process_recv()
				347	refill recv ring
				348	rds_ib_process_recv()
				349	validate header checksum
				350	copy header to rds_ib_incoming struct if start of a new datagram
				351	add to ibinc's fraglist
				352	if competed datagram:
				353	update cong map if datagram was cong update
				354	call rds_recv_incoming() otherwise
				355	note if ack is required
				356	rds_recv_incoming()
				357	drop duplicate packets
				358	respond to pings
				359	find the sock associated with this datagram
				360	add to sock queue
				361	wake up sock
				362	do some congestion calculations
				363	rds_recvmsg
				364	copy data into user iovec
				365	handle CMSGs
				366	return to application
				367
				368	Multipath RDS (mprds)
				369	=====================
				370	Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
				371	(though the concept can be extended to other transports). The classical
				372	implementation of RDS-over-TCP is implemented by demultiplexing multiple
				373	PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
				374	port]) over a single TCP socket between the 2 IP addresses involved. This
				375	has the limitation that it ends up funneling multiple RDS flows over a
				376	single TCP flow, thus it is
				377	(a) upper-bounded to the single-flow bandwidth,
				378	(b) suffers from head-of-line blocking for all the RDS sockets.
				379
				380	Better throughput (for a fixed small packet size, MTU) can be achieved
				381	by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
				382	RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
				383	connection. RDS sockets will be attached to a path based on some hash
				384	(e.g., of local address and RDS port number) and packets for that RDS
				385	socket will be sent over the attached path using TCP to segment/reassemble
				386	RDS datagrams on that path.
				387
				388	Multipathed RDS is implemented by splitting the struct rds_connection into
				389	a common (to all paths) part, and a per-path struct rds_conn_path. All
				390	I/O workqs and reconnect threads are driven from the rds_conn_path.
				391	Transports such as TCP that are multipath capable may then set up a
				392	TPC socket per rds_conn_path, and this is managed by the transport via
				393	the transport privatee cp_transport_data pointer.
				394
				395	Transports announce themselves as multipath capable by setting the
				396	t_mp_capable bit during registration with the rds core module. When the
				397	transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
				398	across multiple paths. The outgoing hash is computed based on the
				399	local address and port that the PF_RDS socket is bound to.
				400
				401	Additionally, even if the transport is MP capable, we may be
				402	peering with some node that does not support mprds, or supports
				403	a different number of paths. As a result, the peering nodes need
				404	to agree on the number of paths to be used for the connection.
				405	This is done by sending out a control packet exchange before the
				406	first data packet. The control packet exchange must have completed
				407	prior to outgoing hash completion in rds_sendmsg() when the transport
				408	is mutlipath capable.
				409
				410	The control packet is an RDS ping packet (i.e., packet to rds dest
				411	port 0) with the ping packet having a rds extension header option of
				412	type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
				413	number of paths supported by the sender. The "probe" ping packet will
				414	get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
				415	The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
				416	be able to compute the min(sender_paths, rcvr_paths). The pong
				417	sent in response to a probe-ping should contain the rcvr's npaths
				418	when the rcvr is mprds-capable.
				419
				420	If the rcvr is not mprds-capable, the exthdr in the ping will be
				421	ignored. In this case the pong will not have any exthdrs, so the sender
				422	of the probe-ping can default to single-path mprds.
				423