Blame - marvell/linux/Documentation/admin-guide/mm/numaperf.rst - T108

blob: a80c3c37226eb0244be4d549026b8fa0885fba93 [file] [log] [blame]

b.liu	e958203	2025-04-17 19:18:16 +0800	[diff] [blame^]	1	.. _numaperf:
				2
				3	=============
				4	NUMA Locality
				5	=============
				6
				7	Some platforms may have multiple types of memory attached to a compute
				8	node. These disparate memory ranges may share some characteristics, such
				9	as CPU cache coherence, but may have different performance. For example,
				10	different media types and buses affect bandwidth and latency.
				11
				12	A system supports such heterogeneous memory by grouping each memory type
				13	under different domains, or "nodes", based on locality and performance
				14	characteristics. Some memory may share the same node as a CPU, and others
				15	are provided as memory only nodes. While memory only nodes do not provide
				16	CPUs, they may still be local to one or more compute nodes relative to
				17	other nodes. The following diagram shows one such example of two compute
				18	nodes with local memory and a memory only node for each of compute node::
				19
				20	+------------------+ +------------------+
				21	\| Compute Node 0 +-----+ Compute Node 1 \|
				22	\| Local Node0 Mem \| \| Local Node1 Mem \|
				23	+--------+---------+ +--------+---------+
				24	\| \|
				25	+--------+---------+ +--------+---------+
				26	\| Slower Node2 Mem \| \| Slower Node3 Mem \|
				27	+------------------+ +--------+---------+
				28
				29	A "memory initiator" is a node containing one or more devices such as
				30	CPUs or separate memory I/O devices that can initiate memory requests.
				31	A "memory target" is a node containing one or more physical address
				32	ranges accessible from one or more memory initiators.
				33
				34	When multiple memory initiators exist, they may not all have the same
				35	performance when accessing a given memory target. Each initiator-target
				36	pair may be organized into different ranked access classes to represent
				37	this relationship. The highest performing initiator to a given target
				38	is considered to be one of that target's local initiators, and given
				39	the highest access class, 0. Any given target may have one or more
				40	local initiators, and any given initiator may have multiple local
				41	memory targets.
				42
				43	To aid applications matching memory targets with their initiators, the
				44	kernel provides symlinks to each other. The following example lists the
				45	relationship for the access class "0" memory initiators and targets::
				46
				47	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
				48	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
				49
				50	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
				51	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
				52
				53	A memory initiator may have multiple memory targets in the same access
				54	class. The target memory's initiators in a given class indicate the
				55	nodes' access characteristics share the same performance relative to other
				56	linked initiator nodes. Each target within an initiator's access class,
				57	though, do not necessarily perform the same as each other.
				58
				59	================
				60	NUMA Performance
				61	================
				62
				63	Applications may wish to consider which node they want their memory to
				64	be allocated from based on the node's performance characteristics. If
				65	the system provides these attributes, the kernel exports them under the
				66	node sysfs hierarchy by appending the attributes directory under the
				67	memory node's access class 0 initiators as follows::
				68
				69	/sys/devices/system/node/nodeY/access0/initiators/
				70
				71	These attributes apply only when accessed from nodes that have the
				72	are linked under the this access's inititiators.
				73
				74	The performance characteristics the kernel provides for the local initiators
				75	are exported are as follows::
				76
				77	# tree -P "read\|write" /sys/devices/system/node/nodeY/access0/initiators/
				78	/sys/devices/system/node/nodeY/access0/initiators/
				79	\|-- read_bandwidth
				80	\|-- read_latency
				81	\|-- write_bandwidth
				82	`-- write_latency
				83
				84	The bandwidth attributes are provided in MiB/second.
				85
				86	The latency attributes are provided in nanoseconds.
				87
				88	The values reported here correspond to the rated latency and bandwidth
				89	for the platform.
				90
				91	==========
				92	NUMA Cache
				93	==========
				94
				95	System memory may be constructed in a hierarchy of elements with various
				96	performance characteristics in order to provide large address space of
				97	slower performing memory cached by a smaller higher performing memory. The
				98	system physical addresses memory initiators are aware of are provided
				99	by the last memory level in the hierarchy. The system meanwhile uses
				100	higher performing memory to transparently cache access to progressively
				101	slower levels.
				102
				103	The term "far memory" is used to denote the last level memory in the
				104	hierarchy. Each increasing cache level provides higher performing
				105	initiator access, and the term "near memory" represents the fastest
				106	cache provided by the system.
				107
				108	This numbering is different than CPU caches where the cache level (ex:
				109	L1, L2, L3) uses the CPU-side view where each increased level is lower
				110	performing. In contrast, the memory cache level is centric to the last
				111	level memory, so the higher numbered cache level corresponds to memory
				112	nearer to the CPU, and further from far memory.
				113
				114	The memory-side caches are not directly addressable by software. When
				115	software accesses a system address, the system will return it from the
				116	near memory cache if it is present. If it is not present, the system
				117	accesses the next level of memory until there is either a hit in that
				118	cache level, or it reaches far memory.
				119
				120	An application does not need to know about caching attributes in order
				121	to use the system. Software may optionally query the memory cache
				122	attributes in order to maximize the performance out of such a setup.
				123	If the system provides a way for the kernel to discover this information,
				124	for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
				125	the kernel will append these attributes to the NUMA node memory target.
				126
				127	When the kernel first registers a memory cache with a node, the kernel
				128	will create the following directory::
				129
				130	/sys/devices/system/node/nodeX/memory_side_cache/
				131
				132	If that directory is not present, the system either does not not provide
				133	a memory-side cache, or that information is not accessible to the kernel.
				134
				135	The attributes for each level of cache is provided under its cache
				136	level index::
				137
				138	/sys/devices/system/node/nodeX/memory_side_cache/indexA/
				139	/sys/devices/system/node/nodeX/memory_side_cache/indexB/
				140	/sys/devices/system/node/nodeX/memory_side_cache/indexC/
				141
				142	Each cache level's directory provides its attributes. For example, the
				143	following shows a single cache level and the attributes available for
				144	software to query::
				145
				146	# tree sys/devices/system/node/node0/memory_side_cache/
				147	/sys/devices/system/node/node0/memory_side_cache/
				148	\|-- index1
				149	\| \|-- indexing
				150	\| \|-- line_size
				151	\| \|-- size
				152	\| `-- write_policy
				153
				154	The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
				155	for any other indexed based, multi-way associativity.
				156
				157	The "line_size" is the number of bytes accessed from the next cache
				158	level on a miss.
				159
				160	The "size" is the number of bytes provided by this cache level.
				161
				162	The "write_policy" will be 0 for write-back, and non-zero for
				163	write-through caching.
				164
				165	========
				166	See Also
				167	========
				168
				169	[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
				170	- Section 5.2.27