Blame - ap/libc/glibc/glibc-2.23/manual/resource.texi - T106_DC

blob: e68458b363b1173c6d488e27138c8ffd37508e78 [file] [log] [blame]

lh	9ed821d	2023-04-07 01:36:19 -0700	[diff] [blame]	1	@node Resource Usage And Limitation, Non-Local Exits, Date and Time, Top
				2	@c %MENU% Functions for examining resource usage and getting and setting limits
				3	@chapter Resource Usage And Limitation
				4	This chapter describes functions for examining how much of various kinds of
				5	resources (CPU time, memory, etc.) a process has used and getting and setting
				6	limits on future usage.
				7
				8	@menu
				9	* Resource Usage:: Measuring various resources used.
				10	* Limits on Resources:: Specifying limits on resource usage.
				11	* Priority:: Reading or setting process run priority.
				12	* Memory Resources:: Querying memory available resources.
				13	* Processor Resources:: Learn about the processors available.
				14	@end menu
				15
				16
				17	@node Resource Usage
				18	@section Resource Usage
				19
				20	@pindex sys/resource.h
				21	The function @code{getrusage} and the data type @code{struct rusage}
				22	are used to examine the resource usage of a process. They are declared
				23	in @file{sys/resource.h}.
				24
				25	@comment sys/resource.h
				26	@comment BSD
				27	@deftypefun int getrusage (int @var{processes}, struct rusage *@var{rusage})
				28	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				29	@c On HURD, this calls task_info 3 times. On UNIX, it's a syscall.
				30	This function reports resource usage totals for processes specified by
				31	@var{processes}, storing the information in @code{*@var{rusage}}.
				32
				33	In most systems, @var{processes} has only two valid values:
				34
				35	@table @code
				36	@comment sys/resource.h
				37	@comment BSD
				38	@item RUSAGE_SELF
				39	Just the current process.
				40
				41	@comment sys/resource.h
				42	@comment BSD
				43	@item RUSAGE_CHILDREN
				44	All child processes (direct and indirect) that have already terminated.
				45	@end table
				46
				47	The return value of @code{getrusage} is zero for success, and @code{-1}
				48	for failure.
				49
				50	@table @code
				51	@item EINVAL
				52	The argument @var{processes} is not valid.
				53	@end table
				54	@end deftypefun
				55
				56	One way of getting resource usage for a particular child process is with
				57	the function @code{wait4}, which returns totals for a child when it
				58	terminates. @xref{BSD Wait Functions}.
				59
				60	@comment sys/resource.h
				61	@comment BSD
				62	@deftp {Data Type} {struct rusage}
				63	This data type stores various resource usage statistics. It has the
				64	following members, and possibly others:
				65
				66	@table @code
				67	@item struct timeval ru_utime
				68	Time spent executing user instructions.
				69
				70	@item struct timeval ru_stime
				71	Time spent in operating system code on behalf of @var{processes}.
				72
				73	@item long int ru_maxrss
				74	The maximum resident set size used, in kilobytes. That is, the maximum
				75	number of kilobytes of physical memory that @var{processes} used
				76	simultaneously.
				77
				78	@item long int ru_ixrss
				79	An integral value expressed in kilobytes times ticks of execution, which
				80	indicates the amount of memory used by text that was shared with other
				81	processes.
				82
				83	@item long int ru_idrss
				84	An integral value expressed the same way, which is the amount of
				85	unshared memory used for data.
				86
				87	@item long int ru_isrss
				88	An integral value expressed the same way, which is the amount of
				89	unshared memory used for stack space.
				90
				91	@item long int ru_minflt
				92	The number of page faults which were serviced without requiring any I/O.
				93
				94	@item long int ru_majflt
				95	The number of page faults which were serviced by doing I/O.
				96
				97	@item long int ru_nswap
				98	The number of times @var{processes} was swapped entirely out of main memory.
				99
				100	@item long int ru_inblock
				101	The number of times the file system had to read from the disk on behalf
				102	of @var{processes}.
				103
				104	@item long int ru_oublock
				105	The number of times the file system had to write to the disk on behalf
				106	of @var{processes}.
				107
				108	@item long int ru_msgsnd
				109	Number of IPC messages sent.
				110
				111	@item long int ru_msgrcv
				112	Number of IPC messages received.
				113
				114	@item long int ru_nsignals
				115	Number of signals received.
				116
				117	@item long int ru_nvcsw
				118	The number of times @var{processes} voluntarily invoked a context switch
				119	(usually to wait for some service).
				120
				121	@item long int ru_nivcsw
				122	The number of times an involuntary context switch took place (because
				123	a time slice expired, or another process of higher priority was
				124	scheduled).
				125	@end table
				126	@end deftp
				127
				128	@code{vtimes} is a historical function that does some of what
				129	@code{getrusage} does. @code{getrusage} is a better choice.
				130
				131	@code{vtimes} and its @code{vtimes} data structure are declared in
				132	@file{sys/vtimes.h}.
				133	@pindex sys/vtimes.h
				134
				135	@comment sys/vtimes.h
				136	@deftypefun int vtimes (struct vtimes @var{current}, struct vtimes @var{child})
				137	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				138	@c Calls getrusage twice.
				139
				140	@code{vtimes} reports resource usage totals for a process.
				141
				142	If @var{current} is non-null, @code{vtimes} stores resource usage totals for
				143	the invoking process alone in the structure to which it points. If
				144	@var{child} is non-null, @code{vtimes} stores resource usage totals for all
				145	past children (which have terminated) of the invoking process in the structure
				146	to which it points.
				147
				148	@deftp {Data Type} {struct vtimes}
				149	This data type contains information about the resource usage of a process.
				150	Each member corresponds to a member of the @code{struct rusage} data type
				151	described above.
				152
				153	@table @code
				154	@item vm_utime
				155	User CPU time. Analogous to @code{ru_utime} in @code{struct rusage}
				156	@item vm_stime
				157	System CPU time. Analogous to @code{ru_stime} in @code{struct rusage}
				158	@item vm_idsrss
				159	Data and stack memory. The sum of the values that would be reported as
				160	@code{ru_idrss} and @code{ru_isrss} in @code{struct rusage}
				161	@item vm_ixrss
				162	Shared memory. Analogous to @code{ru_ixrss} in @code{struct rusage}
				163	@item vm_maxrss
				164	Maximent resident set size. Analogous to @code{ru_maxrss} in
				165	@code{struct rusage}
				166	@item vm_majflt
				167	Major page faults. Analogous to @code{ru_majflt} in @code{struct rusage}
				168	@item vm_minflt
				169	Minor page faults. Analogous to @code{ru_minflt} in @code{struct rusage}
				170	@item vm_nswap
				171	Swap count. Analogous to @code{ru_nswap} in @code{struct rusage}
				172	@item vm_inblk
				173	Disk reads. Analogous to @code{ru_inblk} in @code{struct rusage}
				174	@item vm_oublk
				175	Disk writes. Analogous to @code{ru_oublk} in @code{struct rusage}
				176	@end table
				177	@end deftp
				178
				179
				180	The return value is zero if the function succeeds; @code{-1} otherwise.
				181
				182
				183
				184	@end deftypefun
				185	An additional historical function for examining resource usage,
				186	@code{vtimes}, is supported but not documented here. It is declared in
				187	@file{sys/vtimes.h}.
				188
				189	@node Limits on Resources
				190	@section Limiting Resource Usage
				191	@cindex resource limits
				192	@cindex limits on resource usage
				193	@cindex usage limits
				194
				195	You can specify limits for the resource usage of a process. When the
				196	process tries to exceed a limit, it may get a signal, or the system call
				197	by which it tried to do so may fail, depending on the resource. Each
				198	process initially inherits its limit values from its parent, but it can
				199	subsequently change them.
				200
				201	There are two per-process limits associated with a resource:
				202	@cindex limit
				203
				204	@table @dfn
				205	@item current limit
				206	The current limit is the value the system will not allow usage to
				207	exceed. It is also called the ``soft limit'' because the process being
				208	limited can generally raise the current limit at will.
				209	@cindex current limit
				210	@cindex soft limit
				211
				212	@item maximum limit
				213	The maximum limit is the maximum value to which a process is allowed to
				214	set its current limit. It is also called the ``hard limit'' because
				215	there is no way for a process to get around it. A process may lower
				216	its own maximum limit, but only the superuser may increase a maximum
				217	limit.
				218	@cindex maximum limit
				219	@cindex hard limit
				220	@end table
				221
				222	@pindex sys/resource.h
				223	The symbols for use with @code{getrlimit}, @code{setrlimit},
				224	@code{getrlimit64}, and @code{setrlimit64} are defined in
				225	@file{sys/resource.h}.
				226
				227	@comment sys/resource.h
				228	@comment BSD
				229	@deftypefun int getrlimit (int @var{resource}, struct rlimit *@var{rlp})
				230	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				231	@c Direct syscall on most systems.
				232	Read the current and maximum limits for the resource @var{resource}
				233	and store them in @code{*@var{rlp}}.
				234
				235	The return value is @code{0} on success and @code{-1} on failure. The
				236	only possible @code{errno} error condition is @code{EFAULT}.
				237
				238	When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
				239	32-bit system this function is in fact @code{getrlimit64}. Thus, the
				240	LFS interface transparently replaces the old interface.
				241	@end deftypefun
				242
				243	@comment sys/resource.h
				244	@comment Unix98
				245	@deftypefun int getrlimit64 (int @var{resource}, struct rlimit64 *@var{rlp})
				246	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				247	@c Direct syscall on most systems, wrapper to getrlimit otherwise.
				248	This function is similar to @code{getrlimit} but its second parameter is
				249	a pointer to a variable of type @code{struct rlimit64}, which allows it
				250	to read values which wouldn't fit in the member of a @code{struct
				251	rlimit}.
				252
				253	If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
				254	32-bit machine, this function is available under the name
				255	@code{getrlimit} and so transparently replaces the old interface.
				256	@end deftypefun
				257
				258	@comment sys/resource.h
				259	@comment BSD
				260	@deftypefun int setrlimit (int @var{resource}, const struct rlimit *@var{rlp})
				261	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				262	@c Direct syscall on most systems; lock-taking critical section on HURD.
				263	Store the current and maximum limits for the resource @var{resource}
				264	in @code{*@var{rlp}}.
				265
				266	The return value is @code{0} on success and @code{-1} on failure. The
				267	following @code{errno} error condition is possible:
				268
				269	@table @code
				270	@item EPERM
				271	@itemize @bullet
				272	@item
				273	The process tried to raise a current limit beyond the maximum limit.
				274
				275	@item
				276	The process tried to raise a maximum limit, but is not superuser.
				277	@end itemize
				278	@end table
				279
				280	When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
				281	32-bit system this function is in fact @code{setrlimit64}. Thus, the
				282	LFS interface transparently replaces the old interface.
				283	@end deftypefun
				284
				285	@comment sys/resource.h
				286	@comment Unix98
				287	@deftypefun int setrlimit64 (int @var{resource}, const struct rlimit64 *@var{rlp})
				288	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				289	@c Wrapper for setrlimit or direct syscall.
				290	This function is similar to @code{setrlimit} but its second parameter is
				291	a pointer to a variable of type @code{struct rlimit64} which allows it
				292	to set values which wouldn't fit in the member of a @code{struct
				293	rlimit}.
				294
				295	If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
				296	32-bit machine this function is available under the name
				297	@code{setrlimit} and so transparently replaces the old interface.
				298	@end deftypefun
				299
				300	@comment sys/resource.h
				301	@comment BSD
				302	@deftp {Data Type} {struct rlimit}
				303	This structure is used with @code{getrlimit} to receive limit values,
				304	and with @code{setrlimit} to specify limit values for a particular process
				305	and resource. It has two fields:
				306
				307	@table @code
				308	@item rlim_t rlim_cur
				309	The current limit
				310
				311	@item rlim_t rlim_max
				312	The maximum limit.
				313	@end table
				314
				315	For @code{getrlimit}, the structure is an output; it receives the current
				316	values. For @code{setrlimit}, it specifies the new values.
				317	@end deftp
				318
				319	For the LFS functions a similar type is defined in @file{sys/resource.h}.
				320
				321	@comment sys/resource.h
				322	@comment Unix98
				323	@deftp {Data Type} {struct rlimit64}
				324	This structure is analogous to the @code{rlimit} structure above, but
				325	its components have wider ranges. It has two fields:
				326
				327	@table @code
				328	@item rlim64_t rlim_cur
				329	This is analogous to @code{rlimit.rlim_cur}, but with a different type.
				330
				331	@item rlim64_t rlim_max
				332	This is analogous to @code{rlimit.rlim_max}, but with a different type.
				333	@end table
				334
				335	@end deftp
				336
				337	Here is a list of resources for which you can specify a limit. Memory
				338	and file sizes are measured in bytes.
				339
				340	@table @code
				341	@comment sys/resource.h
				342	@comment BSD
				343	@item RLIMIT_CPU
				344	@vindex RLIMIT_CPU
				345	The maximum amount of CPU time the process can use. If it runs for
				346	longer than this, it gets a signal: @code{SIGXCPU}. The value is
				347	measured in seconds. @xref{Operation Error Signals}.
				348
				349	@comment sys/resource.h
				350	@comment BSD
				351	@item RLIMIT_FSIZE
				352	@vindex RLIMIT_FSIZE
				353	The maximum size of file the process can create. Trying to write a
				354	larger file causes a signal: @code{SIGXFSZ}. @xref{Operation Error
				355	Signals}.
				356
				357	@comment sys/resource.h
				358	@comment BSD
				359	@item RLIMIT_DATA
				360	@vindex RLIMIT_DATA
				361	The maximum size of data memory for the process. If the process tries
				362	to allocate data memory beyond this amount, the allocation function
				363	fails.
				364
				365	@comment sys/resource.h
				366	@comment BSD
				367	@item RLIMIT_STACK
				368	@vindex RLIMIT_STACK
				369	The maximum stack size for the process. If the process tries to extend
				370	its stack past this size, it gets a @code{SIGSEGV} signal.
				371	@xref{Program Error Signals}.
				372
				373	@comment sys/resource.h
				374	@comment BSD
				375	@item RLIMIT_CORE
				376	@vindex RLIMIT_CORE
				377	The maximum size core file that this process can create. If the process
				378	terminates and would dump a core file larger than this, then no core
				379	file is created. So setting this limit to zero prevents core files from
				380	ever being created.
				381
				382	@comment sys/resource.h
				383	@comment BSD
				384	@item RLIMIT_RSS
				385	@vindex RLIMIT_RSS
				386	The maximum amount of physical memory that this process should get.
				387	This parameter is a guide for the system's scheduler and memory
				388	allocator; the system may give the process more memory when there is a
				389	surplus.
				390
				391	@comment sys/resource.h
				392	@comment BSD
				393	@item RLIMIT_MEMLOCK
				394	The maximum amount of memory that can be locked into physical memory (so
				395	it will never be paged out).
				396
				397	@comment sys/resource.h
				398	@comment BSD
				399	@item RLIMIT_NPROC
				400	The maximum number of processes that can be created with the same user ID.
				401	If you have reached the limit for your user ID, @code{fork} will fail
				402	with @code{EAGAIN}. @xref{Creating a Process}.
				403
				404	@comment sys/resource.h
				405	@comment BSD
				406	@item RLIMIT_NOFILE
				407	@vindex RLIMIT_NOFILE
				408	@itemx RLIMIT_OFILE
				409	@vindex RLIMIT_OFILE
				410	The maximum number of files that the process can open. If it tries to
				411	open more files than this, its open attempt fails with @code{errno}
				412	@code{EMFILE}. @xref{Error Codes}. Not all systems support this limit;
				413	GNU does, and 4.4 BSD does.
				414
				415	@comment sys/resource.h
				416	@comment Unix98
				417	@item RLIMIT_AS
				418	@vindex RLIMIT_AS
				419	The maximum size of total memory that this process should get. If the
				420	process tries to allocate more memory beyond this amount with, for
				421	example, @code{brk}, @code{malloc}, @code{mmap} or @code{sbrk}, the
				422	allocation function fails.
				423
				424	@comment sys/resource.h
				425	@comment BSD
				426	@item RLIM_NLIMITS
				427	@vindex RLIM_NLIMITS
				428	The number of different resource limits. Any valid @var{resource}
				429	operand must be less than @code{RLIM_NLIMITS}.
				430	@end table
				431
				432	@comment sys/resource.h
				433	@comment BSD
				434	@deftypevr Constant rlim_t RLIM_INFINITY
				435	This constant stands for a value of ``infinity'' when supplied as
				436	the limit value in @code{setrlimit}.
				437	@end deftypevr
				438
				439
				440	The following are historical functions to do some of what the functions
				441	above do. The functions above are better choices.
				442
				443	@code{ulimit} and the command symbols are declared in @file{ulimit.h}.
				444	@pindex ulimit.h
				445
				446	@comment ulimit.h
				447	@comment BSD
				448	@deftypefun {long int} ulimit (int @var{cmd}, @dots{})
				449	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				450	@c Wrapper for getrlimit, setrlimit or
				451	@c sysconf(_SC_OPEN_MAX)->getdtablesize->getrlimit.
				452
				453	@code{ulimit} gets the current limit or sets the current and maximum
				454	limit for a particular resource for the calling process according to the
				455	command @var{cmd}.a
				456
				457	If you are getting a limit, the command argument is the only argument.
				458	If you are setting a limit, there is a second argument:
				459	@code{long int} @var{limit} which is the value to which you are setting
				460	the limit.
				461
				462	The @var{cmd} values and the operations they specify are:
				463	@table @code
				464
				465	@item GETFSIZE
				466	Get the current limit on the size of a file, in units of 512 bytes.
				467
				468	@item SETFSIZE
				469	Set the current and maximum limit on the size of a file to @var{limit} *
				470	512 bytes.
				471
				472	@end table
				473
				474	There are also some other @var{cmd} values that may do things on some
				475	systems, but they are not supported.
				476
				477	Only the superuser may increase a maximum limit.
				478
				479	When you successfully get a limit, the return value of @code{ulimit} is
				480	that limit, which is never negative. When you successfully set a limit,
				481	the return value is zero. When the function fails, the return value is
				482	@code{-1} and @code{errno} is set according to the reason:
				483
				484	@table @code
				485	@item EPERM
				486	A process tried to increase a maximum limit, but is not superuser.
				487	@end table
				488
				489
				490	@end deftypefun
				491
				492	@code{vlimit} and its resource symbols are declared in @file{sys/vlimit.h}.
				493	@pindex sys/vlimit.h
				494
				495	@comment sys/vlimit.h
				496	@comment BSD
				497	@deftypefun int vlimit (int @var{resource}, int @var{limit})
				498	@safety{@prelim{}@mtunsafe{@mtasurace{:setrlimit}}@asunsafe{}@acsafe{}}
				499	@c It calls getrlimit and modifies the rlim_cur field before calling
				500	@c setrlimit. There's a window for a concurrent call to setrlimit that
				501	@c modifies e.g. rlim_max, which will be lost if running as super-user.
				502
				503	@code{vlimit} sets the current limit for a resource for a process.
				504
				505	@var{resource} identifies the resource:
				506
				507	@table @code
				508	@item LIM_CPU
				509	Maximum CPU time. Same as @code{RLIMIT_CPU} for @code{setrlimit}.
				510	@item LIM_FSIZE
				511	Maximum file size. Same as @code{RLIMIT_FSIZE} for @code{setrlimit}.
				512	@item LIM_DATA
				513	Maximum data memory. Same as @code{RLIMIT_DATA} for @code{setrlimit}.
				514	@item LIM_STACK
				515	Maximum stack size. Same as @code{RLIMIT_STACK} for @code{setrlimit}.
				516	@item LIM_CORE
				517	Maximum core file size. Same as @code{RLIMIT_COR} for @code{setrlimit}.
				518	@item LIM_MAXRSS
				519	Maximum physical memory. Same as @code{RLIMIT_RSS} for @code{setrlimit}.
				520	@end table
				521
				522	The return value is zero for success, and @code{-1} with @code{errno} set
				523	accordingly for failure:
				524
				525	@table @code
				526	@item EPERM
				527	The process tried to set its current limit beyond its maximum limit.
				528	@end table
				529
				530	@end deftypefun
				531
				532	@node Priority
				533	@section Process CPU Priority And Scheduling
				534	@cindex process priority
				535	@cindex cpu priority
				536	@cindex priority of a process
				537
				538	When multiple processes simultaneously require CPU time, the system's
				539	scheduling policy and process CPU priorities determine which processes
				540	get it. This section describes how that determination is made and
				541	@glibcadj{} functions to control it.
				542
				543	It is common to refer to CPU scheduling simply as scheduling and a
				544	process' CPU priority simply as the process' priority, with the CPU
				545	resource being implied. Bear in mind, though, that CPU time is not the
				546	only resource a process uses or that processes contend for. In some
				547	cases, it is not even particularly important. Giving a process a high
				548	``priority'' may have very little effect on how fast a process runs with
				549	respect to other processes. The priorities discussed in this section
				550	apply only to CPU time.
				551
				552	CPU scheduling is a complex issue and different systems do it in wildly
				553	different ways. New ideas continually develop and find their way into
				554	the intricacies of the various systems' scheduling algorithms. This
				555	section discusses the general concepts, some specifics of systems
				556	that commonly use @theglibc{}, and some standards.
				557
				558	For simplicity, we talk about CPU contention as if there is only one CPU
				559	in the system. But all the same principles apply when a processor has
				560	multiple CPUs, and knowing that the number of processes that can run at
				561	any one time is equal to the number of CPUs, you can easily extrapolate
				562	the information.
				563
				564	The functions described in this section are all defined by the POSIX.1
				565	and POSIX.1b standards (the @code{sched@dots{}} functions are POSIX.1b).
				566	However, POSIX does not define any semantics for the values that these
				567	functions get and set. In this chapter, the semantics are based on the
				568	Linux kernel's implementation of the POSIX standard. As you will see,
				569	the Linux implementation is quite the inverse of what the authors of the
				570	POSIX syntax had in mind.
				571
				572	@menu
				573	* Absolute Priority:: The first tier of priority. Posix
				574	* Realtime Scheduling:: Scheduling among the process nobility
				575	* Basic Scheduling Functions:: Get/set scheduling policy, priority
				576	* Traditional Scheduling:: Scheduling among the vulgar masses
				577	* CPU Affinity:: Limiting execution to certain CPUs
				578	@end menu
				579
				580
				581
				582	@node Absolute Priority
				583	@subsection Absolute Priority
				584	@cindex absolute priority
				585	@cindex priority, absolute
				586
				587	Every process has an absolute priority, and it is represented by a number.
				588	The higher the number, the higher the absolute priority.
				589
				590	@cindex realtime CPU scheduling
				591	On systems of the past, and most systems today, all processes have
				592	absolute priority 0 and this section is irrelevant. In that case,
				593	@xref{Traditional Scheduling}. Absolute priorities were invented to
				594	accommodate realtime systems, in which it is vital that certain processes
				595	be able to respond to external events happening in real time, which
				596	means they cannot wait around while some other process that @emph{wants
				597	to}, but doesn't @emph{need to} run occupies the CPU.
				598
				599	@cindex ready to run
				600	@cindex preemptive scheduling
				601	When two processes are in contention to use the CPU at any instant, the
				602	one with the higher absolute priority always gets it. This is true even if the
				603	process with the lower priority is already using the CPU (i.e., the
				604	scheduling is preemptive). Of course, we're only talking about
				605	processes that are running or ``ready to run,'' which means they are
				606	ready to execute instructions right now. When a process blocks to wait
				607	for something like I/O, its absolute priority is irrelevant.
				608
				609	@cindex runnable process
				610	@strong{NB:} The term ``runnable'' is a synonym for ``ready to run.''
				611
				612	When two processes are running or ready to run and both have the same
				613	absolute priority, it's more interesting. In that case, who gets the
				614	CPU is determined by the scheduling policy. If the processes have
				615	absolute priority 0, the traditional scheduling policy described in
				616	@ref{Traditional Scheduling} applies. Otherwise, the policies described
				617	in @ref{Realtime Scheduling} apply.
				618
				619	You normally give an absolute priority above 0 only to a process that
				620	can be trusted not to hog the CPU. Such processes are designed to block
				621	(or terminate) after relatively short CPU runs.
				622
				623	A process begins life with the same absolute priority as its parent
				624	process. Functions described in @ref{Basic Scheduling Functions} can
				625	change it.
				626
				627	Only a privileged process can change a process' absolute priority to
				628	something other than @code{0}. Only a privileged process or the
				629	target process' owner can change its absolute priority at all.
				630
				631	POSIX requires absolute priority values used with the realtime
				632	scheduling policies to be consecutive with a range of at least 32. On
				633	Linux, they are 1 through 99. The functions
				634	@code{sched_get_priority_max} and @code{sched_set_priority_min} portably
				635	tell you what the range is on a particular system.
				636
				637
				638	@subsubsection Using Absolute Priority
				639
				640	One thing you must keep in mind when designing real time applications is
				641	that having higher absolute priority than any other process doesn't
				642	guarantee the process can run continuously. Two things that can wreck a
				643	good CPU run are interrupts and page faults.
				644
				645	Interrupt handlers live in that limbo between processes. The CPU is
				646	executing instructions, but they aren't part of any process. An
				647	interrupt will stop even the highest priority process. So you must
				648	allow for slight delays and make sure that no device in the system has
				649	an interrupt handler that could cause too long a delay between
				650	instructions for your process.
				651
				652	Similarly, a page fault causes what looks like a straightforward
				653	sequence of instructions to take a long time. The fact that other
				654	processes get to run while the page faults in is of no consequence,
				655	because as soon as the I/O is complete, the high priority process will
				656	kick them out and run again, but the wait for the I/O itself could be a
				657	problem. To neutralize this threat, use @code{mlock} or
				658	@code{mlockall}.
				659
				660	There are a few ramifications of the absoluteness of this priority on a
				661	single-CPU system that you need to keep in mind when you choose to set a
				662	priority and also when you're working on a program that runs with high
				663	absolute priority. Consider a process that has higher absolute priority
				664	than any other process in the system and due to a bug in its program, it
				665	gets into an infinite loop. It will never cede the CPU. You can't run
				666	a command to kill it because your command would need to get the CPU in
				667	order to run. The errant program is in complete control. It controls
				668	the vertical, it controls the horizontal.
				669
				670	There are two ways to avoid this: 1) keep a shell running somewhere with
				671	a higher absolute priority. 2) keep a controlling terminal attached to
				672	the high priority process group. All the priority in the world won't
				673	stop an interrupt handler from running and delivering a signal to the
				674	process if you hit Control-C.
				675
				676	Some systems use absolute priority as a means of allocating a fixed
				677	percentage of CPU time to a process. To do this, a super high priority
				678	privileged process constantly monitors the process' CPU usage and raises
				679	its absolute priority when the process isn't getting its entitled share
				680	and lowers it when the process is exceeding it.
				681
				682	@strong{NB:} The absolute priority is sometimes called the ``static
				683	priority.'' We don't use that term in this manual because it misses the
				684	most important feature of the absolute priority: its absoluteness.
				685
				686
				687	@node Realtime Scheduling
				688	@subsection Realtime Scheduling
				689	@cindex realtime scheduling
				690
				691	Whenever two processes with the same absolute priority are ready to run,
				692	the kernel has a decision to make, because only one can run at a time.
				693	If the processes have absolute priority 0, the kernel makes this decision
				694	as described in @ref{Traditional Scheduling}. Otherwise, the decision
				695	is as described in this section.
				696
				697	If two processes are ready to run but have different absolute priorities,
				698	the decision is much simpler, and is described in @ref{Absolute
				699	Priority}.
				700
				701	Each process has a scheduling policy. For processes with absolute
				702	priority other than zero, there are two available:
				703
				704	@enumerate
				705	@item
				706	First Come First Served
				707	@item
				708	Round Robin
				709	@end enumerate
				710
				711	The most sensible case is where all the processes with a certain
				712	absolute priority have the same scheduling policy. We'll discuss that
				713	first.
				714
				715	In Round Robin, processes share the CPU, each one running for a small
				716	quantum of time (``time slice'') and then yielding to another in a
				717	circular fashion. Of course, only processes that are ready to run and
				718	have the same absolute priority are in this circle.
				719
				720	In First Come First Served, the process that has been waiting the
				721	longest to run gets the CPU, and it keeps it until it voluntarily
				722	relinquishes the CPU, runs out of things to do (blocks), or gets
				723	preempted by a higher priority process.
				724
				725	First Come First Served, along with maximal absolute priority and
				726	careful control of interrupts and page faults, is the one to use when a
				727	process absolutely, positively has to run at full CPU speed or not at
				728	all.
				729
				730	Judicious use of @code{sched_yield} function invocations by processes
				731	with First Come First Served scheduling policy forms a good compromise
				732	between Round Robin and First Come First Served.
				733
				734	To understand how scheduling works when processes of different scheduling
				735	policies occupy the same absolute priority, you have to know the nitty
				736	gritty details of how processes enter and exit the ready to run list:
				737
				738	In both cases, the ready to run list is organized as a true queue, where
				739	a process gets pushed onto the tail when it becomes ready to run and is
				740	popped off the head when the scheduler decides to run it. Note that
				741	ready to run and running are two mutually exclusive states. When the
				742	scheduler runs a process, that process is no longer ready to run and no
				743	longer in the ready to run list. When the process stops running, it
				744	may go back to being ready to run again.
				745
				746	The only difference between a process that is assigned the Round Robin
				747	scheduling policy and a process that is assigned First Come First Serve
				748	is that in the former case, the process is automatically booted off the
				749	CPU after a certain amount of time. When that happens, the process goes
				750	back to being ready to run, which means it enters the queue at the tail.
				751	The time quantum we're talking about is small. Really small. This is
				752	not your father's timesharing. For example, with the Linux kernel, the
				753	round robin time slice is a thousand times shorter than its typical
				754	time slice for traditional scheduling.
				755
				756	A process begins life with the same scheduling policy as its parent process.
				757	Functions described in @ref{Basic Scheduling Functions} can change it.
				758
				759	Only a privileged process can set the scheduling policy of a process
				760	that has absolute priority higher than 0.
				761
				762	@node Basic Scheduling Functions
				763	@subsection Basic Scheduling Functions
				764
				765	This section describes functions in @theglibc{} for setting the
				766	absolute priority and scheduling policy of a process.
				767
				768	@strong{Portability Note:} On systems that have the functions in this
				769	section, the macro _POSIX_PRIORITY_SCHEDULING is defined in
				770	@file{<unistd.h>}.
				771
				772	For the case that the scheduling policy is traditional scheduling, more
				773	functions to fine tune the scheduling are in @ref{Traditional Scheduling}.
				774
				775	Don't try to make too much out of the naming and structure of these
				776	functions. They don't match the concepts described in this manual
				777	because the functions are as defined by POSIX.1b, but the implementation
				778	on systems that use @theglibc{} is the inverse of what the POSIX
				779	structure contemplates. The POSIX scheme assumes that the primary
				780	scheduling parameter is the scheduling policy and that the priority
				781	value, if any, is a parameter of the scheduling policy. In the
				782	implementation, though, the priority value is king and the scheduling
				783	policy, if anything, only fine tunes the effect of that priority.
				784
				785	The symbols in this section are declared by including file @file{sched.h}.
				786
				787	@comment sched.h
				788	@comment POSIX
				789	@deftp {Data Type} {struct sched_param}
				790	This structure describes an absolute priority.
				791	@table @code
				792	@item int sched_priority
				793	absolute priority value
				794	@end table
				795	@end deftp
				796
				797	@comment sched.h
				798	@comment POSIX
				799	@deftypefun int sched_setscheduler (pid_t @var{pid}, int @var{policy}, const struct sched_param *@var{param})
				800	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				801	@c Direct syscall, Linux only.
				802
				803	This function sets both the absolute priority and the scheduling policy
				804	for a process.
				805
				806	It assigns the absolute priority value given by @var{param} and the
				807	scheduling policy @var{policy} to the process with Process ID @var{pid},
				808	or the calling process if @var{pid} is zero. If @var{policy} is
				809	negative, @code{sched_setscheduler} keeps the existing scheduling policy.
				810
				811	The following macros represent the valid values for @var{policy}:
				812
				813	@table @code
				814	@item SCHED_OTHER
				815	Traditional Scheduling
				816	@item SCHED_FIFO
				817	First In First Out
				818	@item SCHED_RR
				819	Round Robin
				820	@end table
				821
				822	@c The Linux kernel code (in sched.c) actually reschedules the process,
				823	@c but it puts it at the head of the run queue, so I'm not sure just what
				824	@c the effect is, but it must be subtle.
				825
				826	On success, the return value is @code{0}. Otherwise, it is @code{-1}
				827	and @code{ERRNO} is set accordingly. The @code{errno} values specific
				828	to this function are:
				829
				830	@table @code
				831	@item EPERM
				832	@itemize @bullet
				833	@item
				834	The calling process does not have @code{CAP_SYS_NICE} permission and
				835	@var{policy} is not @code{SCHED_OTHER} (or it's negative and the
				836	existing policy is not @code{SCHED_OTHER}.
				837
				838	@item
				839	The calling process does not have @code{CAP_SYS_NICE} permission and its
				840	owner is not the target process' owner. I.e., the effective uid of the
				841	calling process is neither the effective nor the real uid of process
				842	@var{pid}.
				843	@c We need a cross reference to the capabilities section, when written.
				844	@end itemize
				845
				846	@item ESRCH
				847	There is no process with pid @var{pid} and @var{pid} is not zero.
				848
				849	@item EINVAL
				850	@itemize @bullet
				851	@item
				852	@var{policy} does not identify an existing scheduling policy.
				853
				854	@item
				855	The absolute priority value identified by *@var{param} is outside the
				856	valid range for the scheduling policy @var{policy} (or the existing
				857	scheduling policy if @var{policy} is negative) or @var{param} is
				858	null. @code{sched_get_priority_max} and @code{sched_get_priority_min}
				859	tell you what the valid range is.
				860
				861	@item
				862	@var{pid} is negative.
				863	@end itemize
				864	@end table
				865
				866	@end deftypefun
				867
				868
				869	@comment sched.h
				870	@comment POSIX
				871	@deftypefun int sched_getscheduler (pid_t @var{pid})
				872	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				873	@c Direct syscall, Linux only.
				874
				875	This function returns the scheduling policy assigned to the process with
				876	Process ID (pid) @var{pid}, or the calling process if @var{pid} is zero.
				877
				878	The return value is the scheduling policy. See
				879	@code{sched_setscheduler} for the possible values.
				880
				881	If the function fails, the return value is instead @code{-1} and
				882	@code{errno} is set accordingly.
				883
				884	The @code{errno} values specific to this function are:
				885
				886	@table @code
				887
				888	@item ESRCH
				889	There is no process with pid @var{pid} and it is not zero.
				890
				891	@item EINVAL
				892	@var{pid} is negative.
				893
				894	@end table
				895
				896	Note that this function is not an exact mate to @code{sched_setscheduler}
				897	because while that function sets the scheduling policy and the absolute
				898	priority, this function gets only the scheduling policy. To get the
				899	absolute priority, use @code{sched_getparam}.
				900
				901	@end deftypefun
				902
				903
				904	@comment sched.h
				905	@comment POSIX
				906	@deftypefun int sched_setparam (pid_t @var{pid}, const struct sched_param *@var{param})
				907	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				908	@c Direct syscall, Linux only.
				909
				910	This function sets a process' absolute priority.
				911
				912	It is functionally identical to @code{sched_setscheduler} with
				913	@var{policy} = @code{-1}.
				914
				915	@c in fact, that's how it's implemented in Linux.
				916
				917	@end deftypefun
				918
				919	@comment sched.h
				920	@comment POSIX
				921	@deftypefun int sched_getparam (pid_t @var{pid}, struct sched_param *@var{param})
				922	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				923	@c Direct syscall, Linux only.
				924
				925	This function returns a process' absolute priority.
				926
				927	@var{pid} is the Process ID (pid) of the process whose absolute priority
				928	you want to know.
				929
				930	@var{param} is a pointer to a structure in which the function stores the
				931	absolute priority of the process.
				932
				933	On success, the return value is @code{0}. Otherwise, it is @code{-1}
				934	and @code{ERRNO} is set accordingly. The @code{errno} values specific
				935	to this function are:
				936
				937	@table @code
				938
				939	@item ESRCH
				940	There is no process with pid @var{pid} and it is not zero.
				941
				942	@item EINVAL
				943	@var{pid} is negative.
				944
				945	@end table
				946
				947	@end deftypefun
				948
				949
				950	@comment sched.h
				951	@comment POSIX
				952	@deftypefun int sched_get_priority_min (int @var{policy})
				953	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				954	@c Direct syscall, Linux only.
				955
				956	This function returns the lowest absolute priority value that is
				957	allowable for a process with scheduling policy @var{policy}.
				958
				959	On Linux, it is 0 for SCHED_OTHER and 1 for everything else.
				960
				961	On success, the return value is @code{0}. Otherwise, it is @code{-1}
				962	and @code{ERRNO} is set accordingly. The @code{errno} values specific
				963	to this function are:
				964
				965	@table @code
				966	@item EINVAL
				967	@var{policy} does not identify an existing scheduling policy.
				968	@end table
				969
				970	@end deftypefun
				971
				972	@comment sched.h
				973	@comment POSIX
				974	@deftypefun int sched_get_priority_max (int @var{policy})
				975	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				976	@c Direct syscall, Linux only.
				977
				978	This function returns the highest absolute priority value that is
				979	allowable for a process that with scheduling policy @var{policy}.
				980
				981	On Linux, it is 0 for SCHED_OTHER and 99 for everything else.
				982
				983	On success, the return value is @code{0}. Otherwise, it is @code{-1}
				984	and @code{ERRNO} is set accordingly. The @code{errno} values specific
				985	to this function are:
				986
				987	@table @code
				988	@item EINVAL
				989	@var{policy} does not identify an existing scheduling policy.
				990	@end table
				991
				992	@end deftypefun
				993
				994	@comment sched.h
				995	@comment POSIX
				996	@deftypefun int sched_rr_get_interval (pid_t @var{pid}, struct timespec *@var{interval})
				997	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				998	@c Direct syscall, Linux only.
				999
				1000	This function returns the length of the quantum (time slice) used with
				1001	the Round Robin scheduling policy, if it is used, for the process with
				1002	Process ID @var{pid}.
				1003
				1004	It returns the length of time as @var{interval}.
				1005	@c We need a cross-reference to where timespec is explained. But that
				1006	@c section doesn't exist yet, and the time chapter needs to be slightly
				1007	@c reorganized so there is a place to put it (which will be right next
				1008	@c to timeval, which is presently misplaced). 2000.05.07.
				1009
				1010	With a Linux kernel, the round robin time slice is always 150
				1011	microseconds, and @var{pid} need not even be a real pid.
				1012
				1013	The return value is @code{0} on success and in the pathological case
				1014	that it fails, the return value is @code{-1} and @code{errno} is set
				1015	accordingly. There is nothing specific that can go wrong with this
				1016	function, so there are no specific @code{errno} values.
				1017
				1018	@end deftypefun
				1019
				1020	@comment sched.h
				1021	@comment POSIX
				1022	@deftypefun int sched_yield (void)
				1023	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1024	@c Direct syscall on Linux; alias to swtch on HURD.
				1025
				1026	This function voluntarily gives up the process' claim on the CPU.
				1027
				1028	Technically, @code{sched_yield} causes the calling process to be made
				1029	immediately ready to run (as opposed to running, which is what it was
				1030	before). This means that if it has absolute priority higher than 0, it
				1031	gets pushed onto the tail of the queue of processes that share its
				1032	absolute priority and are ready to run, and it will run again when its
				1033	turn next arrives. If its absolute priority is 0, it is more
				1034	complicated, but still has the effect of yielding the CPU to other
				1035	processes.
				1036
				1037	If there are no other processes that share the calling process' absolute
				1038	priority, this function doesn't have any effect.
				1039
				1040	To the extent that the containing program is oblivious to what other
				1041	processes in the system are doing and how fast it executes, this
				1042	function appears as a no-op.
				1043
				1044	The return value is @code{0} on success and in the pathological case
				1045	that it fails, the return value is @code{-1} and @code{errno} is set
				1046	accordingly. There is nothing specific that can go wrong with this
				1047	function, so there are no specific @code{errno} values.
				1048
				1049	@end deftypefun
				1050
				1051	@node Traditional Scheduling
				1052	@subsection Traditional Scheduling
				1053	@cindex scheduling, traditional
				1054
				1055	This section is about the scheduling among processes whose absolute
				1056	priority is 0. When the system hands out the scraps of CPU time that
				1057	are left over after the processes with higher absolute priority have
				1058	taken all they want, the scheduling described herein determines who
				1059	among the great unwashed processes gets them.
				1060
				1061	@menu
				1062	* Traditional Scheduling Intro::
				1063	* Traditional Scheduling Functions::
				1064	@end menu
				1065
				1066	@node Traditional Scheduling Intro
				1067	@subsubsection Introduction To Traditional Scheduling
				1068
				1069	Long before there was absolute priority (See @ref{Absolute Priority}),
				1070	Unix systems were scheduling the CPU using this system. When Posix came
				1071	in like the Romans and imposed absolute priorities to accommodate the
				1072	needs of realtime processing, it left the indigenous Absolute Priority
				1073	Zero processes to govern themselves by their own familiar scheduling
				1074	policy.
				1075
				1076	Indeed, absolute priorities higher than zero are not available on many
				1077	systems today and are not typically used when they are, being intended
				1078	mainly for computers that do realtime processing. So this section
				1079	describes the only scheduling many programmers need to be concerned
				1080	about.
				1081
				1082	But just to be clear about the scope of this scheduling: Any time a
				1083	process with an absolute priority of 0 and a process with an absolute
				1084	priority higher than 0 are ready to run at the same time, the one with
				1085	absolute priority 0 does not run. If it's already running when the
				1086	higher priority ready-to-run process comes into existence, it stops
				1087	immediately.
				1088
				1089	In addition to its absolute priority of zero, every process has another
				1090	priority, which we will refer to as "dynamic priority" because it changes
				1091	over time. The dynamic priority is meaningless for processes with
				1092	an absolute priority higher than zero.
				1093
				1094	The dynamic priority sometimes determines who gets the next turn on the
				1095	CPU. Sometimes it determines how long turns last. Sometimes it
				1096	determines whether a process can kick another off the CPU.
				1097
				1098	In Linux, the value is a combination of these things, but mostly it is
				1099	just determines the length of the time slice. The higher a process'
				1100	dynamic priority, the longer a shot it gets on the CPU when it gets one.
				1101	If it doesn't use up its time slice before giving up the CPU to do
				1102	something like wait for I/O, it is favored for getting the CPU back when
				1103	it's ready for it, to finish out its time slice. Other than that,
				1104	selection of processes for new time slices is basically round robin.
				1105	But the scheduler does throw a bone to the low priority processes: A
				1106	process' dynamic priority rises every time it is snubbed in the
				1107	scheduling process. In Linux, even the fat kid gets to play.
				1108
				1109	The fluctuation of a process' dynamic priority is regulated by another
				1110	value: The ``nice'' value. The nice value is an integer, usually in the
				1111	range -20 to 20, and represents an upper limit on a process' dynamic
				1112	priority. The higher the nice number, the lower that limit.
				1113
				1114	On a typical Linux system, for example, a process with a nice value of
				1115	20 can get only 10 milliseconds on the CPU at a time, whereas a process
				1116	with a nice value of -20 can achieve a high enough priority to get 400
				1117	milliseconds.
				1118
				1119	The idea of the nice value is deferential courtesy. In the beginning,
				1120	in the Unix garden of Eden, all processes shared equally in the bounty
				1121	of the computer system. But not all processes really need the same
				1122	share of CPU time, so the nice value gave a courteous process the
				1123	ability to refuse its equal share of CPU time that others might prosper.
				1124	Hence, the higher a process' nice value, the nicer the process is.
				1125	(Then a snake came along and offered some process a negative nice value
				1126	and the system became the crass resource allocation system we know
				1127	today).
				1128
				1129	Dynamic priorities tend upward and downward with an objective of
				1130	smoothing out allocation of CPU time and giving quick response time to
				1131	infrequent requests. But they never exceed their nice limits, so on a
				1132	heavily loaded CPU, the nice value effectively determines how fast a
				1133	process runs.
				1134
				1135	In keeping with the socialistic heritage of Unix process priority, a
				1136	process begins life with the same nice value as its parent process and
				1137	can raise it at will. A process can also raise the nice value of any
				1138	other process owned by the same user (or effective user). But only a
				1139	privileged process can lower its nice value. A privileged process can
				1140	also raise or lower another process' nice value.
				1141
				1142	@glibcadj{} functions for getting and setting nice values are described in
				1143	@xref{Traditional Scheduling Functions}.
				1144
				1145	@node Traditional Scheduling Functions
				1146	@subsubsection Functions For Traditional Scheduling
				1147
				1148	@pindex sys/resource.h
				1149	This section describes how you can read and set the nice value of a
				1150	process. All these symbols are declared in @file{sys/resource.h}.
				1151
				1152	The function and macro names are defined by POSIX, and refer to
				1153	"priority," but the functions actually have to do with nice values, as
				1154	the terms are used both in the manual and POSIX.
				1155
				1156	The range of valid nice values depends on the kernel, but typically it
				1157	runs from @code{-20} to @code{20}. A lower nice value corresponds to
				1158	higher priority for the process. These constants describe the range of
				1159	priority values:
				1160
				1161	@vtable @code
				1162	@comment sys/resource.h
				1163	@comment BSD
				1164	@item PRIO_MIN
				1165	The lowest valid nice value.
				1166
				1167	@comment sys/resource.h
				1168	@comment BSD
				1169	@item PRIO_MAX
				1170	The highest valid nice value.
				1171	@end vtable
				1172
				1173	@comment sys/resource.h
				1174	@comment BSD,POSIX
				1175	@deftypefun int getpriority (int @var{class}, int @var{id})
				1176	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1177	@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
				1178	Return the nice value of a set of processes; @var{class} and @var{id}
				1179	specify which ones (see below). If the processes specified do not all
				1180	have the same nice value, this returns the lowest value that any of them
				1181	has.
				1182
				1183	On success, the return value is @code{0}. Otherwise, it is @code{-1}
				1184	and @code{ERRNO} is set accordingly. The @code{errno} values specific
				1185	to this function are:
				1186
				1187	@table @code
				1188	@item ESRCH
				1189	The combination of @var{class} and @var{id} does not match any existing
				1190	process.
				1191
				1192	@item EINVAL
				1193	The value of @var{class} is not valid.
				1194	@end table
				1195
				1196	If the return value is @code{-1}, it could indicate failure, or it could
				1197	be the nice value. The only way to make certain is to set @code{errno =
				1198	0} before calling @code{getpriority}, then use @code{errno != 0}
				1199	afterward as the criterion for failure.
				1200	@end deftypefun
				1201
				1202	@comment sys/resource.h
				1203	@comment BSD,POSIX
				1204	@deftypefun int setpriority (int @var{class}, int @var{id}, int @var{niceval})
				1205	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1206	@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
				1207	Set the nice value of a set of processes to @var{niceval}; @var{class}
				1208	and @var{id} specify which ones (see below).
				1209
				1210	The return value is @code{0} on success, and @code{-1} on
				1211	failure. The following @code{errno} error condition are possible for
				1212	this function:
				1213
				1214	@table @code
				1215	@item ESRCH
				1216	The combination of @var{class} and @var{id} does not match any existing
				1217	process.
				1218
				1219	@item EINVAL
				1220	The value of @var{class} is not valid.
				1221
				1222	@item EPERM
				1223	The call would set the nice value of a process which is owned by a different
				1224	user than the calling process (i.e., the target process' real or effective
				1225	uid does not match the calling process' effective uid) and the calling
				1226	process does not have @code{CAP_SYS_NICE} permission.
				1227
				1228	@item EACCES
				1229	The call would lower the process' nice value and the process does not have
				1230	@code{CAP_SYS_NICE} permission.
				1231	@end table
				1232
				1233	@end deftypefun
				1234
				1235	The arguments @var{class} and @var{id} together specify a set of
				1236	processes in which you are interested. These are the possible values of
				1237	@var{class}:
				1238
				1239	@vtable @code
				1240	@comment sys/resource.h
				1241	@comment BSD
				1242	@item PRIO_PROCESS
				1243	One particular process. The argument @var{id} is a process ID (pid).
				1244
				1245	@comment sys/resource.h
				1246	@comment BSD
				1247	@item PRIO_PGRP
				1248	All the processes in a particular process group. The argument @var{id} is
				1249	a process group ID (pgid).
				1250
				1251	@comment sys/resource.h
				1252	@comment BSD
				1253	@item PRIO_USER
				1254	All the processes owned by a particular user (i.e., whose real uid
				1255	indicates the user). The argument @var{id} is a user ID (uid).
				1256	@end vtable
				1257
				1258	If the argument @var{id} is 0, it stands for the calling process, its
				1259	process group, or its owner (real uid), according to @var{class}.
				1260
				1261	@comment unistd.h
				1262	@comment BSD
				1263	@deftypefun int nice (int @var{increment})
				1264	@safety{@prelim{}@mtunsafe{@mtasurace{:setpriority}}@asunsafe{}@acsafe{}}
				1265	@c Calls getpriority before and after setpriority, using the result of
				1266	@c the first call to compute the argument for setpriority. This creates
				1267	@c a window for a concurrent setpriority (or nice) call to be lost or
				1268	@c exhibit surprising behavior.
				1269	Increment the nice value of the calling process by @var{increment}.
				1270	The return value is the new nice value on success, and @code{-1} on
				1271	failure. In the case of failure, @code{errno} will be set to the
				1272	same values as for @code{setpriority}.
				1273
				1274
				1275	Here is an equivalent definition of @code{nice}:
				1276
				1277	@smallexample
				1278	int
				1279	nice (int increment)
				1280	@{
				1281	int result, old = getpriority (PRIO_PROCESS, 0);
				1282	result = setpriority (PRIO_PROCESS, 0, old + increment);
				1283	if (result != -1)
				1284	return old + increment;
				1285	else
				1286	return -1;
				1287	@}
				1288	@end smallexample
				1289	@end deftypefun
				1290
				1291
				1292	@node CPU Affinity
				1293	@subsection Limiting execution to certain CPUs
				1294
				1295	On a multi-processor system the operating system usually distributes
				1296	the different processes which are runnable on all available CPUs in a
				1297	way which allows the system to work most efficiently. Which processes
				1298	and threads run can be to some extend be control with the scheduling
				1299	functionality described in the last sections. But which CPU finally
				1300	executes which process or thread is not covered.
				1301
				1302	There are a number of reasons why a program might want to have control
				1303	over this aspect of the system as well:
				1304
				1305	@itemize @bullet
				1306	@item
				1307	One thread or process is responsible for absolutely critical work
				1308	which under no circumstances must be interrupted or hindered from
				1309	making process by other process or threads using CPU resources. In
				1310	this case the special process would be confined to a CPU which no
				1311	other process or thread is allowed to use.
				1312
				1313	@item
				1314	The access to certain resources (RAM, I/O ports) has different costs
				1315	from different CPUs. This is the case in NUMA (Non-Uniform Memory
				1316	Architecture) machines. Preferably memory should be accessed locally
				1317	but this requirement is usually not visible to the scheduler.
				1318	Therefore forcing a process or thread to the CPUs which have local
				1319	access to the mostly used memory helps to significantly boost the
				1320	performance.
				1321
				1322	@item
				1323	In controlled runtimes resource allocation and book-keeping work (for
				1324	instance garbage collection) is performance local to processors. This
				1325	can help to reduce locking costs if the resources do not have to be
				1326	protected from concurrent accesses from different processors.
				1327	@end itemize
				1328
				1329	The POSIX standard up to this date is of not much help to solve this
				1330	problem. The Linux kernel provides a set of interfaces to allow
				1331	specifying @emph{affinity sets} for a process. The scheduler will
				1332	schedule the thread or process on CPUs specified by the affinity
				1333	masks. The interfaces which @theglibc{} define follow to some
				1334	extend the Linux kernel interface.
				1335
				1336	@comment sched.h
				1337	@comment GNU
				1338	@deftp {Data Type} cpu_set_t
				1339	This data set is a bitset where each bit represents a CPU. How the
				1340	system's CPUs are mapped to bits in the bitset is system dependent.
				1341	The data type has a fixed size; in the unlikely case that the number
				1342	of bits are not sufficient to describe the CPUs of the system a
				1343	different interface has to be used.
				1344
				1345	This type is a GNU extension and is defined in @file{sched.h}.
				1346	@end deftp
				1347
				1348	To manipulate the bitset, to set and reset bits, a number of macros is
				1349	defined. Some of the macros take a CPU number as a parameter. Here
				1350	it is important to never exceed the size of the bitset. The following
				1351	macro specifies the number of bits in the @code{cpu_set_t} bitset.
				1352
				1353	@comment sched.h
				1354	@comment GNU
				1355	@deftypevr Macro int CPU_SETSIZE
				1356	The value of this macro is the maximum number of CPUs which can be
				1357	handled with a @code{cpu_set_t} object.
				1358	@end deftypevr
				1359
				1360	The type @code{cpu_set_t} should be considered opaque; all
				1361	manipulation should happen via the next four macros.
				1362
				1363	@comment sched.h
				1364	@comment GNU
				1365	@deftypefn Macro void CPU_ZERO (cpu_set_t *@var{set})
				1366	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1367	@c CPU_ZERO ok
				1368	@c __CPU_ZERO_S ok
				1369	@c memset dup ok
				1370	This macro initializes the CPU set @var{set} to be the empty set.
				1371
				1372	This macro is a GNU extension and is defined in @file{sched.h}.
				1373	@end deftypefn
				1374
				1375	@comment sched.h
				1376	@comment GNU
				1377	@deftypefn Macro void CPU_SET (int @var{cpu}, cpu_set_t *@var{set})
				1378	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1379	@c CPU_SET ok
				1380	@c __CPU_SET_S ok
				1381	@c __CPUELT ok
				1382	@c __CPUMASK ok
				1383	This macro adds @var{cpu} to the CPU set @var{set}.
				1384
				1385	The @var{cpu} parameter must not have side effects since it is
				1386	evaluated more than once.
				1387
				1388	This macro is a GNU extension and is defined in @file{sched.h}.
				1389	@end deftypefn
				1390
				1391	@comment sched.h
				1392	@comment GNU
				1393	@deftypefn Macro void CPU_CLR (int @var{cpu}, cpu_set_t *@var{set})
				1394	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1395	@c CPU_CLR ok
				1396	@c __CPU_CLR_S ok
				1397	@c __CPUELT dup ok
				1398	@c __CPUMASK dup ok
				1399	This macro removes @var{cpu} from the CPU set @var{set}.
				1400
				1401	The @var{cpu} parameter must not have side effects since it is
				1402	evaluated more than once.
				1403
				1404	This macro is a GNU extension and is defined in @file{sched.h}.
				1405	@end deftypefn
				1406
				1407	@comment sched.h
				1408	@comment GNU
				1409	@deftypefn Macro int CPU_ISSET (int @var{cpu}, const cpu_set_t *@var{set})
				1410	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1411	@c CPU_ISSET ok
				1412	@c __CPU_ISSET_S ok
				1413	@c __CPUELT dup ok
				1414	@c __CPUMASK dup ok
				1415	This macro returns a nonzero value (true) if @var{cpu} is a member
				1416	of the CPU set @var{set}, and zero (false) otherwise.
				1417
				1418	The @var{cpu} parameter must not have side effects since it is
				1419	evaluated more than once.
				1420
				1421	This macro is a GNU extension and is defined in @file{sched.h}.
				1422	@end deftypefn
				1423
				1424
				1425	CPU bitsets can be constructed from scratch or the currently installed
				1426	affinity mask can be retrieved from the system.
				1427
				1428	@comment sched.h
				1429	@comment GNU
				1430	@deftypefun int sched_getaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, cpu_set_t *@var{cpuset})
				1431	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1432	@c Wrapped syscall to zero out past the kernel cpu set size; Linux
				1433	@c only.
				1434
				1435	This functions stores the CPU affinity mask for the process or thread
				1436	with the ID @var{pid} in the @var{cpusetsize} bytes long bitmap
				1437	pointed to by @var{cpuset}. If successful, the function always
				1438	initializes all bits in the @code{cpu_set_t} object and returns zero.
				1439
				1440	If @var{pid} does not correspond to a process or thread on the system
				1441	the or the function fails for some other reason, it returns @code{-1}
				1442	and @code{errno} is set to represent the error condition.
				1443
				1444	@table @code
				1445	@item ESRCH
				1446	No process or thread with the given ID found.
				1447
				1448	@item EFAULT
				1449	The pointer @var{cpuset} is does not point to a valid object.
				1450	@end table
				1451
				1452	This function is a GNU extension and is declared in @file{sched.h}.
				1453	@end deftypefun
				1454
				1455	Note that it is not portably possible to use this information to
				1456	retrieve the information for different POSIX threads. A separate
				1457	interface must be provided for that.
				1458
				1459	@comment sched.h
				1460	@comment GNU
				1461	@deftypefun int sched_setaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, const cpu_set_t *@var{cpuset})
				1462	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1463	@c Wrapped syscall to detect attempts to set bits past the kernel cpu
				1464	@c set size; Linux only.
				1465
				1466	This function installs the @var{cpusetsize} bytes long affinity mask
				1467	pointed to by @var{cpuset} for the process or thread with the ID @var{pid}.
				1468	If successful the function returns zero and the scheduler will in future
				1469	take the affinity information into account.
				1470
				1471	If the function fails it will return @code{-1} and @code{errno} is set
				1472	to the error code:
				1473
				1474	@table @code
				1475	@item ESRCH
				1476	No process or thread with the given ID found.
				1477
				1478	@item EFAULT
				1479	The pointer @var{cpuset} is does not point to a valid object.
				1480
				1481	@item EINVAL
				1482	The bitset is not valid. This might mean that the affinity set might
				1483	not leave a processor for the process or thread to run on.
				1484	@end table
				1485
				1486	This function is a GNU extension and is declared in @file{sched.h}.
				1487	@end deftypefun
				1488
				1489
				1490	@node Memory Resources
				1491	@section Querying memory available resources
				1492
				1493	The amount of memory available in the system and the way it is organized
				1494	determines oftentimes the way programs can and have to work. For
				1495	functions like @code{mmap} it is necessary to know about the size of
				1496	individual memory pages and knowing how much memory is available enables
				1497	a program to select appropriate sizes for, say, caches. Before we get
				1498	into these details a few words about memory subsystems in traditional
				1499	Unix systems will be given.
				1500
				1501	@menu
				1502	* Memory Subsystem:: Overview about traditional Unix memory handling.
				1503	* Query Memory Parameters:: How to get information about the memory
				1504	subsystem?
				1505	@end menu
				1506
				1507	@node Memory Subsystem
				1508	@subsection Overview about traditional Unix memory handling
				1509
				1510	@cindex address space
				1511	@cindex physical memory
				1512	@cindex physical address
				1513	Unix systems normally provide processes virtual address spaces. This
				1514	means that the addresses of the memory regions do not have to correspond
				1515	directly to the addresses of the actual physical memory which stores the
				1516	data. An extra level of indirection is introduced which translates
				1517	virtual addresses into physical addresses. This is normally done by the
				1518	hardware of the processor.
				1519
				1520	@cindex shared memory
				1521	Using a virtual address space has several advantage. The most important
				1522	is process isolation. The different processes running on the system
				1523	cannot interfere directly with each other. No process can write into
				1524	the address space of another process (except when shared memory is used
				1525	but then it is wanted and controlled).
				1526
				1527	Another advantage of virtual memory is that the address space the
				1528	processes see can actually be larger than the physical memory available.
				1529	The physical memory can be extended by storage on an external media
				1530	where the content of currently unused memory regions is stored. The
				1531	address translation can then intercept accesses to these memory regions
				1532	and make memory content available again by loading the data back into
				1533	memory. This concept makes it necessary that programs which have to use
				1534	lots of memory know the difference between available virtual address
				1535	space and available physical memory. If the working set of virtual
				1536	memory of all the processes is larger than the available physical memory
				1537	the system will slow down dramatically due to constant swapping of
				1538	memory content from the memory to the storage media and back. This is
				1539	called ``thrashing''.
				1540	@cindex thrashing
				1541
				1542	@cindex memory page
				1543	@cindex page, memory
				1544	A final aspect of virtual memory which is important and follows from
				1545	what is said in the last paragraph is the granularity of the virtual
				1546	address space handling. When we said that the virtual address handling
				1547	stores memory content externally it cannot do this on a byte-by-byte
				1548	basis. The administrative overhead does not allow this (leaving alone
				1549	the processor hardware). Instead several thousand bytes are handled
				1550	together and form a @dfn{page}. The size of each page is always a power
				1551	of two byte. The smallest page size in use today is 4096, with 8192,
				1552	16384, and 65536 being other popular sizes.
				1553
				1554	@node Query Memory Parameters
				1555	@subsection How to get information about the memory subsystem?
				1556
				1557	The page size of the virtual memory the process sees is essential to
				1558	know in several situations. Some programming interface (e.g.,
				1559	@code{mmap}, @pxref{Memory-mapped I/O}) require the user to provide
				1560	information adjusted to the page size. In the case of @code{mmap} is it
				1561	necessary to provide a length argument which is a multiple of the page
				1562	size. Another place where the knowledge about the page size is useful
				1563	is in memory allocation. If one allocates pieces of memory in larger
				1564	chunks which are then subdivided by the application code it is useful to
				1565	adjust the size of the larger blocks to the page size. If the total
				1566	memory requirement for the block is close (but not larger) to a multiple
				1567	of the page size the kernel's memory handling can work more effectively
				1568	since it only has to allocate memory pages which are fully used. (To do
				1569	this optimization it is necessary to know a bit about the memory
				1570	allocator which will require a bit of memory itself for each block and
				1571	this overhead must not push the total size over the page size multiple.
				1572
				1573	The page size traditionally was a compile time constant. But recent
				1574	development of processors changed this. Processors now support
				1575	different page sizes and they can possibly even vary among different
				1576	processes on the same system. Therefore the system should be queried at
				1577	runtime about the current page size and no assumptions (except about it
				1578	being a power of two) should be made.
				1579
				1580	@vindex _SC_PAGESIZE
				1581	The correct interface to query about the page size is @code{sysconf}
				1582	(@pxref{Sysconf Definition}) with the parameter @code{_SC_PAGESIZE}.
				1583	There is a much older interface available, too.
				1584
				1585	@comment unistd.h
				1586	@comment BSD
				1587	@deftypefun int getpagesize (void)
				1588	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				1589	@c Obtained from the aux vec at program startup time. GNU/Linux/m68k is
				1590	@c the exception, with the possibility of a syscall.
				1591	The @code{getpagesize} function returns the page size of the process.
				1592	This value is fixed for the runtime of the process but can vary in
				1593	different runs of the application.
				1594
				1595	The function is declared in @file{unistd.h}.
				1596	@end deftypefun
				1597
				1598	Widely available on @w{System V} derived systems is a method to get
				1599	information about the physical memory the system has. The call
				1600
				1601	@vindex _SC_PHYS_PAGES
				1602	@cindex sysconf
				1603	@smallexample
				1604	sysconf (_SC_PHYS_PAGES)
				1605	@end smallexample
				1606
				1607	@noindent
				1608	returns the total number of pages of physical the system has.
				1609	This does not mean all this memory is available. This information can
				1610	be found using
				1611
				1612	@vindex _SC_AVPHYS_PAGES
				1613	@cindex sysconf
				1614	@smallexample
				1615	sysconf (_SC_AVPHYS_PAGES)
				1616	@end smallexample
				1617
				1618	These two values help to optimize applications. The value returned for
				1619	@code{_SC_AVPHYS_PAGES} is the amount of memory the application can use
				1620	without hindering any other process (given that no other process
				1621	increases its memory usage). The value returned for
				1622	@code{_SC_PHYS_PAGES} is more or less a hard limit for the working set.
				1623	If all applications together constantly use more than that amount of
				1624	memory the system is in trouble.
				1625
				1626	@Theglibc{} provides in addition to these already described way to
				1627	get this information two functions. They are declared in the file
				1628	@file{sys/sysinfo.h}. Programmers should prefer to use the
				1629	@code{sysconf} method described above.
				1630
				1631	@comment sys/sysinfo.h
				1632	@comment GNU
				1633	@deftypefun {long int} get_phys_pages (void)
				1634	@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
				1635	@c This fopens a /proc file and scans it for the requested information.
				1636	The @code{get_phys_pages} function returns the total number of pages of
				1637	physical the system has. To get the amount of memory this number has to
				1638	be multiplied by the page size.
				1639
				1640	This function is a GNU extension.
				1641	@end deftypefun
				1642
				1643	@comment sys/sysinfo.h
				1644	@comment GNU
				1645	@deftypefun {long int} get_avphys_pages (void)
				1646	@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
				1647	The @code{get_avphys_pages} function returns the number of available pages of
				1648	physical the system has. To get the amount of memory this number has to
				1649	be multiplied by the page size.
				1650
				1651	This function is a GNU extension.
				1652	@end deftypefun
				1653
				1654	@node Processor Resources
				1655	@section Learn about the processors available
				1656
				1657	The use of threads or processes with shared memory allows an application
				1658	to take advantage of all the processing power a system can provide. If
				1659	the task can be parallelized the optimal way to write an application is
				1660	to have at any time as many processes running as there are processors.
				1661	To determine the number of processors available to the system one can
				1662	run
				1663
				1664	@vindex _SC_NPROCESSORS_CONF
				1665	@cindex sysconf
				1666	@smallexample
				1667	sysconf (_SC_NPROCESSORS_CONF)
				1668	@end smallexample
				1669
				1670	@noindent
				1671	which returns the number of processors the operating system configured.
				1672	But it might be possible for the operating system to disable individual
				1673	processors and so the call
				1674
				1675	@vindex _SC_NPROCESSORS_ONLN
				1676	@cindex sysconf
				1677	@smallexample
				1678	sysconf (_SC_NPROCESSORS_ONLN)
				1679	@end smallexample
				1680
				1681	@noindent
				1682	returns the number of processors which are currently online (i.e.,
				1683	available).
				1684
				1685	For these two pieces of information @theglibc{} also provides
				1686	functions to get the information directly. The functions are declared
				1687	in @file{sys/sysinfo.h}.
				1688
				1689	@comment sys/sysinfo.h
				1690	@comment GNU
				1691	@deftypefun int get_nprocs_conf (void)
				1692	@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
				1693	@c This function reads from from /sys using dir streams (single user, so
				1694	@c no @mtasurace issue), and on some arches, from /proc using streams.
				1695	The @code{get_nprocs_conf} function returns the number of processors the
				1696	operating system configured.
				1697
				1698	This function is a GNU extension.
				1699	@end deftypefun
				1700
				1701	@comment sys/sysinfo.h
				1702	@comment GNU
				1703	@deftypefun int get_nprocs (void)
				1704	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
				1705	@c This function reads from /proc using file descriptor I/O.
				1706	The @code{get_nprocs} function returns the number of available processors.
				1707
				1708	This function is a GNU extension.
				1709	@end deftypefun
				1710
				1711	@cindex load average
				1712	Before starting more threads it should be checked whether the processors
				1713	are not already overused. Unix systems calculate something called the
				1714	@dfn{load average}. This is a number indicating how many processes were
				1715	running. This number is average over different periods of times
				1716	(normally 1, 5, and 15 minutes).
				1717
				1718	@comment stdlib.h
				1719	@comment BSD
				1720	@deftypefun int getloadavg (double @var{loadavg}[], int @var{nelem})
				1721	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
				1722	@c Calls host_info on HURD; on Linux, opens /proc/loadavg, reads from
				1723	@c it, closes it, without cancellation point, and calls strtod_l with
				1724	@c the C locale to convert the strings to doubles.
				1725	This function gets the 1, 5 and 15 minute load averages of the
				1726	system. The values are placed in @var{loadavg}. @code{getloadavg} will
				1727	place at most @var{nelem} elements into the array but never more than
				1728	three elements. The return value is the number of elements written to
				1729	@var{loadavg}, or -1 on error.
				1730
				1731	This function is declared in @file{stdlib.h}.
				1732	@end deftypefun