Blame - ap/libc/glibc/glibc-2.23/manual/charset.texi - T106_DC

blob: 68aecd3f1ea6426a6d5dd8e08799a2d2de3734c3 [file] [log] [blame]

lh	9ed821d	2023-04-07 01:36:19 -0700	[diff] [blame]	1	@node Character Set Handling, Locales, String and Array Utilities, Top
				2	@c %MENU% Support for extended character sets
				3	@chapter Character Set Handling
				4
				5	@ifnottex
				6	@macro cal{text}
				7	\text\
				8	@end macro
				9	@end ifnottex
				10
				11	Character sets used in the early days of computing had only six, seven,
				12	or eight bits for each character: there was never a case where more than
				13	eight bits (one byte) were used to represent a single character. The
				14	limitations of this approach became more apparent as more people
				15	grappled with non-Roman character sets, where not all the characters
				16	that make up a language's character set can be represented by @math{2^8}
				17	choices. This chapter shows the functionality that was added to the C
				18	library to support multiple character sets.
				19
				20	@menu
				21	* Extended Char Intro:: Introduction to Extended Characters.
				22	* Charset Function Overview:: Overview about Character Handling
				23	Functions.
				24	* Restartable multibyte conversion:: Restartable multibyte conversion
				25	Functions.
				26	* Non-reentrant Conversion:: Non-reentrant Conversion Function.
				27	* Generic Charset Conversion:: Generic Charset Conversion.
				28	@end menu
				29
				30
				31	@node Extended Char Intro
				32	@section Introduction to Extended Characters
				33
				34	A variety of solutions is available to overcome the differences between
				35	character sets with a 1:1 relation between bytes and characters and
				36	character sets with ratios of 2:1 or 4:1. The remainder of this
				37	section gives a few examples to help understand the design decisions
				38	made while developing the functionality of the @w{C library}.
				39
				40	@cindex internal representation
				41	A distinction we have to make right away is between internal and
				42	external representation. @dfn{Internal representation} means the
				43	representation used by a program while keeping the text in memory.
				44	External representations are used when text is stored or transmitted
				45	through some communication channel. Examples of external
				46	representations include files waiting in a directory to be
				47	read and parsed.
				48
				49	Traditionally there has been no difference between the two representations.
				50	It was equally comfortable and useful to use the same single-byte
				51	representation internally and externally. This comfort level decreases
				52	with more and larger character sets.
				53
				54	One of the problems to overcome with the internal representation is
				55	handling text that is externally encoded using different character
				56	sets. Assume a program that reads two texts and compares them using
				57	some metric. The comparison can be usefully done only if the texts are
				58	internally kept in a common format.
				59
				60	@cindex wide character
				61	For such a common format (@math{=} character set) eight bits are certainly
				62	no longer enough. So the smallest entity will have to grow: @dfn{wide
				63	characters} will now be used. Instead of one byte per character, two or
				64	four will be used instead. (Three are not good to address in memory and
				65	more than four bytes seem not to be necessary).
				66
				67	@cindex Unicode
				68	@cindex ISO 10646
				69	As shown in some other part of this manual,
				70	@c !!! Ahem, wide char string functions are not yet covered -- drepper
				71	a completely new family has been created of functions that can handle wide
				72	character texts in memory. The most commonly used character sets for such
				73	internal wide character representations are Unicode and @w{ISO 10646}
				74	(also known as UCS for Universal Character Set). Unicode was originally
				75	planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
				76	be a 31-bit large code space. The two standards are practically identical.
				77	They have the same character repertoire and code table, but Unicode specifies
				78	added semantics. At the moment, only characters in the first @code{0x10000}
				79	code positions (the so-called Basic Multilingual Plane, BMP) have been
				80	assigned, but the assignment of more specialized characters outside this
				81	16-bit space is already in progress. A number of encodings have been
				82	defined for Unicode and @w{ISO 10646} characters:
				83	@cindex UCS-2
				84	@cindex UCS-4
				85	@cindex UTF-8
				86	@cindex UTF-16
				87	UCS-2 is a 16-bit word that can only represent characters
				88	from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
				89	and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
				90	ASCII characters are represented by ASCII bytes and non-ASCII characters
				91	by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
				92	of UCS-2 in which pairs of certain UCS-2 words can be used to encode
				93	non-BMP characters up to @code{0x10ffff}.
				94
				95	To represent wide characters the @code{char} type is not suitable. For
				96	this reason the @w{ISO C} standard introduces a new type that is
				97	designed to keep one character of a wide character string. To maintain
				98	the similarity there is also a type corresponding to @code{int} for
				99	those functions that take a single wide character.
				100
				101	@comment stddef.h
				102	@comment ISO
				103	@deftp {Data type} wchar_t
				104	This data type is used as the base type for wide character strings.
				105	In other words, arrays of objects of this type are the equivalent of
				106	@code{char[]} for multibyte character strings. The type is defined in
				107	@file{stddef.h}.
				108
				109	The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
				110	say anything specific about the representation. It only requires that
				111	this type is capable of storing all elements of the basic character set.
				112	Therefore it would be legitimate to define @code{wchar_t} as @code{char},
				113	which might make sense for embedded systems.
				114
				115	But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore,
				116	capable of representing all UCS-4 values and, therefore, covering all of
				117	@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type
				118	and thereby follow Unicode very strictly. This definition is perfectly
				119	fine with the standard, but it also means that to represent all
				120	characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
				121	characters, which is in fact a multi-wide-character encoding. But
				122	resorting to multi-wide-character encoding contradicts the purpose of the
				123	@code{wchar_t} type.
				124	@end deftp
				125
				126	@comment wchar.h
				127	@comment ISO
				128	@deftp {Data type} wint_t
				129	@code{wint_t} is a data type used for parameters and variables that
				130	contain a single wide character. As the name suggests this type is the
				131	equivalent of @code{int} when using the normal @code{char} strings. The
				132	types @code{wchar_t} and @code{wint_t} often have the same
				133	representation if their size is 32 bits wide but if @code{wchar_t} is
				134	defined as @code{char} the type @code{wint_t} must be defined as
				135	@code{int} due to the parameter promotion.
				136
				137	@pindex wchar.h
				138	This type is defined in @file{wchar.h} and was introduced in
				139	@w{Amendment 1} to @w{ISO C90}.
				140	@end deftp
				141
				142	As there are for the @code{char} data type macros are available for
				143	specifying the minimum and maximum value representable in an object of
				144	type @code{wchar_t}.
				145
				146	@comment wchar.h
				147	@comment ISO
				148	@deftypevr Macro wint_t WCHAR_MIN
				149	The macro @code{WCHAR_MIN} evaluates to the minimum value representable
				150	by an object of type @code{wint_t}.
				151
				152	This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
				153	@end deftypevr
				154
				155	@comment wchar.h
				156	@comment ISO
				157	@deftypevr Macro wint_t WCHAR_MAX
				158	The macro @code{WCHAR_MAX} evaluates to the maximum value representable
				159	by an object of type @code{wint_t}.
				160
				161	This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
				162	@end deftypevr
				163
				164	Another special wide character value is the equivalent to @code{EOF}.
				165
				166	@comment wchar.h
				167	@comment ISO
				168	@deftypevr Macro wint_t WEOF
				169	The macro @code{WEOF} evaluates to a constant expression of type
				170	@code{wint_t} whose value is different from any member of the extended
				171	character set.
				172
				173	@code{WEOF} need not be the same value as @code{EOF} and unlike
				174	@code{EOF} it also need @emph{not} be negative. In other words, sloppy
				175	code like
				176
				177	@smallexample
				178	@{
				179	int c;
				180	@dots{}
				181	while ((c = getc (fp)) < 0)
				182	@dots{}
				183	@}
				184	@end smallexample
				185
				186	@noindent
				187	has to be rewritten to use @code{WEOF} explicitly when wide characters
				188	are used:
				189
				190	@smallexample
				191	@{
				192	wint_t c;
				193	@dots{}
				194	while ((c = wgetc (fp)) != WEOF)
				195	@dots{}
				196	@}
				197	@end smallexample
				198
				199	@pindex wchar.h
				200	This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
				201	defined in @file{wchar.h}.
				202	@end deftypevr
				203
				204
				205	These internal representations present problems when it comes to storing
				206	and transmittal. Because each single wide character consists of more
				207	than one byte, they are affected by byte-ordering. Thus, machines with
				208	different endianesses would see different values when accessing the same
				209	data. This byte ordering concern also applies for communication protocols
				210	that are all byte-based and therefore require that the sender has to
				211	decide about splitting the wide character in bytes. A last (but not least
				212	important) point is that wide characters often require more storage space
				213	than a customized byte-oriented character set.
				214
				215	@cindex multibyte character
				216	@cindex EBCDIC
				217	For all the above reasons, an external encoding that is different from
				218	the internal encoding is often used if the latter is UCS-2 or UCS-4.
				219	The external encoding is byte-based and can be chosen appropriately for
				220	the environment and for the texts to be handled. A variety of different
				221	character sets can be used for this external encoding (information that
				222	will not be exhaustively presented here--instead, a description of the
				223	major groups will suffice). All of the ASCII-based character sets
				224	fulfill one requirement: they are "filesystem safe." This means that
				225	the character @code{'/'} is used in the encoding @emph{only} to
				226	represent itself. Things are a bit different for character sets like
				227	EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
				228	family used by IBM), but if the operating system does not understand
				229	EBCDIC directly the parameters-to-system calls have to be converted
				230	first anyhow.
				231
				232	@itemize @bullet
				233	@item
				234	The simplest character sets are single-byte character sets. There can
				235	be only up to 256 characters (for @w{8 bit} character sets), which is
				236	not sufficient to cover all languages but might be sufficient to handle
				237	a specific text. Handling of a @w{8 bit} character sets is simple. This
				238	is not true for other kinds presented later, and therefore, the
				239	application one uses might require the use of @w{8 bit} character sets.
				240
				241	@cindex ISO 2022
				242	@item
				243	The @w{ISO 2022} standard defines a mechanism for extended character
				244	sets where one character @emph{can} be represented by more than one
				245	byte. This is achieved by associating a state with the text.
				246	Characters that can be used to change the state can be embedded in the
				247	text. Each byte in the text might have a different interpretation in each
				248	state. The state might even influence whether a given byte stands for a
				249	character on its own or whether it has to be combined with some more
				250	bytes.
				251
				252	@cindex EUC
				253	@cindex Shift_JIS
				254	@cindex SJIS
				255	In most uses of @w{ISO 2022} the defined character sets do not allow
				256	state changes that cover more than the next character. This has the
				257	big advantage that whenever one can identify the beginning of the byte
				258	sequence of a character one can interpret a text correctly. Examples of
				259	character sets using this policy are the various EUC character sets
				260	(used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
				261	or Shift_JIS (SJIS, a Japanese encoding).
				262
				263	But there are also character sets using a state that is valid for more
				264	than one character and has to be changed by another byte sequence.
				265	Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
				266
				267	@item
				268	@cindex ISO 6937
				269	Early attempts to fix 8 bit character sets for other languages using the
				270	Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
				271	representing characters like the acute accent do not produce output
				272	themselves: one has to combine them with other characters to get the
				273	desired result. For example, the byte sequence @code{0xc2 0x61}
				274	(non-spacing acute accent, followed by lower-case `a') to get the ``small
				275	a with acute'' character. To get the acute accent character on its own,
				276	one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
				277	space).
				278
				279	Character sets like @w{ISO 6937} are used in some embedded systems such
				280	as teletex.
				281
				282	@item
				283	@cindex UTF-8
				284	Instead of converting the Unicode or @w{ISO 10646} text used internally,
				285	it is often also sufficient to simply use an encoding different than
				286	UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
				287	encoding: UTF-8. This encoding is able to represent all of @w{ISO
				288	10646} 31 bits in a byte string of length one to six.
				289
				290	@cindex UTF-7
				291	There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
				292	but UTF-8 is today the only encoding that should be used. In fact, with
				293	any luck UTF-8 will soon be the only external encoding that has to be
				294	supported. It proves to be universally usable and its only disadvantage
				295	is that it favors Roman languages by making the byte string
				296	representation of other scripts (Cyrillic, Greek, Asian scripts) longer
				297	than necessary if using a specific character set for these scripts.
				298	Methods like the Unicode compression scheme can alleviate these
				299	problems.
				300	@end itemize
				301
				302	The question remaining is: how to select the character set or encoding
				303	to use. The answer: you cannot decide about it yourself, it is decided
				304	by the developers of the system or the majority of the users. Since the
				305	goal is interoperability one has to use whatever the other people one
				306	works with use. If there are no constraints, the selection is based on
				307	the requirements the expected circle of users will have. In other words,
				308	if a project is expected to be used in only, say, Russia it is fine to use
				309	KOI8-R or a similar character set. But if at the same time people from,
				310	say, Greece are participating one should use a character set that allows
				311	all people to collaborate.
				312
				313	The most widely useful solution seems to be: go with the most general
				314	character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
				315	and problems about users not being able to use their own language
				316	adequately are a thing of the past.
				317
				318	One final comment about the choice of the wide character representation
				319	is necessary at this point. We have said above that the natural choice
				320	is using Unicode or @w{ISO 10646}. This is not required, but at least
				321	encouraged, by the @w{ISO C} standard. The standard defines at least a
				322	macro @code{__STDC_ISO_10646__} that is only defined on systems where
				323	the @code{wchar_t} type encodes @w{ISO 10646} characters. If this
				324	symbol is not defined one should avoid making assumptions about the wide
				325	character representation. If the programmer uses only the functions
				326	provided by the C library to handle wide character strings there should
				327	be no compatibility problems with other systems.
				328
				329	@node Charset Function Overview
				330	@section Overview about Character Handling Functions
				331
				332	A Unix @w{C library} contains three different sets of functions in two
				333	families to handle character set conversion. One of the function families
				334	(the most commonly used) is specified in the @w{ISO C90} standard and,
				335	therefore, is portable even beyond the Unix world. Unfortunately this
				336	family is the least useful one. These functions should be avoided
				337	whenever possible, especially when developing libraries (as opposed to
				338	applications).
				339
				340	The second family of functions got introduced in the early Unix standards
				341	(XPG2) and is still part of the latest and greatest Unix standard:
				342	@w{Unix 98}. It is also the most powerful and useful set of functions.
				343	But we will start with the functions defined in @w{Amendment 1} to
				344	@w{ISO C90}.
				345
				346	@node Restartable multibyte conversion
				347	@section Restartable Multibyte Conversion Functions
				348
				349	The @w{ISO C} standard defines functions to convert strings from a
				350	multibyte representation to wide character strings. There are a number
				351	of peculiarities:
				352
				353	@itemize @bullet
				354	@item
				355	The character set assumed for the multibyte encoding is not specified
				356	as an argument to the functions. Instead the character set specified by
				357	the @code{LC_CTYPE} category of the current locale is used; see
				358	@ref{Locale Categories}.
				359
				360	@item
				361	The functions handling more than one character at a time require NUL
				362	terminated strings as the argument (i.e., converting blocks of text
				363	does not work unless one can add a NUL byte at an appropriate place).
				364	@Theglibc{} contains some extensions to the standard that allow
				365	specifying a size, but basically they also expect terminated strings.
				366	@end itemize
				367
				368	Despite these limitations the @w{ISO C} functions can be used in many
				369	contexts. In graphical user interfaces, for instance, it is not
				370	uncommon to have functions that require text to be displayed in a wide
				371	character string if the text is not simple ASCII. The text itself might
				372	come from a file with translations and the user should decide about the
				373	current locale, which determines the translation and therefore also the
				374	external encoding used. In such a situation (and many others) the
				375	functions described here are perfect. If more freedom while performing
				376	the conversion is necessary take a look at the @code{iconv} functions
				377	(@pxref{Generic Charset Conversion}).
				378
				379	@menu
				380	* Selecting the Conversion:: Selecting the conversion and its properties.
				381	* Keeping the state:: Representing the state of the conversion.
				382	* Converting a Character:: Converting Single Characters.
				383	* Converting Strings:: Converting Multibyte and Wide Character
				384	Strings.
				385	* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
				386	@end menu
				387
				388	@node Selecting the Conversion
				389	@subsection Selecting the conversion and its properties
				390
				391	We already said above that the currently selected locale for the
				392	@code{LC_CTYPE} category decides about the conversion that is performed
				393	by the functions we are about to describe. Each locale uses its own
				394	character set (given as an argument to @code{localedef}) and this is the
				395	one assumed as the external multibyte encoding. The wide character
				396	set is always UCS-4 in @theglibc{}.
				397
				398	A characteristic of each multibyte character set is the maximum number
				399	of bytes that can be necessary to represent one character. This
				400	information is quite important when writing code that uses the
				401	conversion functions (as shown in the examples below).
				402	The @w{ISO C} standard defines two macros that provide this information.
				403
				404
				405	@comment limits.h
				406	@comment ISO
				407	@deftypevr Macro int MB_LEN_MAX
				408	@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
				409	sequence for a single character in any of the supported locales. It is
				410	a compile-time constant and is defined in @file{limits.h}.
				411	@pindex limits.h
				412	@end deftypevr
				413
				414	@comment stdlib.h
				415	@comment ISO
				416	@deftypevr Macro int MB_CUR_MAX
				417	@code{MB_CUR_MAX} expands into a positive integer expression that is the
				418	maximum number of bytes in a multibyte character in the current locale.
				419	The value is never greater than @code{MB_LEN_MAX}. Unlike
				420	@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
				421	@theglibc{} it is not.
				422
				423	@pindex stdlib.h
				424	@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
				425	@end deftypevr
				426
				427	Two different macros are necessary since strictly @w{ISO C90} compilers
				428	do not allow variable length array definitions, but still it is desirable
				429	to avoid dynamic allocation. This incomplete piece of code shows the
				430	problem:
				431
				432	@smallexample
				433	@{
				434	char buf[MB_LEN_MAX];
				435	ssize_t len = 0;
				436
				437	while (! feof (fp))
				438	@{
				439	fread (&buf[len], 1, MB_CUR_MAX - len, fp);
				440	/* @r{@dots{} process} buf */
				441	len -= used;
				442	@}
				443	@}
				444	@end smallexample
				445
				446	The code in the inner loop is expected to have always enough bytes in
				447	the array @var{buf} to convert one multibyte character. The array
				448	@var{buf} has to be sized statically since many compilers do not allow a
				449	variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX}
				450	bytes are always available in @var{buf}. Note that it isn't
				451	a problem if @code{MB_CUR_MAX} is not a compile-time constant.
				452
				453
				454	@node Keeping the state
				455	@subsection Representing the state of the conversion
				456
				457	@cindex stateful
				458	In the introduction of this chapter it was said that certain character
				459	sets use a @dfn{stateful} encoding. That is, the encoded values depend
				460	in some way on the previous bytes in the text.
				461
				462	Since the conversion functions allow converting a text in more than one
				463	step we must have a way to pass this information from one call of the
				464	functions to another.
				465
				466	@comment wchar.h
				467	@comment ISO
				468	@deftp {Data type} mbstate_t
				469	@cindex shift state
				470	A variable of type @code{mbstate_t} can contain all the information
				471	about the @dfn{shift state} needed from one call to a conversion
				472	function to another.
				473
				474	@pindex wchar.h
				475	@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in
				476	@w{Amendment 1} to @w{ISO C90}.
				477	@end deftp
				478
				479	To use objects of type @code{mbstate_t} the programmer has to define such
				480	objects (normally as local variables on the stack) and pass a pointer to
				481	the object to the conversion functions. This way the conversion function
				482	can update the object if the current multibyte character set is stateful.
				483
				484	There is no specific function or initializer to put the state object in
				485	any specific state. The rules are that the object should always
				486	represent the initial state before the first use, and this is achieved by
				487	clearing the whole variable with code such as follows:
				488
				489	@smallexample
				490	@{
				491	mbstate_t state;
				492	memset (&state, '\0', sizeof (state));
				493	/* @r{from now on @var{state} can be used.} */
				494	@dots{}
				495	@}
				496	@end smallexample
				497
				498	When using the conversion functions to generate output it is often
				499	necessary to test whether the current state corresponds to the initial
				500	state. This is necessary, for example, to decide whether to emit
				501	escape sequences to set the state to the initial state at certain
				502	sequence points. Communication protocols often require this.
				503
				504	@comment wchar.h
				505	@comment ISO
				506	@deftypefun int mbsinit (const mbstate_t *@var{ps})
				507	@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
				508	@c ps is dereferenced once, unguarded. This would call for @mtsrace:ps,
				509	@c but since a single word-sized field is (atomically) accessed, any
				510	@c race here would be harmless. Other functions that take an optional
				511	@c mbstate_t* argument named ps are marked with @mtasurace:<func>/!ps,
				512	@c to indicate that the function uses a static buffer if ps is NULL.
				513	@c These could also have been marked with @mtsrace:ps, but we'll omit
				514	@c that for brevity, for it's somewhat redundant with the @mtasurace.
				515	The @code{mbsinit} function determines whether the state object pointed
				516	to by @var{ps} is in the initial state. If @var{ps} is a null pointer or
				517	the object is in the initial state the return value is nonzero. Otherwise
				518	it is zero.
				519
				520	@pindex wchar.h
				521	@code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is
				522	declared in @file{wchar.h}.
				523	@end deftypefun
				524
				525	Code using @code{mbsinit} often looks similar to this:
				526
				527	@c Fix the example to explicitly say how to generate the escape sequence
				528	@c to restore the initial state.
				529	@smallexample
				530	@{
				531	mbstate_t state;
				532	memset (&state, '\0', sizeof (state));
				533	/* @r{Use @var{state}.} */
				534	@dots{}
				535	if (! mbsinit (&state))
				536	@{
				537	/* @r{Emit code to return to initial state.} */
				538	const wchar_t empty[] = L"";
				539	const wchar_t *srcp = empty;
				540	wcsrtombs (outbuf, &srcp, outbuflen, &state);
				541	@}
				542	@dots{}
				543	@}
				544	@end smallexample
				545
				546	The code to emit the escape sequence to get back to the initial state is
				547	interesting. The @code{wcsrtombs} function can be used to determine the
				548	necessary output code (@pxref{Converting Strings}). Please note that with
				549	@theglibc{} it is not necessary to perform this extra action for the
				550	conversion from multibyte text to wide character text since the wide
				551	character encoding is not stateful. But there is nothing mentioned in
				552	any standard that prohibits making @code{wchar_t} using a stateful
				553	encoding.
				554
				555	@node Converting a Character
				556	@subsection Converting Single Characters
				557
				558	The most fundamental of the conversion functions are those dealing with
				559	single characters. Please note that this does not always mean single
				560	bytes. But since there is very often a subset of the multibyte
				561	character set that consists of single byte sequences, there are
				562	functions to help with converting bytes. Frequently, ASCII is a subpart
				563	of the multibyte character set. In such a scenario, each ASCII character
				564	stands for itself, and all other characters have at least a first byte
				565	that is beyond the range @math{0} to @math{127}.
				566
				567	@comment wchar.h
				568	@comment ISO
				569	@deftypefun wint_t btowc (int @var{c})
				570	@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				571	@c Calls btowc_fct or __fct; reads from locale, and from the
				572	@c get_gconv_fcts result multiple times. get_gconv_fcts calls
				573	@c __wcsmbs_load_conv to initialize the ctype if it's null.
				574	@c wcsmbs_load_conv takes a non-recursive wrlock before allocating
				575	@c memory for the fcts structure, initializing it, and then storing it
				576	@c in the locale object. The initialization involves dlopening and a
				577	@c lot more.
				578	The @code{btowc} function (``byte to wide character'') converts a valid
				579	single byte character @var{c} in the initial shift state into the wide
				580	character equivalent using the conversion rules from the currently
				581	selected locale of the @code{LC_CTYPE} category.
				582
				583	If @code{(unsigned char) @var{c}} is no valid single byte multibyte
				584	character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
				585
				586	Please note the restriction of @var{c} being tested for validity only in
				587	the initial shift state. No @code{mbstate_t} object is used from
				588	which the state information is taken, and the function also does not use
				589	any static state.
				590
				591	@pindex wchar.h
				592	The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90}
				593	and is declared in @file{wchar.h}.
				594	@end deftypefun
				595
				596	Despite the limitation that the single byte value is always interpreted
				597	in the initial state, this function is actually useful most of the time.
				598	Most characters are either entirely single-byte character sets or they
				599	are extension to ASCII. But then it is possible to write code like this
				600	(not that this specific example is very useful):
				601
				602	@smallexample
				603	wchar_t *
				604	itow (unsigned long int val)
				605	@{
				606	static wchar_t buf[30];
				607	wchar_t *wcp = &buf[29];
				608	*wcp = L'\0';
				609	while (val != 0)
				610	@{
				611	*--wcp = btowc ('0' + val % 10);
				612	val /= 10;
				613	@}
				614	if (wcp == &buf[29])
				615	*--wcp = L'0';
				616	return wcp;
				617	@}
				618	@end smallexample
				619
				620	Why is it necessary to use such a complicated implementation and not
				621	simply cast @code{'0' + val % 10} to a wide character? The answer is
				622	that there is no guarantee that one can perform this kind of arithmetic
				623	on the character of the character set used for @code{wchar_t}
				624	representation. In other situations the bytes are not constant at
				625	compile time and so the compiler cannot do the work. In situations like
				626	this, using @code{btowc} is required.
				627
				628	@noindent
				629	There is also a function for the conversion in the other direction.
				630
				631	@comment wchar.h
				632	@comment ISO
				633	@deftypefun int wctob (wint_t @var{c})
				634	@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				635	The @code{wctob} function (``wide character to byte'') takes as the
				636	parameter a valid wide character. If the multibyte representation for
				637	this character in the initial state is exactly one byte long, the return
				638	value of this function is this character. Otherwise the return value is
				639	@code{EOF}.
				640
				641	@pindex wchar.h
				642	@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
				643	is declared in @file{wchar.h}.
				644	@end deftypefun
				645
				646	There are more general functions to convert single character from
				647	multibyte representation to wide characters and vice versa. These
				648	functions pose no limit on the length of the multibyte representation
				649	and they also do not require it to be in the initial state.
				650
				651	@comment wchar.h
				652	@comment ISO
				653	@deftypefun size_t mbrtowc (wchar_t restrict @var{pwc}, const char restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
				654	@safety{@prelim{}@mtunsafe{@mtasurace{:mbrtowc/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				655	@cindex stateful
				656	The @code{mbrtowc} function (``multibyte restartable to wide
				657	character'') converts the next multibyte character in the string pointed
				658	to by @var{s} into a wide character and stores it in the wide character
				659	string pointed to by @var{pwc}. The conversion is performed according
				660	to the locale currently selected for the @code{LC_CTYPE} category. If
				661	the conversion for the character set used in the locale requires a state,
				662	the multibyte string is interpreted in the state represented by the
				663	object pointed to by @var{ps}. If @var{ps} is a null pointer, a static,
				664	internal state variable used only by the @code{mbrtowc} function is
				665	used.
				666
				667	If the next multibyte character corresponds to the NUL wide character,
				668	the return value of the function is @math{0} and the state object is
				669	afterwards in the initial state. If the next @var{n} or fewer bytes
				670	form a correct multibyte character, the return value is the number of
				671	bytes starting from @var{s} that form the multibyte character. The
				672	conversion state is updated according to the bytes consumed in the
				673	conversion. In both cases the wide character (either the @code{L'\0'}
				674	or the one found in the conversion) is stored in the string pointed to
				675	by @var{pwc} if @var{pwc} is not null.
				676
				677	If the first @var{n} bytes of the multibyte string possibly form a valid
				678	multibyte character but there are more than @var{n} bytes needed to
				679	complete it, the return value of the function is @code{(size_t) -2} and
				680	no value is stored. Please note that this can happen even if @var{n}
				681	has a value greater than or equal to @code{MB_CUR_MAX} since the input
				682	might contain redundant shift sequences.
				683
				684	If the first @code{n} bytes of the multibyte string cannot possibly form
				685	a valid multibyte character, no value is stored, the global variable
				686	@code{errno} is set to the value @code{EILSEQ}, and the function returns
				687	@code{(size_t) -1}. The conversion state is afterwards undefined.
				688
				689	@pindex wchar.h
				690	@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
				691	is declared in @file{wchar.h}.
				692	@end deftypefun
				693
				694	Use of @code{mbrtowc} is straightforward. A function that copies a
				695	multibyte string into a wide character string while at the same time
				696	converting all lowercase characters into uppercase could look like this
				697	(this is not the final version, just an example; it has no error
				698	checking, and sometimes leaks memory):
				699
				700	@smallexample
				701	wchar_t *
				702	mbstouwcs (const char *s)
				703	@{
				704	size_t len = strlen (s);
				705	wchar_t result = malloc ((len + 1) sizeof (wchar_t));
				706	wchar_t *wcp = result;
				707	wchar_t tmp[1];
				708	mbstate_t state;
				709	size_t nbytes;
				710
				711	memset (&state, '\0', sizeof (state));
				712	while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
				713	@{
				714	if (nbytes >= (size_t) -2)
				715	/* Invalid input string. */
				716	return NULL;
				717	*wcp++ = towupper (tmp[0]);
				718	len -= nbytes;
				719	s += nbytes;
				720	@}
				721	return result;
				722	@}
				723	@end smallexample
				724
				725	The use of @code{mbrtowc} should be clear. A single wide character is
				726	stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
				727	in the variable @var{nbytes}. If the conversion is successful, the
				728	uppercase variant of the wide character is stored in the @var{result}
				729	array and the pointer to the input string and the number of available
				730	bytes is adjusted.
				731
				732	The only non-obvious thing about @code{mbrtowc} might be the way memory
				733	is allocated for the result. The above code uses the fact that there
				734	can never be more wide characters in the converted results than there are
				735	bytes in the multibyte input string. This method yields a pessimistic
				736	guess about the size of the result, and if many wide character strings
				737	have to be constructed this way or if the strings are long, the extra
				738	memory required to be allocated because the input string contains
				739	multibyte characters might be significant. The allocated memory block can
				740	be resized to the correct size before returning it, but a better solution
				741	might be to allocate just the right amount of space for the result right
				742	away. Unfortunately there is no function to compute the length of the wide
				743	character string directly from the multibyte string. There is, however, a
				744	function that does part of the work.
				745
				746	@comment wchar.h
				747	@comment ISO
				748	@deftypefun size_t mbrlen (const char restrict @var{s}, size_t @var{n}, mbstate_t @var{ps})
				749	@safety{@prelim{}@mtunsafe{@mtasurace{:mbrlen/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				750	The @code{mbrlen} function (``multibyte restartable length'') computes
				751	the number of at most @var{n} bytes starting at @var{s}, which form the
				752	next valid and complete multibyte character.
				753
				754	If the next multibyte character corresponds to the NUL wide character,
				755	the return value is @math{0}. If the next @var{n} bytes form a valid
				756	multibyte character, the number of bytes belonging to this multibyte
				757	character byte sequence is returned.
				758
				759	If the first @var{n} bytes possibly form a valid multibyte
				760	character but the character is incomplete, the return value is
				761	@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid
				762	and the return value is @code{(size_t) -1}.
				763
				764	The multibyte sequence is interpreted in the state represented by the
				765	object pointed to by @var{ps}. If @var{ps} is a null pointer, a state
				766	object local to @code{mbrlen} is used.
				767
				768	@pindex wchar.h
				769	@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and
				770	is declared in @file{wchar.h}.
				771	@end deftypefun
				772
				773	The attentive reader now will note that @code{mbrlen} can be implemented
				774	as
				775
				776	@smallexample
				777	mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
				778	@end smallexample
				779
				780	This is true and in fact is mentioned in the official specification.
				781	How can this function be used to determine the length of the wide
				782	character string created from a multibyte character string? It is not
				783	directly usable, but we can define a function @code{mbslen} using it:
				784
				785	@smallexample
				786	size_t
				787	mbslen (const char *s)
				788	@{
				789	mbstate_t state;
				790	size_t result = 0;
				791	size_t nbytes;
				792	memset (&state, '\0', sizeof (state));
				793	while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
				794	@{
				795	if (nbytes >= (size_t) -2)
				796	/* @r{Something is wrong.} */
				797	return (size_t) -1;
				798	s += nbytes;
				799	++result;
				800	@}
				801	return result;
				802	@}
				803	@end smallexample
				804
				805	This function simply calls @code{mbrlen} for each multibyte character
				806	in the string and counts the number of function calls. Please note that
				807	we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
				808	call. This is acceptable since a) this value is larger than the length of
				809	the longest multibyte character sequence and b) we know that the string
				810	@var{s} ends with a NUL byte, which cannot be part of any other multibyte
				811	character sequence but the one representing the NUL wide character.
				812	Therefore, the @code{mbrlen} function will never read invalid memory.
				813
				814	Now that this function is available (just to make this clear, this
				815	function is @emph{not} part of @theglibc{}) we can compute the
				816	number of wide character required to store the converted multibyte
				817	character string @var{s} using
				818
				819	@smallexample
				820	wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
				821	@end smallexample
				822
				823	Please note that the @code{mbslen} function is quite inefficient. The
				824	implementation of @code{mbstouwcs} with @code{mbslen} would have to
				825	perform the conversion of the multibyte character input string twice, and
				826	this conversion might be quite expensive. So it is necessary to think
				827	about the consequences of using the easier but imprecise method before
				828	doing the work twice.
				829
				830	@comment wchar.h
				831	@comment ISO
				832	@deftypefun size_t wcrtomb (char restrict @var{s}, wchar_t @var{wc}, mbstate_t restrict @var{ps})
				833	@safety{@prelim{}@mtunsafe{@mtasurace{:wcrtomb/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				834	@c wcrtomb uses a static, non-thread-local unguarded state variable when
				835	@c PS is NULL. When a state is passed in, and it's not used
				836	@c concurrently in other threads, this function behaves safely as long
				837	@c as gconv modules don't bring MT safety issues of their own.
				838	@c Attempting to load gconv modules or to build conversion chains in
				839	@c signal handlers may encounter gconv databases or caches in a
				840	@c partially-updated state, and asynchronous cancellation may leave them
				841	@c in such states, besides leaking the lock that guards them.
				842	@c get_gconv_fcts ok
				843	@c wcsmbs_load_conv ok
				844	@c norm_add_slashes ok
				845	@c wcsmbs_getfct ok
				846	@c gconv_find_transform ok
				847	@c gconv_read_conf (libc_once)
				848	@c gconv_lookup_cache ok
				849	@c find_module_idx ok
				850	@c find_module ok
				851	@c gconv_find_shlib (ok)
				852	@c ->init_fct (assumed ok)
				853	@c gconv_get_builtin_trans ok
				854	@c gconv_release_step ok
				855	@c do_lookup_alias ok
				856	@c find_derivation ok
				857	@c derivation_lookup ok
				858	@c increment_counter ok
				859	@c gconv_find_shlib ok
				860	@c step->init_fct (assumed ok)
				861	@c gen_steps ok
				862	@c gconv_find_shlib ok
				863	@c dlopen (presumed ok)
				864	@c dlsym (presumed ok)
				865	@c step->init_fct (assumed ok)
				866	@c step->end_fct (assumed ok)
				867	@c gconv_get_builtin_trans ok
				868	@c gconv_release_step ok
				869	@c add_derivation ok
				870	@c gconv_close_transform ok
				871	@c gconv_release_step ok
				872	@c step->end_fct (assumed ok)
				873	@c gconv_release_shlib ok
				874	@c dlclose (presumed ok)
				875	@c gconv_release_cache ok
				876	@c ->tomb->__fct (assumed ok)
				877	The @code{wcrtomb} function (``wide character restartable to
				878	multibyte'') converts a single wide character into a multibyte string
				879	corresponding to that wide character.
				880
				881	If @var{s} is a null pointer, the function resets the state stored in
				882	the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
				883	object) to the initial state. This can also be achieved by a call like
				884	this:
				885
				886	@smallexample
				887	wcrtombs (temp_buf, L'\0', ps)
				888	@end smallexample
				889
				890	@noindent
				891	since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it
				892	writes into an internal buffer, which is guaranteed to be large enough.
				893
				894	If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if
				895	necessary, a shift sequence to get the state @var{ps} into the initial
				896	state followed by a single NUL byte, which is stored in the string
				897	@var{s}.
				898
				899	Otherwise a byte sequence (possibly including shift sequences) is written
				900	into the string @var{s}. This only happens if @var{wc} is a valid wide
				901	character (i.e., it has a multibyte representation in the character set
				902	selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no
				903	valid wide character, nothing is stored in the strings @var{s},
				904	@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps}
				905	is undefined and the return value is @code{(size_t) -1}.
				906
				907	If no error occurred the function returns the number of bytes stored in
				908	the string @var{s}. This includes all bytes representing shift
				909	sequences.
				910
				911	One word about the interface of the function: there is no parameter
				912	specifying the length of the array @var{s}. Instead the function
				913	assumes that there are at least @code{MB_CUR_MAX} bytes available since
				914	this is the maximum length of any byte sequence representing a single
				915	character. So the caller has to make sure that there is enough space
				916	available, otherwise buffer overruns can occur.
				917
				918	@pindex wchar.h
				919	@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
				920	declared in @file{wchar.h}.
				921	@end deftypefun
				922
				923	Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following
				924	example appends a wide character string to a multibyte character string.
				925	Again, the code is not really useful (or correct), it is simply here to
				926	demonstrate the use and some problems.
				927
				928	@smallexample
				929	char *
				930	mbscatwcs (char s, size_t len, const wchar_t ws)
				931	@{
				932	mbstate_t state;
				933	/* @r{Find the end of the existing string.} */
				934	char *wp = strchr (s, '\0');
				935	len -= wp - s;
				936	memset (&state, '\0', sizeof (state));
				937	do
				938	@{
				939	size_t nbytes;
				940	if (len < MB_CUR_LEN)
				941	@{
				942	/* @r{We cannot guarantee that the next}
				943	@r{character fits into the buffer, so}
				944	@r{return an error.} */
				945	errno = E2BIG;
				946	return NULL;
				947	@}
				948	nbytes = wcrtomb (wp, *ws, &state);
				949	if (nbytes == (size_t) -1)
				950	/* @r{Error in the conversion.} */
				951	return NULL;
				952	len -= nbytes;
				953	wp += nbytes;
				954	@}
				955	while (*ws++ != L'\0');
				956	return s;
				957	@}
				958	@end smallexample
				959
				960	First the function has to find the end of the string currently in the
				961	array @var{s}. The @code{strchr} call does this very efficiently since a
				962	requirement for multibyte character representations is that the NUL byte
				963	is never used except to represent itself (and in this context, the end
				964	of the string).
				965
				966	After initializing the state object the loop is entered where the first
				967	task is to make sure there is enough room in the array @var{s}. We
				968	abort if there are not at least @code{MB_CUR_LEN} bytes available. This
				969	is not always optimal but we have no other choice. We might have less
				970	than @code{MB_CUR_LEN} bytes available but the next multibyte character
				971	might also be only one byte long. At the time the @code{wcrtomb} call
				972	returns it is too late to decide whether the buffer was large enough. If
				973	this solution is unsuitable, there is a very slow but more accurate
				974	solution.
				975
				976	@smallexample
				977	@dots{}
				978	if (len < MB_CUR_LEN)
				979	@{
				980	mbstate_t temp_state;
				981	memcpy (&temp_state, &state, sizeof (state));
				982	if (wcrtomb (NULL, *ws, &temp_state) > len)
				983	@{
				984	/* @r{We cannot guarantee that the next}
				985	@r{character fits into the buffer, so}
				986	@r{return an error.} */
				987	errno = E2BIG;
				988	return NULL;
				989	@}
				990	@}
				991	@dots{}
				992	@end smallexample
				993
				994	Here we perform the conversion that might overflow the buffer so that
				995	we are afterwards in the position to make an exact decision about the
				996	buffer size. Please note the @code{NULL} argument for the destination
				997	buffer in the new @code{wcrtomb} call; since we are not interested in the
				998	converted text at this point, this is a nice way to express this. The
				999	most unusual thing about this piece of code certainly is the duplication
				1000	of the conversion state object, but if a change of the state is necessary
				1001	to emit the next multibyte character, we want to have the same shift state
				1002	change performed in the real conversion. Therefore, we have to preserve
				1003	the initial shift state information.
				1004
				1005	There are certainly many more and even better solutions to this problem.
				1006	This example is only provided for educational purposes.
				1007
				1008	@node Converting Strings
				1009	@subsection Converting Multibyte and Wide Character Strings
				1010
				1011	The functions described in the previous section only convert a single
				1012	character at a time. Most operations to be performed in real-world
				1013	programs include strings and therefore the @w{ISO C} standard also
				1014	defines conversions on entire strings. However, the defined set of
				1015	functions is quite limited; therefore, @theglibc{} contains a few
				1016	extensions that can help in some important situations.
				1017
				1018	@comment wchar.h
				1019	@comment ISO
				1020	@deftypefun size_t mbsrtowcs (wchar_t restrict @var{dst}, const char restrict @var{src}, size_t @var{len}, mbstate_t restrict @var{ps})
				1021	@safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1022	The @code{mbsrtowcs} function (``multibyte string restartable to wide
				1023	character string'') converts a NUL-terminated multibyte character
				1024	string at @code{*@var{src}} into an equivalent wide character string,
				1025	including the NUL wide character at the end. The conversion is started
				1026	using the state information from the object pointed to by @var{ps} or
				1027	from an internal object of @code{mbsrtowcs} if @var{ps} is a null
				1028	pointer. Before returning, the state object is updated to match the state
				1029	after the last converted character. The state is the initial state if the
				1030	terminating NUL byte is reached and converted.
				1031
				1032	If @var{dst} is not a null pointer, the result is stored in the array
				1033	pointed to by @var{dst}; otherwise, the conversion result is not
				1034	available since it is stored in an internal buffer.
				1035
				1036	If @var{len} wide characters are stored in the array @var{dst} before
				1037	reaching the end of the input string, the conversion stops and @var{len}
				1038	is returned. If @var{dst} is a null pointer, @var{len} is never checked.
				1039
				1040	Another reason for a premature return from the function call is if the
				1041	input string contains an invalid multibyte sequence. In this case the
				1042	global variable @code{errno} is set to @code{EILSEQ} and the function
				1043	returns @code{(size_t) -1}.
				1044
				1045	@c XXX The ISO C9x draft seems to have a problem here. It says that PS
				1046	@c is not updated if DST is NULL. This is not said straightforward and
				1047	@c none of the other functions is described like this. It would make sense
				1048	@c to define the function this way but I don't think it is meant like this.
				1049
				1050	In all other cases the function returns the number of wide characters
				1051	converted during this call. If @var{dst} is not null, @code{mbsrtowcs}
				1052	stores in the pointer pointed to by @var{src} either a null pointer (if
				1053	the NUL byte in the input string was reached) or the address of the byte
				1054	following the last converted multibyte character.
				1055
				1056	@pindex wchar.h
				1057	@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
				1058	declared in @file{wchar.h}.
				1059	@end deftypefun
				1060
				1061	The definition of the @code{mbsrtowcs} function has one important
				1062	limitation. The requirement that @var{dst} has to be a NUL-terminated
				1063	string provides problems if one wants to convert buffers with text. A
				1064	buffer is normally no collection of NUL-terminated strings but instead a
				1065	continuous collection of lines, separated by newline characters. Now
				1066	assume that a function to convert one line from a buffer is needed. Since
				1067	the line is not NUL-terminated, the source pointer cannot directly point
				1068	into the unmodified text buffer. This means, either one inserts the NUL
				1069	byte at the appropriate place for the time of the @code{mbsrtowcs}
				1070	function call (which is not doable for a read-only buffer or in a
				1071	multi-threaded application) or one copies the line in an extra buffer
				1072	where it can be terminated by a NUL byte. Note that it is not in general
				1073	possible to limit the number of characters to convert by setting the
				1074	parameter @var{len} to any specific value. Since it is not known how
				1075	many bytes each multibyte character sequence is in length, one can only
				1076	guess.
				1077
				1078	@cindex stateful
				1079	There is still a problem with the method of NUL-terminating a line right
				1080	after the newline character, which could lead to very strange results.
				1081	As said in the description of the @code{mbsrtowcs} function above the
				1082	conversion state is guaranteed to be in the initial shift state after
				1083	processing the NUL byte at the end of the input string. But this NUL
				1084	byte is not really part of the text (i.e., the conversion state after
				1085	the newline in the original text could be something different than the
				1086	initial shift state and therefore the first character of the next line
				1087	is encoded using this state). But the state in question is never
				1088	accessible to the user since the conversion stops after the NUL byte
				1089	(which resets the state). Most stateful character sets in use today
				1090	require that the shift state after a newline be the initial state--but
				1091	this is not a strict guarantee. Therefore, simply NUL-terminating a
				1092	piece of a running text is not always an adequate solution and,
				1093	therefore, should never be used in generally used code.
				1094
				1095	The generic conversion interface (@pxref{Generic Charset Conversion})
				1096	does not have this limitation (it simply works on buffers, not
				1097	strings), and @theglibc{} contains a set of functions that take
				1098	additional parameters specifying the maximal number of bytes that are
				1099	consumed from the input string. This way the problem of
				1100	@code{mbsrtowcs}'s example above could be solved by determining the line
				1101	length and passing this length to the function.
				1102
				1103	@comment wchar.h
				1104	@comment ISO
				1105	@deftypefun size_t wcsrtombs (char restrict @var{dst}, const wchar_t restrict @var{src}, size_t @var{len}, mbstate_t restrict @var{ps})
				1106	@safety{@prelim{}@mtunsafe{@mtasurace{:wcsrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1107	The @code{wcsrtombs} function (``wide character string restartable to
				1108	multibyte string'') converts the NUL-terminated wide character string at
				1109	@code{*@var{src}} into an equivalent multibyte character string and
				1110	stores the result in the array pointed to by @var{dst}. The NUL wide
				1111	character is also converted. The conversion starts in the state
				1112	described in the object pointed to by @var{ps} or by a state object
				1113	locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
				1114	@var{dst} is a null pointer, the conversion is performed as usual but the
				1115	result is not available. If all characters of the input string were
				1116	successfully converted and if @var{dst} is not a null pointer, the
				1117	pointer pointed to by @var{src} gets assigned a null pointer.
				1118
				1119	If one of the wide characters in the input string has no valid multibyte
				1120	character equivalent, the conversion stops early, sets the global
				1121	variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
				1122
				1123	Another reason for a premature stop is if @var{dst} is not a null
				1124	pointer and the next converted character would require more than
				1125	@var{len} bytes in total to the array @var{dst}. In this case (and if
				1126	@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
				1127	assigned a value pointing to the wide character right after the last one
				1128	successfully converted.
				1129
				1130	Except in the case of an encoding error the return value of the
				1131	@code{wcsrtombs} function is the number of bytes in all the multibyte
				1132	character sequences stored in @var{dst}. Before returning the state in
				1133	the object pointed to by @var{ps} (or the internal object in case
				1134	@var{ps} is a null pointer) is updated to reflect the state after the
				1135	last conversion. The state is the initial shift state in case the
				1136	terminating NUL wide character was converted.
				1137
				1138	@pindex wchar.h
				1139	The @code{wcsrtombs} function was introduced in @w{Amendment 1} to
				1140	@w{ISO C90} and is declared in @file{wchar.h}.
				1141	@end deftypefun
				1142
				1143	The restriction mentioned above for the @code{mbsrtowcs} function applies
				1144	here also. There is no possibility of directly controlling the number of
				1145	input characters. One has to place the NUL wide character at the correct
				1146	place or control the consumed input indirectly via the available output
				1147	array size (the @var{len} parameter).
				1148
				1149	@comment wchar.h
				1150	@comment GNU
				1151	@deftypefun size_t mbsnrtowcs (wchar_t restrict @var{dst}, const char restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t restrict @var{ps})
				1152	@safety{@prelim{}@mtunsafe{@mtasurace{:mbsnrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1153	The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
				1154	function. All the parameters are the same except for @var{nmc}, which is
				1155	new. The return value is the same as for @code{mbsrtowcs}.
				1156
				1157	This new parameter specifies how many bytes at most can be used from the
				1158	multibyte character string. In other words, the multibyte character
				1159	string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte
				1160	is found within the @var{nmc} first bytes of the string, the conversion
				1161	stops here.
				1162
				1163	This function is a GNU extension. It is meant to work around the
				1164	problems mentioned above. Now it is possible to convert a buffer with
				1165	multibyte character text piece for piece without having to care about
				1166	inserting NUL bytes and the effect of NUL bytes on the conversion state.
				1167	@end deftypefun
				1168
				1169	A function to convert a multibyte string into a wide character string
				1170	and display it could be written like this (this is not a really useful
				1171	example):
				1172
				1173	@smallexample
				1174	void
				1175	showmbs (const char src, FILE fp)
				1176	@{
				1177	mbstate_t state;
				1178	int cnt = 0;
				1179	memset (&state, '\0', sizeof (state));
				1180	while (1)
				1181	@{
				1182	wchar_t linebuf[100];
				1183	const char *endp = strchr (src, '\n');
				1184	size_t n;
				1185
				1186	/* @r{Exit if there is no more line.} */
				1187	if (endp == NULL)
				1188	break;
				1189
				1190	n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
				1191	linebuf[n] = L'\0';
				1192	fprintf (fp, "line %d: \"%S\"\n", linebuf);
				1193	@}
				1194	@}
				1195	@end smallexample
				1196
				1197	There is no problem with the state after a call to @code{mbsnrtowcs}.
				1198	Since we don't insert characters in the strings that were not in there
				1199	right from the beginning and we use @var{state} only for the conversion
				1200	of the given buffer, there is no problem with altering the state.
				1201
				1202	@comment wchar.h
				1203	@comment GNU
				1204	@deftypefun size_t wcsnrtombs (char restrict @var{dst}, const wchar_t restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t restrict @var{ps})
				1205	@safety{@prelim{}@mtunsafe{@mtasurace{:wcsnrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1206	The @code{wcsnrtombs} function implements the conversion from wide
				1207	character strings to multibyte character strings. It is similar to
				1208	@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra
				1209	parameter, which specifies the length of the input string.
				1210
				1211	No more than @var{nwc} wide characters from the input string
				1212	@code{*@var{src}} are converted. If the input string contains a NUL
				1213	wide character in the first @var{nwc} characters, the conversion stops at
				1214	this place.
				1215
				1216	The @code{wcsnrtombs} function is a GNU extension and just like
				1217	@code{mbsnrtowcs} helps in situations where no NUL-terminated input
				1218	strings are available.
				1219	@end deftypefun
				1220
				1221
				1222	@node Multibyte Conversion Example
				1223	@subsection A Complete Multibyte Conversion Example
				1224
				1225	The example programs given in the last sections are only brief and do
				1226	not contain all the error checking, etc. Presented here is a complete
				1227	and documented example. It features the @code{mbrtowc} function but it
				1228	should be easy to derive versions using the other functions.
				1229
				1230	@smallexample
				1231	int
				1232	file_mbsrtowcs (int input, int output)
				1233	@{
				1234	/* @r{Note the use of @code{MB_LEN_MAX}.}
				1235	@r{@code{MB_CUR_MAX} cannot portably be used here.} */
				1236	char buffer[BUFSIZ + MB_LEN_MAX];
				1237	mbstate_t state;
				1238	int filled = 0;
				1239	int eof = 0;
				1240
				1241	/* @r{Initialize the state.} */
				1242	memset (&state, '\0', sizeof (state));
				1243
				1244	while (!eof)
				1245	@{
				1246	ssize_t nread;
				1247	ssize_t nwrite;
				1248	char *inp = buffer;
				1249	wchar_t outbuf[BUFSIZ];
				1250	wchar_t *outp = outbuf;
				1251
				1252	/* @r{Fill up the buffer from the input file.} */
				1253	nread = read (input, buffer + filled, BUFSIZ);
				1254	if (nread < 0)
				1255	@{
				1256	perror ("read");
				1257	return 0;
				1258	@}
				1259	/* @r{If we reach end of file, make a note to read no more.} */
				1260	if (nread == 0)
				1261	eof = 1;
				1262
				1263	/* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
				1264	filled += nread;
				1265
				1266	/* @r{Convert those bytes to wide characters--as many as we can.} */
				1267	while (1)
				1268	@{
				1269	size_t thislen = mbrtowc (outp, inp, filled, &state);
				1270	/* @r{Stop converting at invalid character;}
				1271	@r{this can mean we have read just the first part}
				1272	@r{of a valid character.} */
				1273	if (thislen == (size_t) -1)
				1274	break;
				1275	/* @r{We want to handle embedded NUL bytes}
				1276	@r{but the return value is 0. Correct this.} */
				1277	if (thislen == 0)
				1278	thislen = 1;
				1279	/* @r{Advance past this character.} */
				1280	inp += thislen;
				1281	filled -= thislen;
				1282	++outp;
				1283	@}
				1284
				1285	/* @r{Write the wide characters we just made.} */
				1286	nwrite = write (output, outbuf,
				1287	(outp - outbuf) * sizeof (wchar_t));
				1288	if (nwrite < 0)
				1289	@{
				1290	perror ("write");
				1291	return 0;
				1292	@}
				1293
				1294	/* @r{See if we have a @emph{real} invalid character.} */
				1295	if ((eof && filled > 0) \|\| filled >= MB_CUR_MAX)
				1296	@{
				1297	error (0, 0, "invalid multibyte character");
				1298	return 0;
				1299	@}
				1300
				1301	/* @r{If any characters must be carried forward,}
				1302	@r{put them at the beginning of @code{buffer}.} */
				1303	if (filled > 0)
				1304	memmove (buffer, inp, filled);
				1305	@}
				1306
				1307	return 1;
				1308	@}
				1309	@end smallexample
				1310
				1311
				1312	@node Non-reentrant Conversion
				1313	@section Non-reentrant Conversion Function
				1314
				1315	The functions described in the previous chapter are defined in
				1316	@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard
				1317	also contained functions for character set conversion. The reason that
				1318	these original functions are not described first is that they are almost
				1319	entirely useless.
				1320
				1321	The problem is that all the conversion functions described in the
				1322	original @w{ISO C90} use a local state. Using a local state implies that
				1323	multiple conversions at the same time (not only when using threads)
				1324	cannot be done, and that you cannot first convert single characters and
				1325	then strings since you cannot tell the conversion functions which state
				1326	to use.
				1327
				1328	These original functions are therefore usable only in a very limited set
				1329	of situations. One must complete converting the entire string before
				1330	starting a new one, and each string/text must be converted with the same
				1331	function (there is no problem with the library itself; it is guaranteed
				1332	that no library function changes the state of any of these functions).
				1333	@strong{For the above reasons it is highly requested that the functions
				1334	described in the previous section be used in place of non-reentrant
				1335	conversion functions.}
				1336
				1337	@menu
				1338	* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
				1339	Characters.
				1340	* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings.
				1341	* Shift State:: States in Non-reentrant Functions.
				1342	@end menu
				1343
				1344	@node Non-reentrant Character Conversion
				1345	@subsection Non-reentrant Conversion of Single Characters
				1346
				1347	@comment stdlib.h
				1348	@comment ISO
				1349	@deftypefun int mbtowc (wchar_t restrict @var{result}, const char restrict @var{string}, size_t @var{size})
				1350	@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1351	The @code{mbtowc} (``multibyte to wide character'') function when called
				1352	with non-null @var{string} converts the first multibyte character
				1353	beginning at @var{string} to its corresponding wide character code. It
				1354	stores the result in @code{*@var{result}}.
				1355
				1356	@code{mbtowc} never examines more than @var{size} bytes. (The idea is
				1357	to supply for @var{size} the number of bytes of data you have in hand.)
				1358
				1359	@code{mbtowc} with non-null @var{string} distinguishes three
				1360	possibilities: the first @var{size} bytes at @var{string} start with
				1361	valid multibyte characters, they start with an invalid byte sequence or
				1362	just part of a character, or @var{string} points to an empty string (a
				1363	null character).
				1364
				1365	For a valid multibyte character, @code{mbtowc} converts it to a wide
				1366	character and stores that in @code{*@var{result}}, and returns the
				1367	number of bytes in that character (always at least @math{1} and never
				1368	more than @var{size}).
				1369
				1370	For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an
				1371	empty string, it returns @math{0}, also storing @code{'\0'} in
				1372	@code{*@var{result}}.
				1373
				1374	If the multibyte character code uses shift characters, then
				1375	@code{mbtowc} maintains and updates a shift state as it scans. If you
				1376	call @code{mbtowc} with a null pointer for @var{string}, that
				1377	initializes the shift state to its standard initial value. It also
				1378	returns nonzero if the multibyte character code in use actually has a
				1379	shift state. @xref{Shift State}.
				1380	@end deftypefun
				1381
				1382	@comment stdlib.h
				1383	@comment ISO
				1384	@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
				1385	@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1386	The @code{wctomb} (``wide character to multibyte'') function converts
				1387	the wide character code @var{wchar} to its corresponding multibyte
				1388	character sequence, and stores the result in bytes starting at
				1389	@var{string}. At most @code{MB_CUR_MAX} characters are stored.
				1390
				1391	@code{wctomb} with non-null @var{string} distinguishes three
				1392	possibilities for @var{wchar}: a valid wide character code (one that can
				1393	be translated to a multibyte character), an invalid code, and
				1394	@code{L'\0'}.
				1395
				1396	Given a valid code, @code{wctomb} converts it to a multibyte character,
				1397	storing the bytes starting at @var{string}. Then it returns the number
				1398	of bytes in that character (always at least @math{1} and never more
				1399	than @code{MB_CUR_MAX}).
				1400
				1401	If @var{wchar} is an invalid wide character code, @code{wctomb} returns
				1402	@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
				1403	storing @code{'\0'} in @code{*@var{string}}.
				1404
				1405	If the multibyte character code uses shift characters, then
				1406	@code{wctomb} maintains and updates a shift state as it scans. If you
				1407	call @code{wctomb} with a null pointer for @var{string}, that
				1408	initializes the shift state to its standard initial value. It also
				1409	returns nonzero if the multibyte character code in use actually has a
				1410	shift state. @xref{Shift State}.
				1411
				1412	Calling this function with a @var{wchar} argument of zero when
				1413	@var{string} is not null has the side-effect of reinitializing the
				1414	stored shift state @emph{as well as} storing the multibyte character
				1415	@code{'\0'} and returning @math{0}.
				1416	@end deftypefun
				1417
				1418	Similar to @code{mbrlen} there is also a non-reentrant function that
				1419	computes the length of a multibyte character. It can be defined in
				1420	terms of @code{mbtowc}.
				1421
				1422	@comment stdlib.h
				1423	@comment ISO
				1424	@deftypefun int mblen (const char *@var{string}, size_t @var{size})
				1425	@safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1426	The @code{mblen} function with a non-null @var{string} argument returns
				1427	the number of bytes that make up the multibyte character beginning at
				1428	@var{string}, never examining more than @var{size} bytes. (The idea is
				1429	to supply for @var{size} the number of bytes of data you have in hand.)
				1430
				1431	The return value of @code{mblen} distinguishes three possibilities: the
				1432	first @var{size} bytes at @var{string} start with valid multibyte
				1433	characters, they start with an invalid byte sequence or just part of a
				1434	character, or @var{string} points to an empty string (a null character).
				1435
				1436	For a valid multibyte character, @code{mblen} returns the number of
				1437	bytes in that character (always at least @code{1} and never more than
				1438	@var{size}). For an invalid byte sequence, @code{mblen} returns
				1439	@math{-1}. For an empty string, it returns @math{0}.
				1440
				1441	If the multibyte character code uses shift characters, then @code{mblen}
				1442	maintains and updates a shift state as it scans. If you call
				1443	@code{mblen} with a null pointer for @var{string}, that initializes the
				1444	shift state to its standard initial value. It also returns a nonzero
				1445	value if the multibyte character code in use actually has a shift state.
				1446	@xref{Shift State}.
				1447
				1448	@pindex stdlib.h
				1449	The function @code{mblen} is declared in @file{stdlib.h}.
				1450	@end deftypefun
				1451
				1452
				1453	@node Non-reentrant String Conversion
				1454	@subsection Non-reentrant Conversion of Strings
				1455
				1456	For convenience the @w{ISO C90} standard also defines functions to
				1457	convert entire strings instead of single characters. These functions
				1458	suffer from the same problems as their reentrant counterparts from
				1459	@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
				1460
				1461	@comment stdlib.h
				1462	@comment ISO
				1463	@deftypefun size_t mbstowcs (wchar_t @var{wstring}, const char @var{string}, size_t @var{size})
				1464	@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1465	@c Odd... Although this was supposed to be non-reentrant, the internal
				1466	@c state is not a static buffer, but an automatic variable.
				1467	The @code{mbstowcs} (``multibyte string to wide character string'')
				1468	function converts the null-terminated string of multibyte characters
				1469	@var{string} to an array of wide character codes, storing not more than
				1470	@var{size} wide characters into the array beginning at @var{wstring}.
				1471	The terminating null character counts towards the size, so if @var{size}
				1472	is less than the actual number of wide characters resulting from
				1473	@var{string}, no terminating null character is stored.
				1474
				1475	The conversion of characters from @var{string} begins in the initial
				1476	shift state.
				1477
				1478	If an invalid multibyte character sequence is found, the @code{mbstowcs}
				1479	function returns a value of @math{-1}. Otherwise, it returns the number
				1480	of wide characters stored in the array @var{wstring}. This number does
				1481	not include the terminating null character, which is present if the
				1482	number is less than @var{size}.
				1483
				1484	Here is an example showing how to convert a string of multibyte
				1485	characters, allocating enough space for the result.
				1486
				1487	@smallexample
				1488	wchar_t *
				1489	mbstowcs_alloc (const char *string)
				1490	@{
				1491	size_t size = strlen (string) + 1;
				1492	wchar_t buf = xmalloc (size sizeof (wchar_t));
				1493
				1494	size = mbstowcs (buf, string, size);
				1495	if (size == (size_t) -1)
				1496	return NULL;
				1497	buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
				1498	return buf;
				1499	@}
				1500	@end smallexample
				1501
				1502	@end deftypefun
				1503
				1504	@comment stdlib.h
				1505	@comment ISO
				1506	@deftypefun size_t wcstombs (char @var{string}, const wchar_t @var{wstring}, size_t @var{size})
				1507	@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1508	The @code{wcstombs} (``wide character string to multibyte string'')
				1509	function converts the null-terminated wide character array @var{wstring}
				1510	into a string containing multibyte characters, storing not more than
				1511	@var{size} bytes starting at @var{string}, followed by a terminating
				1512	null character if there is room. The conversion of characters begins in
				1513	the initial shift state.
				1514
				1515	The terminating null character counts towards the size, so if @var{size}
				1516	is less than or equal to the number of bytes needed in @var{wstring}, no
				1517	terminating null character is stored.
				1518
				1519	If a code that does not correspond to a valid multibyte character is
				1520	found, the @code{wcstombs} function returns a value of @math{-1}.
				1521	Otherwise, the return value is the number of bytes stored in the array
				1522	@var{string}. This number does not include the terminating null character,
				1523	which is present if the number is less than @var{size}.
				1524	@end deftypefun
				1525
				1526	@node Shift State
				1527	@subsection States in Non-reentrant Functions
				1528
				1529	In some multibyte character codes, the @emph{meaning} of any particular
				1530	byte sequence is not fixed; it depends on what other sequences have come
				1531	earlier in the same string. Typically there are just a few sequences that
				1532	can change the meaning of other sequences; these few are called
				1533	@dfn{shift sequences} and we say that they set the @dfn{shift state} for
				1534	other sequences that follow.
				1535
				1536	To illustrate shift state and shift sequences, suppose we decide that
				1537	the sequence @code{0200} (just one byte) enters Japanese mode, in which
				1538	pairs of bytes in the range from @code{0240} to @code{0377} are single
				1539	characters, while @code{0201} enters Latin-1 mode, in which single bytes
				1540	in the range from @code{0240} to @code{0377} are characters, and
				1541	interpreted according to the ISO Latin-1 character set. This is a
				1542	multibyte code that has two alternative shift states (``Japanese mode''
				1543	and ``Latin-1 mode''), and two shift sequences that specify particular
				1544	shift states.
				1545
				1546	When the multibyte character code in use has shift states, then
				1547	@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update
				1548	the current shift state as they scan the string. To make this work
				1549	properly, you must follow these rules:
				1550
				1551	@itemize @bullet
				1552	@item
				1553	Before starting to scan a string, call the function with a null pointer
				1554	for the multibyte character address---for example, @code{mblen (NULL,
				1555	0)}. This initializes the shift state to its standard initial value.
				1556
				1557	@item
				1558	Scan the string one character at a time, in order. Do not ``back up''
				1559	and rescan characters already scanned, and do not intersperse the
				1560	processing of different strings.
				1561	@end itemize
				1562
				1563	Here is an example of using @code{mblen} following these rules:
				1564
				1565	@smallexample
				1566	void
				1567	scan_string (char *s)
				1568	@{
				1569	int length = strlen (s);
				1570
				1571	/* @r{Initialize shift state.} */
				1572	mblen (NULL, 0);
				1573
				1574	while (1)
				1575	@{
				1576	int thischar = mblen (s, length);
				1577	/* @r{Deal with end of string and invalid characters.} */
				1578	if (thischar == 0)
				1579	break;
				1580	if (thischar == -1)
				1581	@{
				1582	error ("invalid multibyte character");
				1583	break;
				1584	@}
				1585	/* @r{Advance past this character.} */
				1586	s += thischar;
				1587	length -= thischar;
				1588	@}
				1589	@}
				1590	@end smallexample
				1591
				1592	The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
				1593	reentrant when using a multibyte code that uses a shift state. However,
				1594	no other library functions call these functions, so you don't have to
				1595	worry that the shift state will be changed mysteriously.
				1596
				1597
				1598	@node Generic Charset Conversion
				1599	@section Generic Charset Conversion
				1600
				1601	The conversion functions mentioned so far in this chapter all had in
				1602	common that they operate on character sets that are not directly
				1603	specified by the functions. The multibyte encoding used is specified by
				1604	the currently selected locale for the @code{LC_CTYPE} category. The
				1605	wide character set is fixed by the implementation (in the case of @theglibc{}
				1606	it is always UCS-4 encoded @w{ISO 10646}.
				1607
				1608	This has of course several problems when it comes to general character
				1609	conversion:
				1610
				1611	@itemize @bullet
				1612	@item
				1613	For every conversion where neither the source nor the destination
				1614	character set is the character set of the locale for the @code{LC_CTYPE}
				1615	category, one has to change the @code{LC_CTYPE} locale using
				1616	@code{setlocale}.
				1617
				1618	Changing the @code{LC_CTYPE} locale introduces major problems for the rest
				1619	of the programs since several more functions (e.g., the character
				1620	classification functions, @pxref{Classification of Characters}) use the
				1621	@code{LC_CTYPE} category.
				1622
				1623	@item
				1624	Parallel conversions to and from different character sets are not
				1625	possible since the @code{LC_CTYPE} selection is global and shared by all
				1626	threads.
				1627
				1628	@item
				1629	If neither the source nor the destination character set is the character
				1630	set used for @code{wchar_t} representation, there is at least a two-step
				1631	process necessary to convert a text using the functions above. One would
				1632	have to select the source character set as the multibyte encoding,
				1633	convert the text into a @code{wchar_t} text, select the destination
				1634	character set as the multibyte encoding, and convert the wide character
				1635	text to the multibyte (@math{=} destination) character set.
				1636
				1637	Even if this is possible (which is not guaranteed) it is a very tiring
				1638	work. Plus it suffers from the other two raised points even more due to
				1639	the steady changing of the locale.
				1640	@end itemize
				1641
				1642	The XPG2 standard defines a completely new set of functions, which has
				1643	none of these limitations. They are not at all coupled to the selected
				1644	locales, and they have no constraints on the character sets selected for
				1645	source and destination. Only the set of available conversions limits
				1646	them. The standard does not specify that any conversion at all must be
				1647	available. Such availability is a measure of the quality of the
				1648	implementation.
				1649
				1650	In the following text first the interface to @code{iconv} and then the
				1651	conversion function, will be described. Comparisons with other
				1652	implementations will show what obstacles stand in the way of portable
				1653	applications. Finally, the implementation is described in so far as might
				1654	interest the advanced user who wants to extend conversion capabilities.
				1655
				1656	@menu
				1657	* Generic Conversion Interface:: Generic Character Set Conversion Interface.
				1658	* iconv Examples:: A complete @code{iconv} example.
				1659	* Other iconv Implementations:: Some Details about other @code{iconv}
				1660	Implementations.
				1661	* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C
				1662	library.
				1663	@end menu
				1664
				1665	@node Generic Conversion Interface
				1666	@subsection Generic Character Set Conversion Interface
				1667
				1668	This set of functions follows the traditional cycle of using a resource:
				1669	open--use--close. The interface consists of three functions, each of
				1670	which implements one step.
				1671
				1672	Before the interfaces are described it is necessary to introduce a
				1673	data type. Just like other open--use--close interfaces the functions
				1674	introduced here work using handles and the @file{iconv.h} header
				1675	defines a special type for the handles used.
				1676
				1677	@comment iconv.h
				1678	@comment XPG2
				1679	@deftp {Data Type} iconv_t
				1680	This data type is an abstract type defined in @file{iconv.h}. The user
				1681	must not assume anything about the definition of this type; it must be
				1682	completely opaque.
				1683
				1684	Objects of this type can get assigned handles for the conversions using
				1685	the @code{iconv} functions. The objects themselves need not be freed, but
				1686	the conversions for which the handles stand for have to.
				1687	@end deftp
				1688
				1689	@noindent
				1690	The first step is the function to create a handle.
				1691
				1692	@comment iconv.h
				1693	@comment XPG2
				1694	@deftypefun iconv_t iconv_open (const char @var{tocode}, const char @var{fromcode})
				1695	@safety{@prelim{}@mtsafe{@mtslocale{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}}
				1696	@c Calls malloc if tocode and/or fromcode are too big for alloca. Calls
				1697	@c strip and upstr on both, then gconv_open. strip and upstr call
				1698	@c isalnum_l and toupper_l with the C locale. gconv_open may MT-safely
				1699	@c tokenize toset, replace unspecified codesets with the current locale
				1700	@c (possibly two different accesses), and finally it calls
				1701	@c gconv_find_transform and initializes the gconv_t result with all the
				1702	@c steps in the conversion sequence, running each one's initializer,
				1703	@c destructing and releasing them all if anything fails.
				1704
				1705	The @code{iconv_open} function has to be used before starting a
				1706	conversion. The two parameters this function takes determine the
				1707	source and destination character set for the conversion, and if the
				1708	implementation has the possibility to perform such a conversion, the
				1709	function returns a handle.
				1710
				1711	If the wanted conversion is not available, the @code{iconv_open} function
				1712	returns @code{(iconv_t) -1}. In this case the global variable
				1713	@code{errno} can have the following values:
				1714
				1715	@table @code
				1716	@item EMFILE
				1717	The process already has @code{OPEN_MAX} file descriptors open.
				1718	@item ENFILE
				1719	The system limit of open file is reached.
				1720	@item ENOMEM
				1721	Not enough memory to carry out the operation.
				1722	@item EINVAL
				1723	The conversion from @var{fromcode} to @var{tocode} is not supported.
				1724	@end table
				1725
				1726	It is not possible to use the same descriptor in different threads to
				1727	perform independent conversions. The data structures associated
				1728	with the descriptor include information about the conversion state.
				1729	This must not be messed up by using it in different conversions.
				1730
				1731	An @code{iconv} descriptor is like a file descriptor as for every use a
				1732	new descriptor must be created. The descriptor does not stand for all
				1733	of the conversions from @var{fromset} to @var{toset}.
				1734
				1735	The @glibcadj{} implementation of @code{iconv_open} has one
				1736	significant extension to other implementations. To ease the extension
				1737	of the set of available conversions, the implementation allows storing
				1738	the necessary files with data and code in an arbitrary number of
				1739	directories. How this extension must be written will be explained below
				1740	(@pxref{glibc iconv Implementation}). Here it is only important to say
				1741	that all directories mentioned in the @code{GCONV_PATH} environment
				1742	variable are considered only if they contain a file @file{gconv-modules}.
				1743	These directories need not necessarily be created by the system
				1744	administrator. In fact, this extension is introduced to help users
				1745	writing and using their own, new conversions. Of course, this does not
				1746	work for security reasons in SUID binaries; in this case only the system
				1747	directory is considered and this normally is
				1748	@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment
				1749	variable is examined exactly once at the first call of the
				1750	@code{iconv_open} function. Later modifications of the variable have no
				1751	effect.
				1752
				1753	@pindex iconv.h
				1754	The @code{iconv_open} function was introduced early in the X/Open
				1755	Portability Guide, @w{version 2}. It is supported by all commercial
				1756	Unices as it is required for the Unix branding. However, the quality and
				1757	completeness of the implementation varies widely. The @code{iconv_open}
				1758	function is declared in @file{iconv.h}.
				1759	@end deftypefun
				1760
				1761	The @code{iconv} implementation can associate large data structure with
				1762	the handle returned by @code{iconv_open}. Therefore, it is crucial to
				1763	free all the resources once all conversions are carried out and the
				1764	conversion is not needed anymore.
				1765
				1766	@comment iconv.h
				1767	@comment XPG2
				1768	@deftypefun int iconv_close (iconv_t @var{cd})
				1769	@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{}}}
				1770	@c Calls gconv_close to destruct and release each of the conversion
				1771	@c steps, release the gconv_t object, then call gconv_close_transform.
				1772	@c Access to the gconv_t object is not guarded, but calling iconv_close
				1773	@c concurrently with any other use is undefined.
				1774
				1775	The @code{iconv_close} function frees all resources associated with the
				1776	handle @var{cd}, which must have been returned by a successful call to
				1777	the @code{iconv_open} function.
				1778
				1779	If the function call was successful the return value is @math{0}.
				1780	Otherwise it is @math{-1} and @code{errno} is set appropriately.
				1781	Defined error are:
				1782
				1783	@table @code
				1784	@item EBADF
				1785	The conversion descriptor is invalid.
				1786	@end table
				1787
				1788	@pindex iconv.h
				1789	The @code{iconv_close} function was introduced together with the rest
				1790	of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.
				1791	@end deftypefun
				1792
				1793	The standard defines only one actual conversion function. This has,
				1794	therefore, the most general interface: it allows conversion from one
				1795	buffer to another. Conversion from a file to a buffer, vice versa, or
				1796	even file to file can be implemented on top of it.
				1797
				1798	@comment iconv.h
				1799	@comment XPG2
				1800	@deftypefun size_t iconv (iconv_t @var{cd}, char *@var{inbuf}, size_t @var{inbytesleft}, char *@var{outbuf}, size_t @var{outbytesleft})
				1801	@safety{@prelim{}@mtsafe{@mtsrace{:cd}}@assafe{}@acunsafe{@acucorrupt{}}}
				1802	@c Without guarding access to the iconv_t object pointed to by cd, call
				1803	@c the conversion function to convert inbuf or flush the internal
				1804	@c conversion state.
				1805	@cindex stateful
				1806	The @code{iconv} function converts the text in the input buffer
				1807	according to the rules associated with the descriptor @var{cd} and
				1808	stores the result in the output buffer. It is possible to call the
				1809	function for the same text several times in a row since for stateful
				1810	character sets the necessary state information is kept in the data
				1811	structures associated with the descriptor.
				1812
				1813	The input buffer is specified by @code{*@var{inbuf}} and it contains
				1814	@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for
				1815	communicating the used input back to the caller (see below). It is
				1816	important to note that the buffer pointer is of type @code{char} and the
				1817	length is measured in bytes even if the input text is encoded in wide
				1818	characters.
				1819
				1820	The output buffer is specified in a similar way. @code{*@var{outbuf}}
				1821	points to the beginning of the buffer with at least
				1822	@code{*@var{outbytesleft}} bytes room for the result. The buffer
				1823	pointer again is of type @code{char} and the length is measured in
				1824	bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the
				1825	conversion is performed but no output is available.
				1826
				1827	If @var{inbuf} is a null pointer, the @code{iconv} function performs the
				1828	necessary action to put the state of the conversion into the initial
				1829	state. This is obviously a no-op for non-stateful encodings, but if the
				1830	encoding has a state, such a function call might put some byte sequences
				1831	in the output buffer, which perform the necessary state changes. The
				1832	next call with @var{inbuf} not being a null pointer then simply goes on
				1833	from the initial state. It is important that the programmer never makes
				1834	any assumption as to whether the conversion has to deal with states.
				1835	Even if the input and output character sets are not stateful, the
				1836	implementation might still have to keep states. This is due to the
				1837	implementation chosen for @theglibc{} as it is described below.
				1838	Therefore an @code{iconv} call to reset the state should always be
				1839	performed if some protocol requires this for the output text.
				1840
				1841	The conversion stops for one of three reasons. The first is that all
				1842	characters from the input buffer are converted. This actually can mean
				1843	two things: either all bytes from the input buffer are consumed or
				1844	there are some bytes at the end of the buffer that possibly can form a
				1845	complete character but the input is incomplete. The second reason for a
				1846	stop is that the output buffer is full. And the third reason is that
				1847	the input contains invalid characters.
				1848
				1849	In all of these cases the buffer pointers after the last successful
				1850	conversion, for input and output buffer, are stored in @var{inbuf} and
				1851	@var{outbuf}, and the available room in each buffer is stored in
				1852	@var{inbytesleft} and @var{outbytesleft}.
				1853
				1854	Since the character sets selected in the @code{iconv_open} call can be
				1855	almost arbitrary, there can be situations where the input buffer contains
				1856	valid characters, which have no identical representation in the output
				1857	character set. The behavior in this situation is undefined. The
				1858	@emph{current} behavior of @theglibc{} in this situation is to
				1859	return with an error immediately. This certainly is not the most
				1860	desirable solution; therefore, future versions will provide better ones,
				1861	but they are not yet finished.
				1862
				1863	If all input from the input buffer is successfully converted and stored
				1864	in the output buffer, the function returns the number of non-reversible
				1865	conversions performed. In all other cases the return value is
				1866	@code{(size_t) -1} and @code{errno} is set appropriately. In such cases
				1867	the value pointed to by @var{inbytesleft} is nonzero.
				1868
				1869	@table @code
				1870	@item EILSEQ
				1871	The conversion stopped because of an invalid byte sequence in the input.
				1872	After the call, @code{*@var{inbuf}} points at the first byte of the
				1873	invalid byte sequence.
				1874
				1875	@item E2BIG
				1876	The conversion stopped because it ran out of space in the output buffer.
				1877
				1878	@item EINVAL
				1879	The conversion stopped because of an incomplete byte sequence at the end
				1880	of the input buffer.
				1881
				1882	@item EBADF
				1883	The @var{cd} argument is invalid.
				1884	@end table
				1885
				1886	@pindex iconv.h
				1887	The @code{iconv} function was introduced in the XPG2 standard and is
				1888	declared in the @file{iconv.h} header.
				1889	@end deftypefun
				1890
				1891	The definition of the @code{iconv} function is quite good overall. It
				1892	provides quite flexible functionality. The only problems lie in the
				1893	boundary cases, which are incomplete byte sequences at the end of the
				1894	input buffer and invalid input. A third problem, which is not really
				1895	a design problem, is the way conversions are selected. The standard
				1896	does not say anything about the legitimate names, a minimal set of
				1897	available conversions. We will see how this negatively impacts other
				1898	implementations, as demonstrated below.
				1899
				1900	@node iconv Examples
				1901	@subsection A complete @code{iconv} example
				1902
				1903	The example below features a solution for a common problem. Given that
				1904	one knows the internal encoding used by the system for @code{wchar_t}
				1905	strings, one often is in the position to read text from a file and store
				1906	it in wide character buffers. One can do this using @code{mbsrtowcs},
				1907	but then we run into the problems discussed above.
				1908
				1909	@smallexample
				1910	int
				1911	file2wcs (int fd, const char charset, wchar_t outbuf, size_t avail)
				1912	@{
				1913	char inbuf[BUFSIZ];
				1914	size_t insize = 0;
				1915	char wrptr = (char ) outbuf;
				1916	int result = 0;
				1917	iconv_t cd;
				1918
				1919	cd = iconv_open ("WCHAR_T", charset);
				1920	if (cd == (iconv_t) -1)
				1921	@{
				1922	/* @r{Something went wrong.} */
				1923	if (errno == EINVAL)
				1924	error (0, 0, "conversion from '%s' to wchar_t not available",
				1925	charset);
				1926	else
				1927	perror ("iconv_open");
				1928
				1929	/* @r{Terminate the output string.} */
				1930	*outbuf = L'\0';
				1931
				1932	return -1;
				1933	@}
				1934
				1935	while (avail > 0)
				1936	@{
				1937	size_t nread;
				1938	size_t nconv;
				1939	char *inptr = inbuf;
				1940
				1941	/* @r{Read more input.} */
				1942	nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
				1943	if (nread == 0)
				1944	@{
				1945	/* @r{When we come here the file is completely read.}
				1946	@r{This still could mean there are some unused}
				1947	@r{characters in the @code{inbuf}. Put them back.} */
				1948	if (lseek (fd, -insize, SEEK_CUR) == -1)
				1949	result = -1;
				1950
				1951	/* @r{Now write out the byte sequence to get into the}
				1952	@r{initial state if this is necessary.} */
				1953	iconv (cd, NULL, NULL, &wrptr, &avail);
				1954
				1955	break;
				1956	@}
				1957	insize += nread;
				1958
				1959	/* @r{Do the conversion.} */
				1960	nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
				1961	if (nconv == (size_t) -1)
				1962	@{
				1963	/* @r{Not everything went right. It might only be}
				1964	@r{an unfinished byte sequence at the end of the}
				1965	@r{buffer. Or it is a real problem.} */
				1966	if (errno == EINVAL)
				1967	/* @r{This is harmless. Simply move the unused}
				1968	@r{bytes to the beginning of the buffer so that}
				1969	@r{they can be used in the next round.} */
				1970	memmove (inbuf, inptr, insize);
				1971	else
				1972	@{
				1973	/* @r{It is a real problem. Maybe we ran out of}
				1974	@r{space in the output buffer or we have invalid}
				1975	@r{input. In any case back the file pointer to}
				1976	@r{the position of the last processed byte.} */
				1977	lseek (fd, -insize, SEEK_CUR);
				1978	result = -1;
				1979	break;
				1980	@}
				1981	@}
				1982	@}
				1983
				1984	/* @r{Terminate the output string.} */
				1985	if (avail >= sizeof (wchar_t))
				1986	((wchar_t ) wrptr) = L'\0';
				1987
				1988	if (iconv_close (cd) != 0)
				1989	perror ("iconv_close");
				1990
				1991	return (wchar_t *) wrptr - outbuf;
				1992	@}
				1993	@end smallexample
				1994
				1995	@cindex stateful
				1996	This example shows the most important aspects of using the @code{iconv}
				1997	functions. It shows how successive calls to @code{iconv} can be used to
				1998	convert large amounts of text. The user does not have to care about
				1999	stateful encodings as the functions take care of everything.
				2000
				2001	An interesting point is the case where @code{iconv} returns an error and
				2002	@code{errno} is set to @code{EINVAL}. This is not really an error in the
				2003	transformation. It can happen whenever the input character set contains
				2004	byte sequences of more than one byte for some character and texts are not
				2005	processed in one piece. In this case there is a chance that a multibyte
				2006	sequence is cut. The caller can then simply read the remainder of the
				2007	takes and feed the offending bytes together with new character from the
				2008	input to @code{iconv} and continue the work. The internal state kept in
				2009	the descriptor is @emph{not} unspecified after such an event as is the
				2010	case with the conversion functions from the @w{ISO C} standard.
				2011
				2012	The example also shows the problem of using wide character strings with
				2013	@code{iconv}. As explained in the description of the @code{iconv}
				2014	function above, the function always takes a pointer to a @code{char}
				2015	array and the available space is measured in bytes. In the example, the
				2016	output buffer is a wide character buffer; therefore, we use a local
				2017	variable @var{wrptr} of type @code{char *}, which is used in the
				2018	@code{iconv} calls.
				2019
				2020	This looks rather innocent but can lead to problems on platforms that
				2021	have tight restriction on alignment. Therefore the caller of @code{iconv}
				2022	has to make sure that the pointers passed are suitable for access of
				2023	characters from the appropriate character set. Since, in the
				2024	above case, the input parameter to the function is a @code{wchar_t}
				2025	pointer, this is the case (unless the user violates alignment when
				2026	computing the parameter). But in other situations, especially when
				2027	writing generic functions where one does not know what type of character
				2028	set one uses and, therefore, treats text as a sequence of bytes, it might
				2029	become tricky.
				2030
				2031	@node Other iconv Implementations
				2032	@subsection Some Details about other @code{iconv} Implementations
				2033
				2034	This is not really the place to discuss the @code{iconv} implementation
				2035	of other systems but it is necessary to know a bit about them to write
				2036	portable programs. The above mentioned problems with the specification
				2037	of the @code{iconv} functions can lead to portability issues.
				2038
				2039	The first thing to notice is that, due to the large number of character
				2040	sets in use, it is certainly not practical to encode the conversions
				2041	directly in the C library. Therefore, the conversion information must
				2042	come from files outside the C library. This is usually done in one or
				2043	both of the following ways:
				2044
				2045	@itemize @bullet
				2046	@item
				2047	The C library contains a set of generic conversion functions that can
				2048	read the needed conversion tables and other information from data files.
				2049	These files get loaded when necessary.
				2050
				2051	This solution is problematic as it requires a great deal of effort to
				2052	apply to all character sets (potentially an infinite set). The
				2053	differences in the structure of the different character sets is so large
				2054	that many different variants of the table-processing functions must be
				2055	developed. In addition, the generic nature of these functions make them
				2056	slower than specifically implemented functions.
				2057
				2058	@item
				2059	The C library only contains a framework that can dynamically load
				2060	object files and execute the conversion functions contained therein.
				2061
				2062	This solution provides much more flexibility. The C library itself
				2063	contains only very little code and therefore reduces the general memory
				2064	footprint. Also, with a documented interface between the C library and
				2065	the loadable modules it is possible for third parties to extend the set
				2066	of available conversion modules. A drawback of this solution is that
				2067	dynamic loading must be available.
				2068	@end itemize
				2069
				2070	Some implementations in commercial Unices implement a mixture of these
				2071	possibilities; the majority implement only the second solution. Using
				2072	loadable modules moves the code out of the library itself and keeps
				2073	the door open for extensions and improvements, but this design is also
				2074	limiting on some platforms since not many platforms support dynamic
				2075	loading in statically linked programs. On platforms without this
				2076	capability it is therefore not possible to use this interface in
				2077	statically linked programs. @Theglibc{} has, on ELF platforms, no
				2078	problems with dynamic loading in these situations; therefore, this
				2079	point is moot. The danger is that one gets acquainted with this
				2080	situation and forgets about the restrictions on other systems.
				2081
				2082	A second thing to know about other @code{iconv} implementations is that
				2083	the number of available conversions is often very limited. Some
				2084	implementations provide, in the standard release (not special
				2085	international or developer releases), at most 100 to 200 conversion
				2086	possibilities. This does not mean 200 different character sets are
				2087	supported; for example, conversions from one character set to a set of 10
				2088	others might count as 10 conversions. Together with the other direction
				2089	this makes 20 conversion possibilities used up by one character set. One
				2090	can imagine the thin coverage these platform provide. Some Unix vendors
				2091	even provide only a handful of conversions, which renders them useless for
				2092	almost all uses.
				2093
				2094	This directly leads to a third and probably the most problematic point.
				2095	The way the @code{iconv} conversion functions are implemented on all
				2096	known Unix systems and the availability of the conversion functions from
				2097	character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
				2098	@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
				2099	conversion from @math{@cal{A}} to @math{@cal{C}} is available.
				2100
				2101	This might not seem unreasonable and problematic at first, but it is a
				2102	quite big problem as one will notice shortly after hitting it. To show
				2103	the problem we assume to write a program that has to convert from
				2104	@math{@cal{A}} to @math{@cal{C}}. A call like
				2105
				2106	@smallexample
				2107	cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
				2108	@end smallexample
				2109
				2110	@noindent
				2111	fails according to the assumption above. But what does the program
				2112	do now? The conversion is necessary; therefore, simply giving up is not
				2113	an option.
				2114
				2115	This is a nuisance. The @code{iconv} function should take care of this.
				2116	But how should the program proceed from here on? If it tries to convert
				2117	to character set @math{@cal{B}}, first the two @code{iconv_open}
				2118	calls
				2119
				2120	@smallexample
				2121	cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
				2122	@end smallexample
				2123
				2124	@noindent
				2125	and
				2126
				2127	@smallexample
				2128	cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
				2129	@end smallexample
				2130
				2131	@noindent
				2132	will succeed, but how to find @math{@cal{B}}?
				2133
				2134	Unfortunately, the answer is: there is no general solution. On some
				2135	systems guessing might help. On those systems most character sets can
				2136	convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
				2137	this only some very system-specific methods can help. Since the
				2138	conversion functions come from loadable modules and these modules must
				2139	be stored somewhere in the filesystem, one @emph{could} try to find them
				2140	and determine from the available file which conversions are available
				2141	and whether there is an indirect route from @math{@cal{A}} to
				2142	@math{@cal{C}}.
				2143
				2144	This example shows one of the design errors of @code{iconv} mentioned
				2145	above. It should at least be possible to determine the list of available
				2146	conversion programmatically so that if @code{iconv_open} says there is no
				2147	such conversion, one could make sure this also is true for indirect
				2148	routes.
				2149
				2150	@node glibc iconv Implementation
				2151	@subsection The @code{iconv} Implementation in @theglibc{}
				2152
				2153	After reading about the problems of @code{iconv} implementations in the
				2154	last section it is certainly good to note that the implementation in
				2155	@theglibc{} has none of the problems mentioned above. What
				2156	follows is a step-by-step analysis of the points raised above. The
				2157	evaluation is based on the current state of the development (as of
				2158	January 1999). The development of the @code{iconv} functions is not
				2159	complete, but basic functionality has solidified.
				2160
				2161	@Theglibc{}'s @code{iconv} implementation uses shared loadable
				2162	modules to implement the conversions. A very small number of
				2163	conversions are built into the library itself but these are only rather
				2164	trivial conversions.
				2165
				2166	All the benefits of loadable modules are available in the @glibcadj{}
				2167	implementation. This is especially appealing since the interface is
				2168	well documented (see below), and it, therefore, is easy to write new
				2169	conversion modules. The drawback of using loadable objects is not a
				2170	problem in @theglibc{}, at least on ELF systems. Since the
				2171	library is able to load shared objects even in statically linked
				2172	binaries, static linking need not be forbidden in case one wants to use
				2173	@code{iconv}.
				2174
				2175	The second mentioned problem is the number of supported conversions.
				2176	Currently, @theglibc{} supports more than 150 character sets. The
				2177	way the implementation is designed the number of supported conversions
				2178	is greater than 22350 (@math{150} times @math{149}). If any conversion
				2179	from or to a character set is missing, it can be added easily.
				2180
				2181	Particularly impressive as it may be, this high number is due to the
				2182	fact that the @glibcadj{} implementation of @code{iconv} does not have
				2183	the third problem mentioned above (i.e., whenever there is a conversion
				2184	from a character set @math{@cal{A}} to @math{@cal{B}} and from
				2185	@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
				2186	@math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open}
				2187	returns an error and sets @code{errno} to @code{EINVAL}, there is no
				2188	known way, directly or indirectly, to perform the wanted conversion.
				2189
				2190	@cindex triangulation
				2191	Triangulation is achieved by providing for each character set a
				2192	conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646}
				2193	as an intermediate representation it is possible to @dfn{triangulate}
				2194	(i.e., convert with an intermediate representation).
				2195
				2196	There is no inherent requirement to provide a conversion to @w{ISO
				2197	10646} for a new character set, and it is also possible to provide other
				2198	conversions where neither source nor destination character set is @w{ISO
				2199	10646}. The existing set of conversions is simply meant to cover all
				2200	conversions that might be of interest.
				2201
				2202	@cindex ISO-2022-JP
				2203	@cindex EUC-JP
				2204	All currently available conversions use the triangulation method above,
				2205	making conversion run unnecessarily slow. If, for example, somebody
				2206	often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
				2207	would involve direct conversion between the two character sets, skipping
				2208	the input to @w{ISO 10646} first. The two character sets of interest
				2209	are much more similar to each other than to @w{ISO 10646}.
				2210
				2211	In such a situation one easily can write a new conversion and provide it
				2212	as a better alternative. The @glibcadj{} @code{iconv} implementation
				2213	would automatically use the module implementing the conversion if it is
				2214	specified to be more efficient.
				2215
				2216	@subsubsection Format of @file{gconv-modules} files
				2217
				2218	All information about the available conversions comes from a file named
				2219	@file{gconv-modules}, which can be found in any of the directories along
				2220	the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented
				2221	text files, where each of the lines has one of the following formats:
				2222
				2223	@itemize @bullet
				2224	@item
				2225	If the first non-whitespace character is a @kbd{#} the line contains only
				2226	comments and is ignored.
				2227
				2228	@item
				2229	Lines starting with @code{alias} define an alias name for a character
				2230	set. Two more words are expected on the line. The first word
				2231	defines the alias name, and the second defines the original name of the
				2232	character set. The effect is that it is possible to use the alias name
				2233	in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
				2234	achieve the same result as when using the real character set name.
				2235
				2236	This is quite important as a character set has often many different
				2237	names. There is normally an official name but this need not correspond to
				2238	the most popular name. Beside this many character sets have special
				2239	names that are somehow constructed. For example, all character sets
				2240	specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
				2241	where @var{nnn} is the registration number. This allows programs that
				2242	know about the registration number to construct character set names and
				2243	use them in @code{iconv_open} calls. More on the available names and
				2244	aliases follows below.
				2245
				2246	@item
				2247	Lines starting with @code{module} introduce an available conversion
				2248	module. These lines must contain three or four more words.
				2249
				2250	The first word specifies the source character set, the second word the
				2251	destination character set of conversion implemented in this module, and
				2252	the third word is the name of the loadable module. The filename is
				2253	constructed by appending the usual shared object suffix (normally
				2254	@file{.so}) and this file is then supposed to be found in the same
				2255	directory the @file{gconv-modules} file is in. The last word on the line,
				2256	which is optional, is a numeric value representing the cost of the
				2257	conversion. If this word is missing, a cost of @math{1} is assumed. The
				2258	numeric value itself does not matter that much; what counts are the
				2259	relative values of the sums of costs for all possible conversion paths.
				2260	Below is a more precise description of the use of the cost value.
				2261	@end itemize
				2262
				2263	Returning to the example above where one has written a module to directly
				2264	convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
				2265	to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
				2266	and add a file @file{gconv-modules} with the following content in the
				2267	same directory:
				2268
				2269	@smallexample
				2270	module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
				2271	module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
				2272	@end smallexample
				2273
				2274	To see why this is sufficient, it is necessary to understand how the
				2275	conversion used by @code{iconv} (and described in the descriptor) is
				2276	selected. The approach to this problem is quite simple.
				2277
				2278	At the first call of the @code{iconv_open} function the program reads
				2279	all available @file{gconv-modules} files and builds up two tables: one
				2280	containing all the known aliases and another that contains the
				2281	information about the conversions and which shared object implements
				2282	them.
				2283
				2284	@subsubsection Finding the conversion path in @code{iconv}
				2285
				2286	The set of available conversions form a directed graph with weighted
				2287	edges. The weights on the edges are the costs specified in the
				2288	@file{gconv-modules} files. The @code{iconv_open} function uses an
				2289	algorithm suitable for search for the best path in such a graph and so
				2290	constructs a list of conversions that must be performed in succession
				2291	to get the transformation from the source to the destination character
				2292	set.
				2293
				2294	Explaining why the above @file{gconv-modules} files allows the
				2295	@code{iconv} implementation to resolve the specific ISO-2022-JP to
				2296	EUC-JP conversion module instead of the conversion coming with the
				2297	library itself is straightforward. Since the latter conversion takes two
				2298	steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
				2299	EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules}
				2300	file, however, specifies that the new conversion modules can perform this
				2301	conversion with only the cost of @math{1}.
				2302
				2303	A mysterious item about the @file{gconv-modules} file above (and also
				2304	the file coming with @theglibc{}) are the names of the character
				2305	sets specified in the @code{module} lines. Why do almost all the names
				2306	end in @code{//}? And this is not all: the names can actually be
				2307	regular expressions. At this point in time this mystery should not be
				2308	revealed, unless you have the relevant spell-casting materials: ashes
				2309	from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
				2310	blessed by St.@: Emacs, assorted herbal roots from Central America, sand
				2311	from Cebu, etc. Sorry! @strong{The part of the implementation where
				2312	this is used is not yet finished. For now please simply follow the
				2313	existing examples. It'll become clearer once it is. --drepper}
				2314
				2315	A last remark about the @file{gconv-modules} is about the names not
				2316	ending with @code{//}. A character set named @code{INTERNAL} is often
				2317	mentioned. From the discussion above and the chosen name it should have
				2318	become clear that this is the name for the representation used in the
				2319	intermediate step of the triangulation. We have said that this is UCS-4
				2320	but actually that is not quite right. The UCS-4 specification also
				2321	includes the specification of the byte ordering used. Since a UCS-4 value
				2322	consists of four bytes, a stored value is affected by byte ordering. The
				2323	internal representation is @emph{not} the same as UCS-4 in case the byte
				2324	ordering of the processor (or at least the running process) is not the
				2325	same as the one required for UCS-4. This is done for performance reasons
				2326	as one does not want to perform unnecessary byte-swapping operations if
				2327	one is not interested in actually seeing the result in UCS-4. To avoid
				2328	trouble with endianness, the internal representation consistently is named
				2329	@code{INTERNAL} even on big-endian systems where the representations are
				2330	identical.
				2331
				2332	@subsubsection @code{iconv} module data structures
				2333
				2334	So far this section has described how modules are located and considered
				2335	to be used. What remains to be described is the interface of the modules
				2336	so that one can write new ones. This section describes the interface as
				2337	it is in use in January 1999. The interface will change a bit in the
				2338	future but, with luck, only in an upwardly compatible way.
				2339
				2340	The definitions necessary to write new modules are publicly available
				2341	in the non-standard header @file{gconv.h}. The following text,
				2342	therefore, describes the definitions from this header file. First,
				2343	however, it is necessary to get an overview.
				2344
				2345	From the perspective of the user of @code{iconv} the interface is quite
				2346	simple: the @code{iconv_open} function returns a handle that can be used
				2347	in calls to @code{iconv}, and finally the handle is freed with a call to
				2348	@code{iconv_close}. The problem is that the handle has to be able to
				2349	represent the possibly long sequences of conversion steps and also the
				2350	state of each conversion since the handle is all that is passed to the
				2351	@code{iconv} function. Therefore, the data structures are really the
				2352	elements necessary to understanding the implementation.
				2353
				2354	We need two different kinds of data structures. The first describes the
				2355	conversion and the second describes the state etc. There are really two
				2356	type definitions like this in @file{gconv.h}.
				2357	@pindex gconv.h
				2358
				2359	@comment gconv.h
				2360	@comment GNU
				2361	@deftp {Data type} {struct __gconv_step}
				2362	This data structure describes one conversion a module can perform. For
				2363	each function in a loaded module with conversion functions there is
				2364	exactly one object of this type. This object is shared by all users of
				2365	the conversion (i.e., this object does not contain any information
				2366	corresponding to an actual conversion; it only describes the conversion
				2367	itself).
				2368
				2369	@table @code
				2370	@item struct __gconv_loaded_object *__shlib_handle
				2371	@itemx const char *__modname
				2372	@itemx int __counter
				2373	All these elements of the structure are used internally in the C library
				2374	to coordinate loading and unloading the shared. One must not expect any
				2375	of the other elements to be available or initialized.
				2376
				2377	@item const char *__from_name
				2378	@itemx const char *__to_name
				2379	@code{__from_name} and @code{__to_name} contain the names of the source and
				2380	destination character sets. They can be used to identify the actual
				2381	conversion to be carried out since one module might implement conversions
				2382	for more than one character set and/or direction.
				2383
				2384	@item gconv_fct __fct
				2385	@itemx gconv_init_fct __init_fct
				2386	@itemx gconv_end_fct __end_fct
				2387	These elements contain pointers to the functions in the loadable module.
				2388	The interface will be explained below.
				2389
				2390	@item int __min_needed_from
				2391	@itemx int __max_needed_from
				2392	@itemx int __min_needed_to
				2393	@itemx int __max_needed_to;
				2394	These values have to be supplied in the init function of the module. The
				2395	@code{__min_needed_from} value specifies how many bytes a character of
				2396	the source character set at least needs. The @code{__max_needed_from}
				2397	specifies the maximum value that also includes possible shift sequences.
				2398
				2399	The @code{__min_needed_to} and @code{__max_needed_to} values serve the
				2400	same purpose as @code{__min_needed_from} and @code{__max_needed_from} but
				2401	this time for the destination character set.
				2402
				2403	It is crucial that these values be accurate since otherwise the
				2404	conversion functions will have problems or not work at all.
				2405
				2406	@item int __stateful
				2407	This element must also be initialized by the init function.
				2408	@code{int __stateful} is nonzero if the source character set is stateful.
				2409	Otherwise it is zero.
				2410
				2411	@item void *__data
				2412	This element can be used freely by the conversion functions in the
				2413	module. @code{void *__data} can be used to communicate extra information
				2414	from one call to another. @code{void *__data} need not be initialized if
				2415	not needed at all. If @code{void *__data} element is assigned a pointer
				2416	to dynamically allocated memory (presumably in the init function) it has
				2417	to be made sure that the end function deallocates the memory. Otherwise
				2418	the application will leak memory.
				2419
				2420	It is important to be aware that this data structure is shared by all
				2421	users of this specification conversion and therefore the @code{__data}
				2422	element must not contain data specific to one specific use of the
				2423	conversion function.
				2424	@end table
				2425	@end deftp
				2426
				2427	@comment gconv.h
				2428	@comment GNU
				2429	@deftp {Data type} {struct __gconv_step_data}
				2430	This is the data structure that contains the information specific to
				2431	each use of the conversion functions.
				2432
				2433
				2434	@table @code
				2435	@item char *__outbuf
				2436	@itemx char *__outbufend
				2437	These elements specify the output buffer for the conversion step. The
				2438	@code{__outbuf} element points to the beginning of the buffer, and
				2439	@code{__outbufend} points to the byte following the last byte in the
				2440	buffer. The conversion function must not assume anything about the size
				2441	of the buffer but it can be safely assumed the there is room for at
				2442	least one complete character in the output buffer.
				2443
				2444	Once the conversion is finished, if the conversion is the last step, the
				2445	@code{__outbuf} element must be modified to point after the last byte
				2446	written into the buffer to signal how much output is available. If this
				2447	conversion step is not the last one, the element must not be modified.
				2448	The @code{__outbufend} element must not be modified.
				2449
				2450	@item int __is_last
				2451	This element is nonzero if this conversion step is the last one. This
				2452	information is necessary for the recursion. See the description of the
				2453	conversion function internals below. This element must never be
				2454	modified.
				2455
				2456	@item int __invocation_counter
				2457	The conversion function can use this element to see how many calls of
				2458	the conversion function already happened. Some character sets require a
				2459	certain prolog when generating output, and by comparing this value with
				2460	zero, one can find out whether it is the first call and whether,
				2461	therefore, the prolog should be emitted. This element must never be
				2462	modified.
				2463
				2464	@item int __internal_use
				2465	This element is another one rarely used but needed in certain
				2466	situations. It is assigned a nonzero value in case the conversion
				2467	functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the
				2468	function is not used directly through the @code{iconv} interface).
				2469
				2470	This sometimes makes a difference as it is expected that the
				2471	@code{iconv} functions are used to translate entire texts while the
				2472	@code{mbsrtowcs} functions are normally used only to convert single
				2473	strings and might be used multiple times to convert entire texts.
				2474
				2475	But in this situation we would have problem complying with some rules of
				2476	the character set specification. Some character sets require a prolog,
				2477	which must appear exactly once for an entire text. If a number of
				2478	@code{mbsrtowcs} calls are used to convert the text, only the first call
				2479	must add the prolog. However, because there is no communication between the
				2480	different calls of @code{mbsrtowcs}, the conversion functions have no
				2481	possibility to find this out. The situation is different for sequences
				2482	of @code{iconv} calls since the handle allows access to the needed
				2483	information.
				2484
				2485	The @code{int __internal_use} element is mostly used together with
				2486	@code{__invocation_counter} as follows:
				2487
				2488	@smallexample
				2489	if (!data->__internal_use
				2490	&& data->__invocation_counter == 0)
				2491	/* @r{Emit prolog.} */
				2492	@dots{}
				2493	@end smallexample
				2494
				2495	This element must never be modified.
				2496
				2497	@item mbstate_t *__statep
				2498	The @code{__statep} element points to an object of type @code{mbstate_t}
				2499	(@pxref{Keeping the state}). The conversion of a stateful character
				2500	set must use the object pointed to by @code{__statep} to store
				2501	information about the conversion state. The @code{__statep} element
				2502	itself must never be modified.
				2503
				2504	@item mbstate_t __state
				2505	This element must @emph{never} be used directly. It is only part of
				2506	this structure to have the needed space allocated.
				2507	@end table
				2508	@end deftp
				2509
				2510	@subsubsection @code{iconv} module interfaces
				2511
				2512	With the knowledge about the data structures we now can describe the
				2513	conversion function itself. To understand the interface a bit of
				2514	knowledge is necessary about the functionality in the C library that
				2515	loads the objects with the conversions.
				2516
				2517	It is often the case that one conversion is used more than once (i.e.,
				2518	there are several @code{iconv_open} calls for the same set of character
				2519	sets during one program run). The @code{mbsrtowcs} et.al.@: functions in
				2520	@theglibc{} also use the @code{iconv} functionality, which
				2521	increases the number of uses of the same functions even more.
				2522
				2523	Because of this multiple use of conversions, the modules do not get
				2524	loaded exclusively for one conversion. Instead a module once loaded can
				2525	be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls
				2526	at the same time. The splitting of the information between conversion-
				2527	function-specific information and conversion data makes this possible.
				2528	The last section showed the two data structures used to do this.
				2529
				2530	This is of course also reflected in the interface and semantics of the
				2531	functions that the modules must provide. There are three functions that
				2532	must have the following names:
				2533
				2534	@table @code
				2535	@item gconv_init
				2536	The @code{gconv_init} function initializes the conversion function
				2537	specific data structure. This very same object is shared by all
				2538	conversions that use this conversion and, therefore, no state information
				2539	about the conversion itself must be stored in here. If a module
				2540	implements more than one conversion, the @code{gconv_init} function will
				2541	be called multiple times.
				2542
				2543	@item gconv_end
				2544	The @code{gconv_end} function is responsible for freeing all resources
				2545	allocated by the @code{gconv_init} function. If there is nothing to do,
				2546	this function can be missing. Special care must be taken if the module
				2547	implements more than one conversion and the @code{gconv_init} function
				2548	does not allocate the same resources for all conversions.
				2549
				2550	@item gconv
				2551	This is the actual conversion function. It is called to convert one
				2552	block of text. It gets passed the conversion step information
				2553	initialized by @code{gconv_init} and the conversion data, specific to
				2554	this use of the conversion functions.
				2555	@end table
				2556
				2557	There are three data types defined for the three module interface
				2558	functions and these define the interface.
				2559
				2560	@comment gconv.h
				2561	@comment GNU
				2562	@deftypevr {Data type} int {(__gconv_init_fct)} (struct __gconv_step )
				2563	This specifies the interface of the initialization function of the
				2564	module. It is called exactly once for each conversion the module
				2565	implements.
				2566
				2567	As explained in the description of the @code{struct __gconv_step} data
				2568	structure above the initialization function has to initialize parts of
				2569	it.
				2570
				2571	@table @code
				2572	@item __min_needed_from
				2573	@itemx __max_needed_from
				2574	@itemx __min_needed_to
				2575	@itemx __max_needed_to
				2576	These elements must be initialized to the exact numbers of the minimum
				2577	and maximum number of bytes used by one character in the source and
				2578	destination character sets, respectively. If the characters all have the
				2579	same size, the minimum and maximum values are the same.
				2580
				2581	@item __stateful
				2582	This element must be initialized to a nonzero value if the source
				2583	character set is stateful. Otherwise it must be zero.
				2584	@end table
				2585
				2586	If the initialization function needs to communicate some information
				2587	to the conversion function, this communication can happen using the
				2588	@code{__data} element of the @code{__gconv_step} structure. But since
				2589	this data is shared by all the conversions, it must not be modified by
				2590	the conversion function. The example below shows how this can be used.
				2591
				2592	@smallexample
				2593	#define MIN_NEEDED_FROM 1
				2594	#define MAX_NEEDED_FROM 4
				2595	#define MIN_NEEDED_TO 4
				2596	#define MAX_NEEDED_TO 4
				2597
				2598	int
				2599	gconv_init (struct __gconv_step *step)
				2600	@{
				2601	/* @r{Determine which direction.} */
				2602	struct iso2022jp_data *new_data;
				2603	enum direction dir = illegal_dir;
				2604	enum variant var = illegal_var;
				2605	int result;
				2606
				2607	if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
				2608	@{
				2609	dir = from_iso2022jp;
				2610	var = iso2022jp;
				2611	@}
				2612	else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
				2613	@{
				2614	dir = to_iso2022jp;
				2615	var = iso2022jp;
				2616	@}
				2617	else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
				2618	@{
				2619	dir = from_iso2022jp;
				2620	var = iso2022jp2;
				2621	@}
				2622	else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
				2623	@{
				2624	dir = to_iso2022jp;
				2625	var = iso2022jp2;
				2626	@}
				2627
				2628	result = __GCONV_NOCONV;
				2629	if (dir != illegal_dir)
				2630	@{
				2631	new_data = (struct iso2022jp_data *)
				2632	malloc (sizeof (struct iso2022jp_data));
				2633
				2634	result = __GCONV_NOMEM;
				2635	if (new_data != NULL)
				2636	@{
				2637	new_data->dir = dir;
				2638	new_data->var = var;
				2639	step->__data = new_data;
				2640
				2641	if (dir == from_iso2022jp)
				2642	@{
				2643	step->__min_needed_from = MIN_NEEDED_FROM;
				2644	step->__max_needed_from = MAX_NEEDED_FROM;
				2645	step->__min_needed_to = MIN_NEEDED_TO;
				2646	step->__max_needed_to = MAX_NEEDED_TO;
				2647	@}
				2648	else
				2649	@{
				2650	step->__min_needed_from = MIN_NEEDED_TO;
				2651	step->__max_needed_from = MAX_NEEDED_TO;
				2652	step->__min_needed_to = MIN_NEEDED_FROM;
				2653	step->__max_needed_to = MAX_NEEDED_FROM + 2;
				2654	@}
				2655
				2656	/* @r{Yes, this is a stateful encoding.} */
				2657	step->__stateful = 1;
				2658
				2659	result = __GCONV_OK;
				2660	@}
				2661	@}
				2662
				2663	return result;
				2664	@}
				2665	@end smallexample
				2666
				2667	The function first checks which conversion is wanted. The module from
				2668	which this function is taken implements four different conversions;
				2669	which one is selected can be determined by comparing the names. The
				2670	comparison should always be done without paying attention to the case.
				2671
				2672	Next, a data structure, which contains the necessary information about
				2673	which conversion is selected, is allocated. The data structure
				2674	@code{struct iso2022jp_data} is locally defined since, outside the
				2675	module, this data is not used at all. Please note that if all four
				2676	conversions this modules supports are requested there are four data
				2677	blocks.
				2678
				2679	One interesting thing is the initialization of the @code{__min_} and
				2680	@code{__max_} elements of the step data object. A single ISO-2022-JP
				2681	character can consist of one to four bytes. Therefore the
				2682	@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
				2683	this way. The output is always the @code{INTERNAL} character set (aka
				2684	UCS-4) and therefore each character consists of exactly four bytes. For
				2685	the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
				2686	account that escape sequences might be necessary to switch the character
				2687	sets. Therefore the @code{__max_needed_to} element for this direction
				2688	gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
				2689	two bytes needed for the escape sequences to single the switching. The
				2690	asymmetry in the maximum values for the two directions can be explained
				2691	easily: when reading ISO-2022-JP text, escape sequences can be handled
				2692	alone (i.e., it is not necessary to process a real character since the
				2693	effect of the escape sequence can be recorded in the state information).
				2694	The situation is different for the other direction. Since it is in
				2695	general not known which character comes next, one cannot emit escape
				2696	sequences to change the state in advance. This means the escape
				2697	sequences that have to be emitted together with the next character.
				2698	Therefore one needs more room than only for the character itself.
				2699
				2700	The possible return values of the initialization function are:
				2701
				2702	@table @code
				2703	@item __GCONV_OK
				2704	The initialization succeeded
				2705	@item __GCONV_NOCONV
				2706	The requested conversion is not supported in the module. This can
				2707	happen if the @file{gconv-modules} file has errors.
				2708	@item __GCONV_NOMEM
				2709	Memory required to store additional information could not be allocated.
				2710	@end table
				2711	@end deftypevr
				2712
				2713	The function called before the module is unloaded is significantly
				2714	easier. It often has nothing at all to do; in which case it can be left
				2715	out completely.
				2716
				2717	@comment gconv.h
				2718	@comment GNU
				2719	@deftypevr {Data type} void {(__gconv_end_fct)} (struct gconv_step )
				2720	The task of this function is to free all resources allocated in the
				2721	initialization function. Therefore only the @code{__data} element of
				2722	the object pointed to by the argument is of interest. Continuing the
				2723	example from the initialization function, the finalization function
				2724	looks like this:
				2725
				2726	@smallexample
				2727	void
				2728	gconv_end (struct __gconv_step *data)
				2729	@{
				2730	free (data->__data);
				2731	@}
				2732	@end smallexample
				2733	@end deftypevr
				2734
				2735	The most important function is the conversion function itself, which can
				2736	get quite complicated for complex character sets. But since this is not
				2737	of interest here, we will only describe a possible skeleton for the
				2738	conversion function.
				2739
				2740	@comment gconv.h
				2741	@comment GNU
				2742	@deftypevr {Data type} int {(__gconv_fct)} (struct __gconv_step , struct __gconv_step_data , const char , const char , size_t *, int)
				2743	The conversion function can be called for two basic reason: to convert
				2744	text or to reset the state. From the description of the @code{iconv}
				2745	function it can be seen why the flushing mode is necessary. What mode
				2746	is selected is determined by the sixth argument, an integer. This
				2747	argument being nonzero means that flushing is selected.
				2748
				2749	Common to both modes is where the output buffer can be found. The
				2750	information about this buffer is stored in the conversion step data. A
				2751	pointer to this information is passed as the second argument to this
				2752	function. The description of the @code{struct __gconv_step_data}
				2753	structure has more information on the conversion step data.
				2754
				2755	@cindex stateful
				2756	What has to be done for flushing depends on the source character set.
				2757	If the source character set is not stateful, nothing has to be done.
				2758	Otherwise the function has to emit a byte sequence to bring the state
				2759	object into the initial state. Once this all happened the other
				2760	conversion modules in the chain of conversions have to get the same
				2761	chance. Whether another step follows can be determined from the
				2762	@code{__is_last} element of the step data structure to which the first
				2763	parameter points.
				2764
				2765	The more interesting mode is when actual text has to be converted. The
				2766	first step in this case is to convert as much text as possible from the
				2767	input buffer and store the result in the output buffer. The start of the
				2768	input buffer is determined by the third argument, which is a pointer to a
				2769	pointer variable referencing the beginning of the buffer. The fourth
				2770	argument is a pointer to the byte right after the last byte in the buffer.
				2771
				2772	The conversion has to be performed according to the current state if the
				2773	character set is stateful. The state is stored in an object pointed to
				2774	by the @code{__statep} element of the step data (second argument). Once
				2775	either the input buffer is empty or the output buffer is full the
				2776	conversion stops. At this point, the pointer variable referenced by the
				2777	third parameter must point to the byte following the last processed
				2778	byte (i.e., if all of the input is consumed, this pointer and the fourth
				2779	parameter have the same value).
				2780
				2781	What now happens depends on whether this step is the last one. If it is
				2782	the last step, the only thing that has to be done is to update the
				2783	@code{__outbuf} element of the step data structure to point after the
				2784	last written byte. This update gives the caller the information on how
				2785	much text is available in the output buffer. In addition, the variable
				2786	pointed to by the fifth parameter, which is of type @code{size_t}, must
				2787	be incremented by the number of characters (@emph{not bytes}) that were
				2788	converted in a non-reversible way. Then, the function can return.
				2789
				2790	In case the step is not the last one, the later conversion functions have
				2791	to get a chance to do their work. Therefore, the appropriate conversion
				2792	function has to be called. The information about the functions is
				2793	stored in the conversion data structures, passed as the first parameter.
				2794	This information and the step data are stored in arrays, so the next
				2795	element in both cases can be found by simple pointer arithmetic:
				2796
				2797	@smallexample
				2798	int
				2799	gconv (struct __gconv_step step, struct __gconv_step_data data,
				2800	const char *inbuf, const char inbufend, size_t *written,
				2801	int do_flush)
				2802	@{
				2803	struct __gconv_step *next_step = step + 1;
				2804	struct __gconv_step_data *next_data = data + 1;
				2805	@dots{}
				2806	@end smallexample
				2807
				2808	The @code{next_step} pointer references the next step information and
				2809	@code{next_data} the next data record. The call of the next function
				2810	therefore will look similar to this:
				2811
				2812	@smallexample
				2813	next_step->__fct (next_step, next_data, &outerr, outbuf,
				2814	written, 0)
				2815	@end smallexample
				2816
				2817	But this is not yet all. Once the function call returns the conversion
				2818	function might have some more to do. If the return value of the function
				2819	is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
				2820	buffer. Unless the input buffer is empty the conversion, functions start
				2821	all over again and process the rest of the input buffer. If the return
				2822	value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
				2823	to recover from this.
				2824
				2825	A requirement for the conversion function is that the input buffer
				2826	pointer (the third argument) always point to the last character that
				2827	was put in converted form into the output buffer. This is trivially
				2828	true after the conversion performed in the current step, but if the
				2829	conversion functions deeper downstream stop prematurely, not all
				2830	characters from the output buffer are consumed and, therefore, the input
				2831	buffer pointers must be backed off to the right position.
				2832
				2833	Correcting the input buffers is easy to do if the input and output
				2834	character sets have a fixed width for all characters. In this situation
				2835	we can compute how many characters are left in the output buffer and,
				2836	therefore, can correct the input buffer pointer appropriately with a
				2837	similar computation. Things are getting tricky if either character set
				2838	has characters represented with variable length byte sequences, and it
				2839	gets even more complicated if the conversion has to take care of the
				2840	state. In these cases the conversion has to be performed once again, from
				2841	the known state before the initial conversion (i.e., if necessary the
				2842	state of the conversion has to be reset and the conversion loop has to be
				2843	executed again). The difference now is that it is known how much input
				2844	must be created, and the conversion can stop before converting the first
				2845	unused character. Once this is done the input buffer pointers must be
				2846	updated again and the function can return.
				2847
				2848	One final thing should be mentioned. If it is necessary for the
				2849	conversion to know whether it is the first invocation (in case a prolog
				2850	has to be emitted), the conversion function should increment the
				2851	@code{__invocation_counter} element of the step data structure just
				2852	before returning to the caller. See the description of the @code{struct
				2853	__gconv_step_data} structure above for more information on how this can
				2854	be used.
				2855
				2856	The return value must be one of the following values:
				2857
				2858	@table @code
				2859	@item __GCONV_EMPTY_INPUT
				2860	All input was consumed and there is room left in the output buffer.
				2861	@item __GCONV_FULL_OUTPUT
				2862	No more room in the output buffer. In case this is not the last step
				2863	this value is propagated down from the call of the next conversion
				2864	function in the chain.
				2865	@item __GCONV_INCOMPLETE_INPUT
				2866	The input buffer is not entirely empty since it contains an incomplete
				2867	character sequence.
				2868	@end table
				2869
				2870	The following example provides a framework for a conversion function.
				2871	In case a new conversion has to be written the holes in this
				2872	implementation have to be filled and that is it.
				2873
				2874	@smallexample
				2875	int
				2876	gconv (struct __gconv_step step, struct __gconv_step_data data,
				2877	const char *inbuf, const char inbufend, size_t *written,
				2878	int do_flush)
				2879	@{
				2880	struct __gconv_step *next_step = step + 1;
				2881	struct __gconv_step_data *next_data = data + 1;
				2882	gconv_fct fct = next_step->__fct;
				2883	int status;
				2884
				2885	/* @r{If the function is called with no input this means we have}
				2886	@r{to reset to the initial state. The possibly partly}
				2887	@r{converted input is dropped.} */
				2888	if (do_flush)
				2889	@{
				2890	status = __GCONV_OK;
				2891
				2892	/* @r{Possible emit a byte sequence which put the state object}
				2893	@r{into the initial state.} */
				2894
				2895	/* @r{Call the steps down the chain if there are any but only}
				2896	@r{if we successfully emitted the escape sequence.} */
				2897	if (status == __GCONV_OK && ! data->__is_last)
				2898	status = fct (next_step, next_data, NULL, NULL,
				2899	written, 1);
				2900	@}
				2901	else
				2902	@{
				2903	/* @r{We preserve the initial values of the pointer variables.} */
				2904	const char inptr = inbuf;
				2905	char *outbuf = data->__outbuf;
				2906	char *outend = data->__outbufend;
				2907	char *outptr;
				2908
				2909	do
				2910	@{
				2911	/* @r{Remember the start value for this round.} */
				2912	inptr = *inbuf;
				2913	/* @r{The outbuf buffer is empty.} */
				2914	outptr = outbuf;
				2915
				2916	/* @r{For stateful encodings the state must be safe here.} */
				2917
				2918	/* @r{Run the conversion loop. @code{status} is set}
				2919	@r{appropriately afterwards.} */
				2920
				2921	/* @r{If this is the last step, leave the loop. There is}
				2922	@r{nothing we can do.} */
				2923	if (data->__is_last)
				2924	@{
				2925	/* @r{Store information about how many bytes are}
				2926	@r{available.} */
				2927	data->__outbuf = outbuf;
				2928
				2929	/* @r{If any non-reversible conversions were performed,}
				2930	@r{add the number to @code{written}.} /
				2931
				2932	break;
				2933	@}
				2934
				2935	/* @r{Write out all output that was produced.} */
				2936	if (outbuf > outptr)
				2937	@{
				2938	const char *outerr = data->__outbuf;
				2939	int result;
				2940
				2941	result = fct (next_step, next_data, &outerr,
				2942	outbuf, written, 0);
				2943
				2944	if (result != __GCONV_EMPTY_INPUT)
				2945	@{
				2946	if (outerr != outbuf)
				2947	@{
				2948	/* @r{Reset the input buffer pointer. We}
				2949	@r{document here the complex case.} */
				2950	size_t nstatus;
				2951
				2952	/* @r{Reload the pointers.} */
				2953	*inbuf = inptr;
				2954	outbuf = outptr;
				2955
				2956	/* @r{Possibly reset the state.} */
				2957
				2958	/* @r{Redo the conversion, but this time}
				2959	@r{the end of the output buffer is at}
				2960	@r{@code{outerr}.} */
				2961	@}
				2962
				2963	/* @r{Change the status.} */
				2964	status = result;
				2965	@}
				2966	else
				2967	/* @r{All the output is consumed, we can make}
				2968	@r{ another run if everything was ok.} */
				2969	if (status == __GCONV_FULL_OUTPUT)
				2970	status = __GCONV_OK;
				2971	@}
				2972	@}
				2973	while (status == __GCONV_OK);
				2974
				2975	/* @r{We finished one use of this step.} */
				2976	++data->__invocation_counter;
				2977	@}
				2978
				2979	return status;
				2980	@}
				2981	@end smallexample
				2982	@end deftypevr
				2983
				2984	This information should be sufficient to write new modules. Anybody
				2985	doing so should also take a look at the available source code in the
				2986	@glibcadj{} sources. It contains many examples of working and optimized
				2987	modules.
				2988
				2989	@c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation