lh | 9ed821d | 2023-04-07 01:36:19 -0700 | [diff] [blame] | 1 | @node Character Set Handling, Locales, String and Array Utilities, Top |
| 2 | @c %MENU% Support for extended character sets |
| 3 | @chapter Character Set Handling |
| 4 | |
| 5 | @ifnottex |
| 6 | @macro cal{text} |
| 7 | \text\ |
| 8 | @end macro |
| 9 | @end ifnottex |
| 10 | |
| 11 | Character sets used in the early days of computing had only six, seven, |
| 12 | or eight bits for each character: there was never a case where more than |
| 13 | eight bits (one byte) were used to represent a single character. The |
| 14 | limitations of this approach became more apparent as more people |
| 15 | grappled with non-Roman character sets, where not all the characters |
| 16 | that make up a language's character set can be represented by @math{2^8} |
| 17 | choices. This chapter shows the functionality that was added to the C |
| 18 | library to support multiple character sets. |
| 19 | |
| 20 | @menu |
| 21 | * Extended Char Intro:: Introduction to Extended Characters. |
| 22 | * Charset Function Overview:: Overview about Character Handling |
| 23 | Functions. |
| 24 | * Restartable multibyte conversion:: Restartable multibyte conversion |
| 25 | Functions. |
| 26 | * Non-reentrant Conversion:: Non-reentrant Conversion Function. |
| 27 | * Generic Charset Conversion:: Generic Charset Conversion. |
| 28 | @end menu |
| 29 | |
| 30 | |
| 31 | @node Extended Char Intro |
| 32 | @section Introduction to Extended Characters |
| 33 | |
| 34 | A variety of solutions is available to overcome the differences between |
| 35 | character sets with a 1:1 relation between bytes and characters and |
| 36 | character sets with ratios of 2:1 or 4:1. The remainder of this |
| 37 | section gives a few examples to help understand the design decisions |
| 38 | made while developing the functionality of the @w{C library}. |
| 39 | |
| 40 | @cindex internal representation |
| 41 | A distinction we have to make right away is between internal and |
| 42 | external representation. @dfn{Internal representation} means the |
| 43 | representation used by a program while keeping the text in memory. |
| 44 | External representations are used when text is stored or transmitted |
| 45 | through some communication channel. Examples of external |
| 46 | representations include files waiting in a directory to be |
| 47 | read and parsed. |
| 48 | |
| 49 | Traditionally there has been no difference between the two representations. |
| 50 | It was equally comfortable and useful to use the same single-byte |
| 51 | representation internally and externally. This comfort level decreases |
| 52 | with more and larger character sets. |
| 53 | |
| 54 | One of the problems to overcome with the internal representation is |
| 55 | handling text that is externally encoded using different character |
| 56 | sets. Assume a program that reads two texts and compares them using |
| 57 | some metric. The comparison can be usefully done only if the texts are |
| 58 | internally kept in a common format. |
| 59 | |
| 60 | @cindex wide character |
| 61 | For such a common format (@math{=} character set) eight bits are certainly |
| 62 | no longer enough. So the smallest entity will have to grow: @dfn{wide |
| 63 | characters} will now be used. Instead of one byte per character, two or |
| 64 | four will be used instead. (Three are not good to address in memory and |
| 65 | more than four bytes seem not to be necessary). |
| 66 | |
| 67 | @cindex Unicode |
| 68 | @cindex ISO 10646 |
| 69 | As shown in some other part of this manual, |
| 70 | @c !!! Ahem, wide char string functions are not yet covered -- drepper |
| 71 | a completely new family has been created of functions that can handle wide |
| 72 | character texts in memory. The most commonly used character sets for such |
| 73 | internal wide character representations are Unicode and @w{ISO 10646} |
| 74 | (also known as UCS for Universal Character Set). Unicode was originally |
| 75 | planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to |
| 76 | be a 31-bit large code space. The two standards are practically identical. |
| 77 | They have the same character repertoire and code table, but Unicode specifies |
| 78 | added semantics. At the moment, only characters in the first @code{0x10000} |
| 79 | code positions (the so-called Basic Multilingual Plane, BMP) have been |
| 80 | assigned, but the assignment of more specialized characters outside this |
| 81 | 16-bit space is already in progress. A number of encodings have been |
| 82 | defined for Unicode and @w{ISO 10646} characters: |
| 83 | @cindex UCS-2 |
| 84 | @cindex UCS-4 |
| 85 | @cindex UTF-8 |
| 86 | @cindex UTF-16 |
| 87 | UCS-2 is a 16-bit word that can only represent characters |
| 88 | from the BMP, UCS-4 is a 32-bit word than can represent any Unicode |
| 89 | and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where |
| 90 | ASCII characters are represented by ASCII bytes and non-ASCII characters |
| 91 | by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension |
| 92 | of UCS-2 in which pairs of certain UCS-2 words can be used to encode |
| 93 | non-BMP characters up to @code{0x10ffff}. |
| 94 | |
| 95 | To represent wide characters the @code{char} type is not suitable. For |
| 96 | this reason the @w{ISO C} standard introduces a new type that is |
| 97 | designed to keep one character of a wide character string. To maintain |
| 98 | the similarity there is also a type corresponding to @code{int} for |
| 99 | those functions that take a single wide character. |
| 100 | |
| 101 | @comment stddef.h |
| 102 | @comment ISO |
| 103 | @deftp {Data type} wchar_t |
| 104 | This data type is used as the base type for wide character strings. |
| 105 | In other words, arrays of objects of this type are the equivalent of |
| 106 | @code{char[]} for multibyte character strings. The type is defined in |
| 107 | @file{stddef.h}. |
| 108 | |
| 109 | The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not |
| 110 | say anything specific about the representation. It only requires that |
| 111 | this type is capable of storing all elements of the basic character set. |
| 112 | Therefore it would be legitimate to define @code{wchar_t} as @code{char}, |
| 113 | which might make sense for embedded systems. |
| 114 | |
| 115 | But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore, |
| 116 | capable of representing all UCS-4 values and, therefore, covering all of |
| 117 | @w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type |
| 118 | and thereby follow Unicode very strictly. This definition is perfectly |
| 119 | fine with the standard, but it also means that to represent all |
| 120 | characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate |
| 121 | characters, which is in fact a multi-wide-character encoding. But |
| 122 | resorting to multi-wide-character encoding contradicts the purpose of the |
| 123 | @code{wchar_t} type. |
| 124 | @end deftp |
| 125 | |
| 126 | @comment wchar.h |
| 127 | @comment ISO |
| 128 | @deftp {Data type} wint_t |
| 129 | @code{wint_t} is a data type used for parameters and variables that |
| 130 | contain a single wide character. As the name suggests this type is the |
| 131 | equivalent of @code{int} when using the normal @code{char} strings. The |
| 132 | types @code{wchar_t} and @code{wint_t} often have the same |
| 133 | representation if their size is 32 bits wide but if @code{wchar_t} is |
| 134 | defined as @code{char} the type @code{wint_t} must be defined as |
| 135 | @code{int} due to the parameter promotion. |
| 136 | |
| 137 | @pindex wchar.h |
| 138 | This type is defined in @file{wchar.h} and was introduced in |
| 139 | @w{Amendment 1} to @w{ISO C90}. |
| 140 | @end deftp |
| 141 | |
| 142 | As there are for the @code{char} data type macros are available for |
| 143 | specifying the minimum and maximum value representable in an object of |
| 144 | type @code{wchar_t}. |
| 145 | |
| 146 | @comment wchar.h |
| 147 | @comment ISO |
| 148 | @deftypevr Macro wint_t WCHAR_MIN |
| 149 | The macro @code{WCHAR_MIN} evaluates to the minimum value representable |
| 150 | by an object of type @code{wint_t}. |
| 151 | |
| 152 | This macro was introduced in @w{Amendment 1} to @w{ISO C90}. |
| 153 | @end deftypevr |
| 154 | |
| 155 | @comment wchar.h |
| 156 | @comment ISO |
| 157 | @deftypevr Macro wint_t WCHAR_MAX |
| 158 | The macro @code{WCHAR_MAX} evaluates to the maximum value representable |
| 159 | by an object of type @code{wint_t}. |
| 160 | |
| 161 | This macro was introduced in @w{Amendment 1} to @w{ISO C90}. |
| 162 | @end deftypevr |
| 163 | |
| 164 | Another special wide character value is the equivalent to @code{EOF}. |
| 165 | |
| 166 | @comment wchar.h |
| 167 | @comment ISO |
| 168 | @deftypevr Macro wint_t WEOF |
| 169 | The macro @code{WEOF} evaluates to a constant expression of type |
| 170 | @code{wint_t} whose value is different from any member of the extended |
| 171 | character set. |
| 172 | |
| 173 | @code{WEOF} need not be the same value as @code{EOF} and unlike |
| 174 | @code{EOF} it also need @emph{not} be negative. In other words, sloppy |
| 175 | code like |
| 176 | |
| 177 | @smallexample |
| 178 | @{ |
| 179 | int c; |
| 180 | @dots{} |
| 181 | while ((c = getc (fp)) < 0) |
| 182 | @dots{} |
| 183 | @} |
| 184 | @end smallexample |
| 185 | |
| 186 | @noindent |
| 187 | has to be rewritten to use @code{WEOF} explicitly when wide characters |
| 188 | are used: |
| 189 | |
| 190 | @smallexample |
| 191 | @{ |
| 192 | wint_t c; |
| 193 | @dots{} |
| 194 | while ((c = wgetc (fp)) != WEOF) |
| 195 | @dots{} |
| 196 | @} |
| 197 | @end smallexample |
| 198 | |
| 199 | @pindex wchar.h |
| 200 | This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| 201 | defined in @file{wchar.h}. |
| 202 | @end deftypevr |
| 203 | |
| 204 | |
| 205 | These internal representations present problems when it comes to storing |
| 206 | and transmittal. Because each single wide character consists of more |
| 207 | than one byte, they are affected by byte-ordering. Thus, machines with |
| 208 | different endianesses would see different values when accessing the same |
| 209 | data. This byte ordering concern also applies for communication protocols |
| 210 | that are all byte-based and therefore require that the sender has to |
| 211 | decide about splitting the wide character in bytes. A last (but not least |
| 212 | important) point is that wide characters often require more storage space |
| 213 | than a customized byte-oriented character set. |
| 214 | |
| 215 | @cindex multibyte character |
| 216 | @cindex EBCDIC |
| 217 | For all the above reasons, an external encoding that is different from |
| 218 | the internal encoding is often used if the latter is UCS-2 or UCS-4. |
| 219 | The external encoding is byte-based and can be chosen appropriately for |
| 220 | the environment and for the texts to be handled. A variety of different |
| 221 | character sets can be used for this external encoding (information that |
| 222 | will not be exhaustively presented here--instead, a description of the |
| 223 | major groups will suffice). All of the ASCII-based character sets |
| 224 | fulfill one requirement: they are "filesystem safe." This means that |
| 225 | the character @code{'/'} is used in the encoding @emph{only} to |
| 226 | represent itself. Things are a bit different for character sets like |
| 227 | EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set |
| 228 | family used by IBM), but if the operating system does not understand |
| 229 | EBCDIC directly the parameters-to-system calls have to be converted |
| 230 | first anyhow. |
| 231 | |
| 232 | @itemize @bullet |
| 233 | @item |
| 234 | The simplest character sets are single-byte character sets. There can |
| 235 | be only up to 256 characters (for @w{8 bit} character sets), which is |
| 236 | not sufficient to cover all languages but might be sufficient to handle |
| 237 | a specific text. Handling of a @w{8 bit} character sets is simple. This |
| 238 | is not true for other kinds presented later, and therefore, the |
| 239 | application one uses might require the use of @w{8 bit} character sets. |
| 240 | |
| 241 | @cindex ISO 2022 |
| 242 | @item |
| 243 | The @w{ISO 2022} standard defines a mechanism for extended character |
| 244 | sets where one character @emph{can} be represented by more than one |
| 245 | byte. This is achieved by associating a state with the text. |
| 246 | Characters that can be used to change the state can be embedded in the |
| 247 | text. Each byte in the text might have a different interpretation in each |
| 248 | state. The state might even influence whether a given byte stands for a |
| 249 | character on its own or whether it has to be combined with some more |
| 250 | bytes. |
| 251 | |
| 252 | @cindex EUC |
| 253 | @cindex Shift_JIS |
| 254 | @cindex SJIS |
| 255 | In most uses of @w{ISO 2022} the defined character sets do not allow |
| 256 | state changes that cover more than the next character. This has the |
| 257 | big advantage that whenever one can identify the beginning of the byte |
| 258 | sequence of a character one can interpret a text correctly. Examples of |
| 259 | character sets using this policy are the various EUC character sets |
| 260 | (used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) |
| 261 | or Shift_JIS (SJIS, a Japanese encoding). |
| 262 | |
| 263 | But there are also character sets using a state that is valid for more |
| 264 | than one character and has to be changed by another byte sequence. |
| 265 | Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. |
| 266 | |
| 267 | @item |
| 268 | @cindex ISO 6937 |
| 269 | Early attempts to fix 8 bit character sets for other languages using the |
| 270 | Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes |
| 271 | representing characters like the acute accent do not produce output |
| 272 | themselves: one has to combine them with other characters to get the |
| 273 | desired result. For example, the byte sequence @code{0xc2 0x61} |
| 274 | (non-spacing acute accent, followed by lower-case `a') to get the ``small |
| 275 | a with acute'' character. To get the acute accent character on its own, |
| 276 | one has to write @code{0xc2 0x20} (the non-spacing acute followed by a |
| 277 | space). |
| 278 | |
| 279 | Character sets like @w{ISO 6937} are used in some embedded systems such |
| 280 | as teletex. |
| 281 | |
| 282 | @item |
| 283 | @cindex UTF-8 |
| 284 | Instead of converting the Unicode or @w{ISO 10646} text used internally, |
| 285 | it is often also sufficient to simply use an encoding different than |
| 286 | UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an |
| 287 | encoding: UTF-8. This encoding is able to represent all of @w{ISO |
| 288 | 10646} 31 bits in a byte string of length one to six. |
| 289 | |
| 290 | @cindex UTF-7 |
| 291 | There were a few other attempts to encode @w{ISO 10646} such as UTF-7, |
| 292 | but UTF-8 is today the only encoding that should be used. In fact, with |
| 293 | any luck UTF-8 will soon be the only external encoding that has to be |
| 294 | supported. It proves to be universally usable and its only disadvantage |
| 295 | is that it favors Roman languages by making the byte string |
| 296 | representation of other scripts (Cyrillic, Greek, Asian scripts) longer |
| 297 | than necessary if using a specific character set for these scripts. |
| 298 | Methods like the Unicode compression scheme can alleviate these |
| 299 | problems. |
| 300 | @end itemize |
| 301 | |
| 302 | The question remaining is: how to select the character set or encoding |
| 303 | to use. The answer: you cannot decide about it yourself, it is decided |
| 304 | by the developers of the system or the majority of the users. Since the |
| 305 | goal is interoperability one has to use whatever the other people one |
| 306 | works with use. If there are no constraints, the selection is based on |
| 307 | the requirements the expected circle of users will have. In other words, |
| 308 | if a project is expected to be used in only, say, Russia it is fine to use |
| 309 | KOI8-R or a similar character set. But if at the same time people from, |
| 310 | say, Greece are participating one should use a character set that allows |
| 311 | all people to collaborate. |
| 312 | |
| 313 | The most widely useful solution seems to be: go with the most general |
| 314 | character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding |
| 315 | and problems about users not being able to use their own language |
| 316 | adequately are a thing of the past. |
| 317 | |
| 318 | One final comment about the choice of the wide character representation |
| 319 | is necessary at this point. We have said above that the natural choice |
| 320 | is using Unicode or @w{ISO 10646}. This is not required, but at least |
| 321 | encouraged, by the @w{ISO C} standard. The standard defines at least a |
| 322 | macro @code{__STDC_ISO_10646__} that is only defined on systems where |
| 323 | the @code{wchar_t} type encodes @w{ISO 10646} characters. If this |
| 324 | symbol is not defined one should avoid making assumptions about the wide |
| 325 | character representation. If the programmer uses only the functions |
| 326 | provided by the C library to handle wide character strings there should |
| 327 | be no compatibility problems with other systems. |
| 328 | |
| 329 | @node Charset Function Overview |
| 330 | @section Overview about Character Handling Functions |
| 331 | |
| 332 | A Unix @w{C library} contains three different sets of functions in two |
| 333 | families to handle character set conversion. One of the function families |
| 334 | (the most commonly used) is specified in the @w{ISO C90} standard and, |
| 335 | therefore, is portable even beyond the Unix world. Unfortunately this |
| 336 | family is the least useful one. These functions should be avoided |
| 337 | whenever possible, especially when developing libraries (as opposed to |
| 338 | applications). |
| 339 | |
| 340 | The second family of functions got introduced in the early Unix standards |
| 341 | (XPG2) and is still part of the latest and greatest Unix standard: |
| 342 | @w{Unix 98}. It is also the most powerful and useful set of functions. |
| 343 | But we will start with the functions defined in @w{Amendment 1} to |
| 344 | @w{ISO C90}. |
| 345 | |
| 346 | @node Restartable multibyte conversion |
| 347 | @section Restartable Multibyte Conversion Functions |
| 348 | |
| 349 | The @w{ISO C} standard defines functions to convert strings from a |
| 350 | multibyte representation to wide character strings. There are a number |
| 351 | of peculiarities: |
| 352 | |
| 353 | @itemize @bullet |
| 354 | @item |
| 355 | The character set assumed for the multibyte encoding is not specified |
| 356 | as an argument to the functions. Instead the character set specified by |
| 357 | the @code{LC_CTYPE} category of the current locale is used; see |
| 358 | @ref{Locale Categories}. |
| 359 | |
| 360 | @item |
| 361 | The functions handling more than one character at a time require NUL |
| 362 | terminated strings as the argument (i.e., converting blocks of text |
| 363 | does not work unless one can add a NUL byte at an appropriate place). |
| 364 | @Theglibc{} contains some extensions to the standard that allow |
| 365 | specifying a size, but basically they also expect terminated strings. |
| 366 | @end itemize |
| 367 | |
| 368 | Despite these limitations the @w{ISO C} functions can be used in many |
| 369 | contexts. In graphical user interfaces, for instance, it is not |
| 370 | uncommon to have functions that require text to be displayed in a wide |
| 371 | character string if the text is not simple ASCII. The text itself might |
| 372 | come from a file with translations and the user should decide about the |
| 373 | current locale, which determines the translation and therefore also the |
| 374 | external encoding used. In such a situation (and many others) the |
| 375 | functions described here are perfect. If more freedom while performing |
| 376 | the conversion is necessary take a look at the @code{iconv} functions |
| 377 | (@pxref{Generic Charset Conversion}). |
| 378 | |
| 379 | @menu |
| 380 | * Selecting the Conversion:: Selecting the conversion and its properties. |
| 381 | * Keeping the state:: Representing the state of the conversion. |
| 382 | * Converting a Character:: Converting Single Characters. |
| 383 | * Converting Strings:: Converting Multibyte and Wide Character |
| 384 | Strings. |
| 385 | * Multibyte Conversion Example:: A Complete Multibyte Conversion Example. |
| 386 | @end menu |
| 387 | |
| 388 | @node Selecting the Conversion |
| 389 | @subsection Selecting the conversion and its properties |
| 390 | |
| 391 | We already said above that the currently selected locale for the |
| 392 | @code{LC_CTYPE} category decides about the conversion that is performed |
| 393 | by the functions we are about to describe. Each locale uses its own |
| 394 | character set (given as an argument to @code{localedef}) and this is the |
| 395 | one assumed as the external multibyte encoding. The wide character |
| 396 | set is always UCS-4 in @theglibc{}. |
| 397 | |
| 398 | A characteristic of each multibyte character set is the maximum number |
| 399 | of bytes that can be necessary to represent one character. This |
| 400 | information is quite important when writing code that uses the |
| 401 | conversion functions (as shown in the examples below). |
| 402 | The @w{ISO C} standard defines two macros that provide this information. |
| 403 | |
| 404 | |
| 405 | @comment limits.h |
| 406 | @comment ISO |
| 407 | @deftypevr Macro int MB_LEN_MAX |
| 408 | @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte |
| 409 | sequence for a single character in any of the supported locales. It is |
| 410 | a compile-time constant and is defined in @file{limits.h}. |
| 411 | @pindex limits.h |
| 412 | @end deftypevr |
| 413 | |
| 414 | @comment stdlib.h |
| 415 | @comment ISO |
| 416 | @deftypevr Macro int MB_CUR_MAX |
| 417 | @code{MB_CUR_MAX} expands into a positive integer expression that is the |
| 418 | maximum number of bytes in a multibyte character in the current locale. |
| 419 | The value is never greater than @code{MB_LEN_MAX}. Unlike |
| 420 | @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in |
| 421 | @theglibc{} it is not. |
| 422 | |
| 423 | @pindex stdlib.h |
| 424 | @code{MB_CUR_MAX} is defined in @file{stdlib.h}. |
| 425 | @end deftypevr |
| 426 | |
| 427 | Two different macros are necessary since strictly @w{ISO C90} compilers |
| 428 | do not allow variable length array definitions, but still it is desirable |
| 429 | to avoid dynamic allocation. This incomplete piece of code shows the |
| 430 | problem: |
| 431 | |
| 432 | @smallexample |
| 433 | @{ |
| 434 | char buf[MB_LEN_MAX]; |
| 435 | ssize_t len = 0; |
| 436 | |
| 437 | while (! feof (fp)) |
| 438 | @{ |
| 439 | fread (&buf[len], 1, MB_CUR_MAX - len, fp); |
| 440 | /* @r{@dots{} process} buf */ |
| 441 | len -= used; |
| 442 | @} |
| 443 | @} |
| 444 | @end smallexample |
| 445 | |
| 446 | The code in the inner loop is expected to have always enough bytes in |
| 447 | the array @var{buf} to convert one multibyte character. The array |
| 448 | @var{buf} has to be sized statically since many compilers do not allow a |
| 449 | variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} |
| 450 | bytes are always available in @var{buf}. Note that it isn't |
| 451 | a problem if @code{MB_CUR_MAX} is not a compile-time constant. |
| 452 | |
| 453 | |
| 454 | @node Keeping the state |
| 455 | @subsection Representing the state of the conversion |
| 456 | |
| 457 | @cindex stateful |
| 458 | In the introduction of this chapter it was said that certain character |
| 459 | sets use a @dfn{stateful} encoding. That is, the encoded values depend |
| 460 | in some way on the previous bytes in the text. |
| 461 | |
| 462 | Since the conversion functions allow converting a text in more than one |
| 463 | step we must have a way to pass this information from one call of the |
| 464 | functions to another. |
| 465 | |
| 466 | @comment wchar.h |
| 467 | @comment ISO |
| 468 | @deftp {Data type} mbstate_t |
| 469 | @cindex shift state |
| 470 | A variable of type @code{mbstate_t} can contain all the information |
| 471 | about the @dfn{shift state} needed from one call to a conversion |
| 472 | function to another. |
| 473 | |
| 474 | @pindex wchar.h |
| 475 | @code{mbstate_t} is defined in @file{wchar.h}. It was introduced in |
| 476 | @w{Amendment 1} to @w{ISO C90}. |
| 477 | @end deftp |
| 478 | |
| 479 | To use objects of type @code{mbstate_t} the programmer has to define such |
| 480 | objects (normally as local variables on the stack) and pass a pointer to |
| 481 | the object to the conversion functions. This way the conversion function |
| 482 | can update the object if the current multibyte character set is stateful. |
| 483 | |
| 484 | There is no specific function or initializer to put the state object in |
| 485 | any specific state. The rules are that the object should always |
| 486 | represent the initial state before the first use, and this is achieved by |
| 487 | clearing the whole variable with code such as follows: |
| 488 | |
| 489 | @smallexample |
| 490 | @{ |
| 491 | mbstate_t state; |
| 492 | memset (&state, '\0', sizeof (state)); |
| 493 | /* @r{from now on @var{state} can be used.} */ |
| 494 | @dots{} |
| 495 | @} |
| 496 | @end smallexample |
| 497 | |
| 498 | When using the conversion functions to generate output it is often |
| 499 | necessary to test whether the current state corresponds to the initial |
| 500 | state. This is necessary, for example, to decide whether to emit |
| 501 | escape sequences to set the state to the initial state at certain |
| 502 | sequence points. Communication protocols often require this. |
| 503 | |
| 504 | @comment wchar.h |
| 505 | @comment ISO |
| 506 | @deftypefun int mbsinit (const mbstate_t *@var{ps}) |
| 507 | @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} |
| 508 | @c ps is dereferenced once, unguarded. This would call for @mtsrace:ps, |
| 509 | @c but since a single word-sized field is (atomically) accessed, any |
| 510 | @c race here would be harmless. Other functions that take an optional |
| 511 | @c mbstate_t* argument named ps are marked with @mtasurace:<func>/!ps, |
| 512 | @c to indicate that the function uses a static buffer if ps is NULL. |
| 513 | @c These could also have been marked with @mtsrace:ps, but we'll omit |
| 514 | @c that for brevity, for it's somewhat redundant with the @mtasurace. |
| 515 | The @code{mbsinit} function determines whether the state object pointed |
| 516 | to by @var{ps} is in the initial state. If @var{ps} is a null pointer or |
| 517 | the object is in the initial state the return value is nonzero. Otherwise |
| 518 | it is zero. |
| 519 | |
| 520 | @pindex wchar.h |
| 521 | @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| 522 | declared in @file{wchar.h}. |
| 523 | @end deftypefun |
| 524 | |
| 525 | Code using @code{mbsinit} often looks similar to this: |
| 526 | |
| 527 | @c Fix the example to explicitly say how to generate the escape sequence |
| 528 | @c to restore the initial state. |
| 529 | @smallexample |
| 530 | @{ |
| 531 | mbstate_t state; |
| 532 | memset (&state, '\0', sizeof (state)); |
| 533 | /* @r{Use @var{state}.} */ |
| 534 | @dots{} |
| 535 | if (! mbsinit (&state)) |
| 536 | @{ |
| 537 | /* @r{Emit code to return to initial state.} */ |
| 538 | const wchar_t empty[] = L""; |
| 539 | const wchar_t *srcp = empty; |
| 540 | wcsrtombs (outbuf, &srcp, outbuflen, &state); |
| 541 | @} |
| 542 | @dots{} |
| 543 | @} |
| 544 | @end smallexample |
| 545 | |
| 546 | The code to emit the escape sequence to get back to the initial state is |
| 547 | interesting. The @code{wcsrtombs} function can be used to determine the |
| 548 | necessary output code (@pxref{Converting Strings}). Please note that with |
| 549 | @theglibc{} it is not necessary to perform this extra action for the |
| 550 | conversion from multibyte text to wide character text since the wide |
| 551 | character encoding is not stateful. But there is nothing mentioned in |
| 552 | any standard that prohibits making @code{wchar_t} using a stateful |
| 553 | encoding. |
| 554 | |
| 555 | @node Converting a Character |
| 556 | @subsection Converting Single Characters |
| 557 | |
| 558 | The most fundamental of the conversion functions are those dealing with |
| 559 | single characters. Please note that this does not always mean single |
| 560 | bytes. But since there is very often a subset of the multibyte |
| 561 | character set that consists of single byte sequences, there are |
| 562 | functions to help with converting bytes. Frequently, ASCII is a subpart |
| 563 | of the multibyte character set. In such a scenario, each ASCII character |
| 564 | stands for itself, and all other characters have at least a first byte |
| 565 | that is beyond the range @math{0} to @math{127}. |
| 566 | |
| 567 | @comment wchar.h |
| 568 | @comment ISO |
| 569 | @deftypefun wint_t btowc (int @var{c}) |
| 570 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 571 | @c Calls btowc_fct or __fct; reads from locale, and from the |
| 572 | @c get_gconv_fcts result multiple times. get_gconv_fcts calls |
| 573 | @c __wcsmbs_load_conv to initialize the ctype if it's null. |
| 574 | @c wcsmbs_load_conv takes a non-recursive wrlock before allocating |
| 575 | @c memory for the fcts structure, initializing it, and then storing it |
| 576 | @c in the locale object. The initialization involves dlopening and a |
| 577 | @c lot more. |
| 578 | The @code{btowc} function (``byte to wide character'') converts a valid |
| 579 | single byte character @var{c} in the initial shift state into the wide |
| 580 | character equivalent using the conversion rules from the currently |
| 581 | selected locale of the @code{LC_CTYPE} category. |
| 582 | |
| 583 | If @code{(unsigned char) @var{c}} is no valid single byte multibyte |
| 584 | character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. |
| 585 | |
| 586 | Please note the restriction of @var{c} being tested for validity only in |
| 587 | the initial shift state. No @code{mbstate_t} object is used from |
| 588 | which the state information is taken, and the function also does not use |
| 589 | any static state. |
| 590 | |
| 591 | @pindex wchar.h |
| 592 | The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} |
| 593 | and is declared in @file{wchar.h}. |
| 594 | @end deftypefun |
| 595 | |
| 596 | Despite the limitation that the single byte value is always interpreted |
| 597 | in the initial state, this function is actually useful most of the time. |
| 598 | Most characters are either entirely single-byte character sets or they |
| 599 | are extension to ASCII. But then it is possible to write code like this |
| 600 | (not that this specific example is very useful): |
| 601 | |
| 602 | @smallexample |
| 603 | wchar_t * |
| 604 | itow (unsigned long int val) |
| 605 | @{ |
| 606 | static wchar_t buf[30]; |
| 607 | wchar_t *wcp = &buf[29]; |
| 608 | *wcp = L'\0'; |
| 609 | while (val != 0) |
| 610 | @{ |
| 611 | *--wcp = btowc ('0' + val % 10); |
| 612 | val /= 10; |
| 613 | @} |
| 614 | if (wcp == &buf[29]) |
| 615 | *--wcp = L'0'; |
| 616 | return wcp; |
| 617 | @} |
| 618 | @end smallexample |
| 619 | |
| 620 | Why is it necessary to use such a complicated implementation and not |
| 621 | simply cast @code{'0' + val % 10} to a wide character? The answer is |
| 622 | that there is no guarantee that one can perform this kind of arithmetic |
| 623 | on the character of the character set used for @code{wchar_t} |
| 624 | representation. In other situations the bytes are not constant at |
| 625 | compile time and so the compiler cannot do the work. In situations like |
| 626 | this, using @code{btowc} is required. |
| 627 | |
| 628 | @noindent |
| 629 | There is also a function for the conversion in the other direction. |
| 630 | |
| 631 | @comment wchar.h |
| 632 | @comment ISO |
| 633 | @deftypefun int wctob (wint_t @var{c}) |
| 634 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 635 | The @code{wctob} function (``wide character to byte'') takes as the |
| 636 | parameter a valid wide character. If the multibyte representation for |
| 637 | this character in the initial state is exactly one byte long, the return |
| 638 | value of this function is this character. Otherwise the return value is |
| 639 | @code{EOF}. |
| 640 | |
| 641 | @pindex wchar.h |
| 642 | @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| 643 | is declared in @file{wchar.h}. |
| 644 | @end deftypefun |
| 645 | |
| 646 | There are more general functions to convert single character from |
| 647 | multibyte representation to wide characters and vice versa. These |
| 648 | functions pose no limit on the length of the multibyte representation |
| 649 | and they also do not require it to be in the initial state. |
| 650 | |
| 651 | @comment wchar.h |
| 652 | @comment ISO |
| 653 | @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) |
| 654 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbrtowc/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 655 | @cindex stateful |
| 656 | The @code{mbrtowc} function (``multibyte restartable to wide |
| 657 | character'') converts the next multibyte character in the string pointed |
| 658 | to by @var{s} into a wide character and stores it in the wide character |
| 659 | string pointed to by @var{pwc}. The conversion is performed according |
| 660 | to the locale currently selected for the @code{LC_CTYPE} category. If |
| 661 | the conversion for the character set used in the locale requires a state, |
| 662 | the multibyte string is interpreted in the state represented by the |
| 663 | object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, |
| 664 | internal state variable used only by the @code{mbrtowc} function is |
| 665 | used. |
| 666 | |
| 667 | If the next multibyte character corresponds to the NUL wide character, |
| 668 | the return value of the function is @math{0} and the state object is |
| 669 | afterwards in the initial state. If the next @var{n} or fewer bytes |
| 670 | form a correct multibyte character, the return value is the number of |
| 671 | bytes starting from @var{s} that form the multibyte character. The |
| 672 | conversion state is updated according to the bytes consumed in the |
| 673 | conversion. In both cases the wide character (either the @code{L'\0'} |
| 674 | or the one found in the conversion) is stored in the string pointed to |
| 675 | by @var{pwc} if @var{pwc} is not null. |
| 676 | |
| 677 | If the first @var{n} bytes of the multibyte string possibly form a valid |
| 678 | multibyte character but there are more than @var{n} bytes needed to |
| 679 | complete it, the return value of the function is @code{(size_t) -2} and |
| 680 | no value is stored. Please note that this can happen even if @var{n} |
| 681 | has a value greater than or equal to @code{MB_CUR_MAX} since the input |
| 682 | might contain redundant shift sequences. |
| 683 | |
| 684 | If the first @code{n} bytes of the multibyte string cannot possibly form |
| 685 | a valid multibyte character, no value is stored, the global variable |
| 686 | @code{errno} is set to the value @code{EILSEQ}, and the function returns |
| 687 | @code{(size_t) -1}. The conversion state is afterwards undefined. |
| 688 | |
| 689 | @pindex wchar.h |
| 690 | @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| 691 | is declared in @file{wchar.h}. |
| 692 | @end deftypefun |
| 693 | |
| 694 | Use of @code{mbrtowc} is straightforward. A function that copies a |
| 695 | multibyte string into a wide character string while at the same time |
| 696 | converting all lowercase characters into uppercase could look like this |
| 697 | (this is not the final version, just an example; it has no error |
| 698 | checking, and sometimes leaks memory): |
| 699 | |
| 700 | @smallexample |
| 701 | wchar_t * |
| 702 | mbstouwcs (const char *s) |
| 703 | @{ |
| 704 | size_t len = strlen (s); |
| 705 | wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); |
| 706 | wchar_t *wcp = result; |
| 707 | wchar_t tmp[1]; |
| 708 | mbstate_t state; |
| 709 | size_t nbytes; |
| 710 | |
| 711 | memset (&state, '\0', sizeof (state)); |
| 712 | while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) |
| 713 | @{ |
| 714 | if (nbytes >= (size_t) -2) |
| 715 | /* Invalid input string. */ |
| 716 | return NULL; |
| 717 | *wcp++ = towupper (tmp[0]); |
| 718 | len -= nbytes; |
| 719 | s += nbytes; |
| 720 | @} |
| 721 | return result; |
| 722 | @} |
| 723 | @end smallexample |
| 724 | |
| 725 | The use of @code{mbrtowc} should be clear. A single wide character is |
| 726 | stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored |
| 727 | in the variable @var{nbytes}. If the conversion is successful, the |
| 728 | uppercase variant of the wide character is stored in the @var{result} |
| 729 | array and the pointer to the input string and the number of available |
| 730 | bytes is adjusted. |
| 731 | |
| 732 | The only non-obvious thing about @code{mbrtowc} might be the way memory |
| 733 | is allocated for the result. The above code uses the fact that there |
| 734 | can never be more wide characters in the converted results than there are |
| 735 | bytes in the multibyte input string. This method yields a pessimistic |
| 736 | guess about the size of the result, and if many wide character strings |
| 737 | have to be constructed this way or if the strings are long, the extra |
| 738 | memory required to be allocated because the input string contains |
| 739 | multibyte characters might be significant. The allocated memory block can |
| 740 | be resized to the correct size before returning it, but a better solution |
| 741 | might be to allocate just the right amount of space for the result right |
| 742 | away. Unfortunately there is no function to compute the length of the wide |
| 743 | character string directly from the multibyte string. There is, however, a |
| 744 | function that does part of the work. |
| 745 | |
| 746 | @comment wchar.h |
| 747 | @comment ISO |
| 748 | @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) |
| 749 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbrlen/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 750 | The @code{mbrlen} function (``multibyte restartable length'') computes |
| 751 | the number of at most @var{n} bytes starting at @var{s}, which form the |
| 752 | next valid and complete multibyte character. |
| 753 | |
| 754 | If the next multibyte character corresponds to the NUL wide character, |
| 755 | the return value is @math{0}. If the next @var{n} bytes form a valid |
| 756 | multibyte character, the number of bytes belonging to this multibyte |
| 757 | character byte sequence is returned. |
| 758 | |
| 759 | If the first @var{n} bytes possibly form a valid multibyte |
| 760 | character but the character is incomplete, the return value is |
| 761 | @code{(size_t) -2}. Otherwise the multibyte character sequence is invalid |
| 762 | and the return value is @code{(size_t) -1}. |
| 763 | |
| 764 | The multibyte sequence is interpreted in the state represented by the |
| 765 | object pointed to by @var{ps}. If @var{ps} is a null pointer, a state |
| 766 | object local to @code{mbrlen} is used. |
| 767 | |
| 768 | @pindex wchar.h |
| 769 | @code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and |
| 770 | is declared in @file{wchar.h}. |
| 771 | @end deftypefun |
| 772 | |
| 773 | The attentive reader now will note that @code{mbrlen} can be implemented |
| 774 | as |
| 775 | |
| 776 | @smallexample |
| 777 | mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) |
| 778 | @end smallexample |
| 779 | |
| 780 | This is true and in fact is mentioned in the official specification. |
| 781 | How can this function be used to determine the length of the wide |
| 782 | character string created from a multibyte character string? It is not |
| 783 | directly usable, but we can define a function @code{mbslen} using it: |
| 784 | |
| 785 | @smallexample |
| 786 | size_t |
| 787 | mbslen (const char *s) |
| 788 | @{ |
| 789 | mbstate_t state; |
| 790 | size_t result = 0; |
| 791 | size_t nbytes; |
| 792 | memset (&state, '\0', sizeof (state)); |
| 793 | while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) |
| 794 | @{ |
| 795 | if (nbytes >= (size_t) -2) |
| 796 | /* @r{Something is wrong.} */ |
| 797 | return (size_t) -1; |
| 798 | s += nbytes; |
| 799 | ++result; |
| 800 | @} |
| 801 | return result; |
| 802 | @} |
| 803 | @end smallexample |
| 804 | |
| 805 | This function simply calls @code{mbrlen} for each multibyte character |
| 806 | in the string and counts the number of function calls. Please note that |
| 807 | we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} |
| 808 | call. This is acceptable since a) this value is larger than the length of |
| 809 | the longest multibyte character sequence and b) we know that the string |
| 810 | @var{s} ends with a NUL byte, which cannot be part of any other multibyte |
| 811 | character sequence but the one representing the NUL wide character. |
| 812 | Therefore, the @code{mbrlen} function will never read invalid memory. |
| 813 | |
| 814 | Now that this function is available (just to make this clear, this |
| 815 | function is @emph{not} part of @theglibc{}) we can compute the |
| 816 | number of wide character required to store the converted multibyte |
| 817 | character string @var{s} using |
| 818 | |
| 819 | @smallexample |
| 820 | wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); |
| 821 | @end smallexample |
| 822 | |
| 823 | Please note that the @code{mbslen} function is quite inefficient. The |
| 824 | implementation of @code{mbstouwcs} with @code{mbslen} would have to |
| 825 | perform the conversion of the multibyte character input string twice, and |
| 826 | this conversion might be quite expensive. So it is necessary to think |
| 827 | about the consequences of using the easier but imprecise method before |
| 828 | doing the work twice. |
| 829 | |
| 830 | @comment wchar.h |
| 831 | @comment ISO |
| 832 | @deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) |
| 833 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcrtomb/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 834 | @c wcrtomb uses a static, non-thread-local unguarded state variable when |
| 835 | @c PS is NULL. When a state is passed in, and it's not used |
| 836 | @c concurrently in other threads, this function behaves safely as long |
| 837 | @c as gconv modules don't bring MT safety issues of their own. |
| 838 | @c Attempting to load gconv modules or to build conversion chains in |
| 839 | @c signal handlers may encounter gconv databases or caches in a |
| 840 | @c partially-updated state, and asynchronous cancellation may leave them |
| 841 | @c in such states, besides leaking the lock that guards them. |
| 842 | @c get_gconv_fcts ok |
| 843 | @c wcsmbs_load_conv ok |
| 844 | @c norm_add_slashes ok |
| 845 | @c wcsmbs_getfct ok |
| 846 | @c gconv_find_transform ok |
| 847 | @c gconv_read_conf (libc_once) |
| 848 | @c gconv_lookup_cache ok |
| 849 | @c find_module_idx ok |
| 850 | @c find_module ok |
| 851 | @c gconv_find_shlib (ok) |
| 852 | @c ->init_fct (assumed ok) |
| 853 | @c gconv_get_builtin_trans ok |
| 854 | @c gconv_release_step ok |
| 855 | @c do_lookup_alias ok |
| 856 | @c find_derivation ok |
| 857 | @c derivation_lookup ok |
| 858 | @c increment_counter ok |
| 859 | @c gconv_find_shlib ok |
| 860 | @c step->init_fct (assumed ok) |
| 861 | @c gen_steps ok |
| 862 | @c gconv_find_shlib ok |
| 863 | @c dlopen (presumed ok) |
| 864 | @c dlsym (presumed ok) |
| 865 | @c step->init_fct (assumed ok) |
| 866 | @c step->end_fct (assumed ok) |
| 867 | @c gconv_get_builtin_trans ok |
| 868 | @c gconv_release_step ok |
| 869 | @c add_derivation ok |
| 870 | @c gconv_close_transform ok |
| 871 | @c gconv_release_step ok |
| 872 | @c step->end_fct (assumed ok) |
| 873 | @c gconv_release_shlib ok |
| 874 | @c dlclose (presumed ok) |
| 875 | @c gconv_release_cache ok |
| 876 | @c ->tomb->__fct (assumed ok) |
| 877 | The @code{wcrtomb} function (``wide character restartable to |
| 878 | multibyte'') converts a single wide character into a multibyte string |
| 879 | corresponding to that wide character. |
| 880 | |
| 881 | If @var{s} is a null pointer, the function resets the state stored in |
| 882 | the objects pointed to by @var{ps} (or the internal @code{mbstate_t} |
| 883 | object) to the initial state. This can also be achieved by a call like |
| 884 | this: |
| 885 | |
| 886 | @smallexample |
| 887 | wcrtombs (temp_buf, L'\0', ps) |
| 888 | @end smallexample |
| 889 | |
| 890 | @noindent |
| 891 | since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it |
| 892 | writes into an internal buffer, which is guaranteed to be large enough. |
| 893 | |
| 894 | If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if |
| 895 | necessary, a shift sequence to get the state @var{ps} into the initial |
| 896 | state followed by a single NUL byte, which is stored in the string |
| 897 | @var{s}. |
| 898 | |
| 899 | Otherwise a byte sequence (possibly including shift sequences) is written |
| 900 | into the string @var{s}. This only happens if @var{wc} is a valid wide |
| 901 | character (i.e., it has a multibyte representation in the character set |
| 902 | selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no |
| 903 | valid wide character, nothing is stored in the strings @var{s}, |
| 904 | @code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} |
| 905 | is undefined and the return value is @code{(size_t) -1}. |
| 906 | |
| 907 | If no error occurred the function returns the number of bytes stored in |
| 908 | the string @var{s}. This includes all bytes representing shift |
| 909 | sequences. |
| 910 | |
| 911 | One word about the interface of the function: there is no parameter |
| 912 | specifying the length of the array @var{s}. Instead the function |
| 913 | assumes that there are at least @code{MB_CUR_MAX} bytes available since |
| 914 | this is the maximum length of any byte sequence representing a single |
| 915 | character. So the caller has to make sure that there is enough space |
| 916 | available, otherwise buffer overruns can occur. |
| 917 | |
| 918 | @pindex wchar.h |
| 919 | @code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| 920 | declared in @file{wchar.h}. |
| 921 | @end deftypefun |
| 922 | |
| 923 | Using @code{wcrtomb} is as easy as using @code{mbrtowc}. The following |
| 924 | example appends a wide character string to a multibyte character string. |
| 925 | Again, the code is not really useful (or correct), it is simply here to |
| 926 | demonstrate the use and some problems. |
| 927 | |
| 928 | @smallexample |
| 929 | char * |
| 930 | mbscatwcs (char *s, size_t len, const wchar_t *ws) |
| 931 | @{ |
| 932 | mbstate_t state; |
| 933 | /* @r{Find the end of the existing string.} */ |
| 934 | char *wp = strchr (s, '\0'); |
| 935 | len -= wp - s; |
| 936 | memset (&state, '\0', sizeof (state)); |
| 937 | do |
| 938 | @{ |
| 939 | size_t nbytes; |
| 940 | if (len < MB_CUR_LEN) |
| 941 | @{ |
| 942 | /* @r{We cannot guarantee that the next} |
| 943 | @r{character fits into the buffer, so} |
| 944 | @r{return an error.} */ |
| 945 | errno = E2BIG; |
| 946 | return NULL; |
| 947 | @} |
| 948 | nbytes = wcrtomb (wp, *ws, &state); |
| 949 | if (nbytes == (size_t) -1) |
| 950 | /* @r{Error in the conversion.} */ |
| 951 | return NULL; |
| 952 | len -= nbytes; |
| 953 | wp += nbytes; |
| 954 | @} |
| 955 | while (*ws++ != L'\0'); |
| 956 | return s; |
| 957 | @} |
| 958 | @end smallexample |
| 959 | |
| 960 | First the function has to find the end of the string currently in the |
| 961 | array @var{s}. The @code{strchr} call does this very efficiently since a |
| 962 | requirement for multibyte character representations is that the NUL byte |
| 963 | is never used except to represent itself (and in this context, the end |
| 964 | of the string). |
| 965 | |
| 966 | After initializing the state object the loop is entered where the first |
| 967 | task is to make sure there is enough room in the array @var{s}. We |
| 968 | abort if there are not at least @code{MB_CUR_LEN} bytes available. This |
| 969 | is not always optimal but we have no other choice. We might have less |
| 970 | than @code{MB_CUR_LEN} bytes available but the next multibyte character |
| 971 | might also be only one byte long. At the time the @code{wcrtomb} call |
| 972 | returns it is too late to decide whether the buffer was large enough. If |
| 973 | this solution is unsuitable, there is a very slow but more accurate |
| 974 | solution. |
| 975 | |
| 976 | @smallexample |
| 977 | @dots{} |
| 978 | if (len < MB_CUR_LEN) |
| 979 | @{ |
| 980 | mbstate_t temp_state; |
| 981 | memcpy (&temp_state, &state, sizeof (state)); |
| 982 | if (wcrtomb (NULL, *ws, &temp_state) > len) |
| 983 | @{ |
| 984 | /* @r{We cannot guarantee that the next} |
| 985 | @r{character fits into the buffer, so} |
| 986 | @r{return an error.} */ |
| 987 | errno = E2BIG; |
| 988 | return NULL; |
| 989 | @} |
| 990 | @} |
| 991 | @dots{} |
| 992 | @end smallexample |
| 993 | |
| 994 | Here we perform the conversion that might overflow the buffer so that |
| 995 | we are afterwards in the position to make an exact decision about the |
| 996 | buffer size. Please note the @code{NULL} argument for the destination |
| 997 | buffer in the new @code{wcrtomb} call; since we are not interested in the |
| 998 | converted text at this point, this is a nice way to express this. The |
| 999 | most unusual thing about this piece of code certainly is the duplication |
| 1000 | of the conversion state object, but if a change of the state is necessary |
| 1001 | to emit the next multibyte character, we want to have the same shift state |
| 1002 | change performed in the real conversion. Therefore, we have to preserve |
| 1003 | the initial shift state information. |
| 1004 | |
| 1005 | There are certainly many more and even better solutions to this problem. |
| 1006 | This example is only provided for educational purposes. |
| 1007 | |
| 1008 | @node Converting Strings |
| 1009 | @subsection Converting Multibyte and Wide Character Strings |
| 1010 | |
| 1011 | The functions described in the previous section only convert a single |
| 1012 | character at a time. Most operations to be performed in real-world |
| 1013 | programs include strings and therefore the @w{ISO C} standard also |
| 1014 | defines conversions on entire strings. However, the defined set of |
| 1015 | functions is quite limited; therefore, @theglibc{} contains a few |
| 1016 | extensions that can help in some important situations. |
| 1017 | |
| 1018 | @comment wchar.h |
| 1019 | @comment ISO |
| 1020 | @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| 1021 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1022 | The @code{mbsrtowcs} function (``multibyte string restartable to wide |
| 1023 | character string'') converts a NUL-terminated multibyte character |
| 1024 | string at @code{*@var{src}} into an equivalent wide character string, |
| 1025 | including the NUL wide character at the end. The conversion is started |
| 1026 | using the state information from the object pointed to by @var{ps} or |
| 1027 | from an internal object of @code{mbsrtowcs} if @var{ps} is a null |
| 1028 | pointer. Before returning, the state object is updated to match the state |
| 1029 | after the last converted character. The state is the initial state if the |
| 1030 | terminating NUL byte is reached and converted. |
| 1031 | |
| 1032 | If @var{dst} is not a null pointer, the result is stored in the array |
| 1033 | pointed to by @var{dst}; otherwise, the conversion result is not |
| 1034 | available since it is stored in an internal buffer. |
| 1035 | |
| 1036 | If @var{len} wide characters are stored in the array @var{dst} before |
| 1037 | reaching the end of the input string, the conversion stops and @var{len} |
| 1038 | is returned. If @var{dst} is a null pointer, @var{len} is never checked. |
| 1039 | |
| 1040 | Another reason for a premature return from the function call is if the |
| 1041 | input string contains an invalid multibyte sequence. In this case the |
| 1042 | global variable @code{errno} is set to @code{EILSEQ} and the function |
| 1043 | returns @code{(size_t) -1}. |
| 1044 | |
| 1045 | @c XXX The ISO C9x draft seems to have a problem here. It says that PS |
| 1046 | @c is not updated if DST is NULL. This is not said straightforward and |
| 1047 | @c none of the other functions is described like this. It would make sense |
| 1048 | @c to define the function this way but I don't think it is meant like this. |
| 1049 | |
| 1050 | In all other cases the function returns the number of wide characters |
| 1051 | converted during this call. If @var{dst} is not null, @code{mbsrtowcs} |
| 1052 | stores in the pointer pointed to by @var{src} either a null pointer (if |
| 1053 | the NUL byte in the input string was reached) or the address of the byte |
| 1054 | following the last converted multibyte character. |
| 1055 | |
| 1056 | @pindex wchar.h |
| 1057 | @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is |
| 1058 | declared in @file{wchar.h}. |
| 1059 | @end deftypefun |
| 1060 | |
| 1061 | The definition of the @code{mbsrtowcs} function has one important |
| 1062 | limitation. The requirement that @var{dst} has to be a NUL-terminated |
| 1063 | string provides problems if one wants to convert buffers with text. A |
| 1064 | buffer is normally no collection of NUL-terminated strings but instead a |
| 1065 | continuous collection of lines, separated by newline characters. Now |
| 1066 | assume that a function to convert one line from a buffer is needed. Since |
| 1067 | the line is not NUL-terminated, the source pointer cannot directly point |
| 1068 | into the unmodified text buffer. This means, either one inserts the NUL |
| 1069 | byte at the appropriate place for the time of the @code{mbsrtowcs} |
| 1070 | function call (which is not doable for a read-only buffer or in a |
| 1071 | multi-threaded application) or one copies the line in an extra buffer |
| 1072 | where it can be terminated by a NUL byte. Note that it is not in general |
| 1073 | possible to limit the number of characters to convert by setting the |
| 1074 | parameter @var{len} to any specific value. Since it is not known how |
| 1075 | many bytes each multibyte character sequence is in length, one can only |
| 1076 | guess. |
| 1077 | |
| 1078 | @cindex stateful |
| 1079 | There is still a problem with the method of NUL-terminating a line right |
| 1080 | after the newline character, which could lead to very strange results. |
| 1081 | As said in the description of the @code{mbsrtowcs} function above the |
| 1082 | conversion state is guaranteed to be in the initial shift state after |
| 1083 | processing the NUL byte at the end of the input string. But this NUL |
| 1084 | byte is not really part of the text (i.e., the conversion state after |
| 1085 | the newline in the original text could be something different than the |
| 1086 | initial shift state and therefore the first character of the next line |
| 1087 | is encoded using this state). But the state in question is never |
| 1088 | accessible to the user since the conversion stops after the NUL byte |
| 1089 | (which resets the state). Most stateful character sets in use today |
| 1090 | require that the shift state after a newline be the initial state--but |
| 1091 | this is not a strict guarantee. Therefore, simply NUL-terminating a |
| 1092 | piece of a running text is not always an adequate solution and, |
| 1093 | therefore, should never be used in generally used code. |
| 1094 | |
| 1095 | The generic conversion interface (@pxref{Generic Charset Conversion}) |
| 1096 | does not have this limitation (it simply works on buffers, not |
| 1097 | strings), and @theglibc{} contains a set of functions that take |
| 1098 | additional parameters specifying the maximal number of bytes that are |
| 1099 | consumed from the input string. This way the problem of |
| 1100 | @code{mbsrtowcs}'s example above could be solved by determining the line |
| 1101 | length and passing this length to the function. |
| 1102 | |
| 1103 | @comment wchar.h |
| 1104 | @comment ISO |
| 1105 | @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| 1106 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcsrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1107 | The @code{wcsrtombs} function (``wide character string restartable to |
| 1108 | multibyte string'') converts the NUL-terminated wide character string at |
| 1109 | @code{*@var{src}} into an equivalent multibyte character string and |
| 1110 | stores the result in the array pointed to by @var{dst}. The NUL wide |
| 1111 | character is also converted. The conversion starts in the state |
| 1112 | described in the object pointed to by @var{ps} or by a state object |
| 1113 | locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If |
| 1114 | @var{dst} is a null pointer, the conversion is performed as usual but the |
| 1115 | result is not available. If all characters of the input string were |
| 1116 | successfully converted and if @var{dst} is not a null pointer, the |
| 1117 | pointer pointed to by @var{src} gets assigned a null pointer. |
| 1118 | |
| 1119 | If one of the wide characters in the input string has no valid multibyte |
| 1120 | character equivalent, the conversion stops early, sets the global |
| 1121 | variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. |
| 1122 | |
| 1123 | Another reason for a premature stop is if @var{dst} is not a null |
| 1124 | pointer and the next converted character would require more than |
| 1125 | @var{len} bytes in total to the array @var{dst}. In this case (and if |
| 1126 | @var{dest} is not a null pointer) the pointer pointed to by @var{src} is |
| 1127 | assigned a value pointing to the wide character right after the last one |
| 1128 | successfully converted. |
| 1129 | |
| 1130 | Except in the case of an encoding error the return value of the |
| 1131 | @code{wcsrtombs} function is the number of bytes in all the multibyte |
| 1132 | character sequences stored in @var{dst}. Before returning the state in |
| 1133 | the object pointed to by @var{ps} (or the internal object in case |
| 1134 | @var{ps} is a null pointer) is updated to reflect the state after the |
| 1135 | last conversion. The state is the initial shift state in case the |
| 1136 | terminating NUL wide character was converted. |
| 1137 | |
| 1138 | @pindex wchar.h |
| 1139 | The @code{wcsrtombs} function was introduced in @w{Amendment 1} to |
| 1140 | @w{ISO C90} and is declared in @file{wchar.h}. |
| 1141 | @end deftypefun |
| 1142 | |
| 1143 | The restriction mentioned above for the @code{mbsrtowcs} function applies |
| 1144 | here also. There is no possibility of directly controlling the number of |
| 1145 | input characters. One has to place the NUL wide character at the correct |
| 1146 | place or control the consumed input indirectly via the available output |
| 1147 | array size (the @var{len} parameter). |
| 1148 | |
| 1149 | @comment wchar.h |
| 1150 | @comment GNU |
| 1151 | @deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| 1152 | @safety{@prelim{}@mtunsafe{@mtasurace{:mbsnrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1153 | The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} |
| 1154 | function. All the parameters are the same except for @var{nmc}, which is |
| 1155 | new. The return value is the same as for @code{mbsrtowcs}. |
| 1156 | |
| 1157 | This new parameter specifies how many bytes at most can be used from the |
| 1158 | multibyte character string. In other words, the multibyte character |
| 1159 | string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte |
| 1160 | is found within the @var{nmc} first bytes of the string, the conversion |
| 1161 | stops here. |
| 1162 | |
| 1163 | This function is a GNU extension. It is meant to work around the |
| 1164 | problems mentioned above. Now it is possible to convert a buffer with |
| 1165 | multibyte character text piece for piece without having to care about |
| 1166 | inserting NUL bytes and the effect of NUL bytes on the conversion state. |
| 1167 | @end deftypefun |
| 1168 | |
| 1169 | A function to convert a multibyte string into a wide character string |
| 1170 | and display it could be written like this (this is not a really useful |
| 1171 | example): |
| 1172 | |
| 1173 | @smallexample |
| 1174 | void |
| 1175 | showmbs (const char *src, FILE *fp) |
| 1176 | @{ |
| 1177 | mbstate_t state; |
| 1178 | int cnt = 0; |
| 1179 | memset (&state, '\0', sizeof (state)); |
| 1180 | while (1) |
| 1181 | @{ |
| 1182 | wchar_t linebuf[100]; |
| 1183 | const char *endp = strchr (src, '\n'); |
| 1184 | size_t n; |
| 1185 | |
| 1186 | /* @r{Exit if there is no more line.} */ |
| 1187 | if (endp == NULL) |
| 1188 | break; |
| 1189 | |
| 1190 | n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); |
| 1191 | linebuf[n] = L'\0'; |
| 1192 | fprintf (fp, "line %d: \"%S\"\n", linebuf); |
| 1193 | @} |
| 1194 | @} |
| 1195 | @end smallexample |
| 1196 | |
| 1197 | There is no problem with the state after a call to @code{mbsnrtowcs}. |
| 1198 | Since we don't insert characters in the strings that were not in there |
| 1199 | right from the beginning and we use @var{state} only for the conversion |
| 1200 | of the given buffer, there is no problem with altering the state. |
| 1201 | |
| 1202 | @comment wchar.h |
| 1203 | @comment GNU |
| 1204 | @deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) |
| 1205 | @safety{@prelim{}@mtunsafe{@mtasurace{:wcsnrtombs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1206 | The @code{wcsnrtombs} function implements the conversion from wide |
| 1207 | character strings to multibyte character strings. It is similar to |
| 1208 | @code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra |
| 1209 | parameter, which specifies the length of the input string. |
| 1210 | |
| 1211 | No more than @var{nwc} wide characters from the input string |
| 1212 | @code{*@var{src}} are converted. If the input string contains a NUL |
| 1213 | wide character in the first @var{nwc} characters, the conversion stops at |
| 1214 | this place. |
| 1215 | |
| 1216 | The @code{wcsnrtombs} function is a GNU extension and just like |
| 1217 | @code{mbsnrtowcs} helps in situations where no NUL-terminated input |
| 1218 | strings are available. |
| 1219 | @end deftypefun |
| 1220 | |
| 1221 | |
| 1222 | @node Multibyte Conversion Example |
| 1223 | @subsection A Complete Multibyte Conversion Example |
| 1224 | |
| 1225 | The example programs given in the last sections are only brief and do |
| 1226 | not contain all the error checking, etc. Presented here is a complete |
| 1227 | and documented example. It features the @code{mbrtowc} function but it |
| 1228 | should be easy to derive versions using the other functions. |
| 1229 | |
| 1230 | @smallexample |
| 1231 | int |
| 1232 | file_mbsrtowcs (int input, int output) |
| 1233 | @{ |
| 1234 | /* @r{Note the use of @code{MB_LEN_MAX}.} |
| 1235 | @r{@code{MB_CUR_MAX} cannot portably be used here.} */ |
| 1236 | char buffer[BUFSIZ + MB_LEN_MAX]; |
| 1237 | mbstate_t state; |
| 1238 | int filled = 0; |
| 1239 | int eof = 0; |
| 1240 | |
| 1241 | /* @r{Initialize the state.} */ |
| 1242 | memset (&state, '\0', sizeof (state)); |
| 1243 | |
| 1244 | while (!eof) |
| 1245 | @{ |
| 1246 | ssize_t nread; |
| 1247 | ssize_t nwrite; |
| 1248 | char *inp = buffer; |
| 1249 | wchar_t outbuf[BUFSIZ]; |
| 1250 | wchar_t *outp = outbuf; |
| 1251 | |
| 1252 | /* @r{Fill up the buffer from the input file.} */ |
| 1253 | nread = read (input, buffer + filled, BUFSIZ); |
| 1254 | if (nread < 0) |
| 1255 | @{ |
| 1256 | perror ("read"); |
| 1257 | return 0; |
| 1258 | @} |
| 1259 | /* @r{If we reach end of file, make a note to read no more.} */ |
| 1260 | if (nread == 0) |
| 1261 | eof = 1; |
| 1262 | |
| 1263 | /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ |
| 1264 | filled += nread; |
| 1265 | |
| 1266 | /* @r{Convert those bytes to wide characters--as many as we can.} */ |
| 1267 | while (1) |
| 1268 | @{ |
| 1269 | size_t thislen = mbrtowc (outp, inp, filled, &state); |
| 1270 | /* @r{Stop converting at invalid character;} |
| 1271 | @r{this can mean we have read just the first part} |
| 1272 | @r{of a valid character.} */ |
| 1273 | if (thislen == (size_t) -1) |
| 1274 | break; |
| 1275 | /* @r{We want to handle embedded NUL bytes} |
| 1276 | @r{but the return value is 0. Correct this.} */ |
| 1277 | if (thislen == 0) |
| 1278 | thislen = 1; |
| 1279 | /* @r{Advance past this character.} */ |
| 1280 | inp += thislen; |
| 1281 | filled -= thislen; |
| 1282 | ++outp; |
| 1283 | @} |
| 1284 | |
| 1285 | /* @r{Write the wide characters we just made.} */ |
| 1286 | nwrite = write (output, outbuf, |
| 1287 | (outp - outbuf) * sizeof (wchar_t)); |
| 1288 | if (nwrite < 0) |
| 1289 | @{ |
| 1290 | perror ("write"); |
| 1291 | return 0; |
| 1292 | @} |
| 1293 | |
| 1294 | /* @r{See if we have a @emph{real} invalid character.} */ |
| 1295 | if ((eof && filled > 0) || filled >= MB_CUR_MAX) |
| 1296 | @{ |
| 1297 | error (0, 0, "invalid multibyte character"); |
| 1298 | return 0; |
| 1299 | @} |
| 1300 | |
| 1301 | /* @r{If any characters must be carried forward,} |
| 1302 | @r{put them at the beginning of @code{buffer}.} */ |
| 1303 | if (filled > 0) |
| 1304 | memmove (buffer, inp, filled); |
| 1305 | @} |
| 1306 | |
| 1307 | return 1; |
| 1308 | @} |
| 1309 | @end smallexample |
| 1310 | |
| 1311 | |
| 1312 | @node Non-reentrant Conversion |
| 1313 | @section Non-reentrant Conversion Function |
| 1314 | |
| 1315 | The functions described in the previous chapter are defined in |
| 1316 | @w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard |
| 1317 | also contained functions for character set conversion. The reason that |
| 1318 | these original functions are not described first is that they are almost |
| 1319 | entirely useless. |
| 1320 | |
| 1321 | The problem is that all the conversion functions described in the |
| 1322 | original @w{ISO C90} use a local state. Using a local state implies that |
| 1323 | multiple conversions at the same time (not only when using threads) |
| 1324 | cannot be done, and that you cannot first convert single characters and |
| 1325 | then strings since you cannot tell the conversion functions which state |
| 1326 | to use. |
| 1327 | |
| 1328 | These original functions are therefore usable only in a very limited set |
| 1329 | of situations. One must complete converting the entire string before |
| 1330 | starting a new one, and each string/text must be converted with the same |
| 1331 | function (there is no problem with the library itself; it is guaranteed |
| 1332 | that no library function changes the state of any of these functions). |
| 1333 | @strong{For the above reasons it is highly requested that the functions |
| 1334 | described in the previous section be used in place of non-reentrant |
| 1335 | conversion functions.} |
| 1336 | |
| 1337 | @menu |
| 1338 | * Non-reentrant Character Conversion:: Non-reentrant Conversion of Single |
| 1339 | Characters. |
| 1340 | * Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. |
| 1341 | * Shift State:: States in Non-reentrant Functions. |
| 1342 | @end menu |
| 1343 | |
| 1344 | @node Non-reentrant Character Conversion |
| 1345 | @subsection Non-reentrant Conversion of Single Characters |
| 1346 | |
| 1347 | @comment stdlib.h |
| 1348 | @comment ISO |
| 1349 | @deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size}) |
| 1350 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1351 | The @code{mbtowc} (``multibyte to wide character'') function when called |
| 1352 | with non-null @var{string} converts the first multibyte character |
| 1353 | beginning at @var{string} to its corresponding wide character code. It |
| 1354 | stores the result in @code{*@var{result}}. |
| 1355 | |
| 1356 | @code{mbtowc} never examines more than @var{size} bytes. (The idea is |
| 1357 | to supply for @var{size} the number of bytes of data you have in hand.) |
| 1358 | |
| 1359 | @code{mbtowc} with non-null @var{string} distinguishes three |
| 1360 | possibilities: the first @var{size} bytes at @var{string} start with |
| 1361 | valid multibyte characters, they start with an invalid byte sequence or |
| 1362 | just part of a character, or @var{string} points to an empty string (a |
| 1363 | null character). |
| 1364 | |
| 1365 | For a valid multibyte character, @code{mbtowc} converts it to a wide |
| 1366 | character and stores that in @code{*@var{result}}, and returns the |
| 1367 | number of bytes in that character (always at least @math{1} and never |
| 1368 | more than @var{size}). |
| 1369 | |
| 1370 | For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an |
| 1371 | empty string, it returns @math{0}, also storing @code{'\0'} in |
| 1372 | @code{*@var{result}}. |
| 1373 | |
| 1374 | If the multibyte character code uses shift characters, then |
| 1375 | @code{mbtowc} maintains and updates a shift state as it scans. If you |
| 1376 | call @code{mbtowc} with a null pointer for @var{string}, that |
| 1377 | initializes the shift state to its standard initial value. It also |
| 1378 | returns nonzero if the multibyte character code in use actually has a |
| 1379 | shift state. @xref{Shift State}. |
| 1380 | @end deftypefun |
| 1381 | |
| 1382 | @comment stdlib.h |
| 1383 | @comment ISO |
| 1384 | @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) |
| 1385 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1386 | The @code{wctomb} (``wide character to multibyte'') function converts |
| 1387 | the wide character code @var{wchar} to its corresponding multibyte |
| 1388 | character sequence, and stores the result in bytes starting at |
| 1389 | @var{string}. At most @code{MB_CUR_MAX} characters are stored. |
| 1390 | |
| 1391 | @code{wctomb} with non-null @var{string} distinguishes three |
| 1392 | possibilities for @var{wchar}: a valid wide character code (one that can |
| 1393 | be translated to a multibyte character), an invalid code, and |
| 1394 | @code{L'\0'}. |
| 1395 | |
| 1396 | Given a valid code, @code{wctomb} converts it to a multibyte character, |
| 1397 | storing the bytes starting at @var{string}. Then it returns the number |
| 1398 | of bytes in that character (always at least @math{1} and never more |
| 1399 | than @code{MB_CUR_MAX}). |
| 1400 | |
| 1401 | If @var{wchar} is an invalid wide character code, @code{wctomb} returns |
| 1402 | @math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also |
| 1403 | storing @code{'\0'} in @code{*@var{string}}. |
| 1404 | |
| 1405 | If the multibyte character code uses shift characters, then |
| 1406 | @code{wctomb} maintains and updates a shift state as it scans. If you |
| 1407 | call @code{wctomb} with a null pointer for @var{string}, that |
| 1408 | initializes the shift state to its standard initial value. It also |
| 1409 | returns nonzero if the multibyte character code in use actually has a |
| 1410 | shift state. @xref{Shift State}. |
| 1411 | |
| 1412 | Calling this function with a @var{wchar} argument of zero when |
| 1413 | @var{string} is not null has the side-effect of reinitializing the |
| 1414 | stored shift state @emph{as well as} storing the multibyte character |
| 1415 | @code{'\0'} and returning @math{0}. |
| 1416 | @end deftypefun |
| 1417 | |
| 1418 | Similar to @code{mbrlen} there is also a non-reentrant function that |
| 1419 | computes the length of a multibyte character. It can be defined in |
| 1420 | terms of @code{mbtowc}. |
| 1421 | |
| 1422 | @comment stdlib.h |
| 1423 | @comment ISO |
| 1424 | @deftypefun int mblen (const char *@var{string}, size_t @var{size}) |
| 1425 | @safety{@prelim{}@mtunsafe{@mtasurace{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1426 | The @code{mblen} function with a non-null @var{string} argument returns |
| 1427 | the number of bytes that make up the multibyte character beginning at |
| 1428 | @var{string}, never examining more than @var{size} bytes. (The idea is |
| 1429 | to supply for @var{size} the number of bytes of data you have in hand.) |
| 1430 | |
| 1431 | The return value of @code{mblen} distinguishes three possibilities: the |
| 1432 | first @var{size} bytes at @var{string} start with valid multibyte |
| 1433 | characters, they start with an invalid byte sequence or just part of a |
| 1434 | character, or @var{string} points to an empty string (a null character). |
| 1435 | |
| 1436 | For a valid multibyte character, @code{mblen} returns the number of |
| 1437 | bytes in that character (always at least @code{1} and never more than |
| 1438 | @var{size}). For an invalid byte sequence, @code{mblen} returns |
| 1439 | @math{-1}. For an empty string, it returns @math{0}. |
| 1440 | |
| 1441 | If the multibyte character code uses shift characters, then @code{mblen} |
| 1442 | maintains and updates a shift state as it scans. If you call |
| 1443 | @code{mblen} with a null pointer for @var{string}, that initializes the |
| 1444 | shift state to its standard initial value. It also returns a nonzero |
| 1445 | value if the multibyte character code in use actually has a shift state. |
| 1446 | @xref{Shift State}. |
| 1447 | |
| 1448 | @pindex stdlib.h |
| 1449 | The function @code{mblen} is declared in @file{stdlib.h}. |
| 1450 | @end deftypefun |
| 1451 | |
| 1452 | |
| 1453 | @node Non-reentrant String Conversion |
| 1454 | @subsection Non-reentrant Conversion of Strings |
| 1455 | |
| 1456 | For convenience the @w{ISO C90} standard also defines functions to |
| 1457 | convert entire strings instead of single characters. These functions |
| 1458 | suffer from the same problems as their reentrant counterparts from |
| 1459 | @w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}. |
| 1460 | |
| 1461 | @comment stdlib.h |
| 1462 | @comment ISO |
| 1463 | @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) |
| 1464 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1465 | @c Odd... Although this was supposed to be non-reentrant, the internal |
| 1466 | @c state is not a static buffer, but an automatic variable. |
| 1467 | The @code{mbstowcs} (``multibyte string to wide character string'') |
| 1468 | function converts the null-terminated string of multibyte characters |
| 1469 | @var{string} to an array of wide character codes, storing not more than |
| 1470 | @var{size} wide characters into the array beginning at @var{wstring}. |
| 1471 | The terminating null character counts towards the size, so if @var{size} |
| 1472 | is less than the actual number of wide characters resulting from |
| 1473 | @var{string}, no terminating null character is stored. |
| 1474 | |
| 1475 | The conversion of characters from @var{string} begins in the initial |
| 1476 | shift state. |
| 1477 | |
| 1478 | If an invalid multibyte character sequence is found, the @code{mbstowcs} |
| 1479 | function returns a value of @math{-1}. Otherwise, it returns the number |
| 1480 | of wide characters stored in the array @var{wstring}. This number does |
| 1481 | not include the terminating null character, which is present if the |
| 1482 | number is less than @var{size}. |
| 1483 | |
| 1484 | Here is an example showing how to convert a string of multibyte |
| 1485 | characters, allocating enough space for the result. |
| 1486 | |
| 1487 | @smallexample |
| 1488 | wchar_t * |
| 1489 | mbstowcs_alloc (const char *string) |
| 1490 | @{ |
| 1491 | size_t size = strlen (string) + 1; |
| 1492 | wchar_t *buf = xmalloc (size * sizeof (wchar_t)); |
| 1493 | |
| 1494 | size = mbstowcs (buf, string, size); |
| 1495 | if (size == (size_t) -1) |
| 1496 | return NULL; |
| 1497 | buf = xrealloc (buf, (size + 1) * sizeof (wchar_t)); |
| 1498 | return buf; |
| 1499 | @} |
| 1500 | @end smallexample |
| 1501 | |
| 1502 | @end deftypefun |
| 1503 | |
| 1504 | @comment stdlib.h |
| 1505 | @comment ISO |
| 1506 | @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) |
| 1507 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1508 | The @code{wcstombs} (``wide character string to multibyte string'') |
| 1509 | function converts the null-terminated wide character array @var{wstring} |
| 1510 | into a string containing multibyte characters, storing not more than |
| 1511 | @var{size} bytes starting at @var{string}, followed by a terminating |
| 1512 | null character if there is room. The conversion of characters begins in |
| 1513 | the initial shift state. |
| 1514 | |
| 1515 | The terminating null character counts towards the size, so if @var{size} |
| 1516 | is less than or equal to the number of bytes needed in @var{wstring}, no |
| 1517 | terminating null character is stored. |
| 1518 | |
| 1519 | If a code that does not correspond to a valid multibyte character is |
| 1520 | found, the @code{wcstombs} function returns a value of @math{-1}. |
| 1521 | Otherwise, the return value is the number of bytes stored in the array |
| 1522 | @var{string}. This number does not include the terminating null character, |
| 1523 | which is present if the number is less than @var{size}. |
| 1524 | @end deftypefun |
| 1525 | |
| 1526 | @node Shift State |
| 1527 | @subsection States in Non-reentrant Functions |
| 1528 | |
| 1529 | In some multibyte character codes, the @emph{meaning} of any particular |
| 1530 | byte sequence is not fixed; it depends on what other sequences have come |
| 1531 | earlier in the same string. Typically there are just a few sequences that |
| 1532 | can change the meaning of other sequences; these few are called |
| 1533 | @dfn{shift sequences} and we say that they set the @dfn{shift state} for |
| 1534 | other sequences that follow. |
| 1535 | |
| 1536 | To illustrate shift state and shift sequences, suppose we decide that |
| 1537 | the sequence @code{0200} (just one byte) enters Japanese mode, in which |
| 1538 | pairs of bytes in the range from @code{0240} to @code{0377} are single |
| 1539 | characters, while @code{0201} enters Latin-1 mode, in which single bytes |
| 1540 | in the range from @code{0240} to @code{0377} are characters, and |
| 1541 | interpreted according to the ISO Latin-1 character set. This is a |
| 1542 | multibyte code that has two alternative shift states (``Japanese mode'' |
| 1543 | and ``Latin-1 mode''), and two shift sequences that specify particular |
| 1544 | shift states. |
| 1545 | |
| 1546 | When the multibyte character code in use has shift states, then |
| 1547 | @code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update |
| 1548 | the current shift state as they scan the string. To make this work |
| 1549 | properly, you must follow these rules: |
| 1550 | |
| 1551 | @itemize @bullet |
| 1552 | @item |
| 1553 | Before starting to scan a string, call the function with a null pointer |
| 1554 | for the multibyte character address---for example, @code{mblen (NULL, |
| 1555 | 0)}. This initializes the shift state to its standard initial value. |
| 1556 | |
| 1557 | @item |
| 1558 | Scan the string one character at a time, in order. Do not ``back up'' |
| 1559 | and rescan characters already scanned, and do not intersperse the |
| 1560 | processing of different strings. |
| 1561 | @end itemize |
| 1562 | |
| 1563 | Here is an example of using @code{mblen} following these rules: |
| 1564 | |
| 1565 | @smallexample |
| 1566 | void |
| 1567 | scan_string (char *s) |
| 1568 | @{ |
| 1569 | int length = strlen (s); |
| 1570 | |
| 1571 | /* @r{Initialize shift state.} */ |
| 1572 | mblen (NULL, 0); |
| 1573 | |
| 1574 | while (1) |
| 1575 | @{ |
| 1576 | int thischar = mblen (s, length); |
| 1577 | /* @r{Deal with end of string and invalid characters.} */ |
| 1578 | if (thischar == 0) |
| 1579 | break; |
| 1580 | if (thischar == -1) |
| 1581 | @{ |
| 1582 | error ("invalid multibyte character"); |
| 1583 | break; |
| 1584 | @} |
| 1585 | /* @r{Advance past this character.} */ |
| 1586 | s += thischar; |
| 1587 | length -= thischar; |
| 1588 | @} |
| 1589 | @} |
| 1590 | @end smallexample |
| 1591 | |
| 1592 | The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not |
| 1593 | reentrant when using a multibyte code that uses a shift state. However, |
| 1594 | no other library functions call these functions, so you don't have to |
| 1595 | worry that the shift state will be changed mysteriously. |
| 1596 | |
| 1597 | |
| 1598 | @node Generic Charset Conversion |
| 1599 | @section Generic Charset Conversion |
| 1600 | |
| 1601 | The conversion functions mentioned so far in this chapter all had in |
| 1602 | common that they operate on character sets that are not directly |
| 1603 | specified by the functions. The multibyte encoding used is specified by |
| 1604 | the currently selected locale for the @code{LC_CTYPE} category. The |
| 1605 | wide character set is fixed by the implementation (in the case of @theglibc{} |
| 1606 | it is always UCS-4 encoded @w{ISO 10646}. |
| 1607 | |
| 1608 | This has of course several problems when it comes to general character |
| 1609 | conversion: |
| 1610 | |
| 1611 | @itemize @bullet |
| 1612 | @item |
| 1613 | For every conversion where neither the source nor the destination |
| 1614 | character set is the character set of the locale for the @code{LC_CTYPE} |
| 1615 | category, one has to change the @code{LC_CTYPE} locale using |
| 1616 | @code{setlocale}. |
| 1617 | |
| 1618 | Changing the @code{LC_CTYPE} locale introduces major problems for the rest |
| 1619 | of the programs since several more functions (e.g., the character |
| 1620 | classification functions, @pxref{Classification of Characters}) use the |
| 1621 | @code{LC_CTYPE} category. |
| 1622 | |
| 1623 | @item |
| 1624 | Parallel conversions to and from different character sets are not |
| 1625 | possible since the @code{LC_CTYPE} selection is global and shared by all |
| 1626 | threads. |
| 1627 | |
| 1628 | @item |
| 1629 | If neither the source nor the destination character set is the character |
| 1630 | set used for @code{wchar_t} representation, there is at least a two-step |
| 1631 | process necessary to convert a text using the functions above. One would |
| 1632 | have to select the source character set as the multibyte encoding, |
| 1633 | convert the text into a @code{wchar_t} text, select the destination |
| 1634 | character set as the multibyte encoding, and convert the wide character |
| 1635 | text to the multibyte (@math{=} destination) character set. |
| 1636 | |
| 1637 | Even if this is possible (which is not guaranteed) it is a very tiring |
| 1638 | work. Plus it suffers from the other two raised points even more due to |
| 1639 | the steady changing of the locale. |
| 1640 | @end itemize |
| 1641 | |
| 1642 | The XPG2 standard defines a completely new set of functions, which has |
| 1643 | none of these limitations. They are not at all coupled to the selected |
| 1644 | locales, and they have no constraints on the character sets selected for |
| 1645 | source and destination. Only the set of available conversions limits |
| 1646 | them. The standard does not specify that any conversion at all must be |
| 1647 | available. Such availability is a measure of the quality of the |
| 1648 | implementation. |
| 1649 | |
| 1650 | In the following text first the interface to @code{iconv} and then the |
| 1651 | conversion function, will be described. Comparisons with other |
| 1652 | implementations will show what obstacles stand in the way of portable |
| 1653 | applications. Finally, the implementation is described in so far as might |
| 1654 | interest the advanced user who wants to extend conversion capabilities. |
| 1655 | |
| 1656 | @menu |
| 1657 | * Generic Conversion Interface:: Generic Character Set Conversion Interface. |
| 1658 | * iconv Examples:: A complete @code{iconv} example. |
| 1659 | * Other iconv Implementations:: Some Details about other @code{iconv} |
| 1660 | Implementations. |
| 1661 | * glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C |
| 1662 | library. |
| 1663 | @end menu |
| 1664 | |
| 1665 | @node Generic Conversion Interface |
| 1666 | @subsection Generic Character Set Conversion Interface |
| 1667 | |
| 1668 | This set of functions follows the traditional cycle of using a resource: |
| 1669 | open--use--close. The interface consists of three functions, each of |
| 1670 | which implements one step. |
| 1671 | |
| 1672 | Before the interfaces are described it is necessary to introduce a |
| 1673 | data type. Just like other open--use--close interfaces the functions |
| 1674 | introduced here work using handles and the @file{iconv.h} header |
| 1675 | defines a special type for the handles used. |
| 1676 | |
| 1677 | @comment iconv.h |
| 1678 | @comment XPG2 |
| 1679 | @deftp {Data Type} iconv_t |
| 1680 | This data type is an abstract type defined in @file{iconv.h}. The user |
| 1681 | must not assume anything about the definition of this type; it must be |
| 1682 | completely opaque. |
| 1683 | |
| 1684 | Objects of this type can get assigned handles for the conversions using |
| 1685 | the @code{iconv} functions. The objects themselves need not be freed, but |
| 1686 | the conversions for which the handles stand for have to. |
| 1687 | @end deftp |
| 1688 | |
| 1689 | @noindent |
| 1690 | The first step is the function to create a handle. |
| 1691 | |
| 1692 | @comment iconv.h |
| 1693 | @comment XPG2 |
| 1694 | @deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) |
| 1695 | @safety{@prelim{}@mtsafe{@mtslocale{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} |
| 1696 | @c Calls malloc if tocode and/or fromcode are too big for alloca. Calls |
| 1697 | @c strip and upstr on both, then gconv_open. strip and upstr call |
| 1698 | @c isalnum_l and toupper_l with the C locale. gconv_open may MT-safely |
| 1699 | @c tokenize toset, replace unspecified codesets with the current locale |
| 1700 | @c (possibly two different accesses), and finally it calls |
| 1701 | @c gconv_find_transform and initializes the gconv_t result with all the |
| 1702 | @c steps in the conversion sequence, running each one's initializer, |
| 1703 | @c destructing and releasing them all if anything fails. |
| 1704 | |
| 1705 | The @code{iconv_open} function has to be used before starting a |
| 1706 | conversion. The two parameters this function takes determine the |
| 1707 | source and destination character set for the conversion, and if the |
| 1708 | implementation has the possibility to perform such a conversion, the |
| 1709 | function returns a handle. |
| 1710 | |
| 1711 | If the wanted conversion is not available, the @code{iconv_open} function |
| 1712 | returns @code{(iconv_t) -1}. In this case the global variable |
| 1713 | @code{errno} can have the following values: |
| 1714 | |
| 1715 | @table @code |
| 1716 | @item EMFILE |
| 1717 | The process already has @code{OPEN_MAX} file descriptors open. |
| 1718 | @item ENFILE |
| 1719 | The system limit of open file is reached. |
| 1720 | @item ENOMEM |
| 1721 | Not enough memory to carry out the operation. |
| 1722 | @item EINVAL |
| 1723 | The conversion from @var{fromcode} to @var{tocode} is not supported. |
| 1724 | @end table |
| 1725 | |
| 1726 | It is not possible to use the same descriptor in different threads to |
| 1727 | perform independent conversions. The data structures associated |
| 1728 | with the descriptor include information about the conversion state. |
| 1729 | This must not be messed up by using it in different conversions. |
| 1730 | |
| 1731 | An @code{iconv} descriptor is like a file descriptor as for every use a |
| 1732 | new descriptor must be created. The descriptor does not stand for all |
| 1733 | of the conversions from @var{fromset} to @var{toset}. |
| 1734 | |
| 1735 | The @glibcadj{} implementation of @code{iconv_open} has one |
| 1736 | significant extension to other implementations. To ease the extension |
| 1737 | of the set of available conversions, the implementation allows storing |
| 1738 | the necessary files with data and code in an arbitrary number of |
| 1739 | directories. How this extension must be written will be explained below |
| 1740 | (@pxref{glibc iconv Implementation}). Here it is only important to say |
| 1741 | that all directories mentioned in the @code{GCONV_PATH} environment |
| 1742 | variable are considered only if they contain a file @file{gconv-modules}. |
| 1743 | These directories need not necessarily be created by the system |
| 1744 | administrator. In fact, this extension is introduced to help users |
| 1745 | writing and using their own, new conversions. Of course, this does not |
| 1746 | work for security reasons in SUID binaries; in this case only the system |
| 1747 | directory is considered and this normally is |
| 1748 | @file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment |
| 1749 | variable is examined exactly once at the first call of the |
| 1750 | @code{iconv_open} function. Later modifications of the variable have no |
| 1751 | effect. |
| 1752 | |
| 1753 | @pindex iconv.h |
| 1754 | The @code{iconv_open} function was introduced early in the X/Open |
| 1755 | Portability Guide, @w{version 2}. It is supported by all commercial |
| 1756 | Unices as it is required for the Unix branding. However, the quality and |
| 1757 | completeness of the implementation varies widely. The @code{iconv_open} |
| 1758 | function is declared in @file{iconv.h}. |
| 1759 | @end deftypefun |
| 1760 | |
| 1761 | The @code{iconv} implementation can associate large data structure with |
| 1762 | the handle returned by @code{iconv_open}. Therefore, it is crucial to |
| 1763 | free all the resources once all conversions are carried out and the |
| 1764 | conversion is not needed anymore. |
| 1765 | |
| 1766 | @comment iconv.h |
| 1767 | @comment XPG2 |
| 1768 | @deftypefun int iconv_close (iconv_t @var{cd}) |
| 1769 | @safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{}}} |
| 1770 | @c Calls gconv_close to destruct and release each of the conversion |
| 1771 | @c steps, release the gconv_t object, then call gconv_close_transform. |
| 1772 | @c Access to the gconv_t object is not guarded, but calling iconv_close |
| 1773 | @c concurrently with any other use is undefined. |
| 1774 | |
| 1775 | The @code{iconv_close} function frees all resources associated with the |
| 1776 | handle @var{cd}, which must have been returned by a successful call to |
| 1777 | the @code{iconv_open} function. |
| 1778 | |
| 1779 | If the function call was successful the return value is @math{0}. |
| 1780 | Otherwise it is @math{-1} and @code{errno} is set appropriately. |
| 1781 | Defined error are: |
| 1782 | |
| 1783 | @table @code |
| 1784 | @item EBADF |
| 1785 | The conversion descriptor is invalid. |
| 1786 | @end table |
| 1787 | |
| 1788 | @pindex iconv.h |
| 1789 | The @code{iconv_close} function was introduced together with the rest |
| 1790 | of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}. |
| 1791 | @end deftypefun |
| 1792 | |
| 1793 | The standard defines only one actual conversion function. This has, |
| 1794 | therefore, the most general interface: it allows conversion from one |
| 1795 | buffer to another. Conversion from a file to a buffer, vice versa, or |
| 1796 | even file to file can be implemented on top of it. |
| 1797 | |
| 1798 | @comment iconv.h |
| 1799 | @comment XPG2 |
| 1800 | @deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) |
| 1801 | @safety{@prelim{}@mtsafe{@mtsrace{:cd}}@assafe{}@acunsafe{@acucorrupt{}}} |
| 1802 | @c Without guarding access to the iconv_t object pointed to by cd, call |
| 1803 | @c the conversion function to convert inbuf or flush the internal |
| 1804 | @c conversion state. |
| 1805 | @cindex stateful |
| 1806 | The @code{iconv} function converts the text in the input buffer |
| 1807 | according to the rules associated with the descriptor @var{cd} and |
| 1808 | stores the result in the output buffer. It is possible to call the |
| 1809 | function for the same text several times in a row since for stateful |
| 1810 | character sets the necessary state information is kept in the data |
| 1811 | structures associated with the descriptor. |
| 1812 | |
| 1813 | The input buffer is specified by @code{*@var{inbuf}} and it contains |
| 1814 | @code{*@var{inbytesleft}} bytes. The extra indirection is necessary for |
| 1815 | communicating the used input back to the caller (see below). It is |
| 1816 | important to note that the buffer pointer is of type @code{char} and the |
| 1817 | length is measured in bytes even if the input text is encoded in wide |
| 1818 | characters. |
| 1819 | |
| 1820 | The output buffer is specified in a similar way. @code{*@var{outbuf}} |
| 1821 | points to the beginning of the buffer with at least |
| 1822 | @code{*@var{outbytesleft}} bytes room for the result. The buffer |
| 1823 | pointer again is of type @code{char} and the length is measured in |
| 1824 | bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the |
| 1825 | conversion is performed but no output is available. |
| 1826 | |
| 1827 | If @var{inbuf} is a null pointer, the @code{iconv} function performs the |
| 1828 | necessary action to put the state of the conversion into the initial |
| 1829 | state. This is obviously a no-op for non-stateful encodings, but if the |
| 1830 | encoding has a state, such a function call might put some byte sequences |
| 1831 | in the output buffer, which perform the necessary state changes. The |
| 1832 | next call with @var{inbuf} not being a null pointer then simply goes on |
| 1833 | from the initial state. It is important that the programmer never makes |
| 1834 | any assumption as to whether the conversion has to deal with states. |
| 1835 | Even if the input and output character sets are not stateful, the |
| 1836 | implementation might still have to keep states. This is due to the |
| 1837 | implementation chosen for @theglibc{} as it is described below. |
| 1838 | Therefore an @code{iconv} call to reset the state should always be |
| 1839 | performed if some protocol requires this for the output text. |
| 1840 | |
| 1841 | The conversion stops for one of three reasons. The first is that all |
| 1842 | characters from the input buffer are converted. This actually can mean |
| 1843 | two things: either all bytes from the input buffer are consumed or |
| 1844 | there are some bytes at the end of the buffer that possibly can form a |
| 1845 | complete character but the input is incomplete. The second reason for a |
| 1846 | stop is that the output buffer is full. And the third reason is that |
| 1847 | the input contains invalid characters. |
| 1848 | |
| 1849 | In all of these cases the buffer pointers after the last successful |
| 1850 | conversion, for input and output buffer, are stored in @var{inbuf} and |
| 1851 | @var{outbuf}, and the available room in each buffer is stored in |
| 1852 | @var{inbytesleft} and @var{outbytesleft}. |
| 1853 | |
| 1854 | Since the character sets selected in the @code{iconv_open} call can be |
| 1855 | almost arbitrary, there can be situations where the input buffer contains |
| 1856 | valid characters, which have no identical representation in the output |
| 1857 | character set. The behavior in this situation is undefined. The |
| 1858 | @emph{current} behavior of @theglibc{} in this situation is to |
| 1859 | return with an error immediately. This certainly is not the most |
| 1860 | desirable solution; therefore, future versions will provide better ones, |
| 1861 | but they are not yet finished. |
| 1862 | |
| 1863 | If all input from the input buffer is successfully converted and stored |
| 1864 | in the output buffer, the function returns the number of non-reversible |
| 1865 | conversions performed. In all other cases the return value is |
| 1866 | @code{(size_t) -1} and @code{errno} is set appropriately. In such cases |
| 1867 | the value pointed to by @var{inbytesleft} is nonzero. |
| 1868 | |
| 1869 | @table @code |
| 1870 | @item EILSEQ |
| 1871 | The conversion stopped because of an invalid byte sequence in the input. |
| 1872 | After the call, @code{*@var{inbuf}} points at the first byte of the |
| 1873 | invalid byte sequence. |
| 1874 | |
| 1875 | @item E2BIG |
| 1876 | The conversion stopped because it ran out of space in the output buffer. |
| 1877 | |
| 1878 | @item EINVAL |
| 1879 | The conversion stopped because of an incomplete byte sequence at the end |
| 1880 | of the input buffer. |
| 1881 | |
| 1882 | @item EBADF |
| 1883 | The @var{cd} argument is invalid. |
| 1884 | @end table |
| 1885 | |
| 1886 | @pindex iconv.h |
| 1887 | The @code{iconv} function was introduced in the XPG2 standard and is |
| 1888 | declared in the @file{iconv.h} header. |
| 1889 | @end deftypefun |
| 1890 | |
| 1891 | The definition of the @code{iconv} function is quite good overall. It |
| 1892 | provides quite flexible functionality. The only problems lie in the |
| 1893 | boundary cases, which are incomplete byte sequences at the end of the |
| 1894 | input buffer and invalid input. A third problem, which is not really |
| 1895 | a design problem, is the way conversions are selected. The standard |
| 1896 | does not say anything about the legitimate names, a minimal set of |
| 1897 | available conversions. We will see how this negatively impacts other |
| 1898 | implementations, as demonstrated below. |
| 1899 | |
| 1900 | @node iconv Examples |
| 1901 | @subsection A complete @code{iconv} example |
| 1902 | |
| 1903 | The example below features a solution for a common problem. Given that |
| 1904 | one knows the internal encoding used by the system for @code{wchar_t} |
| 1905 | strings, one often is in the position to read text from a file and store |
| 1906 | it in wide character buffers. One can do this using @code{mbsrtowcs}, |
| 1907 | but then we run into the problems discussed above. |
| 1908 | |
| 1909 | @smallexample |
| 1910 | int |
| 1911 | file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) |
| 1912 | @{ |
| 1913 | char inbuf[BUFSIZ]; |
| 1914 | size_t insize = 0; |
| 1915 | char *wrptr = (char *) outbuf; |
| 1916 | int result = 0; |
| 1917 | iconv_t cd; |
| 1918 | |
| 1919 | cd = iconv_open ("WCHAR_T", charset); |
| 1920 | if (cd == (iconv_t) -1) |
| 1921 | @{ |
| 1922 | /* @r{Something went wrong.} */ |
| 1923 | if (errno == EINVAL) |
| 1924 | error (0, 0, "conversion from '%s' to wchar_t not available", |
| 1925 | charset); |
| 1926 | else |
| 1927 | perror ("iconv_open"); |
| 1928 | |
| 1929 | /* @r{Terminate the output string.} */ |
| 1930 | *outbuf = L'\0'; |
| 1931 | |
| 1932 | return -1; |
| 1933 | @} |
| 1934 | |
| 1935 | while (avail > 0) |
| 1936 | @{ |
| 1937 | size_t nread; |
| 1938 | size_t nconv; |
| 1939 | char *inptr = inbuf; |
| 1940 | |
| 1941 | /* @r{Read more input.} */ |
| 1942 | nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); |
| 1943 | if (nread == 0) |
| 1944 | @{ |
| 1945 | /* @r{When we come here the file is completely read.} |
| 1946 | @r{This still could mean there are some unused} |
| 1947 | @r{characters in the @code{inbuf}. Put them back.} */ |
| 1948 | if (lseek (fd, -insize, SEEK_CUR) == -1) |
| 1949 | result = -1; |
| 1950 | |
| 1951 | /* @r{Now write out the byte sequence to get into the} |
| 1952 | @r{initial state if this is necessary.} */ |
| 1953 | iconv (cd, NULL, NULL, &wrptr, &avail); |
| 1954 | |
| 1955 | break; |
| 1956 | @} |
| 1957 | insize += nread; |
| 1958 | |
| 1959 | /* @r{Do the conversion.} */ |
| 1960 | nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); |
| 1961 | if (nconv == (size_t) -1) |
| 1962 | @{ |
| 1963 | /* @r{Not everything went right. It might only be} |
| 1964 | @r{an unfinished byte sequence at the end of the} |
| 1965 | @r{buffer. Or it is a real problem.} */ |
| 1966 | if (errno == EINVAL) |
| 1967 | /* @r{This is harmless. Simply move the unused} |
| 1968 | @r{bytes to the beginning of the buffer so that} |
| 1969 | @r{they can be used in the next round.} */ |
| 1970 | memmove (inbuf, inptr, insize); |
| 1971 | else |
| 1972 | @{ |
| 1973 | /* @r{It is a real problem. Maybe we ran out of} |
| 1974 | @r{space in the output buffer or we have invalid} |
| 1975 | @r{input. In any case back the file pointer to} |
| 1976 | @r{the position of the last processed byte.} */ |
| 1977 | lseek (fd, -insize, SEEK_CUR); |
| 1978 | result = -1; |
| 1979 | break; |
| 1980 | @} |
| 1981 | @} |
| 1982 | @} |
| 1983 | |
| 1984 | /* @r{Terminate the output string.} */ |
| 1985 | if (avail >= sizeof (wchar_t)) |
| 1986 | *((wchar_t *) wrptr) = L'\0'; |
| 1987 | |
| 1988 | if (iconv_close (cd) != 0) |
| 1989 | perror ("iconv_close"); |
| 1990 | |
| 1991 | return (wchar_t *) wrptr - outbuf; |
| 1992 | @} |
| 1993 | @end smallexample |
| 1994 | |
| 1995 | @cindex stateful |
| 1996 | This example shows the most important aspects of using the @code{iconv} |
| 1997 | functions. It shows how successive calls to @code{iconv} can be used to |
| 1998 | convert large amounts of text. The user does not have to care about |
| 1999 | stateful encodings as the functions take care of everything. |
| 2000 | |
| 2001 | An interesting point is the case where @code{iconv} returns an error and |
| 2002 | @code{errno} is set to @code{EINVAL}. This is not really an error in the |
| 2003 | transformation. It can happen whenever the input character set contains |
| 2004 | byte sequences of more than one byte for some character and texts are not |
| 2005 | processed in one piece. In this case there is a chance that a multibyte |
| 2006 | sequence is cut. The caller can then simply read the remainder of the |
| 2007 | takes and feed the offending bytes together with new character from the |
| 2008 | input to @code{iconv} and continue the work. The internal state kept in |
| 2009 | the descriptor is @emph{not} unspecified after such an event as is the |
| 2010 | case with the conversion functions from the @w{ISO C} standard. |
| 2011 | |
| 2012 | The example also shows the problem of using wide character strings with |
| 2013 | @code{iconv}. As explained in the description of the @code{iconv} |
| 2014 | function above, the function always takes a pointer to a @code{char} |
| 2015 | array and the available space is measured in bytes. In the example, the |
| 2016 | output buffer is a wide character buffer; therefore, we use a local |
| 2017 | variable @var{wrptr} of type @code{char *}, which is used in the |
| 2018 | @code{iconv} calls. |
| 2019 | |
| 2020 | This looks rather innocent but can lead to problems on platforms that |
| 2021 | have tight restriction on alignment. Therefore the caller of @code{iconv} |
| 2022 | has to make sure that the pointers passed are suitable for access of |
| 2023 | characters from the appropriate character set. Since, in the |
| 2024 | above case, the input parameter to the function is a @code{wchar_t} |
| 2025 | pointer, this is the case (unless the user violates alignment when |
| 2026 | computing the parameter). But in other situations, especially when |
| 2027 | writing generic functions where one does not know what type of character |
| 2028 | set one uses and, therefore, treats text as a sequence of bytes, it might |
| 2029 | become tricky. |
| 2030 | |
| 2031 | @node Other iconv Implementations |
| 2032 | @subsection Some Details about other @code{iconv} Implementations |
| 2033 | |
| 2034 | This is not really the place to discuss the @code{iconv} implementation |
| 2035 | of other systems but it is necessary to know a bit about them to write |
| 2036 | portable programs. The above mentioned problems with the specification |
| 2037 | of the @code{iconv} functions can lead to portability issues. |
| 2038 | |
| 2039 | The first thing to notice is that, due to the large number of character |
| 2040 | sets in use, it is certainly not practical to encode the conversions |
| 2041 | directly in the C library. Therefore, the conversion information must |
| 2042 | come from files outside the C library. This is usually done in one or |
| 2043 | both of the following ways: |
| 2044 | |
| 2045 | @itemize @bullet |
| 2046 | @item |
| 2047 | The C library contains a set of generic conversion functions that can |
| 2048 | read the needed conversion tables and other information from data files. |
| 2049 | These files get loaded when necessary. |
| 2050 | |
| 2051 | This solution is problematic as it requires a great deal of effort to |
| 2052 | apply to all character sets (potentially an infinite set). The |
| 2053 | differences in the structure of the different character sets is so large |
| 2054 | that many different variants of the table-processing functions must be |
| 2055 | developed. In addition, the generic nature of these functions make them |
| 2056 | slower than specifically implemented functions. |
| 2057 | |
| 2058 | @item |
| 2059 | The C library only contains a framework that can dynamically load |
| 2060 | object files and execute the conversion functions contained therein. |
| 2061 | |
| 2062 | This solution provides much more flexibility. The C library itself |
| 2063 | contains only very little code and therefore reduces the general memory |
| 2064 | footprint. Also, with a documented interface between the C library and |
| 2065 | the loadable modules it is possible for third parties to extend the set |
| 2066 | of available conversion modules. A drawback of this solution is that |
| 2067 | dynamic loading must be available. |
| 2068 | @end itemize |
| 2069 | |
| 2070 | Some implementations in commercial Unices implement a mixture of these |
| 2071 | possibilities; the majority implement only the second solution. Using |
| 2072 | loadable modules moves the code out of the library itself and keeps |
| 2073 | the door open for extensions and improvements, but this design is also |
| 2074 | limiting on some platforms since not many platforms support dynamic |
| 2075 | loading in statically linked programs. On platforms without this |
| 2076 | capability it is therefore not possible to use this interface in |
| 2077 | statically linked programs. @Theglibc{} has, on ELF platforms, no |
| 2078 | problems with dynamic loading in these situations; therefore, this |
| 2079 | point is moot. The danger is that one gets acquainted with this |
| 2080 | situation and forgets about the restrictions on other systems. |
| 2081 | |
| 2082 | A second thing to know about other @code{iconv} implementations is that |
| 2083 | the number of available conversions is often very limited. Some |
| 2084 | implementations provide, in the standard release (not special |
| 2085 | international or developer releases), at most 100 to 200 conversion |
| 2086 | possibilities. This does not mean 200 different character sets are |
| 2087 | supported; for example, conversions from one character set to a set of 10 |
| 2088 | others might count as 10 conversions. Together with the other direction |
| 2089 | this makes 20 conversion possibilities used up by one character set. One |
| 2090 | can imagine the thin coverage these platform provide. Some Unix vendors |
| 2091 | even provide only a handful of conversions, which renders them useless for |
| 2092 | almost all uses. |
| 2093 | |
| 2094 | This directly leads to a third and probably the most problematic point. |
| 2095 | The way the @code{iconv} conversion functions are implemented on all |
| 2096 | known Unix systems and the availability of the conversion functions from |
| 2097 | character set @math{@cal{A}} to @math{@cal{B}} and the conversion from |
| 2098 | @math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the |
| 2099 | conversion from @math{@cal{A}} to @math{@cal{C}} is available. |
| 2100 | |
| 2101 | This might not seem unreasonable and problematic at first, but it is a |
| 2102 | quite big problem as one will notice shortly after hitting it. To show |
| 2103 | the problem we assume to write a program that has to convert from |
| 2104 | @math{@cal{A}} to @math{@cal{C}}. A call like |
| 2105 | |
| 2106 | @smallexample |
| 2107 | cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); |
| 2108 | @end smallexample |
| 2109 | |
| 2110 | @noindent |
| 2111 | fails according to the assumption above. But what does the program |
| 2112 | do now? The conversion is necessary; therefore, simply giving up is not |
| 2113 | an option. |
| 2114 | |
| 2115 | This is a nuisance. The @code{iconv} function should take care of this. |
| 2116 | But how should the program proceed from here on? If it tries to convert |
| 2117 | to character set @math{@cal{B}}, first the two @code{iconv_open} |
| 2118 | calls |
| 2119 | |
| 2120 | @smallexample |
| 2121 | cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); |
| 2122 | @end smallexample |
| 2123 | |
| 2124 | @noindent |
| 2125 | and |
| 2126 | |
| 2127 | @smallexample |
| 2128 | cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); |
| 2129 | @end smallexample |
| 2130 | |
| 2131 | @noindent |
| 2132 | will succeed, but how to find @math{@cal{B}}? |
| 2133 | |
| 2134 | Unfortunately, the answer is: there is no general solution. On some |
| 2135 | systems guessing might help. On those systems most character sets can |
| 2136 | convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside |
| 2137 | this only some very system-specific methods can help. Since the |
| 2138 | conversion functions come from loadable modules and these modules must |
| 2139 | be stored somewhere in the filesystem, one @emph{could} try to find them |
| 2140 | and determine from the available file which conversions are available |
| 2141 | and whether there is an indirect route from @math{@cal{A}} to |
| 2142 | @math{@cal{C}}. |
| 2143 | |
| 2144 | This example shows one of the design errors of @code{iconv} mentioned |
| 2145 | above. It should at least be possible to determine the list of available |
| 2146 | conversion programmatically so that if @code{iconv_open} says there is no |
| 2147 | such conversion, one could make sure this also is true for indirect |
| 2148 | routes. |
| 2149 | |
| 2150 | @node glibc iconv Implementation |
| 2151 | @subsection The @code{iconv} Implementation in @theglibc{} |
| 2152 | |
| 2153 | After reading about the problems of @code{iconv} implementations in the |
| 2154 | last section it is certainly good to note that the implementation in |
| 2155 | @theglibc{} has none of the problems mentioned above. What |
| 2156 | follows is a step-by-step analysis of the points raised above. The |
| 2157 | evaluation is based on the current state of the development (as of |
| 2158 | January 1999). The development of the @code{iconv} functions is not |
| 2159 | complete, but basic functionality has solidified. |
| 2160 | |
| 2161 | @Theglibc{}'s @code{iconv} implementation uses shared loadable |
| 2162 | modules to implement the conversions. A very small number of |
| 2163 | conversions are built into the library itself but these are only rather |
| 2164 | trivial conversions. |
| 2165 | |
| 2166 | All the benefits of loadable modules are available in the @glibcadj{} |
| 2167 | implementation. This is especially appealing since the interface is |
| 2168 | well documented (see below), and it, therefore, is easy to write new |
| 2169 | conversion modules. The drawback of using loadable objects is not a |
| 2170 | problem in @theglibc{}, at least on ELF systems. Since the |
| 2171 | library is able to load shared objects even in statically linked |
| 2172 | binaries, static linking need not be forbidden in case one wants to use |
| 2173 | @code{iconv}. |
| 2174 | |
| 2175 | The second mentioned problem is the number of supported conversions. |
| 2176 | Currently, @theglibc{} supports more than 150 character sets. The |
| 2177 | way the implementation is designed the number of supported conversions |
| 2178 | is greater than 22350 (@math{150} times @math{149}). If any conversion |
| 2179 | from or to a character set is missing, it can be added easily. |
| 2180 | |
| 2181 | Particularly impressive as it may be, this high number is due to the |
| 2182 | fact that the @glibcadj{} implementation of @code{iconv} does not have |
| 2183 | the third problem mentioned above (i.e., whenever there is a conversion |
| 2184 | from a character set @math{@cal{A}} to @math{@cal{B}} and from |
| 2185 | @math{@cal{B}} to @math{@cal{C}} it is always possible to convert from |
| 2186 | @math{@cal{A}} to @math{@cal{C}} directly). If the @code{iconv_open} |
| 2187 | returns an error and sets @code{errno} to @code{EINVAL}, there is no |
| 2188 | known way, directly or indirectly, to perform the wanted conversion. |
| 2189 | |
| 2190 | @cindex triangulation |
| 2191 | Triangulation is achieved by providing for each character set a |
| 2192 | conversion from and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} |
| 2193 | as an intermediate representation it is possible to @dfn{triangulate} |
| 2194 | (i.e., convert with an intermediate representation). |
| 2195 | |
| 2196 | There is no inherent requirement to provide a conversion to @w{ISO |
| 2197 | 10646} for a new character set, and it is also possible to provide other |
| 2198 | conversions where neither source nor destination character set is @w{ISO |
| 2199 | 10646}. The existing set of conversions is simply meant to cover all |
| 2200 | conversions that might be of interest. |
| 2201 | |
| 2202 | @cindex ISO-2022-JP |
| 2203 | @cindex EUC-JP |
| 2204 | All currently available conversions use the triangulation method above, |
| 2205 | making conversion run unnecessarily slow. If, for example, somebody |
| 2206 | often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution |
| 2207 | would involve direct conversion between the two character sets, skipping |
| 2208 | the input to @w{ISO 10646} first. The two character sets of interest |
| 2209 | are much more similar to each other than to @w{ISO 10646}. |
| 2210 | |
| 2211 | In such a situation one easily can write a new conversion and provide it |
| 2212 | as a better alternative. The @glibcadj{} @code{iconv} implementation |
| 2213 | would automatically use the module implementing the conversion if it is |
| 2214 | specified to be more efficient. |
| 2215 | |
| 2216 | @subsubsection Format of @file{gconv-modules} files |
| 2217 | |
| 2218 | All information about the available conversions comes from a file named |
| 2219 | @file{gconv-modules}, which can be found in any of the directories along |
| 2220 | the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented |
| 2221 | text files, where each of the lines has one of the following formats: |
| 2222 | |
| 2223 | @itemize @bullet |
| 2224 | @item |
| 2225 | If the first non-whitespace character is a @kbd{#} the line contains only |
| 2226 | comments and is ignored. |
| 2227 | |
| 2228 | @item |
| 2229 | Lines starting with @code{alias} define an alias name for a character |
| 2230 | set. Two more words are expected on the line. The first word |
| 2231 | defines the alias name, and the second defines the original name of the |
| 2232 | character set. The effect is that it is possible to use the alias name |
| 2233 | in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and |
| 2234 | achieve the same result as when using the real character set name. |
| 2235 | |
| 2236 | This is quite important as a character set has often many different |
| 2237 | names. There is normally an official name but this need not correspond to |
| 2238 | the most popular name. Beside this many character sets have special |
| 2239 | names that are somehow constructed. For example, all character sets |
| 2240 | specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} |
| 2241 | where @var{nnn} is the registration number. This allows programs that |
| 2242 | know about the registration number to construct character set names and |
| 2243 | use them in @code{iconv_open} calls. More on the available names and |
| 2244 | aliases follows below. |
| 2245 | |
| 2246 | @item |
| 2247 | Lines starting with @code{module} introduce an available conversion |
| 2248 | module. These lines must contain three or four more words. |
| 2249 | |
| 2250 | The first word specifies the source character set, the second word the |
| 2251 | destination character set of conversion implemented in this module, and |
| 2252 | the third word is the name of the loadable module. The filename is |
| 2253 | constructed by appending the usual shared object suffix (normally |
| 2254 | @file{.so}) and this file is then supposed to be found in the same |
| 2255 | directory the @file{gconv-modules} file is in. The last word on the line, |
| 2256 | which is optional, is a numeric value representing the cost of the |
| 2257 | conversion. If this word is missing, a cost of @math{1} is assumed. The |
| 2258 | numeric value itself does not matter that much; what counts are the |
| 2259 | relative values of the sums of costs for all possible conversion paths. |
| 2260 | Below is a more precise description of the use of the cost value. |
| 2261 | @end itemize |
| 2262 | |
| 2263 | Returning to the example above where one has written a module to directly |
| 2264 | convert from ISO-2022-JP to EUC-JP and back. All that has to be done is |
| 2265 | to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory |
| 2266 | and add a file @file{gconv-modules} with the following content in the |
| 2267 | same directory: |
| 2268 | |
| 2269 | @smallexample |
| 2270 | module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 |
| 2271 | module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 |
| 2272 | @end smallexample |
| 2273 | |
| 2274 | To see why this is sufficient, it is necessary to understand how the |
| 2275 | conversion used by @code{iconv} (and described in the descriptor) is |
| 2276 | selected. The approach to this problem is quite simple. |
| 2277 | |
| 2278 | At the first call of the @code{iconv_open} function the program reads |
| 2279 | all available @file{gconv-modules} files and builds up two tables: one |
| 2280 | containing all the known aliases and another that contains the |
| 2281 | information about the conversions and which shared object implements |
| 2282 | them. |
| 2283 | |
| 2284 | @subsubsection Finding the conversion path in @code{iconv} |
| 2285 | |
| 2286 | The set of available conversions form a directed graph with weighted |
| 2287 | edges. The weights on the edges are the costs specified in the |
| 2288 | @file{gconv-modules} files. The @code{iconv_open} function uses an |
| 2289 | algorithm suitable for search for the best path in such a graph and so |
| 2290 | constructs a list of conversions that must be performed in succession |
| 2291 | to get the transformation from the source to the destination character |
| 2292 | set. |
| 2293 | |
| 2294 | Explaining why the above @file{gconv-modules} files allows the |
| 2295 | @code{iconv} implementation to resolve the specific ISO-2022-JP to |
| 2296 | EUC-JP conversion module instead of the conversion coming with the |
| 2297 | library itself is straightforward. Since the latter conversion takes two |
| 2298 | steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to |
| 2299 | EUC-JP), the cost is @math{1+1 = 2}. The above @file{gconv-modules} |
| 2300 | file, however, specifies that the new conversion modules can perform this |
| 2301 | conversion with only the cost of @math{1}. |
| 2302 | |
| 2303 | A mysterious item about the @file{gconv-modules} file above (and also |
| 2304 | the file coming with @theglibc{}) are the names of the character |
| 2305 | sets specified in the @code{module} lines. Why do almost all the names |
| 2306 | end in @code{//}? And this is not all: the names can actually be |
| 2307 | regular expressions. At this point in time this mystery should not be |
| 2308 | revealed, unless you have the relevant spell-casting materials: ashes |
| 2309 | from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix |
| 2310 | blessed by St.@: Emacs, assorted herbal roots from Central America, sand |
| 2311 | from Cebu, etc. Sorry! @strong{The part of the implementation where |
| 2312 | this is used is not yet finished. For now please simply follow the |
| 2313 | existing examples. It'll become clearer once it is. --drepper} |
| 2314 | |
| 2315 | A last remark about the @file{gconv-modules} is about the names not |
| 2316 | ending with @code{//}. A character set named @code{INTERNAL} is often |
| 2317 | mentioned. From the discussion above and the chosen name it should have |
| 2318 | become clear that this is the name for the representation used in the |
| 2319 | intermediate step of the triangulation. We have said that this is UCS-4 |
| 2320 | but actually that is not quite right. The UCS-4 specification also |
| 2321 | includes the specification of the byte ordering used. Since a UCS-4 value |
| 2322 | consists of four bytes, a stored value is affected by byte ordering. The |
| 2323 | internal representation is @emph{not} the same as UCS-4 in case the byte |
| 2324 | ordering of the processor (or at least the running process) is not the |
| 2325 | same as the one required for UCS-4. This is done for performance reasons |
| 2326 | as one does not want to perform unnecessary byte-swapping operations if |
| 2327 | one is not interested in actually seeing the result in UCS-4. To avoid |
| 2328 | trouble with endianness, the internal representation consistently is named |
| 2329 | @code{INTERNAL} even on big-endian systems where the representations are |
| 2330 | identical. |
| 2331 | |
| 2332 | @subsubsection @code{iconv} module data structures |
| 2333 | |
| 2334 | So far this section has described how modules are located and considered |
| 2335 | to be used. What remains to be described is the interface of the modules |
| 2336 | so that one can write new ones. This section describes the interface as |
| 2337 | it is in use in January 1999. The interface will change a bit in the |
| 2338 | future but, with luck, only in an upwardly compatible way. |
| 2339 | |
| 2340 | The definitions necessary to write new modules are publicly available |
| 2341 | in the non-standard header @file{gconv.h}. The following text, |
| 2342 | therefore, describes the definitions from this header file. First, |
| 2343 | however, it is necessary to get an overview. |
| 2344 | |
| 2345 | From the perspective of the user of @code{iconv} the interface is quite |
| 2346 | simple: the @code{iconv_open} function returns a handle that can be used |
| 2347 | in calls to @code{iconv}, and finally the handle is freed with a call to |
| 2348 | @code{iconv_close}. The problem is that the handle has to be able to |
| 2349 | represent the possibly long sequences of conversion steps and also the |
| 2350 | state of each conversion since the handle is all that is passed to the |
| 2351 | @code{iconv} function. Therefore, the data structures are really the |
| 2352 | elements necessary to understanding the implementation. |
| 2353 | |
| 2354 | We need two different kinds of data structures. The first describes the |
| 2355 | conversion and the second describes the state etc. There are really two |
| 2356 | type definitions like this in @file{gconv.h}. |
| 2357 | @pindex gconv.h |
| 2358 | |
| 2359 | @comment gconv.h |
| 2360 | @comment GNU |
| 2361 | @deftp {Data type} {struct __gconv_step} |
| 2362 | This data structure describes one conversion a module can perform. For |
| 2363 | each function in a loaded module with conversion functions there is |
| 2364 | exactly one object of this type. This object is shared by all users of |
| 2365 | the conversion (i.e., this object does not contain any information |
| 2366 | corresponding to an actual conversion; it only describes the conversion |
| 2367 | itself). |
| 2368 | |
| 2369 | @table @code |
| 2370 | @item struct __gconv_loaded_object *__shlib_handle |
| 2371 | @itemx const char *__modname |
| 2372 | @itemx int __counter |
| 2373 | All these elements of the structure are used internally in the C library |
| 2374 | to coordinate loading and unloading the shared. One must not expect any |
| 2375 | of the other elements to be available or initialized. |
| 2376 | |
| 2377 | @item const char *__from_name |
| 2378 | @itemx const char *__to_name |
| 2379 | @code{__from_name} and @code{__to_name} contain the names of the source and |
| 2380 | destination character sets. They can be used to identify the actual |
| 2381 | conversion to be carried out since one module might implement conversions |
| 2382 | for more than one character set and/or direction. |
| 2383 | |
| 2384 | @item gconv_fct __fct |
| 2385 | @itemx gconv_init_fct __init_fct |
| 2386 | @itemx gconv_end_fct __end_fct |
| 2387 | These elements contain pointers to the functions in the loadable module. |
| 2388 | The interface will be explained below. |
| 2389 | |
| 2390 | @item int __min_needed_from |
| 2391 | @itemx int __max_needed_from |
| 2392 | @itemx int __min_needed_to |
| 2393 | @itemx int __max_needed_to; |
| 2394 | These values have to be supplied in the init function of the module. The |
| 2395 | @code{__min_needed_from} value specifies how many bytes a character of |
| 2396 | the source character set at least needs. The @code{__max_needed_from} |
| 2397 | specifies the maximum value that also includes possible shift sequences. |
| 2398 | |
| 2399 | The @code{__min_needed_to} and @code{__max_needed_to} values serve the |
| 2400 | same purpose as @code{__min_needed_from} and @code{__max_needed_from} but |
| 2401 | this time for the destination character set. |
| 2402 | |
| 2403 | It is crucial that these values be accurate since otherwise the |
| 2404 | conversion functions will have problems or not work at all. |
| 2405 | |
| 2406 | @item int __stateful |
| 2407 | This element must also be initialized by the init function. |
| 2408 | @code{int __stateful} is nonzero if the source character set is stateful. |
| 2409 | Otherwise it is zero. |
| 2410 | |
| 2411 | @item void *__data |
| 2412 | This element can be used freely by the conversion functions in the |
| 2413 | module. @code{void *__data} can be used to communicate extra information |
| 2414 | from one call to another. @code{void *__data} need not be initialized if |
| 2415 | not needed at all. If @code{void *__data} element is assigned a pointer |
| 2416 | to dynamically allocated memory (presumably in the init function) it has |
| 2417 | to be made sure that the end function deallocates the memory. Otherwise |
| 2418 | the application will leak memory. |
| 2419 | |
| 2420 | It is important to be aware that this data structure is shared by all |
| 2421 | users of this specification conversion and therefore the @code{__data} |
| 2422 | element must not contain data specific to one specific use of the |
| 2423 | conversion function. |
| 2424 | @end table |
| 2425 | @end deftp |
| 2426 | |
| 2427 | @comment gconv.h |
| 2428 | @comment GNU |
| 2429 | @deftp {Data type} {struct __gconv_step_data} |
| 2430 | This is the data structure that contains the information specific to |
| 2431 | each use of the conversion functions. |
| 2432 | |
| 2433 | |
| 2434 | @table @code |
| 2435 | @item char *__outbuf |
| 2436 | @itemx char *__outbufend |
| 2437 | These elements specify the output buffer for the conversion step. The |
| 2438 | @code{__outbuf} element points to the beginning of the buffer, and |
| 2439 | @code{__outbufend} points to the byte following the last byte in the |
| 2440 | buffer. The conversion function must not assume anything about the size |
| 2441 | of the buffer but it can be safely assumed the there is room for at |
| 2442 | least one complete character in the output buffer. |
| 2443 | |
| 2444 | Once the conversion is finished, if the conversion is the last step, the |
| 2445 | @code{__outbuf} element must be modified to point after the last byte |
| 2446 | written into the buffer to signal how much output is available. If this |
| 2447 | conversion step is not the last one, the element must not be modified. |
| 2448 | The @code{__outbufend} element must not be modified. |
| 2449 | |
| 2450 | @item int __is_last |
| 2451 | This element is nonzero if this conversion step is the last one. This |
| 2452 | information is necessary for the recursion. See the description of the |
| 2453 | conversion function internals below. This element must never be |
| 2454 | modified. |
| 2455 | |
| 2456 | @item int __invocation_counter |
| 2457 | The conversion function can use this element to see how many calls of |
| 2458 | the conversion function already happened. Some character sets require a |
| 2459 | certain prolog when generating output, and by comparing this value with |
| 2460 | zero, one can find out whether it is the first call and whether, |
| 2461 | therefore, the prolog should be emitted. This element must never be |
| 2462 | modified. |
| 2463 | |
| 2464 | @item int __internal_use |
| 2465 | This element is another one rarely used but needed in certain |
| 2466 | situations. It is assigned a nonzero value in case the conversion |
| 2467 | functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the |
| 2468 | function is not used directly through the @code{iconv} interface). |
| 2469 | |
| 2470 | This sometimes makes a difference as it is expected that the |
| 2471 | @code{iconv} functions are used to translate entire texts while the |
| 2472 | @code{mbsrtowcs} functions are normally used only to convert single |
| 2473 | strings and might be used multiple times to convert entire texts. |
| 2474 | |
| 2475 | But in this situation we would have problem complying with some rules of |
| 2476 | the character set specification. Some character sets require a prolog, |
| 2477 | which must appear exactly once for an entire text. If a number of |
| 2478 | @code{mbsrtowcs} calls are used to convert the text, only the first call |
| 2479 | must add the prolog. However, because there is no communication between the |
| 2480 | different calls of @code{mbsrtowcs}, the conversion functions have no |
| 2481 | possibility to find this out. The situation is different for sequences |
| 2482 | of @code{iconv} calls since the handle allows access to the needed |
| 2483 | information. |
| 2484 | |
| 2485 | The @code{int __internal_use} element is mostly used together with |
| 2486 | @code{__invocation_counter} as follows: |
| 2487 | |
| 2488 | @smallexample |
| 2489 | if (!data->__internal_use |
| 2490 | && data->__invocation_counter == 0) |
| 2491 | /* @r{Emit prolog.} */ |
| 2492 | @dots{} |
| 2493 | @end smallexample |
| 2494 | |
| 2495 | This element must never be modified. |
| 2496 | |
| 2497 | @item mbstate_t *__statep |
| 2498 | The @code{__statep} element points to an object of type @code{mbstate_t} |
| 2499 | (@pxref{Keeping the state}). The conversion of a stateful character |
| 2500 | set must use the object pointed to by @code{__statep} to store |
| 2501 | information about the conversion state. The @code{__statep} element |
| 2502 | itself must never be modified. |
| 2503 | |
| 2504 | @item mbstate_t __state |
| 2505 | This element must @emph{never} be used directly. It is only part of |
| 2506 | this structure to have the needed space allocated. |
| 2507 | @end table |
| 2508 | @end deftp |
| 2509 | |
| 2510 | @subsubsection @code{iconv} module interfaces |
| 2511 | |
| 2512 | With the knowledge about the data structures we now can describe the |
| 2513 | conversion function itself. To understand the interface a bit of |
| 2514 | knowledge is necessary about the functionality in the C library that |
| 2515 | loads the objects with the conversions. |
| 2516 | |
| 2517 | It is often the case that one conversion is used more than once (i.e., |
| 2518 | there are several @code{iconv_open} calls for the same set of character |
| 2519 | sets during one program run). The @code{mbsrtowcs} et.al.@: functions in |
| 2520 | @theglibc{} also use the @code{iconv} functionality, which |
| 2521 | increases the number of uses of the same functions even more. |
| 2522 | |
| 2523 | Because of this multiple use of conversions, the modules do not get |
| 2524 | loaded exclusively for one conversion. Instead a module once loaded can |
| 2525 | be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls |
| 2526 | at the same time. The splitting of the information between conversion- |
| 2527 | function-specific information and conversion data makes this possible. |
| 2528 | The last section showed the two data structures used to do this. |
| 2529 | |
| 2530 | This is of course also reflected in the interface and semantics of the |
| 2531 | functions that the modules must provide. There are three functions that |
| 2532 | must have the following names: |
| 2533 | |
| 2534 | @table @code |
| 2535 | @item gconv_init |
| 2536 | The @code{gconv_init} function initializes the conversion function |
| 2537 | specific data structure. This very same object is shared by all |
| 2538 | conversions that use this conversion and, therefore, no state information |
| 2539 | about the conversion itself must be stored in here. If a module |
| 2540 | implements more than one conversion, the @code{gconv_init} function will |
| 2541 | be called multiple times. |
| 2542 | |
| 2543 | @item gconv_end |
| 2544 | The @code{gconv_end} function is responsible for freeing all resources |
| 2545 | allocated by the @code{gconv_init} function. If there is nothing to do, |
| 2546 | this function can be missing. Special care must be taken if the module |
| 2547 | implements more than one conversion and the @code{gconv_init} function |
| 2548 | does not allocate the same resources for all conversions. |
| 2549 | |
| 2550 | @item gconv |
| 2551 | This is the actual conversion function. It is called to convert one |
| 2552 | block of text. It gets passed the conversion step information |
| 2553 | initialized by @code{gconv_init} and the conversion data, specific to |
| 2554 | this use of the conversion functions. |
| 2555 | @end table |
| 2556 | |
| 2557 | There are three data types defined for the three module interface |
| 2558 | functions and these define the interface. |
| 2559 | |
| 2560 | @comment gconv.h |
| 2561 | @comment GNU |
| 2562 | @deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *) |
| 2563 | This specifies the interface of the initialization function of the |
| 2564 | module. It is called exactly once for each conversion the module |
| 2565 | implements. |
| 2566 | |
| 2567 | As explained in the description of the @code{struct __gconv_step} data |
| 2568 | structure above the initialization function has to initialize parts of |
| 2569 | it. |
| 2570 | |
| 2571 | @table @code |
| 2572 | @item __min_needed_from |
| 2573 | @itemx __max_needed_from |
| 2574 | @itemx __min_needed_to |
| 2575 | @itemx __max_needed_to |
| 2576 | These elements must be initialized to the exact numbers of the minimum |
| 2577 | and maximum number of bytes used by one character in the source and |
| 2578 | destination character sets, respectively. If the characters all have the |
| 2579 | same size, the minimum and maximum values are the same. |
| 2580 | |
| 2581 | @item __stateful |
| 2582 | This element must be initialized to a nonzero value if the source |
| 2583 | character set is stateful. Otherwise it must be zero. |
| 2584 | @end table |
| 2585 | |
| 2586 | If the initialization function needs to communicate some information |
| 2587 | to the conversion function, this communication can happen using the |
| 2588 | @code{__data} element of the @code{__gconv_step} structure. But since |
| 2589 | this data is shared by all the conversions, it must not be modified by |
| 2590 | the conversion function. The example below shows how this can be used. |
| 2591 | |
| 2592 | @smallexample |
| 2593 | #define MIN_NEEDED_FROM 1 |
| 2594 | #define MAX_NEEDED_FROM 4 |
| 2595 | #define MIN_NEEDED_TO 4 |
| 2596 | #define MAX_NEEDED_TO 4 |
| 2597 | |
| 2598 | int |
| 2599 | gconv_init (struct __gconv_step *step) |
| 2600 | @{ |
| 2601 | /* @r{Determine which direction.} */ |
| 2602 | struct iso2022jp_data *new_data; |
| 2603 | enum direction dir = illegal_dir; |
| 2604 | enum variant var = illegal_var; |
| 2605 | int result; |
| 2606 | |
| 2607 | if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) |
| 2608 | @{ |
| 2609 | dir = from_iso2022jp; |
| 2610 | var = iso2022jp; |
| 2611 | @} |
| 2612 | else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) |
| 2613 | @{ |
| 2614 | dir = to_iso2022jp; |
| 2615 | var = iso2022jp; |
| 2616 | @} |
| 2617 | else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) |
| 2618 | @{ |
| 2619 | dir = from_iso2022jp; |
| 2620 | var = iso2022jp2; |
| 2621 | @} |
| 2622 | else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) |
| 2623 | @{ |
| 2624 | dir = to_iso2022jp; |
| 2625 | var = iso2022jp2; |
| 2626 | @} |
| 2627 | |
| 2628 | result = __GCONV_NOCONV; |
| 2629 | if (dir != illegal_dir) |
| 2630 | @{ |
| 2631 | new_data = (struct iso2022jp_data *) |
| 2632 | malloc (sizeof (struct iso2022jp_data)); |
| 2633 | |
| 2634 | result = __GCONV_NOMEM; |
| 2635 | if (new_data != NULL) |
| 2636 | @{ |
| 2637 | new_data->dir = dir; |
| 2638 | new_data->var = var; |
| 2639 | step->__data = new_data; |
| 2640 | |
| 2641 | if (dir == from_iso2022jp) |
| 2642 | @{ |
| 2643 | step->__min_needed_from = MIN_NEEDED_FROM; |
| 2644 | step->__max_needed_from = MAX_NEEDED_FROM; |
| 2645 | step->__min_needed_to = MIN_NEEDED_TO; |
| 2646 | step->__max_needed_to = MAX_NEEDED_TO; |
| 2647 | @} |
| 2648 | else |
| 2649 | @{ |
| 2650 | step->__min_needed_from = MIN_NEEDED_TO; |
| 2651 | step->__max_needed_from = MAX_NEEDED_TO; |
| 2652 | step->__min_needed_to = MIN_NEEDED_FROM; |
| 2653 | step->__max_needed_to = MAX_NEEDED_FROM + 2; |
| 2654 | @} |
| 2655 | |
| 2656 | /* @r{Yes, this is a stateful encoding.} */ |
| 2657 | step->__stateful = 1; |
| 2658 | |
| 2659 | result = __GCONV_OK; |
| 2660 | @} |
| 2661 | @} |
| 2662 | |
| 2663 | return result; |
| 2664 | @} |
| 2665 | @end smallexample |
| 2666 | |
| 2667 | The function first checks which conversion is wanted. The module from |
| 2668 | which this function is taken implements four different conversions; |
| 2669 | which one is selected can be determined by comparing the names. The |
| 2670 | comparison should always be done without paying attention to the case. |
| 2671 | |
| 2672 | Next, a data structure, which contains the necessary information about |
| 2673 | which conversion is selected, is allocated. The data structure |
| 2674 | @code{struct iso2022jp_data} is locally defined since, outside the |
| 2675 | module, this data is not used at all. Please note that if all four |
| 2676 | conversions this modules supports are requested there are four data |
| 2677 | blocks. |
| 2678 | |
| 2679 | One interesting thing is the initialization of the @code{__min_} and |
| 2680 | @code{__max_} elements of the step data object. A single ISO-2022-JP |
| 2681 | character can consist of one to four bytes. Therefore the |
| 2682 | @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined |
| 2683 | this way. The output is always the @code{INTERNAL} character set (aka |
| 2684 | UCS-4) and therefore each character consists of exactly four bytes. For |
| 2685 | the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into |
| 2686 | account that escape sequences might be necessary to switch the character |
| 2687 | sets. Therefore the @code{__max_needed_to} element for this direction |
| 2688 | gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the |
| 2689 | two bytes needed for the escape sequences to single the switching. The |
| 2690 | asymmetry in the maximum values for the two directions can be explained |
| 2691 | easily: when reading ISO-2022-JP text, escape sequences can be handled |
| 2692 | alone (i.e., it is not necessary to process a real character since the |
| 2693 | effect of the escape sequence can be recorded in the state information). |
| 2694 | The situation is different for the other direction. Since it is in |
| 2695 | general not known which character comes next, one cannot emit escape |
| 2696 | sequences to change the state in advance. This means the escape |
| 2697 | sequences that have to be emitted together with the next character. |
| 2698 | Therefore one needs more room than only for the character itself. |
| 2699 | |
| 2700 | The possible return values of the initialization function are: |
| 2701 | |
| 2702 | @table @code |
| 2703 | @item __GCONV_OK |
| 2704 | The initialization succeeded |
| 2705 | @item __GCONV_NOCONV |
| 2706 | The requested conversion is not supported in the module. This can |
| 2707 | happen if the @file{gconv-modules} file has errors. |
| 2708 | @item __GCONV_NOMEM |
| 2709 | Memory required to store additional information could not be allocated. |
| 2710 | @end table |
| 2711 | @end deftypevr |
| 2712 | |
| 2713 | The function called before the module is unloaded is significantly |
| 2714 | easier. It often has nothing at all to do; in which case it can be left |
| 2715 | out completely. |
| 2716 | |
| 2717 | @comment gconv.h |
| 2718 | @comment GNU |
| 2719 | @deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *) |
| 2720 | The task of this function is to free all resources allocated in the |
| 2721 | initialization function. Therefore only the @code{__data} element of |
| 2722 | the object pointed to by the argument is of interest. Continuing the |
| 2723 | example from the initialization function, the finalization function |
| 2724 | looks like this: |
| 2725 | |
| 2726 | @smallexample |
| 2727 | void |
| 2728 | gconv_end (struct __gconv_step *data) |
| 2729 | @{ |
| 2730 | free (data->__data); |
| 2731 | @} |
| 2732 | @end smallexample |
| 2733 | @end deftypevr |
| 2734 | |
| 2735 | The most important function is the conversion function itself, which can |
| 2736 | get quite complicated for complex character sets. But since this is not |
| 2737 | of interest here, we will only describe a possible skeleton for the |
| 2738 | conversion function. |
| 2739 | |
| 2740 | @comment gconv.h |
| 2741 | @comment GNU |
| 2742 | @deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) |
| 2743 | The conversion function can be called for two basic reason: to convert |
| 2744 | text or to reset the state. From the description of the @code{iconv} |
| 2745 | function it can be seen why the flushing mode is necessary. What mode |
| 2746 | is selected is determined by the sixth argument, an integer. This |
| 2747 | argument being nonzero means that flushing is selected. |
| 2748 | |
| 2749 | Common to both modes is where the output buffer can be found. The |
| 2750 | information about this buffer is stored in the conversion step data. A |
| 2751 | pointer to this information is passed as the second argument to this |
| 2752 | function. The description of the @code{struct __gconv_step_data} |
| 2753 | structure has more information on the conversion step data. |
| 2754 | |
| 2755 | @cindex stateful |
| 2756 | What has to be done for flushing depends on the source character set. |
| 2757 | If the source character set is not stateful, nothing has to be done. |
| 2758 | Otherwise the function has to emit a byte sequence to bring the state |
| 2759 | object into the initial state. Once this all happened the other |
| 2760 | conversion modules in the chain of conversions have to get the same |
| 2761 | chance. Whether another step follows can be determined from the |
| 2762 | @code{__is_last} element of the step data structure to which the first |
| 2763 | parameter points. |
| 2764 | |
| 2765 | The more interesting mode is when actual text has to be converted. The |
| 2766 | first step in this case is to convert as much text as possible from the |
| 2767 | input buffer and store the result in the output buffer. The start of the |
| 2768 | input buffer is determined by the third argument, which is a pointer to a |
| 2769 | pointer variable referencing the beginning of the buffer. The fourth |
| 2770 | argument is a pointer to the byte right after the last byte in the buffer. |
| 2771 | |
| 2772 | The conversion has to be performed according to the current state if the |
| 2773 | character set is stateful. The state is stored in an object pointed to |
| 2774 | by the @code{__statep} element of the step data (second argument). Once |
| 2775 | either the input buffer is empty or the output buffer is full the |
| 2776 | conversion stops. At this point, the pointer variable referenced by the |
| 2777 | third parameter must point to the byte following the last processed |
| 2778 | byte (i.e., if all of the input is consumed, this pointer and the fourth |
| 2779 | parameter have the same value). |
| 2780 | |
| 2781 | What now happens depends on whether this step is the last one. If it is |
| 2782 | the last step, the only thing that has to be done is to update the |
| 2783 | @code{__outbuf} element of the step data structure to point after the |
| 2784 | last written byte. This update gives the caller the information on how |
| 2785 | much text is available in the output buffer. In addition, the variable |
| 2786 | pointed to by the fifth parameter, which is of type @code{size_t}, must |
| 2787 | be incremented by the number of characters (@emph{not bytes}) that were |
| 2788 | converted in a non-reversible way. Then, the function can return. |
| 2789 | |
| 2790 | In case the step is not the last one, the later conversion functions have |
| 2791 | to get a chance to do their work. Therefore, the appropriate conversion |
| 2792 | function has to be called. The information about the functions is |
| 2793 | stored in the conversion data structures, passed as the first parameter. |
| 2794 | This information and the step data are stored in arrays, so the next |
| 2795 | element in both cases can be found by simple pointer arithmetic: |
| 2796 | |
| 2797 | @smallexample |
| 2798 | int |
| 2799 | gconv (struct __gconv_step *step, struct __gconv_step_data *data, |
| 2800 | const char **inbuf, const char *inbufend, size_t *written, |
| 2801 | int do_flush) |
| 2802 | @{ |
| 2803 | struct __gconv_step *next_step = step + 1; |
| 2804 | struct __gconv_step_data *next_data = data + 1; |
| 2805 | @dots{} |
| 2806 | @end smallexample |
| 2807 | |
| 2808 | The @code{next_step} pointer references the next step information and |
| 2809 | @code{next_data} the next data record. The call of the next function |
| 2810 | therefore will look similar to this: |
| 2811 | |
| 2812 | @smallexample |
| 2813 | next_step->__fct (next_step, next_data, &outerr, outbuf, |
| 2814 | written, 0) |
| 2815 | @end smallexample |
| 2816 | |
| 2817 | But this is not yet all. Once the function call returns the conversion |
| 2818 | function might have some more to do. If the return value of the function |
| 2819 | is @code{__GCONV_EMPTY_INPUT}, more room is available in the output |
| 2820 | buffer. Unless the input buffer is empty the conversion, functions start |
| 2821 | all over again and process the rest of the input buffer. If the return |
| 2822 | value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have |
| 2823 | to recover from this. |
| 2824 | |
| 2825 | A requirement for the conversion function is that the input buffer |
| 2826 | pointer (the third argument) always point to the last character that |
| 2827 | was put in converted form into the output buffer. This is trivially |
| 2828 | true after the conversion performed in the current step, but if the |
| 2829 | conversion functions deeper downstream stop prematurely, not all |
| 2830 | characters from the output buffer are consumed and, therefore, the input |
| 2831 | buffer pointers must be backed off to the right position. |
| 2832 | |
| 2833 | Correcting the input buffers is easy to do if the input and output |
| 2834 | character sets have a fixed width for all characters. In this situation |
| 2835 | we can compute how many characters are left in the output buffer and, |
| 2836 | therefore, can correct the input buffer pointer appropriately with a |
| 2837 | similar computation. Things are getting tricky if either character set |
| 2838 | has characters represented with variable length byte sequences, and it |
| 2839 | gets even more complicated if the conversion has to take care of the |
| 2840 | state. In these cases the conversion has to be performed once again, from |
| 2841 | the known state before the initial conversion (i.e., if necessary the |
| 2842 | state of the conversion has to be reset and the conversion loop has to be |
| 2843 | executed again). The difference now is that it is known how much input |
| 2844 | must be created, and the conversion can stop before converting the first |
| 2845 | unused character. Once this is done the input buffer pointers must be |
| 2846 | updated again and the function can return. |
| 2847 | |
| 2848 | One final thing should be mentioned. If it is necessary for the |
| 2849 | conversion to know whether it is the first invocation (in case a prolog |
| 2850 | has to be emitted), the conversion function should increment the |
| 2851 | @code{__invocation_counter} element of the step data structure just |
| 2852 | before returning to the caller. See the description of the @code{struct |
| 2853 | __gconv_step_data} structure above for more information on how this can |
| 2854 | be used. |
| 2855 | |
| 2856 | The return value must be one of the following values: |
| 2857 | |
| 2858 | @table @code |
| 2859 | @item __GCONV_EMPTY_INPUT |
| 2860 | All input was consumed and there is room left in the output buffer. |
| 2861 | @item __GCONV_FULL_OUTPUT |
| 2862 | No more room in the output buffer. In case this is not the last step |
| 2863 | this value is propagated down from the call of the next conversion |
| 2864 | function in the chain. |
| 2865 | @item __GCONV_INCOMPLETE_INPUT |
| 2866 | The input buffer is not entirely empty since it contains an incomplete |
| 2867 | character sequence. |
| 2868 | @end table |
| 2869 | |
| 2870 | The following example provides a framework for a conversion function. |
| 2871 | In case a new conversion has to be written the holes in this |
| 2872 | implementation have to be filled and that is it. |
| 2873 | |
| 2874 | @smallexample |
| 2875 | int |
| 2876 | gconv (struct __gconv_step *step, struct __gconv_step_data *data, |
| 2877 | const char **inbuf, const char *inbufend, size_t *written, |
| 2878 | int do_flush) |
| 2879 | @{ |
| 2880 | struct __gconv_step *next_step = step + 1; |
| 2881 | struct __gconv_step_data *next_data = data + 1; |
| 2882 | gconv_fct fct = next_step->__fct; |
| 2883 | int status; |
| 2884 | |
| 2885 | /* @r{If the function is called with no input this means we have} |
| 2886 | @r{to reset to the initial state. The possibly partly} |
| 2887 | @r{converted input is dropped.} */ |
| 2888 | if (do_flush) |
| 2889 | @{ |
| 2890 | status = __GCONV_OK; |
| 2891 | |
| 2892 | /* @r{Possible emit a byte sequence which put the state object} |
| 2893 | @r{into the initial state.} */ |
| 2894 | |
| 2895 | /* @r{Call the steps down the chain if there are any but only} |
| 2896 | @r{if we successfully emitted the escape sequence.} */ |
| 2897 | if (status == __GCONV_OK && ! data->__is_last) |
| 2898 | status = fct (next_step, next_data, NULL, NULL, |
| 2899 | written, 1); |
| 2900 | @} |
| 2901 | else |
| 2902 | @{ |
| 2903 | /* @r{We preserve the initial values of the pointer variables.} */ |
| 2904 | const char *inptr = *inbuf; |
| 2905 | char *outbuf = data->__outbuf; |
| 2906 | char *outend = data->__outbufend; |
| 2907 | char *outptr; |
| 2908 | |
| 2909 | do |
| 2910 | @{ |
| 2911 | /* @r{Remember the start value for this round.} */ |
| 2912 | inptr = *inbuf; |
| 2913 | /* @r{The outbuf buffer is empty.} */ |
| 2914 | outptr = outbuf; |
| 2915 | |
| 2916 | /* @r{For stateful encodings the state must be safe here.} */ |
| 2917 | |
| 2918 | /* @r{Run the conversion loop. @code{status} is set} |
| 2919 | @r{appropriately afterwards.} */ |
| 2920 | |
| 2921 | /* @r{If this is the last step, leave the loop. There is} |
| 2922 | @r{nothing we can do.} */ |
| 2923 | if (data->__is_last) |
| 2924 | @{ |
| 2925 | /* @r{Store information about how many bytes are} |
| 2926 | @r{available.} */ |
| 2927 | data->__outbuf = outbuf; |
| 2928 | |
| 2929 | /* @r{If any non-reversible conversions were performed,} |
| 2930 | @r{add the number to @code{*written}.} */ |
| 2931 | |
| 2932 | break; |
| 2933 | @} |
| 2934 | |
| 2935 | /* @r{Write out all output that was produced.} */ |
| 2936 | if (outbuf > outptr) |
| 2937 | @{ |
| 2938 | const char *outerr = data->__outbuf; |
| 2939 | int result; |
| 2940 | |
| 2941 | result = fct (next_step, next_data, &outerr, |
| 2942 | outbuf, written, 0); |
| 2943 | |
| 2944 | if (result != __GCONV_EMPTY_INPUT) |
| 2945 | @{ |
| 2946 | if (outerr != outbuf) |
| 2947 | @{ |
| 2948 | /* @r{Reset the input buffer pointer. We} |
| 2949 | @r{document here the complex case.} */ |
| 2950 | size_t nstatus; |
| 2951 | |
| 2952 | /* @r{Reload the pointers.} */ |
| 2953 | *inbuf = inptr; |
| 2954 | outbuf = outptr; |
| 2955 | |
| 2956 | /* @r{Possibly reset the state.} */ |
| 2957 | |
| 2958 | /* @r{Redo the conversion, but this time} |
| 2959 | @r{the end of the output buffer is at} |
| 2960 | @r{@code{outerr}.} */ |
| 2961 | @} |
| 2962 | |
| 2963 | /* @r{Change the status.} */ |
| 2964 | status = result; |
| 2965 | @} |
| 2966 | else |
| 2967 | /* @r{All the output is consumed, we can make} |
| 2968 | @r{ another run if everything was ok.} */ |
| 2969 | if (status == __GCONV_FULL_OUTPUT) |
| 2970 | status = __GCONV_OK; |
| 2971 | @} |
| 2972 | @} |
| 2973 | while (status == __GCONV_OK); |
| 2974 | |
| 2975 | /* @r{We finished one use of this step.} */ |
| 2976 | ++data->__invocation_counter; |
| 2977 | @} |
| 2978 | |
| 2979 | return status; |
| 2980 | @} |
| 2981 | @end smallexample |
| 2982 | @end deftypevr |
| 2983 | |
| 2984 | This information should be sufficient to write new modules. Anybody |
| 2985 | doing so should also take a look at the available source code in the |
| 2986 | @glibcadj{} sources. It contains many examples of working and optimized |
| 2987 | modules. |
| 2988 | |
| 2989 | @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation |