lh | 9ed821d | 2023-04-07 01:36:19 -0700 | [diff] [blame^] | 1 | @node Message Translation, Searching and Sorting, Locales, Top |
| 2 | @c %MENU% How to make the program speak the user's language |
| 3 | @chapter Message Translation |
| 4 | |
| 5 | The program's interface with the user should be designed to ease the user's |
| 6 | task. One way to ease the user's task is to use messages in whatever |
| 7 | language the user prefers. |
| 8 | |
| 9 | Printing messages in different languages can be implemented in different |
| 10 | ways. One could add all the different languages in the source code and |
| 11 | choose among the variants every time a message has to be printed. This is |
| 12 | certainly not a good solution since extending the set of languages is |
| 13 | cumbersome (the code must be changed) and the code itself can become |
| 14 | really big with dozens of message sets. |
| 15 | |
| 16 | A better solution is to keep the message sets for each language |
| 17 | in separate files which are loaded at runtime depending on the language |
| 18 | selection of the user. |
| 19 | |
| 20 | @Theglibc{} provides two different sets of functions to support |
| 21 | message translation. The problem is that neither of the interfaces is |
| 22 | officially defined by the POSIX standard. The @code{catgets} family of |
| 23 | functions is defined in the X/Open standard but this is derived from |
| 24 | industry decisions and therefore not necessarily based on reasonable |
| 25 | decisions. |
| 26 | |
| 27 | As mentioned above the message catalog handling provides easy |
| 28 | extendibility by using external data files which contain the message |
| 29 | translations. I.e., these files contain for each of the messages used |
| 30 | in the program a translation for the appropriate language. So the tasks |
| 31 | of the message handling functions are |
| 32 | |
| 33 | @itemize @bullet |
| 34 | @item |
| 35 | locate the external data file with the appropriate translations |
| 36 | @item |
| 37 | load the data and make it possible to address the messages |
| 38 | @item |
| 39 | map a given key to the translated message |
| 40 | @end itemize |
| 41 | |
| 42 | The two approaches mainly differ in the implementation of this last |
| 43 | step. Decisions made in the last step influence the rest of the design. |
| 44 | |
| 45 | @menu |
| 46 | * Message catalogs a la X/Open:: The @code{catgets} family of functions. |
| 47 | * The Uniforum approach:: The @code{gettext} family of functions. |
| 48 | @end menu |
| 49 | |
| 50 | |
| 51 | @node Message catalogs a la X/Open |
| 52 | @section X/Open Message Catalog Handling |
| 53 | |
| 54 | The @code{catgets} functions are based on the simple scheme: |
| 55 | |
| 56 | @quotation |
| 57 | Associate every message to translate in the source code with a unique |
| 58 | identifier. To retrieve a message from a catalog file solely the |
| 59 | identifier is used. |
| 60 | @end quotation |
| 61 | |
| 62 | This means for the author of the program that s/he will have to make |
| 63 | sure the meaning of the identifier in the program code and in the |
| 64 | message catalogs are always the same. |
| 65 | |
| 66 | Before a message can be translated the catalog file must be located. |
| 67 | The user of the program must be able to guide the responsible function |
| 68 | to find whatever catalog the user wants. This is separated from what |
| 69 | the programmer had in mind. |
| 70 | |
| 71 | All the types, constants and functions for the @code{catgets} functions |
| 72 | are defined/declared in the @file{nl_types.h} header file. |
| 73 | |
| 74 | @menu |
| 75 | * The catgets Functions:: The @code{catgets} function family. |
| 76 | * The message catalog files:: Format of the message catalog files. |
| 77 | * The gencat program:: How to generate message catalogs files which |
| 78 | can be used by the functions. |
| 79 | * Common Usage:: How to use the @code{catgets} interface. |
| 80 | @end menu |
| 81 | |
| 82 | |
| 83 | @node The catgets Functions |
| 84 | @subsection The @code{catgets} function family |
| 85 | |
| 86 | @comment nl_types.h |
| 87 | @comment X/Open |
| 88 | @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag}) |
| 89 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}} |
| 90 | @c catopen @mtsenv @ascuheap @acsmem |
| 91 | @c strchr ok |
| 92 | @c setlocale(,NULL) ok |
| 93 | @c getenv @mtsenv |
| 94 | @c strlen ok |
| 95 | @c alloca ok |
| 96 | @c stpcpy ok |
| 97 | @c malloc @ascuheap @acsmem |
| 98 | @c __open_catalog @ascuheap @acsmem |
| 99 | @c strchr ok |
| 100 | @c open_not_cancel_2 @acsfd |
| 101 | @c strlen ok |
| 102 | @c ENOUGH ok |
| 103 | @c alloca ok |
| 104 | @c memcpy ok |
| 105 | @c fxstat64 ok |
| 106 | @c __set_errno ok |
| 107 | @c mmap @acsmem |
| 108 | @c malloc dup @ascuheap @acsmem |
| 109 | @c read_not_cancel ok |
| 110 | @c free dup @ascuheap @acsmem |
| 111 | @c munmap ok |
| 112 | @c close_not_cancel_no_status ok |
| 113 | @c free @ascuheap @acsmem |
| 114 | The @code{catopen} function tries to locate the message data file names |
| 115 | @var{cat_name} and loads it when found. The return value is of an |
| 116 | opaque type and can be used in calls to the other functions to refer to |
| 117 | this loaded catalog. |
| 118 | |
| 119 | The return value is @code{(nl_catd) -1} in case the function failed and |
| 120 | no catalog was loaded. The global variable @var{errno} contains a code |
| 121 | for the error causing the failure. But even if the function call |
| 122 | succeeded this does not mean that all messages can be translated. |
| 123 | |
| 124 | Locating the catalog file must happen in a way which lets the user of |
| 125 | the program influence the decision. It is up to the user to decide |
| 126 | about the language to use and sometimes it is useful to use alternate |
| 127 | catalog files. All this can be specified by the user by setting some |
| 128 | environment variables. |
| 129 | |
| 130 | The first problem is to find out where all the message catalogs are |
| 131 | stored. Every program could have its own place to keep all the |
| 132 | different files but usually the catalog files are grouped by languages |
| 133 | and the catalogs for all programs are kept in the same place. |
| 134 | |
| 135 | @cindex NLSPATH environment variable |
| 136 | To tell the @code{catopen} function where the catalog for the program |
| 137 | can be found the user can set the environment variable @code{NLSPATH} to |
| 138 | a value which describes her/his choice. Since this value must be usable |
| 139 | for different languages and locales it cannot be a simple string. |
| 140 | Instead it is a format string (similar to @code{printf}'s). An example |
| 141 | is |
| 142 | |
| 143 | @smallexample |
| 144 | /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N |
| 145 | @end smallexample |
| 146 | |
| 147 | First one can see that more than one directory can be specified (with |
| 148 | the usual syntax of separating them by colons). The next things to |
| 149 | observe are the format string, @code{%L} and @code{%N} in this case. |
| 150 | The @code{catopen} function knows about several of them and the |
| 151 | replacement for all of them is of course different. |
| 152 | |
| 153 | @table @code |
| 154 | @item %N |
| 155 | This format element is substituted with the name of the catalog file. |
| 156 | This is the value of the @var{cat_name} argument given to |
| 157 | @code{catgets}. |
| 158 | |
| 159 | @item %L |
| 160 | This format element is substituted with the name of the currently |
| 161 | selected locale for translating messages. How this is determined is |
| 162 | explained below. |
| 163 | |
| 164 | @item %l |
| 165 | (This is the lowercase ell.) This format element is substituted with the |
| 166 | language element of the locale name. The string describing the selected |
| 167 | locale is expected to have the form |
| 168 | @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the |
| 169 | first part @var{lang}. |
| 170 | |
| 171 | @item %t |
| 172 | This format element is substituted by the territory part @var{terr} of |
| 173 | the name of the currently selected locale. See the explanation of the |
| 174 | format above. |
| 175 | |
| 176 | @item %c |
| 177 | This format element is substituted by the codeset part @var{codeset} of |
| 178 | the name of the currently selected locale. See the explanation of the |
| 179 | format above. |
| 180 | |
| 181 | @item %% |
| 182 | Since @code{%} is used in a meta character there must be a way to |
| 183 | express the @code{%} character in the result itself. Using @code{%%} |
| 184 | does this just like it works for @code{printf}. |
| 185 | @end table |
| 186 | |
| 187 | |
| 188 | Using @code{NLSPATH} allows arbitrary directories to be searched for |
| 189 | message catalogs while still allowing different languages to be used. |
| 190 | If the @code{NLSPATH} environment variable is not set, the default value |
| 191 | is |
| 192 | |
| 193 | @smallexample |
| 194 | @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N |
| 195 | @end smallexample |
| 196 | |
| 197 | @noindent |
| 198 | where @var{prefix} is given to @code{configure} while installing @theglibc{} |
| 199 | (this value is in many cases @code{/usr} or the empty string). |
| 200 | |
| 201 | The remaining problem is to decide which must be used. The value |
| 202 | decides about the substitution of the format elements mentioned above. |
| 203 | First of all the user can specify a path in the message catalog name |
| 204 | (i.e., the name contains a slash character). In this situation the |
| 205 | @code{NLSPATH} environment variable is not used. The catalog must exist |
| 206 | as specified in the program, perhaps relative to the current working |
| 207 | directory. This situation in not desirable and catalogs names never |
| 208 | should be written this way. Beside this, this behavior is not portable |
| 209 | to all other platforms providing the @code{catgets} interface. |
| 210 | |
| 211 | @cindex LC_ALL environment variable |
| 212 | @cindex LC_MESSAGES environment variable |
| 213 | @cindex LANG environment variable |
| 214 | Otherwise the values of environment variables from the standard |
| 215 | environment are examined (@pxref{Standard Environment}). Which |
| 216 | variables are examined is decided by the @var{flag} parameter of |
| 217 | @code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined |
| 218 | in @file{nl_types.h}) then the @code{catopen} function use the name of |
| 219 | the locale currently selected for the @code{LC_MESSAGES} category. |
| 220 | |
| 221 | If @var{flag} is zero the @code{LANG} environment variable is examined. |
| 222 | This is a left-over from the early days where the concept of the locales |
| 223 | had not even reached the level of POSIX locales. |
| 224 | |
| 225 | The environment variable and the locale name should have a value of the |
| 226 | form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above. |
| 227 | If no environment variable is set the @code{"C"} locale is used which |
| 228 | prevents any translation. |
| 229 | |
| 230 | The return value of the function is in any case a valid string. Either |
| 231 | it is a translation from a message catalog or it is the same as the |
| 232 | @var{string} parameter. So a piece of code to decide whether a |
| 233 | translation actually happened must look like this: |
| 234 | |
| 235 | @smallexample |
| 236 | @{ |
| 237 | char *trans = catgets (desc, set, msg, input_string); |
| 238 | if (trans == input_string) |
| 239 | @{ |
| 240 | /* Something went wrong. */ |
| 241 | @} |
| 242 | @} |
| 243 | @end smallexample |
| 244 | |
| 245 | @noindent |
| 246 | When an error occurred the global variable @var{errno} is set to |
| 247 | |
| 248 | @table @var |
| 249 | @item EBADF |
| 250 | The catalog does not exist. |
| 251 | @item ENOMSG |
| 252 | The set/message tuple does not name an existing element in the |
| 253 | message catalog. |
| 254 | @end table |
| 255 | |
| 256 | While it sometimes can be useful to test for errors programs normally |
| 257 | will avoid any test. If the translation is not available it is no big |
| 258 | problem if the original, untranslated message is printed. Either the |
| 259 | user understands this as well or s/he will look for the reason why the |
| 260 | messages are not translated. |
| 261 | @end deftypefun |
| 262 | |
| 263 | Please note that the currently selected locale does not depend on a call |
| 264 | to the @code{setlocale} function. It is not necessary that the locale |
| 265 | data files for this locale exist and calling @code{setlocale} succeeds. |
| 266 | The @code{catopen} function directly reads the values of the environment |
| 267 | variables. |
| 268 | |
| 269 | |
| 270 | @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string}) |
| 271 | @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} |
| 272 | The function @code{catgets} has to be used to access the massage catalog |
| 273 | previously opened using the @code{catopen} function. The |
| 274 | @var{catalog_desc} parameter must be a value previously returned by |
| 275 | @code{catopen}. |
| 276 | |
| 277 | The next two parameters, @var{set} and @var{message}, reflect the |
| 278 | internal organization of the message catalog files. This will be |
| 279 | explained in detail below. For now it is interesting to know that a |
| 280 | catalog can consists of several set and the messages in each thread are |
| 281 | individually numbered using numbers. Neither the set number nor the |
| 282 | message number must be consecutive. They can be arbitrarily chosen. |
| 283 | But each message (unless equal to another one) must have its own unique |
| 284 | pair of set and message number. |
| 285 | |
| 286 | Since it is not guaranteed that the message catalog for the language |
| 287 | selected by the user exists the last parameter @var{string} helps to |
| 288 | handle this case gracefully. If no matching string can be found |
| 289 | @var{string} is returned. This means for the programmer that |
| 290 | |
| 291 | @itemize @bullet |
| 292 | @item |
| 293 | the @var{string} parameters should contain reasonable text (this also |
| 294 | helps to understand the program seems otherwise there would be no hint |
| 295 | on the string which is expected to be returned. |
| 296 | @item |
| 297 | all @var{string} arguments should be written in the same language. |
| 298 | @end itemize |
| 299 | @end deftypefun |
| 300 | |
| 301 | It is somewhat uncomfortable to write a program using the @code{catgets} |
| 302 | functions if no supporting functionality is available. Since each |
| 303 | set/message number tuple must be unique the programmer must keep lists |
| 304 | of the messages at the same time the code is written. And the work |
| 305 | between several people working on the same project must be coordinated. |
| 306 | We will see some how these problems can be relaxed a bit (@pxref{Common |
| 307 | Usage}). |
| 308 | |
| 309 | @deftypefun int catclose (nl_catd @var{catalog_desc}) |
| 310 | @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acucorrupt{} @acsmem{}}} |
| 311 | @c catclose @ascuheap @acucorrupt @acsmem |
| 312 | @c __set_errno ok |
| 313 | @c munmap ok |
| 314 | @c free @ascuheap @acsmem |
| 315 | The @code{catclose} function can be used to free the resources |
| 316 | associated with a message catalog which previously was opened by a call |
| 317 | to @code{catopen}. If the resources can be successfully freed the |
| 318 | function returns @code{0}. Otherwise it return @code{@minus{}1} and the |
| 319 | global variable @var{errno} is set. Errors can occur if the catalog |
| 320 | descriptor @var{catalog_desc} is not valid in which case @var{errno} is |
| 321 | set to @code{EBADF}. |
| 322 | @end deftypefun |
| 323 | |
| 324 | |
| 325 | @node The message catalog files |
| 326 | @subsection Format of the message catalog files |
| 327 | |
| 328 | The only reasonable way the translate all the messages of a function and |
| 329 | store the result in a message catalog file which can be read by the |
| 330 | @code{catopen} function is to write all the message text to the |
| 331 | translator and let her/him translate them all. I.e., we must have a |
| 332 | file with entries which associate the set/message tuple with a specific |
| 333 | translation. This file format is specified in the X/Open standard and |
| 334 | is as follows: |
| 335 | |
| 336 | @itemize @bullet |
| 337 | @item |
| 338 | Lines containing only whitespace characters or empty lines are ignored. |
| 339 | |
| 340 | @item |
| 341 | Lines which contain as the first non-whitespace character a @code{$} |
| 342 | followed by a whitespace character are comment and are also ignored. |
| 343 | |
| 344 | @item |
| 345 | If a line contains as the first non-whitespace characters the sequence |
| 346 | @code{$set} followed by a whitespace character an additional argument |
| 347 | is required to follow. This argument can either be: |
| 348 | |
| 349 | @itemize @minus |
| 350 | @item |
| 351 | a number. In this case the value of this number determines the set |
| 352 | to which the following messages are added. |
| 353 | |
| 354 | @item |
| 355 | an identifier consisting of alphanumeric characters plus the underscore |
| 356 | character. In this case the set get automatically a number assigned. |
| 357 | This value is one added to the largest set number which so far appeared. |
| 358 | |
| 359 | How to use the symbolic names is explained in section @ref{Common Usage}. |
| 360 | |
| 361 | It is an error if a symbol name appears more than once. All following |
| 362 | messages are placed in a set with this number. |
| 363 | @end itemize |
| 364 | |
| 365 | @item |
| 366 | If a line contains as the first non-whitespace characters the sequence |
| 367 | @code{$delset} followed by a whitespace character an additional argument |
| 368 | is required to follow. This argument can either be: |
| 369 | |
| 370 | @itemize @minus |
| 371 | @item |
| 372 | a number. In this case the value of this number determines the set |
| 373 | which will be deleted. |
| 374 | |
| 375 | @item |
| 376 | an identifier consisting of alphanumeric characters plus the underscore |
| 377 | character. This symbolic identifier must match a name for a set which |
| 378 | previously was defined. It is an error if the name is unknown. |
| 379 | @end itemize |
| 380 | |
| 381 | In both cases all messages in the specified set will be removed. They |
| 382 | will not appear in the output. But if this set is later again selected |
| 383 | with a @code{$set} command again messages could be added and these |
| 384 | messages will appear in the output. |
| 385 | |
| 386 | @item |
| 387 | If a line contains after leading whitespaces the sequence |
| 388 | @code{$quote}, the quoting character used for this input file is |
| 389 | changed to the first non-whitespace character following the |
| 390 | @code{$quote}. If no non-whitespace character is present before the |
| 391 | line ends quoting is disable. |
| 392 | |
| 393 | By default no quoting character is used. In this mode strings are |
| 394 | terminated with the first unescaped line break. If there is a |
| 395 | @code{$quote} sequence present newline need not be escaped. Instead a |
| 396 | string is terminated with the first unescaped appearance of the quote |
| 397 | character. |
| 398 | |
| 399 | A common usage of this feature would be to set the quote character to |
| 400 | @code{"}. Then any appearance of the @code{"} in the strings must |
| 401 | be escaped using the backslash (i.e., @code{\"} must be written). |
| 402 | |
| 403 | @item |
| 404 | Any other line must start with a number or an alphanumeric identifier |
| 405 | (with the underscore character included). The following characters |
| 406 | (starting after the first whitespace character) will form the string |
| 407 | which gets associated with the currently selected set and the message |
| 408 | number represented by the number and identifier respectively. |
| 409 | |
| 410 | If the start of the line is a number the message number is obvious. It |
| 411 | is an error if the same message number already appeared for this set. |
| 412 | |
| 413 | If the leading token was an identifier the message number gets |
| 414 | automatically assigned. The value is the current maximum messages |
| 415 | number for this set plus one. It is an error if the identifier was |
| 416 | already used for a message in this set. It is OK to reuse the |
| 417 | identifier for a message in another thread. How to use the symbolic |
| 418 | identifiers will be explained below (@pxref{Common Usage}). There is |
| 419 | one limitation with the identifier: it must not be @code{Set}. The |
| 420 | reason will be explained below. |
| 421 | |
| 422 | The text of the messages can contain escape characters. The usual bunch |
| 423 | of characters known from the @w{ISO C} language are recognized |
| 424 | (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f}, |
| 425 | @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of |
| 426 | a character code). |
| 427 | @end itemize |
| 428 | |
| 429 | @strong{Important:} The handling of identifiers instead of numbers for |
| 430 | the set and messages is a GNU extension. Systems strictly following the |
| 431 | X/Open specification do not have this feature. An example for a message |
| 432 | catalog file is this: |
| 433 | |
| 434 | @smallexample |
| 435 | $ This is a leading comment. |
| 436 | $quote " |
| 437 | |
| 438 | $set SetOne |
| 439 | 1 Message with ID 1. |
| 440 | two " Message with ID \"two\", which gets the value 2 assigned" |
| 441 | |
| 442 | $set SetTwo |
| 443 | $ Since the last set got the number 1 assigned this set has number 2. |
| 444 | 4000 "The numbers can be arbitrary, they need not start at one." |
| 445 | @end smallexample |
| 446 | |
| 447 | This small example shows various aspects: |
| 448 | @itemize @bullet |
| 449 | @item |
| 450 | Lines 1 and 9 are comments since they start with @code{$} followed by |
| 451 | a whitespace. |
| 452 | @item |
| 453 | The quoting character is set to @code{"}. Otherwise the quotes in the |
| 454 | message definition would have to be left away and in this case the |
| 455 | message with the identifier @code{two} would loose its leading whitespace. |
| 456 | @item |
| 457 | Mixing numbered messages with message having symbolic names is no |
| 458 | problem and the numbering happens automatically. |
| 459 | @end itemize |
| 460 | |
| 461 | |
| 462 | While this file format is pretty easy it is not the best possible for |
| 463 | use in a running program. The @code{catopen} function would have to |
| 464 | parser the file and handle syntactic errors gracefully. This is not so |
| 465 | easy and the whole process is pretty slow. Therefore the @code{catgets} |
| 466 | functions expect the data in another more compact and ready-to-use file |
| 467 | format. There is a special program @code{gencat} which is explained in |
| 468 | detail in the next section. |
| 469 | |
| 470 | Files in this other format are not human readable. To be easy to use by |
| 471 | programs it is a binary file. But the format is byte order independent |
| 472 | so translation files can be shared by systems of arbitrary architecture |
| 473 | (as long as they use @theglibc{}). |
| 474 | |
| 475 | Details about the binary file format are not important to know since |
| 476 | these files are always created by the @code{gencat} program. The |
| 477 | sources of @theglibc{} also provide the sources for the |
| 478 | @code{gencat} program and so the interested reader can look through |
| 479 | these source files to learn about the file format. |
| 480 | |
| 481 | |
| 482 | @node The gencat program |
| 483 | @subsection Generate Message Catalogs files |
| 484 | |
| 485 | @cindex gencat |
| 486 | The @code{gencat} program is specified in the X/Open standard and the |
| 487 | GNU implementation follows this specification and so processes |
| 488 | all correctly formed input files. Additionally some extension are |
| 489 | implemented which help to work in a more reasonable way with the |
| 490 | @code{catgets} functions. |
| 491 | |
| 492 | The @code{gencat} program can be invoked in two ways: |
| 493 | |
| 494 | @example |
| 495 | `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]` |
| 496 | @end example |
| 497 | |
| 498 | This is the interface defined in the X/Open standard. If no |
| 499 | @var{Input-File} parameter is given input will be read from standard |
| 500 | input. Multiple input files will be read as if they are concatenated. |
| 501 | If @var{Output-File} is also missing, the output will be written to |
| 502 | standard output. To provide the interface one is used to from other |
| 503 | programs a second interface is provided. |
| 504 | |
| 505 | @smallexample |
| 506 | `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}` |
| 507 | @end smallexample |
| 508 | |
| 509 | The option @samp{-o} is used to specify the output file and all file |
| 510 | arguments are used as input files. |
| 511 | |
| 512 | Beside this one can use @file{-} or @file{/dev/stdin} for |
| 513 | @var{Input-File} to denote the standard input. Corresponding one can |
| 514 | use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote |
| 515 | standard output. Using @file{-} as a file name is allowed in X/Open |
| 516 | while using the device names is a GNU extension. |
| 517 | |
| 518 | The @code{gencat} program works by concatenating all input files and |
| 519 | then @strong{merge} the resulting collection of message sets with a |
| 520 | possibly existing output file. This is done by removing all messages |
| 521 | with set/message number tuples matching any of the generated messages |
| 522 | from the output file and then adding all the new messages. To |
| 523 | regenerate a catalog file while ignoring the old contents therefore |
| 524 | requires to remove the output file if it exists. If the output is |
| 525 | written to standard output no merging takes place. |
| 526 | |
| 527 | @noindent |
| 528 | The following table shows the options understood by the @code{gencat} |
| 529 | program. The X/Open standard does not specify any option for the |
| 530 | program so all of these are GNU extensions. |
| 531 | |
| 532 | @table @samp |
| 533 | @item -V |
| 534 | @itemx --version |
| 535 | Print the version information and exit. |
| 536 | @item -h |
| 537 | @itemx --help |
| 538 | Print a usage message listing all available options, then exit successfully. |
| 539 | @item --new |
| 540 | Do never merge the new messages from the input files with the old content |
| 541 | of the output files. The old content of the output file is discarded. |
| 542 | @item -H |
| 543 | @itemx --header=name |
| 544 | This option is used to emit the symbolic names given to sets and |
| 545 | messages in the input files for use in the program. Details about how |
| 546 | to use this are given in the next section. The @var{name} parameter to |
| 547 | this option specifies the name of the output file. It will contain a |
| 548 | number of C preprocessor @code{#define}s to associate a name with a |
| 549 | number. |
| 550 | |
| 551 | Please note that the generated file only contains the symbols from the |
| 552 | input files. If the output is merged with the previous content of the |
| 553 | output file the possibly existing symbols from the file(s) which |
| 554 | generated the old output files are not in the generated header file. |
| 555 | @end table |
| 556 | |
| 557 | |
| 558 | @node Common Usage |
| 559 | @subsection How to use the @code{catgets} interface |
| 560 | |
| 561 | The @code{catgets} functions can be used in two different ways. By |
| 562 | following slavishly the X/Open specs and not relying on the extension |
| 563 | and by using the GNU extensions. We will take a look at the former |
| 564 | method first to understand the benefits of extensions. |
| 565 | |
| 566 | @subsubsection Not using symbolic names |
| 567 | |
| 568 | Since the X/Open format of the message catalog files does not allow |
| 569 | symbol names we have to work with numbers all the time. When we start |
| 570 | writing a program we have to replace all appearances of translatable |
| 571 | strings with something like |
| 572 | |
| 573 | @smallexample |
| 574 | catgets (catdesc, set, msg, "string") |
| 575 | @end smallexample |
| 576 | |
| 577 | @noindent |
| 578 | @var{catgets} is retrieved from a call to @code{catopen} which is |
| 579 | normally done once at the program start. The @code{"string"} is the |
| 580 | string we want to translate. The problems start with the set and |
| 581 | message numbers. |
| 582 | |
| 583 | In a bigger program several programmers usually work at the same time on |
| 584 | the program and so coordinating the number allocation is crucial. |
| 585 | Though no two different strings must be indexed by the same tuple of |
| 586 | numbers it is highly desirable to reuse the numbers for equal strings |
| 587 | with equal translations (please note that there might be strings which |
| 588 | are equal in one language but have different translations due to |
| 589 | difference contexts). |
| 590 | |
| 591 | The allocation process can be relaxed a bit by different set numbers for |
| 592 | different parts of the program. So the number of developers who have to |
| 593 | coordinate the allocation can be reduced. But still lists must be keep |
| 594 | track of the allocation and errors can easily happen. These errors |
| 595 | cannot be discovered by the compiler or the @code{catgets} functions. |
| 596 | Only the user of the program might see wrong messages printed. In the |
| 597 | worst cases the messages are so irritating that they cannot be |
| 598 | recognized as wrong. Think about the translations for @code{"true"} and |
| 599 | @code{"false"} being exchanged. This could result in a disaster. |
| 600 | |
| 601 | |
| 602 | @subsubsection Using symbolic names |
| 603 | |
| 604 | The problems mentioned in the last section derive from the fact that: |
| 605 | |
| 606 | @enumerate |
| 607 | @item |
| 608 | the numbers are allocated once and due to the possibly frequent use of |
| 609 | them it is difficult to change a number later. |
| 610 | @item |
| 611 | the numbers do not allow to guess anything about the string and |
| 612 | therefore collisions can easily happen. |
| 613 | @end enumerate |
| 614 | |
| 615 | By constantly using symbolic names and by providing a method which maps |
| 616 | the string content to a symbolic name (however this will happen) one can |
| 617 | prevent both problems above. The cost of this is that the programmer |
| 618 | has to write a complete message catalog file while s/he is writing the |
| 619 | program itself. |
| 620 | |
| 621 | This is necessary since the symbolic names must be mapped to numbers |
| 622 | before the program sources can be compiled. In the last section it was |
| 623 | described how to generate a header containing the mapping of the names. |
| 624 | E.g., for the example message file given in the last section we could |
| 625 | call the @code{gencat} program as follow (assume @file{ex.msg} contains |
| 626 | the sources). |
| 627 | |
| 628 | @smallexample |
| 629 | gencat -H ex.h -o ex.cat ex.msg |
| 630 | @end smallexample |
| 631 | |
| 632 | @noindent |
| 633 | This generates a header file with the following content: |
| 634 | |
| 635 | @smallexample |
| 636 | #define SetTwoSet 0x2 /* ex.msg:8 */ |
| 637 | |
| 638 | #define SetOneSet 0x1 /* ex.msg:4 */ |
| 639 | #define SetOnetwo 0x2 /* ex.msg:6 */ |
| 640 | @end smallexample |
| 641 | |
| 642 | As can be seen the various symbols given in the source file are mangled |
| 643 | to generate unique identifiers and these identifiers get numbers |
| 644 | assigned. Reading the source file and knowing about the rules will |
| 645 | allow to predict the content of the header file (it is deterministic) |
| 646 | but this is not necessary. The @code{gencat} program can take care for |
| 647 | everything. All the programmer has to do is to put the generated header |
| 648 | file in the dependency list of the source files of her/his project and |
| 649 | to add a rules to regenerate the header of any of the input files |
| 650 | change. |
| 651 | |
| 652 | One word about the symbol mangling. Every symbol consists of two parts: |
| 653 | the name of the message set plus the name of the message or the special |
| 654 | string @code{Set}. So @code{SetOnetwo} means this macro can be used to |
| 655 | access the translation with identifier @code{two} in the message set |
| 656 | @code{SetOne}. |
| 657 | |
| 658 | The other names denote the names of the message sets. The special |
| 659 | string @code{Set} is used in the place of the message identifier. |
| 660 | |
| 661 | If in the code the second string of the set @code{SetOne} is used the C |
| 662 | code should look like this: |
| 663 | |
| 664 | @smallexample |
| 665 | catgets (catdesc, SetOneSet, SetOnetwo, |
| 666 | " Message with ID \"two\", which gets the value 2 assigned") |
| 667 | @end smallexample |
| 668 | |
| 669 | Writing the function this way will allow to change the message number |
| 670 | and even the set number without requiring any change in the C source |
| 671 | code. (The text of the string is normally not the same; this is only |
| 672 | for this example.) |
| 673 | |
| 674 | |
| 675 | @subsubsection How does to this allow to develop |
| 676 | |
| 677 | To illustrate the usual way to work with the symbolic version numbers |
| 678 | here is a little example. Assume we want to write the very complex and |
| 679 | famous greeting program. We start by writing the code as usual: |
| 680 | |
| 681 | @smallexample |
| 682 | #include <stdio.h> |
| 683 | int |
| 684 | main (void) |
| 685 | @{ |
| 686 | printf ("Hello, world!\n"); |
| 687 | return 0; |
| 688 | @} |
| 689 | @end smallexample |
| 690 | |
| 691 | Now we want to internationalize the message and therefore replace the |
| 692 | message with whatever the user wants. |
| 693 | |
| 694 | @smallexample |
| 695 | #include <nl_types.h> |
| 696 | #include <stdio.h> |
| 697 | #include "msgnrs.h" |
| 698 | int |
| 699 | main (void) |
| 700 | @{ |
| 701 | nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE); |
| 702 | printf (catgets (catdesc, SetMainSet, SetMainHello, |
| 703 | "Hello, world!\n")); |
| 704 | catclose (catdesc); |
| 705 | return 0; |
| 706 | @} |
| 707 | @end smallexample |
| 708 | |
| 709 | We see how the catalog object is opened and the returned descriptor used |
| 710 | in the other function calls. It is not really necessary to check for |
| 711 | failure of any of the functions since even in these situations the |
| 712 | functions will behave reasonable. They simply will be return a |
| 713 | translation. |
| 714 | |
| 715 | What remains unspecified here are the constants @code{SetMainSet} and |
| 716 | @code{SetMainHello}. These are the symbolic names describing the |
| 717 | message. To get the actual definitions which match the information in |
| 718 | the catalog file we have to create the message catalog source file and |
| 719 | process it using the @code{gencat} program. |
| 720 | |
| 721 | @smallexample |
| 722 | $ Messages for the famous greeting program. |
| 723 | $quote " |
| 724 | |
| 725 | $set Main |
| 726 | Hello "Hallo, Welt!\n" |
| 727 | @end smallexample |
| 728 | |
| 729 | Now we can start building the program (assume the message catalog source |
| 730 | file is named @file{hello.msg} and the program source file @file{hello.c}): |
| 731 | |
| 732 | @smallexample |
| 733 | % gencat -H msgnrs.h -o hello.cat hello.msg |
| 734 | % cat msgnrs.h |
| 735 | #define MainSet 0x1 /* hello.msg:4 */ |
| 736 | #define MainHello 0x1 /* hello.msg:5 */ |
| 737 | % gcc -o hello hello.c -I. |
| 738 | % cp hello.cat /usr/share/locale/de/LC_MESSAGES |
| 739 | % echo $LC_ALL |
| 740 | de |
| 741 | % ./hello |
| 742 | Hallo, Welt! |
| 743 | % |
| 744 | @end smallexample |
| 745 | |
| 746 | The call of the @code{gencat} program creates the missing header file |
| 747 | @file{msgnrs.h} as well as the message catalog binary. The former is |
| 748 | used in the compilation of @file{hello.c} while the later is placed in a |
| 749 | directory in which the @code{catopen} function will try to locate it. |
| 750 | Please check the @code{LC_ALL} environment variable and the default path |
| 751 | for @code{catopen} presented in the description above. |
| 752 | |
| 753 | |
| 754 | @node The Uniforum approach |
| 755 | @section The Uniforum approach to Message Translation |
| 756 | |
| 757 | Sun Microsystems tried to standardize a different approach to message |
| 758 | translation in the Uniforum group. There never was a real standard |
| 759 | defined but still the interface was used in Sun's operating systems. |
| 760 | Since this approach fits better in the development process of free |
| 761 | software it is also used throughout the GNU project and the GNU |
| 762 | @file{gettext} package provides support for this outside @theglibc{}. |
| 763 | |
| 764 | The code of the @file{libintl} from GNU @file{gettext} is the same as |
| 765 | the code in @theglibc{}. So the documentation in the GNU |
| 766 | @file{gettext} manual is also valid for the functionality here. The |
| 767 | following text will describe the library functions in detail. But the |
| 768 | numerous helper programs are not described in this manual. Instead |
| 769 | people should read the GNU @file{gettext} manual |
| 770 | (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}). |
| 771 | We will only give a short overview. |
| 772 | |
| 773 | Though the @code{catgets} functions are available by default on more |
| 774 | systems the @code{gettext} interface is at least as portable as the |
| 775 | former. The GNU @file{gettext} package can be used wherever the |
| 776 | functions are not available. |
| 777 | |
| 778 | |
| 779 | @menu |
| 780 | * Message catalogs with gettext:: The @code{gettext} family of functions. |
| 781 | * Helper programs for gettext:: Programs to handle message catalogs |
| 782 | for @code{gettext}. |
| 783 | @end menu |
| 784 | |
| 785 | |
| 786 | @node Message catalogs with gettext |
| 787 | @subsection The @code{gettext} family of functions |
| 788 | |
| 789 | The paradigms underlying the @code{gettext} approach to message |
| 790 | translations is different from that of the @code{catgets} functions the |
| 791 | basic functionally is equivalent. There are functions of the following |
| 792 | categories: |
| 793 | |
| 794 | @menu |
| 795 | * Translation with gettext:: What has to be done to translate a message. |
| 796 | * Locating gettext catalog:: How to determine which catalog to be used. |
| 797 | * Advanced gettext functions:: Additional functions for more complicated |
| 798 | situations. |
| 799 | * Charset conversion in gettext:: How to specify the output character set |
| 800 | @code{gettext} uses. |
| 801 | * GUI program problems:: How to use @code{gettext} in GUI programs. |
| 802 | * Using gettextized software:: The possibilities of the user to influence |
| 803 | the way @code{gettext} works. |
| 804 | @end menu |
| 805 | |
| 806 | @node Translation with gettext |
| 807 | @subsubsection What has to be done to translate a message? |
| 808 | |
| 809 | The @code{gettext} functions have a very simple interface. The most |
| 810 | basic function just takes the string which shall be translated as the |
| 811 | argument and it returns the translation. This is fundamentally |
| 812 | different from the @code{catgets} approach where an extra key is |
| 813 | necessary and the original string is only used for the error case. |
| 814 | |
| 815 | If the string which has to be translated is the only argument this of |
| 816 | course means the string itself is the key. I.e., the translation will |
| 817 | be selected based on the original string. The message catalogs must |
| 818 | therefore contain the original strings plus one translation for any such |
| 819 | string. The task of the @code{gettext} function is it to compare the |
| 820 | argument string with the available strings in the catalog and return the |
| 821 | appropriate translation. Of course this process is optimized so that |
| 822 | this process is not more expensive than an access using an atomic key |
| 823 | like in @code{catgets}. |
| 824 | |
| 825 | The @code{gettext} approach has some advantages but also some |
| 826 | disadvantages. Please see the GNU @file{gettext} manual for a detailed |
| 827 | discussion of the pros and cons. |
| 828 | |
| 829 | All the definitions and declarations for @code{gettext} can be found in |
| 830 | the @file{libintl.h} header file. On systems where these functions are |
| 831 | not part of the C library they can be found in a separate library named |
| 832 | @file{libintl.a} (or accordingly different for shared libraries). |
| 833 | |
| 834 | @comment libintl.h |
| 835 | @comment GNU |
| 836 | @deftypefun {char *} gettext (const char *@var{msgid}) |
| 837 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 838 | @c Wrapper for dcgettext. |
| 839 | The @code{gettext} function searches the currently selected message |
| 840 | catalogs for a string which is equal to @var{msgid}. If there is such a |
| 841 | string available it is returned. Otherwise the argument string |
| 842 | @var{msgid} is returned. |
| 843 | |
| 844 | Please note that although the return value is @code{char *} the |
| 845 | returned string must not be changed. This broken type results from the |
| 846 | history of the function and does not reflect the way the function should |
| 847 | be used. |
| 848 | |
| 849 | Please note that above we wrote ``message catalogs'' (plural). This is |
| 850 | a specialty of the GNU implementation of these functions and we will |
| 851 | say more about this when we talk about the ways message catalogs are |
| 852 | selected (@pxref{Locating gettext catalog}). |
| 853 | |
| 854 | The @code{gettext} function does not modify the value of the global |
| 855 | @var{errno} variable. This is necessary to make it possible to write |
| 856 | something like |
| 857 | |
| 858 | @smallexample |
| 859 | printf (gettext ("Operation failed: %m\n")); |
| 860 | @end smallexample |
| 861 | |
| 862 | Here the @var{errno} value is used in the @code{printf} function while |
| 863 | processing the @code{%m} format element and if the @code{gettext} |
| 864 | function would change this value (it is called before @code{printf} is |
| 865 | called) we would get a wrong message. |
| 866 | |
| 867 | So there is no easy way to detect a missing message catalog beside |
| 868 | comparing the argument string with the result. But it is normally the |
| 869 | task of the user to react on missing catalogs. The program cannot guess |
| 870 | when a message catalog is really necessary since for a user who speaks |
| 871 | the language the program was developed in does not need any translation. |
| 872 | @end deftypefun |
| 873 | |
| 874 | The remaining two functions to access the message catalog add some |
| 875 | functionality to select a message catalog which is not the default one. |
| 876 | This is important if parts of the program are developed independently. |
| 877 | Every part can have its own message catalog and all of them can be used |
| 878 | at the same time. The C library itself is an example: internally it |
| 879 | uses the @code{gettext} functions but since it must not depend on a |
| 880 | currently selected default message catalog it must specify all ambiguous |
| 881 | information. |
| 882 | |
| 883 | @comment libintl.h |
| 884 | @comment GNU |
| 885 | @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid}) |
| 886 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 887 | @c Wrapper for dcgettext. |
| 888 | The @code{dgettext} functions acts just like the @code{gettext} |
| 889 | function. It only takes an additional first argument @var{domainname} |
| 890 | which guides the selection of the message catalogs which are searched |
| 891 | for the translation. If the @var{domainname} parameter is the null |
| 892 | pointer the @code{dgettext} function is exactly equivalent to |
| 893 | @code{gettext} since the default value for the domain name is used. |
| 894 | |
| 895 | As for @code{gettext} the return value type is @code{char *} which is an |
| 896 | anachronism. The returned string must never be modified. |
| 897 | @end deftypefun |
| 898 | |
| 899 | @comment libintl.h |
| 900 | @comment GNU |
| 901 | @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category}) |
| 902 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 903 | @c dcgettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 904 | @c dcigettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 905 | @c libc_rwlock_rdlock @asulock @aculock |
| 906 | @c current_locale_name ok [protected from @mtslocale] |
| 907 | @c tfind ok |
| 908 | @c libc_rwlock_unlock ok |
| 909 | @c plural_lookup ok |
| 910 | @c plural_eval ok |
| 911 | @c rawmemchr ok |
| 912 | @c DETERMINE_SECURE ok, nothing |
| 913 | @c strcmp ok |
| 914 | @c strlen ok |
| 915 | @c getcwd @ascuheap @acsmem @acsfd |
| 916 | @c strchr ok |
| 917 | @c stpcpy ok |
| 918 | @c category_to_name ok |
| 919 | @c guess_category_value @mtsenv |
| 920 | @c getenv @mtsenv |
| 921 | @c current_locale_name dup ok [protected from @mtslocale by dcigettext] |
| 922 | @c strcmp ok |
| 923 | @c ENABLE_SECURE ok |
| 924 | @c _nl_find_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 925 | @c libc_rwlock_rdlock dup @asulock @aculock |
| 926 | @c _nl_make_l10nflist dup @ascuheap @acsmem |
| 927 | @c libc_rwlock_unlock dup ok |
| 928 | @c _nl_load_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 929 | @c libc_lock_lock_recursive @aculock |
| 930 | @c libc_lock_unlock_recursive @aculock |
| 931 | @c open->open_not_cancel_2 @acsfd |
| 932 | @c fstat ok |
| 933 | @c mmap dup @acsmem |
| 934 | @c close->close_not_cancel_no_status @acsfd |
| 935 | @c malloc dup @ascuheap @acsmem |
| 936 | @c read->read_not_cancel ok |
| 937 | @c munmap dup @acsmem |
| 938 | @c W dup ok |
| 939 | @c strlen dup ok |
| 940 | @c get_sysdep_segment_value ok |
| 941 | @c memcpy dup ok |
| 942 | @c hash_string dup ok |
| 943 | @c free dup @ascuheap @acsmem |
| 944 | @c libc_rwlock_init ok |
| 945 | @c _nl_find_msg dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 946 | @c libc_rwlock_fini ok |
| 947 | @c EXTRACT_PLURAL_EXPRESSION @ascuheap @acsmem |
| 948 | @c strstr dup ok |
| 949 | @c isspace ok |
| 950 | @c strtoul ok |
| 951 | @c PLURAL_PARSE @ascuheap @acsmem |
| 952 | @c malloc dup @ascuheap @acsmem |
| 953 | @c free dup @ascuheap @acsmem |
| 954 | @c INIT_GERMANIC_PLURAL ok, nothing |
| 955 | @c the pre-C99 variant is @acucorrupt [protected from @mtuinit by dcigettext] |
| 956 | @c _nl_expand_alias dup @ascuheap @asulock @acsmem @acsfd @aculock |
| 957 | @c _nl_explode_name dup @ascuheap @acsmem |
| 958 | @c libc_rwlock_wrlock dup @asulock @aculock |
| 959 | @c free dup @asulock @aculock @acsfd @acsmem |
| 960 | @c _nl_find_msg @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 961 | @c _nl_load_domain dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem |
| 962 | @c strlen ok |
| 963 | @c hash_string ok |
| 964 | @c W ok |
| 965 | @c SWAP ok |
| 966 | @c bswap_32 ok |
| 967 | @c strcmp ok |
| 968 | @c get_output_charset @mtsenv @ascuheap @acsmem |
| 969 | @c getenv dup @mtsenv |
| 970 | @c strlen dup ok |
| 971 | @c malloc dup @ascuheap @acsmem |
| 972 | @c memcpy dup ok |
| 973 | @c libc_rwlock_rdlock dup @asulock @aculock |
| 974 | @c libc_rwlock_unlock dup ok |
| 975 | @c libc_rwlock_wrlock dup @asulock @aculock |
| 976 | @c realloc @ascuheap @acsmem |
| 977 | @c strdup @ascuheap @acsmem |
| 978 | @c strstr ok |
| 979 | @c strcspn ok |
| 980 | @c mempcpy dup ok |
| 981 | @c norm_add_slashes dup ok |
| 982 | @c gconv_open @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsmem @acsfd |
| 983 | @c [protected from @mtslocale by dcigettext locale lock] |
| 984 | @c free dup @ascuheap @acsmem |
| 985 | @c libc_lock_lock @asulock @aculock |
| 986 | @c calloc @ascuheap @acsmem |
| 987 | @c gconv dup @acucorrupt [protected from @mtsrace and @asucorrupt by lock] |
| 988 | @c libc_lock_unlock ok |
| 989 | @c malloc @ascuheap @acsmem |
| 990 | @c mempcpy ok |
| 991 | @c memcpy ok |
| 992 | @c strcpy ok |
| 993 | @c libc_rwlock_wrlock @asulock @aculock |
| 994 | @c tsearch @ascuheap @acucorrupt @acsmem [protected from @mtsrace and @asucorrupt] |
| 995 | @c transcmp ok |
| 996 | @c strmp dup ok |
| 997 | @c free @ascuheap @acsmem |
| 998 | The @code{dcgettext} adds another argument to those which |
| 999 | @code{dgettext} takes. This argument @var{category} specifies the last |
| 1000 | piece of information needed to localize the message catalog. I.e., the |
| 1001 | domain name and the locale category exactly specify which message |
| 1002 | catalog has to be used (relative to a given directory, see below). |
| 1003 | |
| 1004 | The @code{dgettext} function can be expressed in terms of |
| 1005 | @code{dcgettext} by using |
| 1006 | |
| 1007 | @smallexample |
| 1008 | dcgettext (domain, string, LC_MESSAGES) |
| 1009 | @end smallexample |
| 1010 | |
| 1011 | @noindent |
| 1012 | instead of |
| 1013 | |
| 1014 | @smallexample |
| 1015 | dgettext (domain, string) |
| 1016 | @end smallexample |
| 1017 | |
| 1018 | This also shows which values are expected for the third parameter. One |
| 1019 | has to use the available selectors for the categories available in |
| 1020 | @file{locale.h}. Normally the available values are @code{LC_CTYPE}, |
| 1021 | @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, |
| 1022 | @code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL} |
| 1023 | must not be used and even though the names might suggest this, there is |
| 1024 | no relation to the environments variables of this name. |
| 1025 | |
| 1026 | The @code{dcgettext} function is only implemented for compatibility with |
| 1027 | other systems which have @code{gettext} functions. There is not really |
| 1028 | any situation where it is necessary (or useful) to use a different value |
| 1029 | but @code{LC_MESSAGES} in for the @var{category} parameter. We are |
| 1030 | dealing with messages here and any other choice can only be irritating. |
| 1031 | |
| 1032 | As for @code{gettext} the return value type is @code{char *} which is an |
| 1033 | anachronism. The returned string must never be modified. |
| 1034 | @end deftypefun |
| 1035 | |
| 1036 | When using the three functions above in a program it is a frequent case |
| 1037 | that the @var{msgid} argument is a constant string. So it is worth to |
| 1038 | optimize this case. Thinking shortly about this one will realize that |
| 1039 | as long as no new message catalog is loaded the translation of a message |
| 1040 | will not change. This optimization is actually implemented by the |
| 1041 | @code{gettext}, @code{dgettext} and @code{dcgettext} functions. |
| 1042 | |
| 1043 | |
| 1044 | @node Locating gettext catalog |
| 1045 | @subsubsection How to determine which catalog to be used |
| 1046 | |
| 1047 | The functions to retrieve the translations for a given message have a |
| 1048 | remarkable simple interface. But to provide the user of the program |
| 1049 | still the opportunity to select exactly the translation s/he wants and |
| 1050 | also to provide the programmer the possibility to influence the way to |
| 1051 | locate the search for catalogs files there is a quite complicated |
| 1052 | underlying mechanism which controls all this. The code is complicated |
| 1053 | the use is easy. |
| 1054 | |
| 1055 | Basically we have two different tasks to perform which can also be |
| 1056 | performed by the @code{catgets} functions: |
| 1057 | |
| 1058 | @enumerate |
| 1059 | @item |
| 1060 | Locate the set of message catalogs. There are a number of files for |
| 1061 | different languages and which all belong to the package. Usually they |
| 1062 | are all stored in the filesystem below a certain directory. |
| 1063 | |
| 1064 | There can be arbitrary many packages installed and they can follow |
| 1065 | different guidelines for the placement of their files. |
| 1066 | |
| 1067 | @item |
| 1068 | Relative to the location specified by the package the actual translation |
| 1069 | files must be searched, based on the wishes of the user. I.e., for each |
| 1070 | language the user selects the program should be able to locate the |
| 1071 | appropriate file. |
| 1072 | @end enumerate |
| 1073 | |
| 1074 | This is the functionality required by the specifications for |
| 1075 | @code{gettext} and this is also what the @code{catgets} functions are |
| 1076 | able to do. But there are some problems unresolved: |
| 1077 | |
| 1078 | @itemize @bullet |
| 1079 | @item |
| 1080 | The language to be used can be specified in several different ways. |
| 1081 | There is no generally accepted standard for this and the user always |
| 1082 | expects the program understand what s/he means. E.g., to select the |
| 1083 | German translation one could write @code{de}, @code{german}, or |
| 1084 | @code{deutsch} and the program should always react the same. |
| 1085 | |
| 1086 | @item |
| 1087 | Sometimes the specification of the user is too detailed. If s/he, e.g., |
| 1088 | specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany, |
| 1089 | coded using the @w{ISO 8859-1} character set there is the possibility |
| 1090 | that a message catalog matching this exactly is not available. But |
| 1091 | there could be a catalog matching @code{de} and if the character set |
| 1092 | used on the machine is always @w{ISO 8859-1} there is no reason why this |
| 1093 | later message catalog should not be used. (We call this @dfn{message |
| 1094 | inheritance}.) |
| 1095 | |
| 1096 | @item |
| 1097 | If a catalog for a wanted language is not available it is not always the |
| 1098 | second best choice to fall back on the language of the developer and |
| 1099 | simply not translate any message. Instead a user might be better able |
| 1100 | to read the messages in another language and so the user of the program |
| 1101 | should be able to define a precedence order of languages. |
| 1102 | @end itemize |
| 1103 | |
| 1104 | We can divide the configuration actions in two parts: the one is |
| 1105 | performed by the programmer, the other by the user. We will start with |
| 1106 | the functions the programmer can use since the user configuration will |
| 1107 | be based on this. |
| 1108 | |
| 1109 | As the functions described in the last sections already mention separate |
| 1110 | sets of messages can be selected by a @dfn{domain name}. This is a |
| 1111 | simple string which should be unique for each program part with uses a |
| 1112 | separate domain. It is possible to use in one program arbitrary many |
| 1113 | domains at the same time. E.g., @theglibc{} itself uses a domain |
| 1114 | named @code{libc} while the program using the C Library could use a |
| 1115 | domain named @code{foo}. The important point is that at any time |
| 1116 | exactly one domain is active. This is controlled with the following |
| 1117 | function. |
| 1118 | |
| 1119 | @comment libintl.h |
| 1120 | @comment GNU |
| 1121 | @deftypefun {char *} textdomain (const char *@var{domainname}) |
| 1122 | @safety{@prelim{}@mtsafe{}@asunsafe{@asulock{} @ascuheap{}}@acunsafe{@aculock{} @acsmem{}}} |
| 1123 | @c textdomain @asulock @ascuheap @aculock @acsmem |
| 1124 | @c libc_rwlock_wrlock @asulock @aculock |
| 1125 | @c strcmp ok |
| 1126 | @c strdup @ascuheap @acsmem |
| 1127 | @c free @ascuheap @acsmem |
| 1128 | @c libc_rwlock_unlock ok |
| 1129 | The @code{textdomain} function sets the default domain, which is used in |
| 1130 | all future @code{gettext} calls, to @var{domainname}. Please note that |
| 1131 | @code{dgettext} and @code{dcgettext} calls are not influenced if the |
| 1132 | @var{domainname} parameter of these functions is not the null pointer. |
| 1133 | |
| 1134 | Before the first call to @code{textdomain} the default domain is |
| 1135 | @code{messages}. This is the name specified in the specification of |
| 1136 | the @code{gettext} API. This name is as good as any other name. No |
| 1137 | program should ever really use a domain with this name since this can |
| 1138 | only lead to problems. |
| 1139 | |
| 1140 | The function returns the value which is from now on taken as the default |
| 1141 | domain. If the system went out of memory the returned value is |
| 1142 | @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}. |
| 1143 | Despite the return value type being @code{char *} the return string must |
| 1144 | not be changed. It is allocated internally by the @code{textdomain} |
| 1145 | function. |
| 1146 | |
| 1147 | If the @var{domainname} parameter is the null pointer no new default |
| 1148 | domain is set. Instead the currently selected default domain is |
| 1149 | returned. |
| 1150 | |
| 1151 | If the @var{domainname} parameter is the empty string the default domain |
| 1152 | is reset to its initial value, the domain with the name @code{messages}. |
| 1153 | This possibility is questionable to use since the domain @code{messages} |
| 1154 | really never should be used. |
| 1155 | @end deftypefun |
| 1156 | |
| 1157 | @comment libintl.h |
| 1158 | @comment GNU |
| 1159 | @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname}) |
| 1160 | @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}} |
| 1161 | @c bindtextdomain @ascuheap @acsmem |
| 1162 | @c set_binding_values @ascuheap @acsmem |
| 1163 | @c libc_rwlock_wrlock dup @asulock @aculock |
| 1164 | @c strcmp dup ok |
| 1165 | @c strdup dup @ascuheap @acsmem |
| 1166 | @c free dup @ascuheap @acsmem |
| 1167 | @c malloc dup @ascuheap @acsmem |
| 1168 | The @code{bindtextdomain} function can be used to specify the directory |
| 1169 | which contains the message catalogs for domain @var{domainname} for the |
| 1170 | different languages. To be correct, this is the directory where the |
| 1171 | hierarchy of directories is expected. Details are explained below. |
| 1172 | |
| 1173 | For the programmer it is important to note that the translations which |
| 1174 | come with the program have be placed in a directory hierarchy starting |
| 1175 | at, say, @file{/foo/bar}. Then the program should make a |
| 1176 | @code{bindtextdomain} call to bind the domain for the current program to |
| 1177 | this directory. So it is made sure the catalogs are found. A correctly |
| 1178 | running program does not depend on the user setting an environment |
| 1179 | variable. |
| 1180 | |
| 1181 | The @code{bindtextdomain} function can be used several times and if the |
| 1182 | @var{domainname} argument is different the previously bound domains |
| 1183 | will not be overwritten. |
| 1184 | |
| 1185 | If the program which wish to use @code{bindtextdomain} at some point of |
| 1186 | time use the @code{chdir} function to change the current working |
| 1187 | directory it is important that the @var{dirname} strings ought to be an |
| 1188 | absolute pathname. Otherwise the addressed directory might vary with |
| 1189 | the time. |
| 1190 | |
| 1191 | If the @var{dirname} parameter is the null pointer @code{bindtextdomain} |
| 1192 | returns the currently selected directory for the domain with the name |
| 1193 | @var{domainname}. |
| 1194 | |
| 1195 | The @code{bindtextdomain} function returns a pointer to a string |
| 1196 | containing the name of the selected directory name. The string is |
| 1197 | allocated internally in the function and must not be changed by the |
| 1198 | user. If the system went out of core during the execution of |
| 1199 | @code{bindtextdomain} the return value is @code{NULL} and the global |
| 1200 | variable @var{errno} is set accordingly. |
| 1201 | @end deftypefun |
| 1202 | |
| 1203 | |
| 1204 | @node Advanced gettext functions |
| 1205 | @subsubsection Additional functions for more complicated situations |
| 1206 | |
| 1207 | The functions of the @code{gettext} family described so far (and all the |
| 1208 | @code{catgets} functions as well) have one problem in the real world |
| 1209 | which have been neglected completely in all existing approaches. What |
| 1210 | is meant here is the handling of plural forms. |
| 1211 | |
| 1212 | Looking through Unix source code before the time anybody thought about |
| 1213 | internationalization (and, sadly, even afterwards) one can often find |
| 1214 | code similar to the following: |
| 1215 | |
| 1216 | @smallexample |
| 1217 | printf ("%d file%s deleted", n, n == 1 ? "" : "s"); |
| 1218 | @end smallexample |
| 1219 | |
| 1220 | @noindent |
| 1221 | After the first complaints from people internationalizing the code people |
| 1222 | either completely avoided formulations like this or used strings like |
| 1223 | @code{"file(s)"}. Both look unnatural and should be avoided. First |
| 1224 | tries to solve the problem correctly looked like this: |
| 1225 | |
| 1226 | @smallexample |
| 1227 | if (n == 1) |
| 1228 | printf ("%d file deleted", n); |
| 1229 | else |
| 1230 | printf ("%d files deleted", n); |
| 1231 | @end smallexample |
| 1232 | |
| 1233 | But this does not solve the problem. It helps languages where the |
| 1234 | plural form of a noun is not simply constructed by adding an `s' but |
| 1235 | that is all. Once again people fell into the trap of believing the |
| 1236 | rules their language is using are universal. But the handling of plural |
| 1237 | forms differs widely between the language families. There are two |
| 1238 | things we can differ between (and even inside language families); |
| 1239 | |
| 1240 | @itemize @bullet |
| 1241 | @item |
| 1242 | The form how plural forms are build differs. This is a problem with |
| 1243 | language which have many irregularities. German, for instance, is a |
| 1244 | drastic case. Though English and German are part of the same language |
| 1245 | family (Germanic), the almost regular forming of plural noun forms |
| 1246 | (appending an `s') is hardly found in German. |
| 1247 | |
| 1248 | @item |
| 1249 | The number of plural forms differ. This is somewhat surprising for |
| 1250 | those who only have experiences with Romanic and Germanic languages |
| 1251 | since here the number is the same (there are two). |
| 1252 | |
| 1253 | But other language families have only one form or many forms. More |
| 1254 | information on this in an extra section. |
| 1255 | @end itemize |
| 1256 | |
| 1257 | The consequence of this is that application writers should not try to |
| 1258 | solve the problem in their code. This would be localization since it is |
| 1259 | only usable for certain, hardcoded language environments. Instead the |
| 1260 | extended @code{gettext} interface should be used. |
| 1261 | |
| 1262 | These extra functions are taking instead of the one key string two |
| 1263 | strings and a numerical argument. The idea behind this is that using |
| 1264 | the numerical argument and the first string as a key, the implementation |
| 1265 | can select using rules specified by the translator the right plural |
| 1266 | form. The two string arguments then will be used to provide a return |
| 1267 | value in case no message catalog is found (similar to the normal |
| 1268 | @code{gettext} behavior). In this case the rules for Germanic language |
| 1269 | is used and it is assumed that the first string argument is the singular |
| 1270 | form, the second the plural form. |
| 1271 | |
| 1272 | This has the consequence that programs without language catalogs can |
| 1273 | display the correct strings only if the program itself is written using |
| 1274 | a Germanic language. This is a limitation but since @theglibc{} |
| 1275 | (as well as the GNU @code{gettext} package) are written as part of the |
| 1276 | GNU package and the coding standards for the GNU project require program |
| 1277 | being written in English, this solution nevertheless fulfills its |
| 1278 | purpose. |
| 1279 | |
| 1280 | @comment libintl.h |
| 1281 | @comment GNU |
| 1282 | @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}) |
| 1283 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 1284 | @c Wrapper for dcngettext. |
| 1285 | The @code{ngettext} function is similar to the @code{gettext} function |
| 1286 | as it finds the message catalogs in the same way. But it takes two |
| 1287 | extra arguments. The @var{msgid1} parameter must contain the singular |
| 1288 | form of the string to be converted. It is also used as the key for the |
| 1289 | search in the catalog. The @var{msgid2} parameter is the plural form. |
| 1290 | The parameter @var{n} is used to determine the plural form. If no |
| 1291 | message catalog is found @var{msgid1} is returned if @code{n == 1}, |
| 1292 | otherwise @code{msgid2}. |
| 1293 | |
| 1294 | An example for the us of this function is: |
| 1295 | |
| 1296 | @smallexample |
| 1297 | printf (ngettext ("%d file removed", "%d files removed", n), n); |
| 1298 | @end smallexample |
| 1299 | |
| 1300 | Please note that the numeric value @var{n} has to be passed to the |
| 1301 | @code{printf} function as well. It is not sufficient to pass it only to |
| 1302 | @code{ngettext}. |
| 1303 | @end deftypefun |
| 1304 | |
| 1305 | @comment libintl.h |
| 1306 | @comment GNU |
| 1307 | @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}) |
| 1308 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 1309 | @c Wrapper for dcngettext. |
| 1310 | The @code{dngettext} is similar to the @code{dgettext} function in the |
| 1311 | way the message catalog is selected. The difference is that it takes |
| 1312 | two extra parameter to provide the correct plural form. These two |
| 1313 | parameters are handled in the same way @code{ngettext} handles them. |
| 1314 | @end deftypefun |
| 1315 | |
| 1316 | @comment libintl.h |
| 1317 | @comment GNU |
| 1318 | @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category}) |
| 1319 | @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}} |
| 1320 | @c Wrapper for dcigettext. |
| 1321 | The @code{dcngettext} is similar to the @code{dcgettext} function in the |
| 1322 | way the message catalog is selected. The difference is that it takes |
| 1323 | two extra parameter to provide the correct plural form. These two |
| 1324 | parameters are handled in the same way @code{ngettext} handles them. |
| 1325 | @end deftypefun |
| 1326 | |
| 1327 | @subsubheading The problem of plural forms |
| 1328 | |
| 1329 | A description of the problem can be found at the beginning of the last |
| 1330 | section. Now there is the question how to solve it. Without the input |
| 1331 | of linguists (which was not available) it was not possible to determine |
| 1332 | whether there are only a few different forms in which plural forms are |
| 1333 | formed or whether the number can increase with every new supported |
| 1334 | language. |
| 1335 | |
| 1336 | Therefore the solution implemented is to allow the translator to specify |
| 1337 | the rules of how to select the plural form. Since the formula varies |
| 1338 | with every language this is the only viable solution except for |
| 1339 | hardcoding the information in the code (which still would require the |
| 1340 | possibility of extensions to not prevent the use of new languages). The |
| 1341 | details are explained in the GNU @code{gettext} manual. Here only a |
| 1342 | bit of information is provided. |
| 1343 | |
| 1344 | The information about the plural form selection has to be stored in the |
| 1345 | header entry (the one with the empty (@code{msgid} string). It looks |
| 1346 | like this: |
| 1347 | |
| 1348 | @smallexample |
| 1349 | Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1; |
| 1350 | @end smallexample |
| 1351 | |
| 1352 | The @code{nplurals} value must be a decimal number which specifies how |
| 1353 | many different plural forms exist for this language. The string |
| 1354 | following @code{plural} is an expression which is using the C language |
| 1355 | syntax. Exceptions are that no negative number are allowed, numbers |
| 1356 | must be decimal, and the only variable allowed is @code{n}. This |
| 1357 | expression will be evaluated whenever one of the functions |
| 1358 | @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called. The |
| 1359 | numeric value passed to these functions is then substituted for all uses |
| 1360 | of the variable @code{n} in the expression. The resulting value then |
| 1361 | must be greater or equal to zero and smaller than the value given as the |
| 1362 | value of @code{nplurals}. |
| 1363 | |
| 1364 | @noindent |
| 1365 | The following rules are known at this point. The language with families |
| 1366 | are listed. But this does not necessarily mean the information can be |
| 1367 | generalized for the whole family (as can be easily seen in the table |
| 1368 | below).@footnote{Additions are welcome. Send appropriate information to |
| 1369 | @email{bug-glibc-manual@@gnu.org}.} |
| 1370 | |
| 1371 | @table @asis |
| 1372 | @item Only one form: |
| 1373 | Some languages only require one single form. There is no distinction |
| 1374 | between the singular and plural form. An appropriate header entry |
| 1375 | would look like this: |
| 1376 | |
| 1377 | @smallexample |
| 1378 | Plural-Forms: nplurals=1; plural=0; |
| 1379 | @end smallexample |
| 1380 | |
| 1381 | @noindent |
| 1382 | Languages with this property include: |
| 1383 | |
| 1384 | @table @asis |
| 1385 | @item Finno-Ugric family |
| 1386 | Hungarian |
| 1387 | @item Asian family |
| 1388 | Japanese, Korean |
| 1389 | @item Turkic/Altaic family |
| 1390 | Turkish |
| 1391 | @end table |
| 1392 | |
| 1393 | @item Two forms, singular used for one only |
| 1394 | This is the form used in most existing programs since it is what English |
| 1395 | is using. A header entry would look like this: |
| 1396 | |
| 1397 | @smallexample |
| 1398 | Plural-Forms: nplurals=2; plural=n != 1; |
| 1399 | @end smallexample |
| 1400 | |
| 1401 | (Note: this uses the feature of C expressions that boolean expressions |
| 1402 | have to value zero or one.) |
| 1403 | |
| 1404 | @noindent |
| 1405 | Languages with this property include: |
| 1406 | |
| 1407 | @table @asis |
| 1408 | @item Germanic family |
| 1409 | Danish, Dutch, English, German, Norwegian, Swedish |
| 1410 | @item Finno-Ugric family |
| 1411 | Estonian, Finnish |
| 1412 | @item Latin/Greek family |
| 1413 | Greek |
| 1414 | @item Semitic family |
| 1415 | Hebrew |
| 1416 | @item Romance family |
| 1417 | Italian, Portuguese, Spanish |
| 1418 | @item Artificial |
| 1419 | Esperanto |
| 1420 | @end table |
| 1421 | |
| 1422 | @item Two forms, singular used for zero and one |
| 1423 | Exceptional case in the language family. The header entry would be: |
| 1424 | |
| 1425 | @smallexample |
| 1426 | Plural-Forms: nplurals=2; plural=n>1; |
| 1427 | @end smallexample |
| 1428 | |
| 1429 | @noindent |
| 1430 | Languages with this property include: |
| 1431 | |
| 1432 | @table @asis |
| 1433 | @item Romanic family |
| 1434 | French, Brazilian Portuguese |
| 1435 | @end table |
| 1436 | |
| 1437 | @item Three forms, special case for zero |
| 1438 | The header entry would be: |
| 1439 | |
| 1440 | @smallexample |
| 1441 | Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2; |
| 1442 | @end smallexample |
| 1443 | |
| 1444 | @noindent |
| 1445 | Languages with this property include: |
| 1446 | |
| 1447 | @table @asis |
| 1448 | @item Baltic family |
| 1449 | Latvian |
| 1450 | @end table |
| 1451 | |
| 1452 | @item Three forms, special cases for one and two |
| 1453 | The header entry would be: |
| 1454 | |
| 1455 | @smallexample |
| 1456 | Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2; |
| 1457 | @end smallexample |
| 1458 | |
| 1459 | @noindent |
| 1460 | Languages with this property include: |
| 1461 | |
| 1462 | @table @asis |
| 1463 | @item Celtic |
| 1464 | Gaeilge (Irish) |
| 1465 | @end table |
| 1466 | |
| 1467 | @item Three forms, special case for numbers ending in 1[2-9] |
| 1468 | The header entry would look like this: |
| 1469 | |
| 1470 | @smallexample |
| 1471 | Plural-Forms: nplurals=3; \ |
| 1472 | plural=n%10==1 && n%100!=11 ? 0 : \ |
| 1473 | n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2; |
| 1474 | @end smallexample |
| 1475 | |
| 1476 | @noindent |
| 1477 | Languages with this property include: |
| 1478 | |
| 1479 | @table @asis |
| 1480 | @item Baltic family |
| 1481 | Lithuanian |
| 1482 | @end table |
| 1483 | |
| 1484 | @item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4] |
| 1485 | The header entry would look like this: |
| 1486 | |
| 1487 | @smallexample |
| 1488 | Plural-Forms: nplurals=3; \ |
| 1489 | plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1; |
| 1490 | @end smallexample |
| 1491 | |
| 1492 | @noindent |
| 1493 | Languages with this property include: |
| 1494 | |
| 1495 | @table @asis |
| 1496 | @item Slavic family |
| 1497 | Croatian, Czech, Russian, Ukrainian |
| 1498 | @end table |
| 1499 | |
| 1500 | @item Three forms, special cases for 1 and 2, 3, 4 |
| 1501 | The header entry would look like this: |
| 1502 | |
| 1503 | @smallexample |
| 1504 | Plural-Forms: nplurals=3; \ |
| 1505 | plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0; |
| 1506 | @end smallexample |
| 1507 | |
| 1508 | @noindent |
| 1509 | Languages with this property include: |
| 1510 | |
| 1511 | @table @asis |
| 1512 | @item Slavic family |
| 1513 | Slovak |
| 1514 | @end table |
| 1515 | |
| 1516 | @item Three forms, special case for one and some numbers ending in 2, 3, or 4 |
| 1517 | The header entry would look like this: |
| 1518 | |
| 1519 | @smallexample |
| 1520 | Plural-Forms: nplurals=3; \ |
| 1521 | plural=n==1 ? 0 : \ |
| 1522 | n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2; |
| 1523 | @end smallexample |
| 1524 | |
| 1525 | @noindent |
| 1526 | Languages with this property include: |
| 1527 | |
| 1528 | @table @asis |
| 1529 | @item Slavic family |
| 1530 | Polish |
| 1531 | @end table |
| 1532 | |
| 1533 | @item Four forms, special case for one and all numbers ending in 02, 03, or 04 |
| 1534 | The header entry would look like this: |
| 1535 | |
| 1536 | @smallexample |
| 1537 | Plural-Forms: nplurals=4; \ |
| 1538 | plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3; |
| 1539 | @end smallexample |
| 1540 | |
| 1541 | @noindent |
| 1542 | Languages with this property include: |
| 1543 | |
| 1544 | @table @asis |
| 1545 | @item Slavic family |
| 1546 | Slovenian |
| 1547 | @end table |
| 1548 | @end table |
| 1549 | |
| 1550 | |
| 1551 | @node Charset conversion in gettext |
| 1552 | @subsubsection How to specify the output character set @code{gettext} uses |
| 1553 | |
| 1554 | @code{gettext} not only looks up a translation in a message catalog. It |
| 1555 | also converts the translation on the fly to the desired output character |
| 1556 | set. This is useful if the user is working in a different character set |
| 1557 | than the translator who created the message catalog, because it avoids |
| 1558 | distributing variants of message catalogs which differ only in the |
| 1559 | character set. |
| 1560 | |
| 1561 | The output character set is, by default, the value of @code{nl_langinfo |
| 1562 | (CODESET)}, which depends on the @code{LC_CTYPE} part of the current |
| 1563 | locale. But programs which store strings in a locale independent way |
| 1564 | (e.g. UTF-8) can request that @code{gettext} and related functions |
| 1565 | return the translations in that encoding, by use of the |
| 1566 | @code{bind_textdomain_codeset} function. |
| 1567 | |
| 1568 | Note that the @var{msgid} argument to @code{gettext} is not subject to |
| 1569 | character set conversion. Also, when @code{gettext} does not find a |
| 1570 | translation for @var{msgid}, it returns @var{msgid} unchanged -- |
| 1571 | independently of the current output character set. It is therefore |
| 1572 | recommended that all @var{msgid}s be US-ASCII strings. |
| 1573 | |
| 1574 | @comment libintl.h |
| 1575 | @comment GNU |
| 1576 | @deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset}) |
| 1577 | @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}} |
| 1578 | @c bind_textdomain_codeset @ascuheap @acsmem |
| 1579 | @c set_binding_values dup @ascuheap @acsmem |
| 1580 | The @code{bind_textdomain_codeset} function can be used to specify the |
| 1581 | output character set for message catalogs for domain @var{domainname}. |
| 1582 | The @var{codeset} argument must be a valid codeset name which can be used |
| 1583 | for the @code{iconv_open} function, or a null pointer. |
| 1584 | |
| 1585 | If the @var{codeset} parameter is the null pointer, |
| 1586 | @code{bind_textdomain_codeset} returns the currently selected codeset |
| 1587 | for the domain with the name @var{domainname}. It returns @code{NULL} if |
| 1588 | no codeset has yet been selected. |
| 1589 | |
| 1590 | The @code{bind_textdomain_codeset} function can be used several times. |
| 1591 | If used multiple times with the same @var{domainname} argument, the |
| 1592 | later call overrides the settings made by the earlier one. |
| 1593 | |
| 1594 | The @code{bind_textdomain_codeset} function returns a pointer to a |
| 1595 | string containing the name of the selected codeset. The string is |
| 1596 | allocated internally in the function and must not be changed by the |
| 1597 | user. If the system went out of core during the execution of |
| 1598 | @code{bind_textdomain_codeset}, the return value is @code{NULL} and the |
| 1599 | global variable @var{errno} is set accordingly. |
| 1600 | @end deftypefun |
| 1601 | |
| 1602 | |
| 1603 | @node GUI program problems |
| 1604 | @subsubsection How to use @code{gettext} in GUI programs |
| 1605 | |
| 1606 | One place where the @code{gettext} functions, if used normally, have big |
| 1607 | problems is within programs with graphical user interfaces (GUIs). The |
| 1608 | problem is that many of the strings which have to be translated are very |
| 1609 | short. They have to appear in pull-down menus which restricts the |
| 1610 | length. But strings which are not containing entire sentences or at |
| 1611 | least large fragments of a sentence may appear in more than one |
| 1612 | situation in the program but might have different translations. This is |
| 1613 | especially true for the one-word strings which are frequently used in |
| 1614 | GUI programs. |
| 1615 | |
| 1616 | As a consequence many people say that the @code{gettext} approach is |
| 1617 | wrong and instead @code{catgets} should be used which indeed does not |
| 1618 | have this problem. But there is a very simple and powerful method to |
| 1619 | handle these kind of problems with the @code{gettext} functions. |
| 1620 | |
| 1621 | @noindent |
| 1622 | As an example consider the following fictional situation. A GUI program |
| 1623 | has a menu bar with the following entries: |
| 1624 | |
| 1625 | @smallexample |
| 1626 | +------------+------------+--------------------------------------+ |
| 1627 | | File | Printer | | |
| 1628 | +------------+------------+--------------------------------------+ |
| 1629 | | Open | | Select | |
| 1630 | | New | | Open | |
| 1631 | +----------+ | Connect | |
| 1632 | +----------+ |
| 1633 | @end smallexample |
| 1634 | |
| 1635 | To have the strings @code{File}, @code{Printer}, @code{Open}, |
| 1636 | @code{New}, @code{Select}, and @code{Connect} translated there has to be |
| 1637 | at some point in the code a call to a function of the @code{gettext} |
| 1638 | family. But in two places the string passed into the function would be |
| 1639 | @code{Open}. The translations might not be the same and therefore we |
| 1640 | are in the dilemma described above. |
| 1641 | |
| 1642 | One solution to this problem is to artificially enlengthen the strings |
| 1643 | to make them unambiguous. But what would the program do if no |
| 1644 | translation is available? The enlengthened string is not what should be |
| 1645 | printed. So we should use a little bit modified version of the functions. |
| 1646 | |
| 1647 | To enlengthen the strings a uniform method should be used. E.g., in the |
| 1648 | example above the strings could be chosen as |
| 1649 | |
| 1650 | @smallexample |
| 1651 | Menu|File |
| 1652 | Menu|Printer |
| 1653 | Menu|File|Open |
| 1654 | Menu|File|New |
| 1655 | Menu|Printer|Select |
| 1656 | Menu|Printer|Open |
| 1657 | Menu|Printer|Connect |
| 1658 | @end smallexample |
| 1659 | |
| 1660 | Now all the strings are different and if now instead of @code{gettext} |
| 1661 | the following little wrapper function is used, everything works just |
| 1662 | fine: |
| 1663 | |
| 1664 | @cindex sgettext |
| 1665 | @smallexample |
| 1666 | char * |
| 1667 | sgettext (const char *msgid) |
| 1668 | @{ |
| 1669 | char *msgval = gettext (msgid); |
| 1670 | if (msgval == msgid) |
| 1671 | msgval = strrchr (msgid, '|') + 1; |
| 1672 | return msgval; |
| 1673 | @} |
| 1674 | @end smallexample |
| 1675 | |
| 1676 | What this little function does is to recognize the case when no |
| 1677 | translation is available. This can be done very efficiently by a |
| 1678 | pointer comparison since the return value is the input value. If there |
| 1679 | is no translation we know that the input string is in the format we used |
| 1680 | for the Menu entries and therefore contains a @code{|} character. We |
| 1681 | simply search for the last occurrence of this character and return a |
| 1682 | pointer to the character following it. That's it! |
| 1683 | |
| 1684 | If one now consistently uses the enlengthened string form and replaces |
| 1685 | the @code{gettext} calls with calls to @code{sgettext} (this is normally |
| 1686 | limited to very few places in the GUI implementation) then it is |
| 1687 | possible to produce a program which can be internationalized. |
| 1688 | |
| 1689 | With advanced compilers (such as GNU C) one can write the |
| 1690 | @code{sgettext} functions as an inline function or as a macro like this: |
| 1691 | |
| 1692 | @cindex sgettext |
| 1693 | @smallexample |
| 1694 | #define sgettext(msgid) \ |
| 1695 | (@{ const char *__msgid = (msgid); \ |
| 1696 | char *__msgstr = gettext (__msgid); \ |
| 1697 | if (__msgval == __msgid) \ |
| 1698 | __msgval = strrchr (__msgid, '|') + 1; \ |
| 1699 | __msgval; @}) |
| 1700 | @end smallexample |
| 1701 | |
| 1702 | The other @code{gettext} functions (@code{dgettext}, @code{dcgettext} |
| 1703 | and the @code{ngettext} equivalents) can and should have corresponding |
| 1704 | functions as well which look almost identical, except for the parameters |
| 1705 | and the call to the underlying function. |
| 1706 | |
| 1707 | Now there is of course the question why such functions do not exist in |
| 1708 | @theglibc{}? There are two parts of the answer to this question. |
| 1709 | |
| 1710 | @itemize @bullet |
| 1711 | @item |
| 1712 | They are easy to write and therefore can be provided by the project they |
| 1713 | are used in. This is not an answer by itself and must be seen together |
| 1714 | with the second part which is: |
| 1715 | |
| 1716 | @item |
| 1717 | There is no way the C library can contain a version which can work |
| 1718 | everywhere. The problem is the selection of the character to separate |
| 1719 | the prefix from the actual string in the enlenghtened string. The |
| 1720 | examples above used @code{|} which is a quite good choice because it |
| 1721 | resembles a notation frequently used in this context and it also is a |
| 1722 | character not often used in message strings. |
| 1723 | |
| 1724 | But what if the character is used in message strings. Or if the chose |
| 1725 | character is not available in the character set on the machine one |
| 1726 | compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is |
| 1727 | why the @file{iso646.h} file exists in @w{ISO C} programming environments). |
| 1728 | @end itemize |
| 1729 | |
| 1730 | There is only one more comment to make left. The wrapper function above |
| 1731 | require that the translations strings are not enlengthened themselves. |
| 1732 | This is only logical. There is no need to disambiguate the strings |
| 1733 | (since they are never used as keys for a search) and one also saves |
| 1734 | quite some memory and disk space by doing this. |
| 1735 | |
| 1736 | |
| 1737 | @node Using gettextized software |
| 1738 | @subsubsection User influence on @code{gettext} |
| 1739 | |
| 1740 | The last sections described what the programmer can do to |
| 1741 | internationalize the messages of the program. But it is finally up to |
| 1742 | the user to select the message s/he wants to see. S/He must understand |
| 1743 | them. |
| 1744 | |
| 1745 | The POSIX locale model uses the environment variables @code{LC_COLLATE}, |
| 1746 | @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC}, |
| 1747 | and @code{LC_TIME} to select the locale which is to be used. This way |
| 1748 | the user can influence lots of functions. As we mentioned above the |
| 1749 | @code{gettext} functions also take advantage of this. |
| 1750 | |
| 1751 | To understand how this happens it is necessary to take a look at the |
| 1752 | various components of the filename which gets computed to locate a |
| 1753 | message catalog. It is composed as follows: |
| 1754 | |
| 1755 | @smallexample |
| 1756 | @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo |
| 1757 | @end smallexample |
| 1758 | |
| 1759 | The default value for @var{dir_name} is system specific. It is computed |
| 1760 | from the value given as the prefix while configuring the C library. |
| 1761 | This value normally is @file{/usr} or @file{/}. For the former the |
| 1762 | complete @var{dir_name} is: |
| 1763 | |
| 1764 | @smallexample |
| 1765 | /usr/share/locale |
| 1766 | @end smallexample |
| 1767 | |
| 1768 | We can use @file{/usr/share} since the @file{.mo} files containing the |
| 1769 | message catalogs are system independent, so all systems can use the same |
| 1770 | files. If the program executed the @code{bindtextdomain} function for |
| 1771 | the message domain that is currently handled, the @code{dir_name} |
| 1772 | component is exactly the value which was given to the function as |
| 1773 | the second parameter. I.e., @code{bindtextdomain} allows overwriting |
| 1774 | the only system dependent and fixed value to make it possible to |
| 1775 | address files anywhere in the filesystem. |
| 1776 | |
| 1777 | The @var{category} is the name of the locale category which was selected |
| 1778 | in the program code. For @code{gettext} and @code{dgettext} this is |
| 1779 | always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the |
| 1780 | value of the third parameter. As said above it should be avoided to |
| 1781 | ever use a category other than @code{LC_MESSAGES}. |
| 1782 | |
| 1783 | The @var{locale} component is computed based on the category used. Just |
| 1784 | like for the @code{setlocale} function here comes the user selection |
| 1785 | into the play. Some environment variables are examined in a fixed order |
| 1786 | and the first environment variable set determines the return value of |
| 1787 | the lookup process. In detail, for the category @code{LC_xxx} the |
| 1788 | following variables in this order are examined: |
| 1789 | |
| 1790 | @table @code |
| 1791 | @item LANGUAGE |
| 1792 | @item LC_ALL |
| 1793 | @item LC_xxx |
| 1794 | @item LANG |
| 1795 | @end table |
| 1796 | |
| 1797 | This looks very familiar. With the exception of the @code{LANGUAGE} |
| 1798 | environment variable this is exactly the lookup order the |
| 1799 | @code{setlocale} function uses. But why introducing the @code{LANGUAGE} |
| 1800 | variable? |
| 1801 | |
| 1802 | The reason is that the syntax of the values these variables can have is |
| 1803 | different to what is expected by the @code{setlocale} function. If we |
| 1804 | would set @code{LC_ALL} to a value following the extended syntax that |
| 1805 | would mean the @code{setlocale} function will never be able to use the |
| 1806 | value of this variable as well. An additional variable removes this |
| 1807 | problem plus we can select the language independently of the locale |
| 1808 | setting which sometimes is useful. |
| 1809 | |
| 1810 | While for the @code{LC_xxx} variables the value should consist of |
| 1811 | exactly one specification of a locale the @code{LANGUAGE} variable's |
| 1812 | value can consist of a colon separated list of locale names. The |
| 1813 | attentive reader will realize that this is the way we manage to |
| 1814 | implement one of our additional demands above: we want to be able to |
| 1815 | specify an ordered list of language. |
| 1816 | |
| 1817 | Back to the constructed filename we have only one component missing. |
| 1818 | The @var{domain_name} part is the name which was either registered using |
| 1819 | the @code{textdomain} function or which was given to @code{dgettext} or |
| 1820 | @code{dcgettext} as the first parameter. Now it becomes obvious that a |
| 1821 | good choice for the domain name in the program code is a string which is |
| 1822 | closely related to the program/package name. E.g., for @theglibc{} |
| 1823 | the domain name is @code{libc}. |
| 1824 | |
| 1825 | @noindent |
| 1826 | A limit piece of example code should show how the programmer is supposed |
| 1827 | to work: |
| 1828 | |
| 1829 | @smallexample |
| 1830 | @{ |
| 1831 | setlocale (LC_ALL, ""); |
| 1832 | textdomain ("test-package"); |
| 1833 | bindtextdomain ("test-package", "/usr/local/share/locale"); |
| 1834 | puts (gettext ("Hello, world!")); |
| 1835 | @} |
| 1836 | @end smallexample |
| 1837 | |
| 1838 | At the program start the default domain is @code{messages}, and the |
| 1839 | default locale is "C". The @code{setlocale} call sets the locale |
| 1840 | according to the user's environment variables; remember that correct |
| 1841 | functioning of @code{gettext} relies on the correct setting of the |
| 1842 | @code{LC_MESSAGES} locale (for looking up the message catalog) and |
| 1843 | of the @code{LC_CTYPE} locale (for the character set conversion). |
| 1844 | The @code{textdomain} call changes the default domain to |
| 1845 | @code{test-package}. The @code{bindtextdomain} call specifies that |
| 1846 | the message catalogs for the domain @code{test-package} can be found |
| 1847 | below the directory @file{/usr/local/share/locale}. |
| 1848 | |
| 1849 | If now the user set in her/his environment the variable @code{LANGUAGE} |
| 1850 | to @code{de} the @code{gettext} function will try to use the |
| 1851 | translations from the file |
| 1852 | |
| 1853 | @smallexample |
| 1854 | /usr/local/share/locale/de/LC_MESSAGES/test-package.mo |
| 1855 | @end smallexample |
| 1856 | |
| 1857 | From the above descriptions it should be clear which component of this |
| 1858 | filename is determined by which source. |
| 1859 | |
| 1860 | In the above example we assumed that the @code{LANGUAGE} environment |
| 1861 | variable to @code{de}. This might be an appropriate selection but what |
| 1862 | happens if the user wants to use @code{LC_ALL} because of the wider |
| 1863 | usability and here the required value is @code{de_DE.ISO-8859-1}? We |
| 1864 | already mentioned above that a situation like this is not infrequent. |
| 1865 | E.g., a person might prefer reading a dialect and if this is not |
| 1866 | available fall back on the standard language. |
| 1867 | |
| 1868 | The @code{gettext} functions know about situations like this and can |
| 1869 | handle them gracefully. The functions recognize the format of the value |
| 1870 | of the environment variable. It can split the value is different pieces |
| 1871 | and by leaving out the only or the other part it can construct new |
| 1872 | values. This happens of course in a predictable way. To understand |
| 1873 | this one must know the format of the environment variable value. There |
| 1874 | is one more or less standardized form, originally from the X/Open |
| 1875 | specification: |
| 1876 | |
| 1877 | @code{language[_territory[.codeset]][@@modifier]} |
| 1878 | |
| 1879 | Less specific locale names will be stripped of in the order of the |
| 1880 | following list: |
| 1881 | |
| 1882 | @enumerate |
| 1883 | @item |
| 1884 | @code{codeset} |
| 1885 | @item |
| 1886 | @code{normalized codeset} |
| 1887 | @item |
| 1888 | @code{territory} |
| 1889 | @item |
| 1890 | @code{modifier} |
| 1891 | @end enumerate |
| 1892 | |
| 1893 | The @code{language} field will never be dropped for obvious reasons. |
| 1894 | |
| 1895 | The only new thing is the @code{normalized codeset} entry. This is |
| 1896 | another goodie which is introduced to help reducing the chaos which |
| 1897 | derives from the inability of the people to standardize the names of |
| 1898 | character sets. Instead of @w{ISO-8859-1} one can often see @w{8859-1}, |
| 1899 | @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}. The @code{normalized |
| 1900 | codeset} value is generated from the user-provided character set name by |
| 1901 | applying the following rules: |
| 1902 | |
| 1903 | @enumerate |
| 1904 | @item |
| 1905 | Remove all characters beside numbers and letters. |
| 1906 | @item |
| 1907 | Fold letters to lowercase. |
| 1908 | @item |
| 1909 | If the same only contains digits prepend the string @code{"iso"}. |
| 1910 | @end enumerate |
| 1911 | |
| 1912 | @noindent |
| 1913 | So all of the above name will be normalized to @code{iso88591}. This |
| 1914 | allows the program user much more freely choosing the locale name. |
| 1915 | |
| 1916 | Even this extended functionality still does not help to solve the |
| 1917 | problem that completely different names can be used to denote the same |
| 1918 | locale (e.g., @code{de} and @code{german}). To be of help in this |
| 1919 | situation the locale implementation and also the @code{gettext} |
| 1920 | functions know about aliases. |
| 1921 | |
| 1922 | The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with |
| 1923 | whatever prefix you used for configuring the C library) contains a |
| 1924 | mapping of alternative names to more regular names. The system manager |
| 1925 | is free to add new entries to fill her/his own needs. The selected |
| 1926 | locale from the environment is compared with the entries in the first |
| 1927 | column of this file ignoring the case. If they match the value of the |
| 1928 | second column is used instead for the further handling. |
| 1929 | |
| 1930 | In the description of the format of the environment variables we already |
| 1931 | mentioned the character set as a factor in the selection of the message |
| 1932 | catalog. In fact, only catalogs which contain text written using the |
| 1933 | character set of the system/program can be used (directly; there will |
| 1934 | come a solution for this some day). This means for the user that s/he |
| 1935 | will always have to take care for this. If in the collection of the |
| 1936 | message catalogs there are files for the same language but coded using |
| 1937 | different character sets the user has to be careful. |
| 1938 | |
| 1939 | |
| 1940 | @node Helper programs for gettext |
| 1941 | @subsection Programs to handle message catalogs for @code{gettext} |
| 1942 | |
| 1943 | @Theglibc{} does not contain the source code for the programs to |
| 1944 | handle message catalogs for the @code{gettext} functions. As part of |
| 1945 | the GNU project the GNU gettext package contains everything the |
| 1946 | developer needs. The functionality provided by the tools in this |
| 1947 | package by far exceeds the abilities of the @code{gencat} program |
| 1948 | described above for the @code{catgets} functions. |
| 1949 | |
| 1950 | There is a program @code{msgfmt} which is the equivalent program to the |
| 1951 | @code{gencat} program. It generates from the human-readable and |
| 1952 | -editable form of the message catalog a binary file which can be used by |
| 1953 | the @code{gettext} functions. But there are several more programs |
| 1954 | available. |
| 1955 | |
| 1956 | The @code{xgettext} program can be used to automatically extract the |
| 1957 | translatable messages from a source file. I.e., the programmer need not |
| 1958 | take care of the translations and the list of messages which have to be |
| 1959 | translated. S/He will simply wrap the translatable string in calls to |
| 1960 | @code{gettext} et.al and the rest will be done by @code{xgettext}. This |
| 1961 | program has a lot of options which help to customize the output or |
| 1962 | help to understand the input better. |
| 1963 | |
| 1964 | Other programs help to manage the development cycle when new messages appear |
| 1965 | in the source files or when a new translation of the messages appears. |
| 1966 | Here it should only be noted that using all the tools in GNU gettext it |
| 1967 | is possible to @emph{completely} automate the handling of message |
| 1968 | catalogs. Beside marking the translatable strings in the source code and |
| 1969 | generating the translations the developers do not have anything to do |
| 1970 | themselves. |