Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

iconv.tex @ 444

Last change on this file since 444 was 444, checked in by satin@…, 6 years ago
add newlib,libalmos-mkh, restructure shared_syscalls.h and mini-libc
File size: 44.5 KB

Line
1	@node Iconv
2	@chapter Encoding conversions (@file{iconv.h})
3
4	This chapter describes the Newlib iconv library.
5	The iconv functions declarations are in
6	@file{iconv.h}.
7
8	@menu
9	* iconv:: Encoding conversion routines
10	* Introduction to iconv:: Introduction to iconv and encodings
11	* Supported encodings:: The list of currently supported encodings
12	* iconv design decisions:: General iconv library design issues
13	* iconv configuration:: iconv-related configure script options
14	* Encoding names:: How encodings are named.
15	* CCS tables:: CCS tables format and 'mktbl.pl' Perl script
16	* CES converters:: CES converters description
17	* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl'
18	* How to add new encoding:: The steps to add new encoding support
19	* The locale support interfaces:: Locale-related iconv interfaces
20	* Contact:: The author contact
21	@end menu
22
23	@page
24	@include iconv/iconv.def
25
26	@page
27	@node Introduction to iconv
28	@section Introduction to iconv
29	@findex encoding
30	@findex character set
31	@findex charset
32	@findex CES
33	@findex CCS
34	@*
35	The iconv library is intended to convert characters from one encoding to
36	another. It implements iconv(), iconv_open() and iconv_close()
37	calls, which are defined by the Single Unix Specification.
38
39	@*
40	In addition to these user-level interfaces, the iconv library also has
41	several useful interfaces which are needed to support coding
42	capabilities of the Newlib Locale infrastructure. Since Locale
43	support also needs to
44	convert various character sets to and from the @emph{wide characters
45	set}, the iconv library shares it's capabilities with the Newlib Locale
46	subsystem. Moreover, the iconv library supports several features which are
47	only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
48
49	@*
50	The Newlib iconv library was created using concepts from another iconv
51	library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
52	was rewritten from scratch and contains a lot of improvements with respect to
53	the original iconv library.
54
55	@*
56	Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
57	are often used with various meanings. The following are the definitions of terms
58	which are used in this documentation as well as in the iconv library
59	implementation:
60
61	@itemize @bullet
62	@item
63	@dfn{encoding} - a machine representation of characters by means of bits;
64
65	@item
66	@dfn{Character Set} or @dfn{Charset} - just a collection of
67	characters, i.e. the encoding is the machine representation of the character set;
68
69	@item
70	@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
71	set of integers @dfn{character codes};
72
73	@item
74	@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
75	codes to a sequence of bytes;
76	@end itemize
77
78	@*
79	Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
80	ASCII, etc. Encodings are formed by the following chain of steps:
81
82	@enumerate
83	@item
84	User has a set of characters which are specific to his or her language (character set).
85
86	@item
87	Each character from this set is uniquely numbered, resulting in an CCS.
88
89	@item
90	Each number from the CCS is converted to a sequence of bits or bytes by means
91	of a CES and form some encoding. Thus, CES may be considered as a
92	function of CCS which produces some encoding. Note, that CES may be
93	applied to more than one CCS.
94	@end enumerate
95
96	@*
97	Thus, an encoding may be considered as one or more CCS + CES.
98
99	@*
100	Sometimes, there is no CES and in such cases encoding is equivalent
101	to CCS, e.g. KOI8-R or ASCII.
102
103	@*
104	An example of a more complicated encoding is UTF-8 which is the UCS
105	(or Unicode) CCS plus the UTF-8 CES.
106
107	@*
108	The following is a brief list of iconv library features:
109	@itemize
110	@item
111	Generic architecture;
112	@item
113	Locale infrastructure support;
114	@item
115	Automatic generation of the program code which handles
116	CES/CCS/Encoding/Names/Aliases dependencies;
117	@item
118	The ability to choose size- or speed-optimazed
119	configuration;
120	@item
121	The ability to exclude a lot of unneeded code and data from the linking step.
122	@end itemize
123
124
125
126
127	@page
128	@node Supported encodings
129	@section Supported encodings
130	@findex big5
131	@findex cp775
132	@findex cp850
133	@findex cp852
134	@findex cp855
135	@findex cp866
136	@findex euc_jp
137	@findex euc_kr
138	@findex euc_tw
139	@findex iso_8859_1
140	@findex iso_8859_10
141	@findex iso_8859_11
142	@findex iso_8859_13
143	@findex iso_8859_14
144	@findex iso_8859_15
145	@findex iso_8859_2
146	@findex iso_8859_3
147	@findex iso_8859_4
148	@findex iso_8859_5
149	@findex iso_8859_6
150	@findex iso_8859_7
151	@findex iso_8859_8
152	@findex iso_8859_9
153	@findex iso_ir_111
154	@findex koi8_r
155	@findex koi8_ru
156	@findex koi8_u
157	@findex koi8_uni
158	@findex ucs_2
159	@findex ucs_2_internal
160	@findex ucs_2be
161	@findex ucs_2le
162	@findex ucs_4
163	@findex ucs_4_internal
164	@findex ucs_4be
165	@findex ucs_4le
166	@findex us_ascii
167	@findex utf_16
168	@findex utf_16be
169	@findex utf_16le
170	@findex utf_8
171	@findex win_1250
172	@findex win_1251
173	@findex win_1252
174	@findex win_1253
175	@findex win_1254
176	@findex win_1255
177	@findex win_1256
178	@findex win_1257
179	@findex win_1258
180	@*
181	The following is the list of currently supported encodings. The first column
182	corresponds to the encoding name, the second column is the list of aliases,
183	the third column is its CES and CCS components names, and the fourth column
184	is a short description.
185
186	@multitable @columnfractions .20 .26 .24 .30
187	@item
188	Name
189	@tab
190	Aliases
191	@tab
192	CES/CCS
193	@tab
194	Short description
195	@item
196	@tab
197	@tab
198	@tab
199
200
201	@item
202	big5
203	@tab
204	csbig5, big_five, bigfive, cn_big5, cp950
205	@tab
206	table_pcs / big5, us_ascii
207	@tab
208	The encoding for the Traditional Chinese.
209
210
211	@item
212	cp775
213	@tab
214	ibm775, cspc775baltic
215	@tab
216	table / cp775
217	@tab
218	The updated version of CP 437 that supports the balitic languages.
219
220
221	@item
222	cp850
223	@tab
224	ibm850, 850, cspc850multilingual
225	@tab
226	table / cp850
227	@tab
228	IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
229	added instead of some less-often used characters like the line-drawing
230	and the greek ones.
231
232
233	@item
234	cp852
235	@tab
236	ibm852, 852, cspcp852
237	@tab
238	@tab
239	IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
240	instead of some less-often used characters like the line-drawing and the greek ones.
241
242
243	@item
244	cp855
245	@tab
246	ibm855, 855, csibm855
247	@tab
248	table / cp855
249	@tab
250	IBM 855 - the updated version of CP 437 that supports Cyrillic.
251
252
253	@item
254	cp866
255	@tab
256	866, IBM866, CSIBM866
257	@tab
258	table / cp866
259	@tab
260	IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
261	ordering of the alternative variant that is preferred by many Russian users.
262
263
264	@item
265	euc_jp
266	@tab
267	eucjp
268	@tab
269	euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
270	@tab
271	EUC-JP - The EUC for Japanese.
272
273
274	@item
275	euc_kr
276	@tab
277	euckr
278	@tab
279	euc / ksx1001
280	@tab
281	EUC-KR - The EUC for Korean.
282
283
284	@item
285	euc_tw
286	@tab
287	euctw
288	@tab
289	euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
290	@tab
291	EUC-TW - The EUC for Traditional Chinese.
292
293
294	@item
295	iso_8859_1
296	@tab
297	iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
298	@tab
299	table / iso_8859_1
300	@tab
301	ISO 8859-1:1987 - Latin 1, West European.
302
303
304	@item
305	iso_8859_10
306	@tab
307	iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
308	@tab
309	table / iso_8859_10
310	@tab
311	ISO 8859-10:1992 - Latin 6, Nordic.
312
313
314	@item
315	iso_8859_11
316	@tab
317	iso8859_11, iso885911
318	@tab
319	table / iso_8859_11
320	@tab
321	ISO 8859-11 - Thai.
322
323
324	@item
325	iso_8859_13
326	@tab
327	iso_8859_13:1998, iso8859_13, iso885913
328	@tab
329	table / iso_8859_13
330	@tab
331	ISO 8859-13:1998 - Latin 7, Baltic Rim.
332
333
334	@item
335	iso_8859_14
336	@tab
337	iso_8859_14:1998, iso885914, iso8859_14
338	@tab
339	table / iso_8859_14
340	@tab
341	ISO 8859-14:1998 - Latin 8, Celtic.
342
343
344	@item
345	iso_8859_15
346	@tab
347	iso885915, iso_8859_15:1998, iso8859_15,
348	@tab
349	table / iso_8859_15
350	@tab
351	ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
352
353
354	@item
355	iso_8859_2
356	@tab
357	iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
358	@tab
359	table / iso_8859_2
360	@tab
361	ISO 8859-2:1987 - Latin 2, East European.
362
363
364	@item
365	iso_8859_3
366	@tab
367	iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
368	@tab
369	table / iso_8859_3
370	@tab
371	ISO 8859-3:1988 - Latin 3, South European.
372
373
374	@item
375	iso_8859_4
376	@tab
377	iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
378	@tab
379	table / iso_8859_4
380	@tab
381	ISO 8859-4:1988 - Latin 4, North European.
382
383
384	@item
385	iso_8859_5
386	@tab
387	iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
388	@tab
389	table / iso_8859_5
390	@tab
391	ISO 8859-5:1988 - Cyrillic.
392
393
394	@item
395	iso_8859_6
396	@tab
397	iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
398	@tab
399	table / iso_8859_6
400	@tab
401	ISO i8859-6:1987 - Arabic.
402
403
404	@item
405	iso_8859_7
406	@tab
407	iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
408	@tab
409	table / iso_8859_7
410	@tab
411	ISO 8859-7:1987 - Greek.
412
413
414	@item
415	iso_8859_8
416	@tab
417	iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
418	@tab
419	table / iso_8859_8
420	@tab
421	ISO 8859-8:1988 - Hebrew.
422
423
424	@item
425	iso_8859_9
426	@tab
427	iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
428	@tab
429	table / iso_8859_9
430	@tab
431	ISO 8859-9:1989 - Latin 5, Turkish.
432
433
434	@item
435	iso_ir_111
436	@tab
437	ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
438	@tab
439	table / iso_ir_111
440	@tab
441	ISO IR 111/ECMA Cyrillic.
442
443
444	@item
445	koi8_r
446	@tab
447	cskoi8r, koi8r, koi8
448	@tab
449	table / koi8_r
450	@tab
451	RFC 1489 Cyrillic.
452
453
454	@item
455	koi8_ru
456	@tab
457	koi8ru
458	@tab
459	table / koi8_ru
460	@tab
461	The obsolete Ukrainian.
462
463
464	@item
465	koi8_u
466	@tab
467	koi8u
468	@tab
469	table / koi8_u
470	@tab
471	RFC 2319 Ukrainian.
472
473
474	@item
475	koi8_uni
476	@tab
477	koi8uni
478	@tab
479	table / koi8_uni
480	@tab
481	KOI8 Unified.
482
483
484	@item
485	ucs_2
486	@tab
487	ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
488	@tab
489	ucs_2 / (UCS)
490	@tab
491	ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
492
493
494	@item
495	ucs_2_internal
496	@tab
497	ucs2_internal, ucs_2internal, ucs2internal
498	@tab
499	ucs_2_internal / (UCS)
500	@tab
501	ISO-10646-UCS-2 in system byte order.
502	NBSP is always interpreted as NBSP (BOM isn't supported).
503
504
505	@item
506	ucs_2be
507	@tab
508	ucs2be
509	@tab
510	ucs_2 / (UCS)
511	@tab
512	Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
513	Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
514
515
516	@item
517	ucs_2le
518	@tab
519	ucs2le
520	@tab
521	ucs_2 / (UCS)
522	@tab
523	Little Endian version of ISO-10646-UCS-2.
524	Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
525
526
527	@item
528	ucs_4
529	@tab
530	ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
531	@tab
532	ucs_4 / (UCS)
533	@tab
534	ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
535
536
537	@item
538	ucs_4_internal
539	@tab
540	ucs4_internal, ucs_4internal, ucs4internal
541	@tab
542	ucs_4_internal / (UCS)
543	@tab
544	ISO-10646-UCS-4 in system byte order.
545	NBSP is always interpreted as NBSP (BOM isn't supported).
546
547
548	@item
549	ucs_4be
550	@tab
551	ucs4be
552	@tab
553	ucs_4 / (UCS)
554	@tab
555	Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
556	Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
557
558
559	@item
560	ucs_4le
561	@tab
562	ucs4le
563	@tab
564	ucs_4 / (UCS)
565	@tab
566	Little Endian version of ISO-10646-UCS-4.
567	Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
568
569
570	@item
571	us_ascii
572	@tab
573	ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
574	@tab
575	us_ascii / (ASCII)
576	@tab
577	7-bit ASCII.
578
579
580	@item
581	utf_16
582	@tab
583	utf16
584	@tab
585	utf_16 / (UCS)
586	@tab
587	RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
588
589
590	@item
591	utf_16be
592	@tab
593	utf16be
594	@tab
595	utf_16 / (UCS)
596	@tab
597	Big Endian version of RFC 2781 UTF-16.
598	NBSP is always interpreted as NBSP (BOM isn't supported).
599
600
601	@item
602	utf_16le
603	@tab
604	utf16le
605	@tab
606	utf_16 / (UCS)
607	@tab
608	Little Endian version of RFC 2781 UTF-16.
609	NBSP is always interpreted as NBSP (BOM isn't supported).
610
611
612	@item
613	utf_8
614	@tab
615	utf8
616	@tab
617	utf_8 / (UCS)
618	@tab
619	RFC 3629 UTF-8.
620
621
622	@item
623	win_1250
624	@tab
625	cp1250
626	@tab
627	@tab
628	Win-1250 Croatian.
629
630
631	@item
632	win_1251
633	@tab
634	cp1251
635	@tab
636	table / win_1251
637	@tab
638	Win-1251 - Cyrillic.
639
640
641	@item
642	win_1252
643	@tab
644	cp1252
645	@tab
646	table / win_1252
647	@tab
648	Win-1252 - Latin 1.
649
650
651	@item
652	win_1253
653	@tab
654	cp1253
655	@tab
656	table / win_1253
657	@tab
658	Win-1253 - Greek.
659
660
661	@item
662	win_1254
663	@tab
664	cp1254
665	@tab
666	table / win_1254
667	@tab
668	Win-1254 - Turkish.
669
670
671	@item
672	win_1255
673	@tab
674	cp1255
675	@tab
676	table / win_1255
677	@tab
678	Win-1255 - Hebrew.
679
680
681	@item
682	win_1256
683	@tab
684	cp1256
685	@tab
686	table / win_1256
687	@tab
688	Win-1256 - Arabic.
689
690
691	@item
692	win_1257
693	@tab
694	cp1257
695	@tab
696	table / win_1257
697	@tab
698	Win-1257 - Baltic.
699
700
701	@item
702	win_1258
703	@tab
704	cp1258
705	@tab
706	table / win_1258
707	@tab
708	Win-1258 - Vietnamese7 that supports Cyrillic.
709	@end multitable
710
711
712
713
714
715	@page
716	@node iconv design decisions
717	@section iconv design decisions
718	@findex CCS table
719	@findex CES converter
720	@findex Speed-optimized tables
721	@findex Size-optimized tables
722	@*
723	The first iconv library design issue arises when considering the
724	following two design approaches:
725
726	@enumerate
727	@item
728	Have modules which implement conversion from the encoding A to the encoding B
729	and vice versa i.e., one conversion module relates to any two encodings.
730	@item
731	Have modules which implement conversion from the encoding A to the fixed
732	encoding C and vice versa i.e., one conversion module relates to any
733	one encoding A and one fixed encoding C. In this case, to convert from
734	the encoding A to the encoding B, two modules are needed (in order to convert
735	from A to C and then from C to B).
736	@end enumerate
737
738	@*
739	It's obvious, that we have tradeoff between commonality/flexibility and
740	efficiency: the first method is more efficient since it converts
741	directly; however, it isn't so flexible since for each
742	encoding pair a distinct module is needed.
743
744	@*
745	The Newlib iconv model uses the second method and always converts through the 32-bit
746	UCS but its design also allows one to write specialized conversion
747	modules if the conversion speed is critical.
748
749	@*
750	The second design issue is how to break down (decompose) encodings.
751	The Newlib iconv library uses the fact that any encoding may be
752	considered as one or more CCS plus a CES. It also decomposes its
753	conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
754	tables}. CCS tables map CCS to UCS and vice versa; the CES converters
755	map CCS to the encoding and vice versa.
756
757	@*
758	As the example, let's consider the conversion from the big5 encoding to
759	the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
760	CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
761	and CNS11643_PLANE14 CCS-es plus the EUC CES.
762
763	@*
764	The euc_jp -> big5 conversion is performed as follows:
765
766	@enumerate
767	@item
768	The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
769	transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
770	CCS-es);
771	@item
772	The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
773	CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
774	@item
775	The resulting UCS codes are transformed to the ASCII and BIG5 codes using
776	the corresponding CCS tables;
777	@item
778	The obtained CCS codes are transformed to the big5 encoding using the corresponding
779	CES converter.
780	@end enumerate
781
782	@*
783	Analogously, the backward conversion is performed as follows:
784
785	@enumerate
786	@item
787	The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
788	(the ASCII and BIG5 CCS-es);
789	@item
790	The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
791	@item
792	The resulting UCS codes are transformed to the ASCII and BIG5 codes using
793	the corresponding CCS tables;
794	@item
795	The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
796	CES converter.
797	@end enumerate
798
799	@*
800	Note, the above is just an example and real names (which are implemented
801	in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
802
803	@*
804	The third design issue also relates to flexibility. Obviously, it isn't
805	desirable to always link all the CES converters and the CCS tables to the library
806	but instead, we want to be able to load the needed converters and tables
807	dynamically on demand. This isn't a problem on "big" machines such as
808	a PC, but it may be very problematical within "small" embedded systems.
809
810	@*
811	Since the CCS tables are just data, it is possible to load them
812	dynamically from external files. The CES converters, on the other hand
813	are algorithms with some code so a dynamic library loading
814	capability is required.
815
816	@*
817	Apart from possible restrictions applied by embedded systems (small
818	RAM for example), Newlib itself has no dynamic library support and
819	therefore, all the CES converters which will ever be used must be linked into
820	the library. However, loading of the dynamic CCS tables is possible and is
821	implemented in the Newlib iconv library. It may be enabled via the Newlib
822	configure script options.
823
824	@*
825	The next design issue is fine-tuning the iconv library
826	configuration. One important ability is for iconv to not link all it's
827	converters and tables (if dynamic loading is not enabled) but instead,
828	enable only those encodings which are specified at configuration
829	time (see the section about the configure script options).
830
831	@*
832	In addition, the Newlib iconv library configure options distinguish between
833	conversion directions. This means that not only are supported encodings
834	selectable, the conversion direction is as well. For example, if user wants
835	the configuration which allows conversions from UTF-8 to UTF-16 and
836	doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
837	enable only
838	this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
839	be included) thus, saving some memory (note, that such technique allows to
840	exclude one half of a CCS table from linking which may be big enough).
841
842	@*
843	One more design aspect are the speed- and size- optimized tables. Users can
844	select between them using configure script options. The
845	speed-optimized CCS tables are the same as the size-optimized ones in
846	case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
847	CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
848	other hand, conversion with speed tables is several times faster.
849
850	@*
851	Its worth to stress that the new encoding support can't be
852	dynamically added into an already compiled Newlib library, even if it
853	needs only an additional CCS table and iconv is configured to use
854	the external files with CCS tables (this isn't the fundamental restriction
855	and the possibility to add new Table-based encoding support dynamically, by
856	means of just adding new .cct file, may be easily added).
857
858	@*
859	Theoretically, the compiled-in CCS tables should be more appropriate for
860	embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM
861	whereas dynamic loading requires RAM. Moreover, in the current iconv
862	implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
863	This means, for example, that if two iconv descriptors for
864	"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
865	koi8-r .cct file will be loaded (actually, iconv loads only the needed part
866	of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
867
868	@page
869	@node iconv configuration
870	@section iconv configuration
871	@findex iconv configuration
872	@findex --enable-newlib-iconv-encodings
873	@findex --enable-newlib-iconv-from-encodings
874	@findex --enable-newlib-iconv-to-encodings
875	@findex --enable-newlib-iconv-external-ccs
876	@findex NLSPATH
877	@*
878	To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
879	script option should be used. This option accepts a comma-separated list
880	of @emph{encodings} that should be enabled. The option enables each encoding in both
881	("to" and "from") directions.
882
883	@*
884	The @option{--enable-newlib-iconv-from-encodings} configure script option enables
885	"from" support for each encoding that was passed to it.
886
887	@*
888	The @option{--enable-newlib-iconv-to-encodings} configure script option enables
889	"to" support for each encoding that was passed to it.
890
891	@*
892	Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
893	"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
894	code and data will be linked) is to configure Newlib with the following
895	options:
896	@*
897	@code{--enable-newlib-iconv-encodings=UTF-8
898	--enable-newlib-iconv-from-encodings=KOI8-R
899	--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
900	@*
901	which is the same as
902	@*
903	@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
904	--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
905	@*
906	User may also just use the
907	@*
908	@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
909	@*
910	configure script option, but it isn't so optimal since there will be
911	some unneeded data and code.
912
913	@*
914	The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
915	capabilities to work with the external CCS files.
916
917	@*
918	The @option{--enable-target-optspace} Newlib configure script option also affects
919	the iconv library. If this option is present, the library uses the size
920	optimized CCS tables. This means, that only the size-optimized CCS
921	tables will be linked or, if the
922	@option{--enable-newlib-iconv-external-ccs} configure script option was used,
923	the iconv library will load the size-optimized tables. If the
924	@option{--enable-target-optspace}configure script option is disabled,
925	the speed-optimized CCS tables are used.
926
927	@*
928	Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
929	Thus, the NLSPATH environment variable should be set.
930
931
932
933
934
935	@page
936	@node Encoding names
937	@section Encoding names
938	@findex encoding name
939	@findex encoding alias
940	@findex normalized name
941	@*
942	Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
943	user works with the iconv library (i.e., when the @code{iconv_open} call
944	is used) both name or aliases may be used. The same is when encoding
945	names are used in configure script options.
946
947	@*
948	Names and aliases may be specified in any case (small or capital
949	letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
950
951	@*
952	Internally the Newlib iconv library always converts aliases to names. It
953	also converts names and aliases in the @dfn{normalized} form which means
954	that all capital letters are converted to small letters and the @kbd{-}
955	symbols are converted to @kbd{_} symbols.
956
957
958
959
960	@page
961	@node CCS tables
962	@section CCS tables
963	@findex Size-optimized CCS table
964	@findex Speed-optimized CCS table
965	@findex mktbl.pl Perl script
966	@findex .cct files
967	@findex The CCT tables source files
968	@findex CCS source files
969	@*
970	The iconv library stores files with CCS tables in the the @emph{ccs/}
971	subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
972	(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
973	of compilable .c source files. The .cct files are only used when the
974	@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
975	The .c files are linked to the Newlib library if the corresponding
976	encoding is enabled.
977
978	@*
979	As stated earlier, the Newlib iconv library performs all
980	conversions through the 32-bit UCS, but the codes which are used
981	in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
982	Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
983	used instead of the 32-bit UCS-4.
984
985	@*
986	CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
987	16-bit UCS-2 and vice versa while 16-bit CCS tables map
988	16-bit CCS to 16-bit UCS-2 and vice versa.
989	8-bit tables are small (in size) while 16-bit tables may be big enough.
990	Because of this, 16-bit CCS tables may be
991	either speed- or size-optimized. Size-optimized CCS tables are
992	smaller then speed-optimized ones, but the conversion process is
993	slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
994	size-optimized variant.
995
996	Each CCS table (both speed- and size-optimized) consists of
997	@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
998	UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
999	UCS-2 codes.
1000
1001	@*
1002	Almost all 16-bit CCS tables contain less then 0xFFFF codes and
1003	a lot of gaps exist.
1004
1005	@subsection Speed-optimized tables format
1006	@*
1007	In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1008	trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
1009	UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
1010	as @emph{Y = to_ucs[X]}.
1011
1012	@*
1013	Obviously, the simplest way to create the "from_ucs" table or the
1014	16-bit "to_ucs" table is to use the huge 16-bit array like in case
1015	of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
1016	less then 0xFFFF code maps and this fact may be exploited to reduce
1017	the size of the CCS tables.
1018
1019	@*
1020	In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
1021	16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
1022	direction and the CCS bits number.
1023
1024	@*
1025	In case of the 8-bit speed-optimized table the "from_ucs" subtable
1026	corresponds the "from_ucs" array and has the following layout:
1027
1028	@*
1029	from_ucs array:
1030	@*
1031	-------------------------------------
1032	@*
1033	0xFF mapping (2 bytes) (only for
1034	8-bit table).
1035	@*
1036	-------------------------------------
1037	@*
1038	Heading block
1039	@*
1040	-------------------------------------
1041	@*
1042	Block 1
1043	@*
1044	-------------------------------------
1045	@*
1046	Block 2
1047	@*
1048	-------------------------------------
1049	@*
1050	...
1051	@*
1052	-------------------------------------
1053	@*
1054	Block N
1055	@*
1056	-------------------------------------
1057
1058	@*
1059	The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
1060	subrange is represented by an 256-element @dfn{block} (256 1-byte
1061	elements or 256 2-byte element in case of 16-bit CCS table) with
1062	elements which are equivalent to the CCS codes of this subrange.
1063	If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
1064	absent and there will be less then 256 blocks.
1065
1066	@*
1067	Any element number @emph{m} of @dfn{the heading block} (which contains
1068	256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
1069	If the subrange contains some codes, the value of the @emph{m}-th element of
1070	the heading block contains the offset of the corresponding block in the
1071	"from_ucs" array. If there is no codes in the subrange, the heading
1072	block element contains 0xFFFF.
1073
1074	@*
1075	If there are some gaps in a block, the corresponding block elements have
1076	the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
1077	is defined in the first 2-byte element of the "from_ucs" array.
1078
1079	@*
1080	Having such a table format, the algorithm of searching the CCS code
1081	@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
1082
1083	@*
1084	@enumerate
1085	@item If @emph{Y} is equivalent to the value of the first 2-byte element
1086	of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
1087
1088	@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
1089
1090	@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
1091	is no corresponding CCS code (error, wrong input data). Else, fetch the
1092	"flom_ucs" array index of the @emph{BlkN}-th block.
1093
1094	@item Calculate the offset of the @emph{X} code in its block:
1095	@emph{Xindex = Y & 0xFF}
1096
1097	@item If the @emph{Xindex}-th element of the block (which is equivalent to
1098	@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
1099	CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
1100	@end enumerate
1101
1102	@subsection Size-optimized tables format
1103	@*
1104	As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
1105	This is because there is too small difference between the speed-optimized
1106	and the size-optimized table sizes in case of 8-bit CCS-es.
1107
1108	@*
1109	Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1110	size-optimized tables.
1111
1112	This sections describes the format of the "UCS-2 -> CCS" size-optimized
1113	CCS table. The format of "CCS -> UCS-2" table is the same.
1114
1115	The idea of the size-optimized tables is to split the UCS-2 codes
1116	("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
1117	Then CCS codes ("to" codes) are stored only for the codes from these
1118	ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
1119	together with the corresponding "to" codes.
1120
1121	@*
1122	The following is the layout of the size-optimized table array:
1123
1124	@*
1125	size_arr array:
1126	@*
1127	-------------------------------------
1128	@*
1129	Ranges number (2 bytes)
1130	@*
1131	-------------------------------------
1132	@*
1133	Unranged codes number (2 bytes)
1134	@*
1135	-------------------------------------
1136	@*
1137	Unranged codes array index (2 bytes)
1138	@*
1139	-------------------------------------
1140	@*
1141	Ranges indexes (triads)
1142	@*
1143	-------------------------------------
1144	@*
1145	Ranges
1146	@*
1147	-------------------------------------
1148	@*
1149	Unranged codes array
1150	@*
1151	-------------------------------------
1152
1153	@*
1154	The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
1155	the offset of the needed range in the @emph{size_arr} and has
1156	the following format (triads):
1157	@*
1158	the first code in range, the last code in range, range offset.
1159
1160	@*
1161	The array of these triads is sorted by the firs element, therefore it is
1162	possible to quickly find the needed range index.
1163
1164	@*
1165	Each range has the corresponding sub-array containing the "to" codes. These
1166	sub-arrays are stored in the place marked as "Ranges" in the layout
1167	diagram.
1168
1169	@*
1170	The "Unranged codes array" contains pairs ("from" code, "to" code") for
1171	each unranged code. The array of these pairs is sorted by "from" code
1172	values, therefore it is possible to find the needed pair quickly.
1173
1174	@*
1175	Note, that each range requires 6 bytes to form its index. If, for
1176	example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
1177	(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
1178	code (total 16). But it is better to join both ranges as 1 - 10 and
1179	mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
1180	range index and 4 bytes to mark codes 6 and 8 as absent are needed
1181	(total 10 bytes). This optimization is done in the size-optimized tables.
1182	Thus, ranges may contain small gaps. The absent codes in ranges are marked
1183	as 0xFFFF.
1184
1185	@*
1186	Note, a pair of "from" codes is stored by means of unranged codes since
1187	the number of bytes which are needed to form the range is greater than
1188	the number of bytes to store two unranged codes (5 against 4).
1189
1190	@*
1191	The algorithm of searching of the CCS code
1192	@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
1193	CCS" size-optimized table is as follows.
1194
1195	@*
1196	@enumerate
1197	@item Try to find the corresponding triad in the "Unranged codes array
1198	index". Since we are searching in the sorted array, we can do it quickly
1199	(divide by 2, compare, etc).
1200
1201	@item If the triad is found, fetch the @emph{X} code from the corresponding
1202	range array. If it is 0xFFFF, return an error.
1203
1204	@item If there is no corresponding triad, search the @emph{X} code among the
1205	sorted unranged codes. Return error, if noting was found.
1206	@end enumerate
1207
1208	@subsection .cct ant .c CCS Table files
1209	@*
1210	The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
1211	speed-optimized tables. The .c source files for 16-bit CCS tables have
1212	"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1213	tables.
1214
1215	@*
1216	When .c files are compiled and used, all the 16-bit and 32-bit values
1217	have the native endian format (Big Endian for the BE systems and Little
1218	Endian for the LE systems) since they are compile for the system before
1219	they are used.
1220
1221	@*
1222	In case of .cct files, which are intended for dynamic CCS tables
1223	loading, the CCS tables are stored either in LE or BE format. Since the
1224	.cct files are generated by the 'mktbl.pl' Perl script, it is possible
1225	to choose the endianess of the tables. It is also possible to store two
1226	copies (both LE and BE) of the CCS tables in one .cct file. The default
1227	.cct files (which come with the Newlib sources) have both LE and BE CCS
1228	tables. The Newlib iconv library automatically chooses the needed CCS tables
1229	(with appropriate endianess).
1230
1231	@*
1232	Note, the .cct files are only used when the
1233	@option{--enable-newlib-iconv-external-ccs} is used.
1234
1235	@subsection The 'mktbl.pl' Perl script
1236	@*
1237	The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1238	files from the @dfn{CCS source files}.
1239
1240	@*
1241	The CCS source files are just text files which has one or more colons
1242	with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
1243	source files see one of them using URL-s which will be given bellow.
1244
1245	@*
1246	The following table describes where the source files for CCS table files
1247	provided by the Newlib distribution are located.
1248
1249	@multitable @columnfractions .25 .75
1250	@item
1251	Name
1252	@tab
1253	URL
1254
1255	@item
1256	@tab
1257
1258	@item
1259	big5
1260	@tab
1261	http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1262
1263	@item
1264	cns11643_plane1
1265	cns11643_plane14
1266	cns11643_plane2
1267	@tab
1268	http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1269
1270	@item
1271	cp775
1272	cp850
1273	cp852
1274	cp855
1275	cp866
1276	@tab
1277	http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1278
1279	@item
1280	iso_8859_1
1281	iso_8859_2
1282	iso_8859_3
1283	iso_8859_4
1284	iso_8859_5
1285	iso_8859_6
1286	iso_8859_7
1287	iso_8859_8
1288	iso_8859_9
1289	iso_8859_10
1290	iso_8859_11
1291	iso_8859_13
1292	iso_8859_14
1293	iso_8859_15
1294	@tab
1295	http://www.unicode.org/Public/MAPPINGS/ISO8859/
1296
1297	@item
1298	iso_ir_111
1299	@tab
1300	http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1301
1302	@item
1303	jis_x0201_1976
1304	jis_x0208_1990
1305	jis_x0212_1990
1306	@tab
1307	http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1308
1309	@item
1310	koi8_r
1311	@tab
1312	http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1313
1314	@item
1315	koi8_ru
1316	@tab
1317	http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1318
1319	@item
1320	koi8_u
1321	@tab
1322	http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1323
1324	@item
1325	koi8_uni
1326	@tab
1327	http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1328
1329	@item
1330	ksx1001
1331	@tab
1332	http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1333
1334	@item
1335	win_1250
1336	win_1251
1337	win_1252
1338	win_1253
1339	win_1254
1340	win_1255
1341	win_1256
1342	win_1257
1343	win_1258
1344	@tab
1345	http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1346	@end multitable
1347
1348	The CCS source files aren't distributed with Newlib because of License
1349	restrictions in most Unicode.org's files.
1350
1351	The following are 'mktbl.pl' options which were used to generate .cct
1352	files. Note, to generate CCS tables source files @option{-s} option
1353	should be added.
1354
1355	@enumerate
1356	@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1357	iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1358	iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1359	iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1360	win_1256.cct, win_1258.cct, win_1251.cct,
1361	win_1253.cct, win_1255.cct, win_1257.cct,
1362	koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1363	big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1364	files, only the @option{-i <SRC_FILE_NAME>} option were used.
1365
1366	@item To generate the jis_x0208_1990.cct file, the
1367	@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
1368
1369	@item To generate the cns11643_plane1.cct file, the
1370	@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct}
1371	options were used.
1372
1373	@item To generate the cns11643_plane2.cct file, the
1374	@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct}
1375	options were used.
1376
1377	@item To generate the cns11643_plane14.cct file, the
1378	@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct}
1379	options were used.
1380	@end enumerate
1381
1382	@*
1383	For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1384
1385	@*
1386	It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
1387	in the CCS source file, the bits which are higher then 16 defines plane (see the
1388	cns11643.txt CCS source file).
1389
1390	@*
1391	Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
1392	several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
1393	the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
1394	codes}) aren't just rejected but instead, they are mapped to the default
1395	UCS-2 code (which is currently the @kbd{?} character's code).
1396
1397
1398
1399
1400
1401	@page
1402	@node CES converters
1403	@section CES converters
1404	@findex PCS
1405	@*
1406	Similar to the CCS tables, CES converters are also split into "from UCS"
1407	and "to UCS" parts. Depending on the iconv library configuration, these
1408	parts are enabled or disabled.
1409
1410	@*
1411	The following it the list of CES converters which are currently present
1412	in the Newlib iconv library.
1413
1414	@itemize @bullet
1415	@item
1416	@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
1417	encodings. The @emph{euc} CES converter uses the @emph{table} and the
1418	@emph{us_ascii} CES converters.
1419
1420	@item
1421	@emph{table} - this CES converter corresponds to "null" and just performs
1422	tables-based conversion using 8- and 16-bit CCS tables. This converter
1423	is also used by any other CES converter which needs the CCS table-based
1424	conversions. The @emph{table} converter is also responsible for .cct files
1425	loading.
1426
1427	@item
1428	@emph{table_pcs} - this is the wrapper over the @emph{table} converter
1429	which is intended for 16-bit encodings which also use the @dfn{Portable
1430	Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
1431	This means, that if the first byte the CCS code is in range of [0x00-0x7f],
1432	this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
1433	the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
1434	The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
1435	@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
1436
1437	@item
1438	@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
1439	@emph{ucs_2le} encodings support.
1440
1441	@item
1442	@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
1443	@emph{ucs_4le} encodings support.
1444
1445	@item
1446	@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
1447
1448	@item
1449	@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
1450
1451	@item
1452	@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
1453	principle, the most natural way to support the @emph{us_ascii} encoding
1454	is to define the @emph{us_ascii} CCS and use the @emph{table} CES
1455	converter. But for the optimization purposes, the specialized
1456	@emph{us_ascii} CES converter was created.
1457
1458	@item
1459	@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
1460	@emph{utf_16le} encodings support.
1461
1462	@item
1463	@emph{utf_8} - intended for the @emph{utf_8} encoding support.
1464	@end itemize
1465
1466
1467
1468
1469
1470	@page
1471	@node The encodings description file
1472	@section The encodings description file
1473	@findex encoding.deps description file
1474	@findex mkdeps.pl Perl script
1475	@*
1476	To simplify the process of adding new encodings support allowing to
1477	automatically generate a lot of "glue" files.
1478
1479	@*
1480	There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
1481	is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1482	uses 'encoding.deps' to generates the "glue" files.
1483
1484	@*
1485	The 'encoding.deps' file is composed of sections, each section consists
1486	of entries, each entry contains some encoding/CES/CCS description.
1487
1488	@*
1489	The 'encoding.deps' file's syntax is very simple. Currently only two
1490	sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
1491
1492	@*
1493	Each @emph{ENCODINGS} section's entry describes one encoding and
1494	contains the following information.
1495
1496	@itemize @bullet
1497	@item
1498	Encoding name (the @emph{ENCODING} field). The name should
1499	be unique and only one name is possible.
1500
1501	@item
1502	The encoding's CES converter name (the @emph{CES} field). Only one CES
1503	converter is allowed.
1504
1505	@item
1506	The whitespace-separated list of CCS table names which are used by the
1507	encoding (the @emph{CCS} field).
1508
1509	@item
1510	The whitespace-separated list of aliases names (the @emph{ENCODING}
1511	field).
1512	@end itemize
1513
1514	@*
1515	Note all names in the 'encoding.deps' file have to have the normalized
1516	form.
1517
1518	@*
1519	Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
1520	one CES converted. For example, the @emph{euc} CES converter depends on
1521	the @emph{table} and the @emph{us_ascii} CES converter since the
1522	@emph{euc} CES converter uses them. This means, that both @emph{table}
1523	and @emph{us_ascii} CES converters should be linked if the @emph{euc}
1524	CES converter is enabled.
1525
1526	@*
1527	The @emph{CES_DEPENDENCIES} section defines the following:
1528
1529	@itemize @bullet
1530	@item
1531	the CES converter name for which the dependencies are defined in this
1532	entry (the @emph{CES} field);
1533
1534	@item
1535	the whitespace-separated list of CES converters which are needed for
1536	this CES converter (the @emph{USED_CES} field).
1537	@end itemize
1538
1539	@*
1540	The 'mktbl.pl' Perl script automatically solves the following tasks.
1541
1542	@itemize @bullet
1543	@item
1544	User works with the iconv library in terms of encodings and doesn't know
1545	anything about CES converters and CCS tables. The script automatically
1546	generates code which enables all needed CES converters and CCS tables
1547	for all encodings, which were enabled by the user.
1548
1549	@item
1550	The CES converters may have dependencies and the script automatically
1551	generates the code which handles these dependencies.
1552
1553	@item
1554	The list of encoding's aliases is also automatically generated.
1555
1556	@item
1557	The script uses a lot of macros in order to enable only the minimum set
1558	of code/data which is needed to support the requested encodings in the
1559	requested directions.
1560	@end itemize
1561
1562	@*
1563	The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1564	file and generates the following files.
1565
1566	@itemize @bullet
1567	@item
1568	@emph{lib/encnames.h} - this header files contains macro definitions for all
1569	encoding names
1570
1571	@item
1572	@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
1573	is used to find the name of requested encoding by it's alias.
1574
1575	@item
1576	@emph{ces/cesbi.c} - this file defines two arrays
1577	(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
1578	description of enabled "to UCS" and "from UCS" CES converters and the
1579	names of encodings which are supported by these CES converters.
1580
1581	@item
1582	@emph{ces/cesbi.h} - this file contains the set of macros which defines
1583	the set of CES converters which should be enabled if only the set of
1584	enabled encodings is given (through macros defined in the
1585	@emph{newlib.h} file). Note, that one CES converter may handle several
1586	encodings.
1587
1588	@item
1589	@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
1590	this file.
1591
1592	@item
1593	@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
1594	here.
1595
1596	@item
1597	@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
1598	CCS names.
1599
1600	@item
1601	@emph{encoding.aliases} - the list of supported encodings and their
1602	aliases which is intended for the Newlib configure scripts in order to
1603	handle the iconv-related configure script options.
1604	@end itemize
1605
1606
1607
1608
1609
1610	@page
1611	@node How to add new encoding
1612	@section How to add new encoding
1613	@*
1614	At first, the new encoding should be broken down to CCS and CES. Then,
1615	the process of adding new encoding is split to the following activities.
1616
1617	@enumerate
1618	@item Generate the .cct CCS file and the .c source file for the new
1619	encoding's CCS (if it isn't already present). To do this, the CCS source
1620	file should be had and the 'mktbl.pl' script should be used.
1621
1622	@item Write the corresponding CES converter (if it isn't already
1623	present). Use the existing CES converters as an example.
1624
1625	@item
1626	Add the corresponding entries to the 'encoding.deps' file and regenerate
1627	the autogenerated "glue" files using the 'mkdeps.pl' script.
1628
1629	@item
1630	Don't forget to add entries to the newlib/newlib.hin file.
1631
1632	@item
1633	Of course, the 'Makefile.am'-s should also be updated (if new files were
1634	added) and the 'Makefile.in'-s should be regenerated using the correct
1635	version of 'automake'.
1636
1637	@item
1638	Don't forget to update the documentation (the list of
1639	supported encodings and CES converters).
1640	@end enumerate
1641
1642	In case a new encoding doesn't fit to the CES/CCS decomposition model or
1643	it is desired to add the specialized (non UCS-based) conversion support,
1644	the Newlib iconv library code should be upgraded.
1645
1646
1647
1648
1649
1650	@page
1651	@node The locale support interfaces
1652	@section The locale support interfaces
1653	@*
1654	The newlib iconv library also has some interface functions (besides the
1655	@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
1656	are intended for the Locale subsystem. All the locale-related code is
1657	placed in the @emph{lib/iconvnls.c} file.
1658
1659	@*
1660	The following is the description of the locale-related interfaces:
1661
1662	@itemize @bullet
1663	@item
1664	@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
1665	wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1666	passed in the function parameters. The @emph{wchar_t} characters encoding is
1667	either ucs_2_internal or ucs_4_internal depending on size of
1668	@emph{wchar_t}.
1669
1670	@item
1671	@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
1672	functions, but if there is no character in the output encoding which
1673	corresponds to the character in the input encoding, the default
1674	conversion isn't performed (the @code{iconv} function sets such output
1675	characters to the @kbd{?} symbol and this is the behavior, which is
1676	specified in SUSv3).
1677
1678	@item
1679	@code{_iconv_nls_get_state} - returns the current encoding's shift state
1680	(the @code{mbstate_t} object).
1681
1682	@item
1683	@code{_iconv_nls_set_state} sets the current encoding's shift state (the
1684	@code{mbstate_t} object).
1685
1686	@item
1687	@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
1688	or stateless.
1689
1690	@item
1691	@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
1692	maximum bytes number) of the encoding's characters.
1693	@end itemize
1694
1695
1696
1697
1698	@page
1699	@node Contact
1700	@section Contact
1701	@*
1702	The author of the original BSD iconv library (Alexander Chuguev) no longer
1703	supports that code.
1704
1705	@*
1706	Any questions regarding the iconv library may be forwarded to
1707	Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1708	well as to the public Newlib mailing list.
1709

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/libs/newlib/src/newlib/libc/iconv/iconv.tex @ 444

Download in other formats: