[go: up one dir, main page]

File: libmoe.shtml

package info (click to toggle)
libmoe 1.5.8-1
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd, squeeze, wheezy
  • size: 6,780 kB
  • ctags: 267,602
  • sloc: ansic: 478,515; perl: 2,318; makefile: 201; sh: 33
file content (210 lines) | stat: -rw-r--r-- 5,943 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html">
    <meta http-equiv="Pragma" content="no-cache">
    <title>Kiyokazu in (hopefully) hacker mode (multi octet character encoding handling library)</title>
  </head>
  <body>
    <a href="/">&lt;Top of this site&gt;</a>
    <a href="/prog/">&lt;Top of programming pages in this site&gt;</a>

    <hr>

    <h3>
      Functions to handle multiple octets character encoding scheme
    </h3>

    <hr>

    <p><a name="stable"></a>
      <blockquote>
	<!--#include virtual="/cgi-perl/showfile?/prog/pub/libmoe-[0-9]*.tar.gz"-->
      </blockquote>
      is gzipped tarball of a collection of functions
      to handle sequences of characters consisting of multiple octets.
      It includes
      <a href="mbconv.html">a character encoding conversion tool</a>
      which is initially written for debugging purpose of this library.
      In spite of my initial intention,
      I believe that it is very useful tool.
      You can view ChangeLog:
      <blockquote>
	<!--#include virtual="/cgi-perl/showfile?/prog/pub/libmoe-[0-9]*-ChangeLog.txt"-->
      </blockquote>
      which is included in the above tarball.
    </p>

    <p><a name="working"></a>
      The developement version:
      <blockquote>
	<!--#include virtual="/cgi-perl/showfile?/prog/pub/libmoe-devel.tar.gz"-->
      </blockquote>
      and its ChangeLog
      <blockquote>
	<!--#include virtual="/cgi-perl/showfile?/prog/pub/libmoe-devel-ChangeLog.txt"-->
      </blockquote>
      are also available.
    </p>

    <p>
      The main functionalities are
      to calculate from a character encoded in multiple octet,
      a non-negative integer,
      which is called Universal Code Point (UCP) for convinience of description in this document,
      including complete information about
      coded character set containing the character and
      codepoint of the character in the set,
      and to reproduce the orignal octet sequence from the integer.
    </p>

    <hr>

    <h3><a name="requirement">Requirement</a></h3>

    <p>
      To build and install this library,
      you need C compiler and libraries conforming to ANSI standard.
      Further

      <ul>
      <li>the &quot;int&quot; of your cc must have 32-bit length at least,</li>
      <li>your stdio library must have functions &quot;fileno()&quot; and &quot;fdopen()&quot;,</li>
      <li>
	if you are going to use the included Makefile,
	you need
	GNU Make,
	GNU fileutils,
	GNU binutils and GNU C compiler
	supporting shared objects.
      </li>
      </ul>

      I strongly recommend to use GNU C compiler and GNU Make.
    </p>

    <p>
      If you build with the included Makefile,
      you need to tell to your dynamic linker,
      the directory (/usr/local/lib) in which the shared library is installed.
    </p>

    <p>
      If you are installing on a Linux box for example,
      add the line
<pre>
/usr/local/lib
</pre>
      to the file /etc/ld.so.conf
      unless it already contains such line,
      and then issue the command
<pre>
/sbin/ldconfig
</pre>
    </p>

    <hr>

    <h3><a name="encodings">Acceptable encodings</a></h3>

    <p>
      This library can handle the following subset of the ISO 2022 escape sequences:

      <ul>
      <li>designating an ISO 2022 registered character set on a intermediate buffer,</li>
      <li>designating UTF-8,</li>
      <li>return from UTF-8,</li>
      <li>locking shift,</li>
      <li>
	7bit single shift by
	1/11 4/14
	or
	1/11 4/15,
      </li>
      <li>
	8bit single shift by
	8/14
	or
	8/15.
      </li>
      </ul>

      Further it can handle the following non-ISO 2022 encodings:

      <ul>
      <li>UTF-8, UTF-16, UTF-16BE, UTF-16LE</li>
      <li><a href="#x-moe-internal">X-MOE-INTERNAL</a>,</li>
      <li>Shift_JIS,</li>
      <li>Big Five,</li>
      <li>EUC-tw,</li>
      <li>GBK, GB 18030-2000 (a.k.a. GBK2K)</li>
      <li>Johab,</li>
      <li>Unified Hangul,</li>
      <li>KOI8-R,</li>
      <li>KOI8-U,</li>
      <li>Microsoft Windows Codepages 1250 -- 1258.</li>
      </ul>

      Characters with these encodings can be inserted into ISO 2022 encoded character sequences
      with leading escape sequence

      <blockquote>
	1/11 2/5 2/1 2/X 3/Y
      </blockquote>

      and trailing

      <blockquote>
	1/11 2/5 4/0
      </blockquote>

      where X * 0x10 + Y are integers assigned to encodings by the library.
    </p>

    <hr>

    <h3><a name="ucp">Universal Code Point</a></h3>

    <p>
      The library classifies the coded character set (CCS) into 6 categories

      <ul>
      <li>Unicode,</li>
      <li>94 set in ISO 2022,</li>
      <li>96 set in ISO 2022,</li>
      <li>7 bit set not in ISO 2022,</li>
      <li>94x94 set, and</li>
      <li>15 bit set not in ISO 2022.</li>
      </ul>

      The characters in Unicode
      are assigned the same codepoints as in Unicode.
    </p>

    <p>
      For a character in an other CCS,
      it is somewhat difficult to describe how to determine UCP in natural language.
      Roughly speaking,
      we order all the codepoints into one sequence,
      in the order of above categorization and final octet of escape sequences designating the CCS.
      The UCP is logical or of the index (staring with 0) in the big sequence and of 1U << 21.
    </p>

    <hr>

    <h3><a name="x-moe-internal">Internal multiple octet encoding</a></h3>

    <p>
      The library has support for a state-less encoding scheme which we call &quot;x-moe-internal&quot;
      to include all UCP in one document:

      <dl>
      <dt>UCP less than 0x80:</dt>
      <dd>0xxxxxxx,</dd>
      <dt>UCP greater than or equal to 0x80:</dt>
      <dd>11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.</dd>
      </dl>
    </p>
    <!--#include virtual="/signature.html"-->
  </body>
</html>