fix some encoding stuff and add some documentation.
git-svn-id: https://yap.svn.sf.net/svnroot/yap/trunk@1863 b08c6af1-5177-4d33-ba66-4b1c6b8b522a
This commit is contained in:
197
docs/yap.tex
197
docs/yap.tex
@@ -138,6 +138,7 @@ Subnodes of Running
|
||||
Subnodes of Syntax
|
||||
* Formal Syntax:: Syntax of Terms
|
||||
* Tokens:: Syntax of Prolog tokens
|
||||
* Encoding:: How characters are encoded and Wide Character Support
|
||||
|
||||
Subnodes of Tokens
|
||||
* Numbers:: Integer and Floating-Point Numbers
|
||||
@@ -151,6 +152,10 @@ Subnodes of Numbers
|
||||
* Integers:: How Integers are read and represented
|
||||
* Floats:: Floating Point Numbers
|
||||
|
||||
Subnodes of Encoding
|
||||
* Stream Encoding:: How Prolog Streams can be coded
|
||||
* BOM:: The Byte Order Mark
|
||||
|
||||
Subnodes of Loading Programs
|
||||
* Compiling:: Program Loading and Updating
|
||||
* Setting the Compiler:: Changing the compiler's parameters
|
||||
@@ -1029,6 +1034,7 @@ built.
|
||||
@menu
|
||||
* Formal Syntax:: Syntax of terms
|
||||
* Tokens:: Syntax of Prolog tokens
|
||||
* Encoding:: How characters are encoded and Wide Character Support
|
||||
@end menu
|
||||
|
||||
@node Formal Syntax, Tokens, ,Syntax
|
||||
@@ -1116,7 +1122,7 @@ dot with single quotes.
|
||||
|
||||
@end itemize
|
||||
|
||||
@node Tokens, , Formal Syntax, Syntax
|
||||
@node Tokens, Encoding, Formal Syntax, Syntax
|
||||
@section Prolog Tokens
|
||||
@cindex token
|
||||
|
||||
@@ -1362,6 +1368,159 @@ layout characters, the YAP parser behaves as if it had found a
|
||||
single blank character. The end of a file also counts as a blank
|
||||
character for this purpose.
|
||||
|
||||
@node Encoding, , Tokens, Syntax
|
||||
@section Wide Character Support
|
||||
@cindex encodings
|
||||
|
||||
@menu
|
||||
* Stream Encoding:: How Prolog Streams can be coded
|
||||
* BOM:: The Byte Order Mark
|
||||
@end menu
|
||||
|
||||
@cindex UTF-8
|
||||
@cindex Unicode
|
||||
@cindex UCS
|
||||
@cindex internationalization
|
||||
YAP now implements a SWI-Prolog compatible interface to wide
|
||||
characters and the Universal Character Set (UCS). The following text
|
||||
was adapted from the SWI-Prolog manual.
|
||||
|
||||
YAP now supports wide characters, characters with character
|
||||
codes above 255 that cannot be represented in a single byte.
|
||||
@emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard
|
||||
that specifies a unique 31-bits unsigned integer for any character in
|
||||
any language. It is a superset of 16-bit Unicode, which in turn is
|
||||
a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS
|
||||
can handle strings holding characters from multiple languages and
|
||||
character classification (uppercase, lowercase, digit, etc.) and
|
||||
operations such as case-conversion are unambiguously defined.
|
||||
|
||||
For this reason YAP, following SWI-Prolog, has two representations for
|
||||
atoms. If the text fits in ISO Latin-1, it is represented as an array
|
||||
of 8-bit characters. Otherwise the text is represented as an array of
|
||||
wide chars, which may take 16 or 32 bits. This representational issue
|
||||
is completely transparent to the Prolog user. Users of the foreign
|
||||
language interface sometimes need to be aware of these issues though.
|
||||
|
||||
Character coding comes into view when characters of strings need to be
|
||||
read from or written to file or when they have to be communicated to
|
||||
other software components using the foreign language interface. In this
|
||||
section we only deal with I/O through streams, which includes file I/O
|
||||
as well as I/O through network sockets.
|
||||
|
||||
|
||||
@node Stream Encoding, , BOM, Encoding
|
||||
@subsection Wide character encodings on streams
|
||||
|
||||
|
||||
|
||||
Although characters are uniquely coded using the UCS standard
|
||||
internally, streams and files are byte (8-bit) oriented and there are a
|
||||
variety of ways to represent the larger UCS codes in an 8-bit octet
|
||||
stream. The most popular one, especially in the context of the web, is
|
||||
UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII
|
||||
character, while bytes 128...255 are used for multi-byte
|
||||
encoding of characters placed higher in the UCS space. Especially on
|
||||
MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is
|
||||
also popular.
|
||||
|
||||
Prolog I/O streams have a property called @emph{encoding} which
|
||||
specifies the used encoding that influence @code{get_code/2} and
|
||||
@code{put_code/2} as well as all the other text I/O predicates.
|
||||
|
||||
The default encoding for files is derived from the Prolog flag
|
||||
@code{encoding}, which is initialised from the environment. If the
|
||||
environment variable @env{LANG} ends in "UTF-8", this encoding is
|
||||
assumed. Otherwise the default is @code{text} and the translation is
|
||||
left to the wide-character functions of the C-library (note that the
|
||||
Prolog native UTF-8 mode is considerably faster than the generic
|
||||
mbrtowc() one). The encoding can be specified explicitly in
|
||||
@code{load_files/2} for loading Prolog source with an alternative
|
||||
encoding, @code{open/4} when opening files or using set_stream/2 on
|
||||
any open stream (not yet implemented). For Prolog source files we also
|
||||
provide the @code{encoding/1} directive that can be used to switch
|
||||
between encodings that are compatible to US-ASCII (@code{ascii},
|
||||
@code{iso_latin_1}, @code{utf8} and many locales).
|
||||
@c See also
|
||||
@c \secref{intsrcfile} for writing Prolog files with non-US-ASCII
|
||||
@c characters and \secref{unicodesyntax} for syntax issues.
|
||||
For
|
||||
additional information and Unicode resources, please visit
|
||||
@uref{http://www.unicode.org/}.
|
||||
|
||||
YAP currently defines and supports the following encodings:
|
||||
|
||||
@table @code
|
||||
@item octet
|
||||
Default encoding for @emph{binary} streams. This causes
|
||||
the stream to be read and written fully untranslated.
|
||||
|
||||
@item ascii
|
||||
7-bit encoding in 8-bit bytes. Equivalent to @code{iso_latin_1},
|
||||
but generates errors and warnings on encountering values above
|
||||
127.
|
||||
|
||||
@item iso_latin_1
|
||||
8-bit encoding supporting many western languages. This causes
|
||||
the stream to be read and written fully untranslated.
|
||||
|
||||
@item text
|
||||
C-library default locale encoding for text files. Files are read and
|
||||
written using the C-library functions @code{mbrtowc()} and
|
||||
@code{wcrtomb()}. This may be the same as one of the other locales,
|
||||
notably it may be the same as @code{iso_latin_1} for western
|
||||
languages and @code{utf8} in a UTF-8 context.
|
||||
|
||||
@item utf8
|
||||
Multi-byte encoding of full UCS, compatible to @code{ascii}.
|
||||
See above.
|
||||
|
||||
@item unicode_be
|
||||
Unicode Big Endian. Reads input in pairs of bytes, most
|
||||
significant byte first. Can only represent 16-bit characters.
|
||||
|
||||
@item unicode_le
|
||||
Unicode Little Endian. Reads input in pairs of bytes, least
|
||||
significant byte first. Can only represent 16-bit characters.
|
||||
@end table
|
||||
|
||||
Note that not all encodings can represent all characters. This implies
|
||||
that writing text to a stream may cause errors because the stream
|
||||
cannot represent these characters. The behaviour of a stream on these
|
||||
errors can be controlled using @code{open/4} or @code{set_stream/2} (not
|
||||
implemented). Initially the terminal stream write the characters using
|
||||
Prolog escape sequences while other streams generate an I/O exception.
|
||||
|
||||
|
||||
@node BOM, Stream Encoding, , Encoding
|
||||
@subsection BOM: Byte Order Mark
|
||||
|
||||
@cindex BOM
|
||||
@cindex Byte Order Mark
|
||||
From @ref{Stream Encoding}, you may have got the impression text-files are
|
||||
complicated. This section deals with a related topic, making live often
|
||||
easier for the user, but providing another worry to the programmer.
|
||||
@strong{BOM} or @emph{Byte Order Marker} is a technique for
|
||||
identifying Unicode text-files as well as the encoding they use. Such
|
||||
files start with the Unicode character @code{0xFEFF}, a non-breaking,
|
||||
zero-width space character. This is a pretty unique sequence that is not
|
||||
likely to be the start of a non-Unicode file and uniquely distinguishes
|
||||
the various Unicode file formats. As it is a zero-width blank, it even
|
||||
doesn't produce any output. This solves all problems, or ...
|
||||
|
||||
Some formats start of as US-ASCII and may contain some encoding mark to
|
||||
switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header.
|
||||
Such formats often explicitly forbid the the use of a UTF-8 BOM. In
|
||||
other cases there is additional information telling the encoding making
|
||||
the use of a BOM redundant or even illegal.
|
||||
|
||||
The BOM is handled by the @code{open/4} predicate. By default, text-files are
|
||||
probed for the BOM when opened for reading. If a BOM is found, the
|
||||
encoding is set accordingly and the property @code{bom(true)} is
|
||||
available through @code{stream_property/2}. When opening a file for
|
||||
writing, writing a BOM can be requested using the option
|
||||
@code{bom(true)} with @code{open/4}.
|
||||
|
||||
@node Loading Programs, Modules, Syntax, Top
|
||||
@chapter Loading Programs
|
||||
|
||||
@@ -3381,6 +3540,24 @@ concerning the stream.
|
||||
The operation will fail and give an error if the alias name is already
|
||||
in use. YAP allows several aliases for the same file, but only
|
||||
one is returned by @code{stream_property/2}
|
||||
|
||||
@item bom(+@var{Bool})
|
||||
If present and @code{true}, a BOM (@emph{Byte Order Mark}) was
|
||||
detected while opening the file for reading or a BOM was written while
|
||||
opening the stream. See @ref{BOM} for details.
|
||||
|
||||
@item encoding(+@var{Encoding})
|
||||
Set the encoding used for text. See @ref{Encoding} for an overview of
|
||||
wide character and encoding issues.
|
||||
|
||||
@item representation_errors(+@var{Mode})
|
||||
Change the behaviour when writing characters to the stream that cannot
|
||||
be represented by the encoding. The behaviour is one of @code{error}
|
||||
(throw and I/O error exception), @code{prolog} (write @code{\u...\}
|
||||
escape code or @code{xml} (write @code{&#...;} XML character entity).
|
||||
The initial mode is @code{prolog} for the user streams and
|
||||
@code{error} for all other streams. See also @ref{Encoding}.
|
||||
|
||||
@end table
|
||||
|
||||
@item close(+@var{S}) [ISO]
|
||||
@@ -3550,6 +3727,24 @@ seekable.
|
||||
@item type(@var{T})
|
||||
Whether the stream is a @code{text} stream or a @code{binary} stream.
|
||||
|
||||
@item bom(+@var{Bool})
|
||||
If present and @code{true}, a BOM (@emph{Byte Order Mark}) was
|
||||
detected while opening the file for reading or a BOM was written while
|
||||
opening the stream. See @ref{BOM} for details.
|
||||
|
||||
@item encoding(+@var{Encoding})
|
||||
Query the encoding used for text. See @ref{Encoding} for an
|
||||
overview of wide character and encoding issues in YAP.
|
||||
|
||||
@item representation_errors(+@var{Mode})
|
||||
Behaviour when writing characters to the stream that cannot be
|
||||
represented by the encoding. The behaviour is one of @code{error}
|
||||
(throw and I/O error exception), @code{prolog} (write @code{\u...\}
|
||||
escape code or @code{xml} (write @code{&#...;} XML character entity).
|
||||
The initial mode is @code{prolog} for the user streams and
|
||||
@code{error} for all other streams. See also @ref{Encoding} and
|
||||
@code{open/4}.
|
||||
|
||||
@end table
|
||||
|
||||
@end table
|
||||
|
Reference in New Issue
Block a user