fix some encoding stuff and add some documentation.

git-svn-id: https://yap.svn.sf.net/svnroot/yap/trunk@1863 b08c6af1-5177-4d33-ba66-4b1c6b8b522a
2007-04-03 15:03:11 +00:00
parent 917c777381
commit 35174e0901
6 changed files with 321 additions and 11 deletions
--- a/docs/yap.tex
+++ b/docs/yap.tex
@@ -138,6 +138,7 @@ Subnodes of Running
 Subnodes of Syntax
 * Formal Syntax:: Syntax of Terms
 * Tokens:: Syntax of Prolog tokens
+* Encoding:: How characters are encoded and Wide Character Support

 Subnodes of Tokens
 * Numbers:: Integer and Floating-Point Numbers
@@ -151,6 +152,10 @@ Subnodes of Numbers
 * Integers:: How Integers are read and represented
 * Floats:: Floating Point Numbers

+Subnodes of Encoding
+* Stream Encoding:: How Prolog Streams can be coded
+* BOM:: The Byte Order Mark
+
 Subnodes of Loading Programs
 * Compiling:: Program Loading and Updating
 * Setting the Compiler:: Changing the compiler's parameters
@@ -1029,6 +1034,7 @@ built.
@menu
 * Formal Syntax:: Syntax of terms 
 * Tokens:: Syntax of Prolog tokens
+* Encoding:: How characters are encoded and Wide Character Support
@end menu

@node Formal Syntax, Tokens, ,Syntax
@@ -1116,7 +1122,7 @@ dot with single quotes.

@end itemize

-@node Tokens, , Formal Syntax, Syntax
+@node Tokens, Encoding, Formal Syntax, Syntax
@section Prolog Tokens
@cindex token

@@ -1362,6 +1368,159 @@ layout characters, the YAP parser behaves as if it had found a
 single blank character. The end of a file also counts as a blank
 character for this purpose.

+@node Encoding, , Tokens, Syntax
+@section Wide Character Support
+@cindex encodings
+
+@menu
+* Stream Encoding:: How Prolog Streams can be coded
+* BOM:: The Byte Order Mark
+@end menu
+
+@cindex UTF-8
+@cindex Unicode
+@cindex UCS
+@cindex internationalization
+YAP now implements a SWI-Prolog compatible interface to wide
+characters and the Universal Character Set (UCS). The following text
+was adapted from the SWI-Prolog manual.
+
+YAP now  supports wide characters, characters with character
+codes above 255 that cannot be represented in a single byte.
+@emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard
+that specifies a unique 31-bits unsigned integer for any character in
+any language.  It is a superset of 16-bit Unicode, which in turn is
+a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII.  UCS
+can handle strings holding characters from multiple languages and
+character classification (uppercase, lowercase, digit, etc.) and
+operations such as case-conversion are unambiguously defined.
+
+For this reason YAP, following SWI-Prolog, has two representations for
+atoms. If the text fits in ISO Latin-1, it is represented as an array
+of 8-bit characters.  Otherwise the text is represented as an array of
+wide chars, which may take 16 or 32 bits.  This representational issue
+is completely transparent to the Prolog user.  Users of the foreign
+language interface sometimes need to be aware of these issues though.
+
+Character coding comes into view when characters of strings need to be
+read from or written to file or when they have to be communicated to
+other software components using the foreign language interface. In this
+section we only deal with I/O through streams, which includes file I/O
+as well as I/O through network sockets.
+
+
+@node Stream Encoding, , BOM, Encoding
+@subsection Wide character encodings on streams
+
+
+
+Although characters are uniquely coded using the UCS standard
+internally, streams and files are byte (8-bit) oriented and there are a
+variety of ways to represent the larger UCS codes in an 8-bit octet
+stream. The most popular one, especially in the context of the web, is
+UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII
+character, while bytes 128...255 are used for multi-byte
+encoding of characters placed higher in the UCS space. Especially on
+MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is
+also popular.
+
+Prolog I/O streams have a property called @emph{encoding} which
+specifies the used encoding that influence @code{get_code/2} and
+@code{put_code/2} as well as all the other text I/O predicates.
+
+The default encoding for files is derived from the Prolog flag
+@code{encoding}, which is initialised from the environment.  If the
+environment variable @env{LANG} ends in "UTF-8", this encoding is
+assumed. Otherwise the default is @code{text} and the translation is
+left to the wide-character functions of the C-library (note that the
+Prolog native UTF-8 mode is considerably faster than the generic
+mbrtowc() one).  The encoding can be specified explicitly in
+@code{load_files/2} for loading Prolog source with an alternative
+encoding, @code{open/4} when opening files or using set_stream/2 on
+any open stream (not yet implemented). For Prolog source files we also
+provide the @code{encoding/1} directive that can be used to switch
+between encodings that are compatible to US-ASCII (@code{ascii},
+@code{iso_latin_1}, @code{utf8} and many locales).  
+@c See also
+@c \secref{intsrcfile} for writing Prolog files with non-US-ASCII
+@c characters and \secref{unicodesyntax} for syntax issues. 
+For
+additional information and Unicode resources, please visit
+@uref{http://www.unicode.org/}.
+
+YAP currently defines and supports the following encodings:
+
+@table @code
+@item  octet
+Default encoding for @emph{binary} streams.  This causes
+the stream to be read and written fully untranslated.
+
+@item  ascii
+7-bit encoding in 8-bit bytes.  Equivalent to @code{iso_latin_1},
+but generates errors and warnings on encountering values above
+127.
+
+@item  iso_latin_1
+8-bit encoding supporting many western languages.  This causes
+the stream to be read and written fully untranslated.
+
+@item  text
+C-library default locale encoding for text files.  Files are read and
+written using the C-library functions @code{mbrtowc()} and
+@code{wcrtomb()}.  This may be the same as one of the other locales,
+notably it may be the same as @code{iso_latin_1} for western
+languages and @code{utf8} in a UTF-8 context.
+
+@item  utf8
+Multi-byte encoding of full UCS, compatible to @code{ascii}.
+See above.
+
+@item  unicode_be
+Unicode Big Endian.  Reads input in pairs of bytes, most
+significant byte first.  Can only represent 16-bit characters.
+
+@item  unicode_le
+Unicode Little Endian.  Reads input in pairs of bytes, least
+significant byte first.  Can only represent 16-bit characters.
+@end table 
+
+Note that not all encodings can represent all characters. This implies
+that writing text to a stream may cause errors because the stream
+cannot represent these characters. The behaviour of a stream on these
+errors can be controlled using @code{open/4} or @code{set_stream/2} (not
+implemented). Initially the terminal stream write the characters using
+Prolog escape sequences while other streams generate an I/O exception.
+
+
+@node BOM, Stream Encoding, , Encoding
+@subsection BOM: Byte Order Mark
+
+@cindex BOM
+@cindex Byte Order Mark
+From @ref{Stream Encoding}, you may have got the impression text-files are
+complicated. This section deals with a related topic, making live often
+easier for the user, but providing another worry to the programmer.
+@strong{BOM} or @emph{Byte Order Marker} is a technique for
+identifying Unicode text-files as well as the encoding they use. Such
+files start with the Unicode character @code{0xFEFF}, a non-breaking,
+zero-width space character. This is a pretty unique sequence that is not
+likely to be the start of a non-Unicode file and uniquely distinguishes
+the various Unicode file formats. As it is a zero-width blank, it even
+doesn't produce any output. This solves all problems, or ...
+
+Some formats start of as US-ASCII and may contain some encoding mark to
+switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header.
+Such formats often explicitly forbid the the use of a UTF-8 BOM. In
+other cases there is additional information telling the encoding making
+the use of a BOM redundant or even illegal.
+
+The BOM is handled by the @code{open/4} predicate. By default, text-files are
+probed for the BOM when opened for reading. If a BOM is found, the
+encoding is set accordingly and the property @code{bom(true)} is
+available through @code{stream_property/2}. When opening a file for
+writing, writing a BOM can be requested using the option
+@code{bom(true)} with @code{open/4}.
+
@node Loading Programs, Modules, Syntax, Top
@chapter Loading Programs

@@ -3381,6 +3540,24 @@ concerning the stream.
 The operation will fail and give an error if the alias name is already
 in use. YAP allows several aliases for the same file, but only
 one is returned by @code{stream_property/2}
+
+@item bom(+@var{Bool})
+If present and @code{true}, a BOM (@emph{Byte Order Mark}) was
+detected while opening the file for reading or a BOM was written while
+opening the stream. See @ref{BOM} for details.
+
+@item encoding(+@var{Encoding})
+Set the encoding used for text.  See @ref{Encoding} for an overview of
+wide character and encoding issues.
+
+@item representation_errors(+@var{Mode})
+Change the behaviour when writing characters to the stream that cannot
+be represented by the encoding.  The behaviour is one of @code{error}
+(throw and I/O error exception), @code{prolog} (write @code{\u...\}
+escape code or @code{xml} (write @code{&#...;} XML character entity).
+The initial mode is @code{prolog} for the user streams and
+@code{error} for all other streams. See also @ref{Encoding}.
+
@end table

@item close(+@var{S}) [ISO]
@@ -3550,6 +3727,24 @@ seekable.
@item type(@var{T})
 Whether the stream is a @code{text} stream or a @code{binary} stream.

+@item bom(+@var{Bool})
+If present and @code{true}, a BOM (@emph{Byte Order Mark}) was
+detected while opening the file for reading or a BOM was written while
+opening the stream. See @ref{BOM} for details.
+
+@item encoding(+@var{Encoding})
+Query the encoding used for text.  See @ref{Encoding} for an
+overview of wide character and encoding issues in YAP.
+
+@item representation_errors(+@var{Mode})
+Behaviour when writing characters to the stream that cannot be
+represented by the encoding.  The behaviour is one of @code{error}
+(throw and I/O error exception), @code{prolog} (write @code{\u...\}
+escape code or @code{xml} (write @code{&#...;} XML character entity).
+The initial mode is @code{prolog} for the user streams and
+@code{error} for all other streams. See also @ref{Encoding} and
+@code{open/4}.
+
@end table

@end table