doc updates

2014-04-09 12:50:27 +01:00
parent 250099cfe8
commit ca0c646793
1 changed files with 493 additions and 0 deletions
--- a/docs/syntax.tex
+++ b/docs/syntax.tex
@@ -0,0 +1,493 @@
+@c -*- mode: texinfo; coding: utf-8; -*-
+
+
+@node Syntax, Loading Programs, Run, Top
+@chapter Syntax
+
+We will describe the syntax of YAP at two levels. We first will
+describe the syntax for Prolog terms. In a second level we describe
+the @i{tokens} from which Prolog @i{terms} are
+built.
+
+@menu
+* Formal Syntax:: Syntax of terms 
+* Tokens:: Syntax of Prolog tokens
+* Encoding:: How characters are encoded and Wide Character Support
+@end menu
+
+@node Formal Syntax, Tokens, ,Syntax
+@section Syntax of Terms
+@cindex syntax
+
+Below, we describe the syntax of YAP terms from the different
+classes of tokens defined above. The formalism used will be @emph{BNF},
+extended where necessary with attributes denoting integer precedence or
+operator type.
+
+@example
+ term       ---->     subterm(1200)   end_of_term_marker
+
+ subterm(N) ---->     term(M)         [M <= N]
+
+ term(N)    ---->     op(N, fx) subterm(N-1)
+             |        op(N, fy) subterm(N)
+             |        subterm(N-1) op(N, xfx) subterm(N-1)
+             |        subterm(N-1) op(N, xfy) subterm(N)
+             |        subterm(N) op(N, yfx) subterm(N-1)
+             |        subterm(N-1) op(N, xf)
+             |        subterm(N) op(N, yf)
+
+ term(0)   ---->      atom '(' arguments ')'
+             |        '(' subterm(1200)  ')'
+             |        '@{' subterm(1200)  '@}'
+             |        list
+             |        string
+             |        number
+             |        atom
+             |        variable
+
+ arguments ---->      subterm(999)
+             |        subterm(999) ',' arguments
+
+ list      ---->      '[]'
+             |        '[' list_expr ']'
+
+ list_expr ---->      subterm(999)
+             |        subterm(999) list_tail
+
+ list_tail ---->      ',' list_expr
+             |        ',..' subterm(999)
+             |        '|' subterm(999)
+@end example
+
+@noindent
+Notes:
+
+@itemize @bullet
+
+@item
+@i{op(N,T)} denotes an atom which has been previously declared with type
+@i{T} and base precedence @i{N}.
+
+@item
+Since ',' is itself a pre-declared operator with type @i{xfy} and
+precedence 1000, is @i{subterm} starts with a '(', @i{op} must be
+followed by a space to avoid ambiguity with the case of a functor
+followed by arguments, e.g.:
+
+@example
+ (a,b)        [the same as '+'(','(a,b)) of arity one]
+@end example
+versus
+@example
+(a,b)         [the same as '+'(a,b) of arity two]
+@end example
+
+@item
+In the first rule for term(0) no blank space should exist between
+@i{atom} and '('.
+
+@item
+@cindex end of term
+Each term to be read by the YAP parser must end with a single
+dot, followed by a blank (in the sense mentioned in the previous
+paragraph). When a name consisting of a single dot could be taken for
+the end of term marker, the ambiguity should be avoided by surrounding the
+dot with single quotes.
+
+@end itemize
+
+@node Tokens, Encoding, Formal Syntax, Syntax
+@section Prolog Tokens
+@cindex token
+
+Prolog tokens are grouped into the following categories:
+
+@menu
+* Numbers:: Integer and Floating-Point Numbers
+* Strings:: Sequences of Characters
+* Atoms:: Atomic Constants
+* Variables:: Logical Variables
+* Punctuation Tokens:: Tokens that separate other tokens
+* Layout:: Comments and Other Layout Rules
+@end menu
+
+@node Numbers, Strings, ,Tokens
+@subsection Numbers
+@cindex number
+
+Numbers can be further subdivided into integer and floating-point numbers.
+
+@menu
+* Integers:: How Integers are read and represented
+* Floats:: Floating Point Numbers
+@end menu
+
+@node Integers, Floats, ,Numbers
+@subsubsection Integers
+@cindex integer
+
+Integer numbers
+are described by the following regular expression:
+
+@example
+<integer> := @{<digit>+<single-quote>|0@{xXo@}@}<alpha_numeric_char>+
+@end example
+@noindent
+where @{...@} stands for optionality, @i{+} optional repetition (one or
+more times), @i{<digit>} denotes one of the characters 0 ... 9, @i{|}
+denotes or, and @i{<single-quote>} denotes the character "'". The digits
+before the @i{<single-quote>} character, when present, form the number
+basis, that can go from 0, 1 and up to 36. Letters from @code{A} to
+@code{Z} are used when the basis is larger than 10.
+
+Note that if no basis is specified then base 10 is assumed. Note also
+that the last digit of an integer token can not be immediately followed
+by one of the characters 'e', 'E', or '.'.
+
+Following the ISO standard, YAP also accepts directives of the
+form @code{0x} to represent numbers in hexadecimal base and of the form
+@code{0o} to represent numbers in octal base. For usefulness,
+YAP also accepts directives of the form @code{0X} to represent
+numbers in hexadecimal base.
+
+Example:
+the following tokens all denote the same integer
+@example
+@code{10  2'1010  3'101  8'12  16'a  36'a  0xa  0o12}
+@end example
+
+Numbers of the form @code{0'a} are used to represent character
+constants. So, the following tokens denote the same integer:
+@example
+@code{0'd  100}
+@end example
+
+YAP (version @value{VERSION}) supports integers that can fit
+the word size of the machine. This is 32 bits in most current machines,
+but 64 in some others, such as the Alpha running Linux or Digital
+Unix. The scanner will read larger or smaller integers erroneously.
+
+@node Floats, , Integers,Numbers
+@subsubsection Floating-point Numbers
+@cindex floating-point number
+
+Floating-point numbers are described by:
+
+@example
+   <float> := <digit>+@{<dot><digit>+@}
+               <exponent-marker>@{<sign>@}<digit>+
+            |<digit>+<dot><digit>+
+               @{<exponent-marker>@{<sign>@}<digit>+@}
+@end example
+
+@noindent
+where @i{<dot>} denotes the decimal-point character '.',
+@i{<exponent-marker>} denotes one of 'e' or 'E', and @i{<sign>} denotes
+one of '+' or '-'.
+
+Examples:
+@example
+@code{10.0   10e3   10e-3   3.1415e+3}
+@end example
+
+Floating-point numbers are represented as a double in the target
+machine. This is usually a 64-bit number.
+
+@node Strings, Atoms, Numbers,Tokens
+@subsection Character Strings
+@cindex string
+
+Strings are described by the following rules:
+@example
+  string --> '"' string_quoted_characters '"'
+
+  string_quoted_characters --> '"' '"' string_quoted_characters
+  string_quoted_characters --> '\'
+                          escape_sequence string_quoted_characters
+  string_quoted_characters -->
+                          string_character string_quoted_characters
+
+  escape_sequence --> 'a' | 'b' | 'r' | 'f' | 't' | 'n' | 'v'
+  escape_sequence --> '\' | '"' | ''' | '`'
+  escape_sequence --> at_most_3_octal_digit_seq_char '\'
+  escape_sequence --> 'x' at_most_2_hexa_digit_seq_char '\'
+@end example
+where @code{string_character} in any character except the double quote
+and escape characters.
+
+Examples:
+@example
+@code{""   "a string"   "a double-quote:""" }
+@end example
+
+The first string is an empty string, the last string shows the use of
+double-quoting. The implementation of YAP represents strings as
+lists of integers. Since YAP 4.3.0 there is no static limit on string
+size.
+
+Escape sequences can be used to include the non-printable characters
+@code{a} (alert), @code{b} (backspace), @code{r} (carriage return),
+@code{f} (form feed), @code{t} (horizontal tabulation), @code{n} (new
+line), and @code{v} (vertical tabulation). Escape sequences also be
+include the meta-characters @code{\}, @code{"}, @code{'}, and
+@code{`}. Last, one can use escape sequences to include the characters
+either as an octal or hexadecimal number.
+
+The next examples demonstrates the use of escape sequences in YAP:
+
+@example
+@code{"\x0c\" "\01\" "\f" "\\" }
+@end example
+
+The first three examples return a list including only character 12 (form
+feed). The last example escapes the escape character.
+
+Escape sequences were not available in C-Prolog and in original
+versions of YAP up to 4.2.0. Escape sequences can be disable by using:
+@example
+@code{:- yap_flag(character_escapes,false).}
+@end example
+
+
+@node Atoms, Variables, Strings, Tokens
+@subsection Atoms
+@cindex atom
+
+Atoms are defined by one of the following rules:
+@example
+   atom --> solo-character
+   atom --> lower-case-letter name-character*
+   atom --> symbol-character+
+   atom --> single-quote  single-quote
+   atom --> ''' atom_quoted_characters '''
+
+
+  atom_quoted_characters --> ''' ''' atom_quoted_characters
+  atom_quoted_characters --> '\' atom_sequence string_quoted_characters
+  atom_quoted_characters --> character string_quoted_characters
+@end example
+where:
+@example
+   <solo-character>     denotes one of:    ! ;
+   <symbol-character>   denotes one of:    # & * + - . / : < 
+                                           = > ? @@ \ ^ ~ `
+   <lower-case-letter>  denotes one of:    a...z
+   <name-character>     denotes one of:    _ a...z A...Z 0....9
+   <single-quote>       denotes:           '
+@end example
+
+and @code{string_character} denotes any character except the double quote
+and escape characters. Note that escape sequences in strings and atoms
+follow the same rules.
+
+Examples:
+@example
+@code{a   a12x   '$a'   !   =>  '1 2'}
+@end example
+
+Version @code{4.2.0} of YAP removed the previous limit of 256
+characters on an atom. Size of an atom is now only limited by the space
+available in the system.
+
+@node Variables, Punctuation Tokens, Atoms, Tokens
+@subsection Variables
+@cindex variable
+
+Variables are described by:
+@example
+   <variable-starter><variable-character>+
+@end example
+where
+@example
+  <variable-starter>   denotes one of:    _ A...Z
+  <variable-character> denotes one of:    _ a...z A...Z
+@end example
+
+@cindex anonymous variable
+If a variable is referred only once in a term, it needs not to be named
+and one can use the character @code{_} to represent the variable. These
+variables are known as anonymous variables. Note that different
+occurrences of @code{_} on the same term represent @emph{different}
+anonymous variables. 
+
+@node Punctuation Tokens, Layout, Variables, Tokens
+@subsection Punctuation Tokens
+@cindex punctuation token
+
+Punctuation tokens consist of one of the following characters:
+@example
+( ) , [ ] @{ @} |
+@end example
+
+These characters are used to group terms.
+
+@node Layout, ,Punctuation Tokens, Tokens
+@subsection Layout
+@cindex comment
+Any characters with ASCII code less than or equal to 32 appearing before
+a token are ignored.
+
+All the text appearing in a line after the character @i{%} is taken to
+be a comment and ignored (including @i{%}).  Comments can also be
+inserted by using the sequence @code{/*} to start the comment and
+@code{*} followed by @code{/} to finish it. In the presence of any sequence of comments or
+layout characters, the YAP parser behaves as if it had found a
+single blank character. The end of a file also counts as a blank
+character for this purpose.
+
+@node Encoding, , Tokens, Syntax
+@section Wide Character Support
+
+@cindex encodings
+
+@menu
+* Stream Encoding:: How Prolog Streams can be coded
+* BOM:: The Byte Order Mark
+@end menu
+
+@cindex UTF-8
+@cindex Unicode
+@cindex UCS
+@cindex internationalization
+
+YAP now implements a SWI-Prolog compatible interface to wide
+characters and the Universal Character Set (UCS). The following text
+was adapted from the SWI-Prolog manual.
+
+YAP now  supports wide characters, characters with character
+codes above 255 that cannot be represented in a single byte.
+@emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard
+that specifies a unique 31-bits unsigned integer for any character in
+any language.  It is a superset of 16-bit Unicode, which in turn is
+a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII.  UCS
+can handle strings holding characters from multiple languages and
+character classification (uppercase, lowercase, digit, etc.) and
+operations such as case-conversion are unambiguously defined.
+
+For this reason YAP, following SWI-Prolog, has two representations for
+atoms. If the text fits in ISO Latin-1, it is represented as an array
+of 8-bit characters.  Otherwise the text is represented as an array of
+wide chars, which may take 16 or 32 bits.  This representational issue
+is completely transparent to the Prolog user.  Users of the foreign
+language interface sometimes need to be aware of these issues though.
+
+Character coding comes into view when characters of strings need to be
+read from or written to file or when they have to be communicated to
+other software components using the foreign language interface. In this
+section we only deal with I/O through streams, which includes file I/O
+as well as I/O through network sockets.
+
+
+@node Stream Encoding, , BOM, Encoding
+@subsection Wide character encodings on streams
+
+
+
+Although characters are uniquely coded using the UCS standard
+internally, streams and files are byte (8-bit) oriented and there are a
+variety of ways to represent the larger UCS codes in an 8-bit octet
+stream. The most popular one, especially in the context of the web, is
+UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII
+character, while bytes 128...255 are used for multi-byte
+encoding of characters placed higher in the UCS space. Especially on
+MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is
+also popular.
+
+Prolog I/O streams have a property called @emph{encoding} which
+specifies the used encoding that influence @code{get_code/2} and
+@code{put_code/2} as well as all the other text I/O predicates.
+
+The default encoding for files is derived from the Prolog flag
+@code{encoding}, which is initialised from the environment.  If the
+environment variable @env{LANG} ends in "UTF-8", this encoding is
+assumed. Otherwise the default is @code{text} and the translation is
+left to the wide-character functions of the C-library (note that the
+Prolog native UTF-8 mode is considerably faster than the generic
+@code{mbrtowc()} one).  The encoding can be specified explicitly in
+@code{load_files/2} for loading Prolog source with an alternative
+encoding, @code{open/4} when opening files or using @code{set_stream/2} on
+any open stream (not yet implemented). For Prolog source files we also
+provide the @code{encoding/1} directive that can be used to switch
+between encodings that are compatible to US-ASCII (@code{ascii},
+@code{iso_latin_1}, @code{utf8} and many locales).  
+@c See also
+@c \secref{intsrcfile} for writing Prolog files with non-US-ASCII
+@c characters and \secref{unicodesyntax} for syntax issues. 
+For
+additional information and Unicode resources, please visit
+@uref{http://www.unicode.org/}.
+
+YAP currently defines and supports the following encodings:
+
+@itemize @bullet
+@item  octet
+Default encoding for @emph{binary} streams.  This causes
+the stream to be read and written fully untranslated.
+
+@item  ascii
+7-bit encoding in 8-bit bytes.  Equivalent to @code{iso_latin_1},
+but generates errors and warnings on encountering values above
+127.
+
+@item  iso_latin_1
+8-bit encoding supporting many western languages.  This causes
+the stream to be read and written fully untranslated.
+
+@item  text
+C-library default locale encoding for text files.  Files are read and
+written using the C-library functions @code{mbrtowc()} and
+@code{wcrtomb()}.  This may be the same as one of the other locales,
+notably it may be the same as @code{iso_latin_1} for western
+languages and @code{utf8} in a UTF-8 context.
+
+@item  utf8
+Multi-byte encoding of full UCS, compatible to @code{ascii}.
+See above.
+
+@item  unicode_be
+Unicode Big Endian.  Reads input in pairs of bytes, most
+significant byte first.  Can only represent 16-bit characters.
+
+@item  unicode_le
+Unicode Little Endian.  Reads input in pairs of bytes, least
+significant byte first.  Can only represent 16-bit characters.
+@end itemize 
+
+Note that not all encodings can represent all characters. This implies
+that writing text to a stream may cause errors because the stream
+cannot represent these characters. The behaviour of a stream on these
+errors can be controlled using @code{open/4} or @code{set_stream/2} (not
+implemented). Initially the terminal stream write the characters using
+Prolog escape sequences while other streams generate an I/O exception.
+
+
+@node BOM, Stream Encoding, , Encoding
+@subsection BOM: Byte Order Mark
+
+@cindex BOM
+@cindex Byte Order Mark
+From @ref{Stream Encoding}, you may have got the impression text-files are
+complicated. This section deals with a related topic, making live often
+easier for the user, but providing another worry to the programmer.
+@strong{BOM} or @emph{Byte Order Marker} is a technique for
+identifying Unicode text-files as well as the encoding they use. Such
+files start with the Unicode character @code{0xFEFF}, a non-breaking,
+zero-width space character. This is a pretty unique sequence that is not
+likely to be the start of a non-Unicode file and uniquely distinguishes
+the various Unicode file formats. As it is a zero-width blank, it even
+doesn't produce any output. This solves all problems, or ...
+
+Some formats start of as US-ASCII and may contain some encoding mark to
+switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header.
+Such formats often explicitly forbid the the use of a UTF-8 BOM. In
+other cases there is additional information telling the encoding making
+the use of a BOM redundant or even illegal.
+
+The BOM is handled by the @code{open/4} predicate. By default, text-files are
+probed for the BOM when opened for reading. If a BOM is found, the
+encoding is set accordingly and the property @code{bom(true)} is
+available through @code{stream_property/2}. When opening a file for
+writing, writing a BOM can be requested using the option
+@code{bom(true)} with @code{open/4}.
+