diff --git a/docs/syntax.tex b/docs/syntax.tex new file mode 100644 index 000000000..478582f2f --- /dev/null +++ b/docs/syntax.tex @@ -0,0 +1,493 @@ +@c -*- mode: texinfo; coding: utf-8; -*- + + +@node Syntax, Loading Programs, Run, Top +@chapter Syntax + +We will describe the syntax of YAP at two levels. We first will +describe the syntax for Prolog terms. In a second level we describe +the @i{tokens} from which Prolog @i{terms} are +built. + +@menu +* Formal Syntax:: Syntax of terms +* Tokens:: Syntax of Prolog tokens +* Encoding:: How characters are encoded and Wide Character Support +@end menu + +@node Formal Syntax, Tokens, ,Syntax +@section Syntax of Terms +@cindex syntax + +Below, we describe the syntax of YAP terms from the different +classes of tokens defined above. The formalism used will be @emph{BNF}, +extended where necessary with attributes denoting integer precedence or +operator type. + +@example + term ----> subterm(1200) end_of_term_marker + + subterm(N) ----> term(M) [M <= N] + + term(N) ----> op(N, fx) subterm(N-1) + | op(N, fy) subterm(N) + | subterm(N-1) op(N, xfx) subterm(N-1) + | subterm(N-1) op(N, xfy) subterm(N) + | subterm(N) op(N, yfx) subterm(N-1) + | subterm(N-1) op(N, xf) + | subterm(N) op(N, yf) + + term(0) ----> atom '(' arguments ')' + | '(' subterm(1200) ')' + | '@{' subterm(1200) '@}' + | list + | string + | number + | atom + | variable + + arguments ----> subterm(999) + | subterm(999) ',' arguments + + list ----> '[]' + | '[' list_expr ']' + + list_expr ----> subterm(999) + | subterm(999) list_tail + + list_tail ----> ',' list_expr + | ',..' subterm(999) + | '|' subterm(999) +@end example + +@noindent +Notes: + +@itemize @bullet + +@item +@i{op(N,T)} denotes an atom which has been previously declared with type +@i{T} and base precedence @i{N}. + +@item +Since ',' is itself a pre-declared operator with type @i{xfy} and +precedence 1000, is @i{subterm} starts with a '(', @i{op} must be +followed by a space to avoid ambiguity with the case of a functor +followed by arguments, e.g.: + +@example ++ (a,b) [the same as '+'(','(a,b)) of arity one] +@end example +versus +@example ++(a,b) [the same as '+'(a,b) of arity two] +@end example + +@item +In the first rule for term(0) no blank space should exist between +@i{atom} and '('. + +@item +@cindex end of term +Each term to be read by the YAP parser must end with a single +dot, followed by a blank (in the sense mentioned in the previous +paragraph). When a name consisting of a single dot could be taken for +the end of term marker, the ambiguity should be avoided by surrounding the +dot with single quotes. + +@end itemize + +@node Tokens, Encoding, Formal Syntax, Syntax +@section Prolog Tokens +@cindex token + +Prolog tokens are grouped into the following categories: + +@menu +* Numbers:: Integer and Floating-Point Numbers +* Strings:: Sequences of Characters +* Atoms:: Atomic Constants +* Variables:: Logical Variables +* Punctuation Tokens:: Tokens that separate other tokens +* Layout:: Comments and Other Layout Rules +@end menu + +@node Numbers, Strings, ,Tokens +@subsection Numbers +@cindex number + +Numbers can be further subdivided into integer and floating-point numbers. + +@menu +* Integers:: How Integers are read and represented +* Floats:: Floating Point Numbers +@end menu + +@node Integers, Floats, ,Numbers +@subsubsection Integers +@cindex integer + +Integer numbers +are described by the following regular expression: + +@example + := @{+|0@{xXo@}@}+ +@end example +@noindent +where @{...@} stands for optionality, @i{+} optional repetition (one or +more times), @i{} denotes one of the characters 0 ... 9, @i{|} +denotes or, and @i{} denotes the character "'". The digits +before the @i{} character, when present, form the number +basis, that can go from 0, 1 and up to 36. Letters from @code{A} to +@code{Z} are used when the basis is larger than 10. + +Note that if no basis is specified then base 10 is assumed. Note also +that the last digit of an integer token can not be immediately followed +by one of the characters 'e', 'E', or '.'. + +Following the ISO standard, YAP also accepts directives of the +form @code{0x} to represent numbers in hexadecimal base and of the form +@code{0o} to represent numbers in octal base. For usefulness, +YAP also accepts directives of the form @code{0X} to represent +numbers in hexadecimal base. + +Example: +the following tokens all denote the same integer +@example +@code{10 2'1010 3'101 8'12 16'a 36'a 0xa 0o12} +@end example + +Numbers of the form @code{0'a} are used to represent character +constants. So, the following tokens denote the same integer: +@example +@code{0'd 100} +@end example + +YAP (version @value{VERSION}) supports integers that can fit +the word size of the machine. This is 32 bits in most current machines, +but 64 in some others, such as the Alpha running Linux or Digital +Unix. The scanner will read larger or smaller integers erroneously. + +@node Floats, , Integers,Numbers +@subsubsection Floating-point Numbers +@cindex floating-point number + +Floating-point numbers are described by: + +@example + := +@{+@} + @{@}+ + |++ + @{@{@}+@} +@end example + +@noindent +where @i{} denotes the decimal-point character '.', +@i{} denotes one of 'e' or 'E', and @i{} denotes +one of '+' or '-'. + +Examples: +@example +@code{10.0 10e3 10e-3 3.1415e+3} +@end example + +Floating-point numbers are represented as a double in the target +machine. This is usually a 64-bit number. + +@node Strings, Atoms, Numbers,Tokens +@subsection Character Strings +@cindex string + +Strings are described by the following rules: +@example + string --> '"' string_quoted_characters '"' + + string_quoted_characters --> '"' '"' string_quoted_characters + string_quoted_characters --> '\' + escape_sequence string_quoted_characters + string_quoted_characters --> + string_character string_quoted_characters + + escape_sequence --> 'a' | 'b' | 'r' | 'f' | 't' | 'n' | 'v' + escape_sequence --> '\' | '"' | ''' | '`' + escape_sequence --> at_most_3_octal_digit_seq_char '\' + escape_sequence --> 'x' at_most_2_hexa_digit_seq_char '\' +@end example +where @code{string_character} in any character except the double quote +and escape characters. + +Examples: +@example +@code{"" "a string" "a double-quote:""" } +@end example + +The first string is an empty string, the last string shows the use of +double-quoting. The implementation of YAP represents strings as +lists of integers. Since YAP 4.3.0 there is no static limit on string +size. + +Escape sequences can be used to include the non-printable characters +@code{a} (alert), @code{b} (backspace), @code{r} (carriage return), +@code{f} (form feed), @code{t} (horizontal tabulation), @code{n} (new +line), and @code{v} (vertical tabulation). Escape sequences also be +include the meta-characters @code{\}, @code{"}, @code{'}, and +@code{`}. Last, one can use escape sequences to include the characters +either as an octal or hexadecimal number. + +The next examples demonstrates the use of escape sequences in YAP: + +@example +@code{"\x0c\" "\01\" "\f" "\\" } +@end example + +The first three examples return a list including only character 12 (form +feed). The last example escapes the escape character. + +Escape sequences were not available in C-Prolog and in original +versions of YAP up to 4.2.0. Escape sequences can be disable by using: +@example +@code{:- yap_flag(character_escapes,false).} +@end example + + +@node Atoms, Variables, Strings, Tokens +@subsection Atoms +@cindex atom + +Atoms are defined by one of the following rules: +@example + atom --> solo-character + atom --> lower-case-letter name-character* + atom --> symbol-character+ + atom --> single-quote single-quote + atom --> ''' atom_quoted_characters ''' + + + atom_quoted_characters --> ''' ''' atom_quoted_characters + atom_quoted_characters --> '\' atom_sequence string_quoted_characters + atom_quoted_characters --> character string_quoted_characters +@end example +where: +@example + denotes one of: ! ; + denotes one of: # & * + - . / : < + = > ? @@ \ ^ ~ ` + denotes one of: a...z + denotes one of: _ a...z A...Z 0....9 + denotes: ' +@end example + +and @code{string_character} denotes any character except the double quote +and escape characters. Note that escape sequences in strings and atoms +follow the same rules. + +Examples: +@example +@code{a a12x '$a' ! => '1 2'} +@end example + +Version @code{4.2.0} of YAP removed the previous limit of 256 +characters on an atom. Size of an atom is now only limited by the space +available in the system. + +@node Variables, Punctuation Tokens, Atoms, Tokens +@subsection Variables +@cindex variable + +Variables are described by: +@example + + +@end example +where +@example + denotes one of: _ A...Z + denotes one of: _ a...z A...Z +@end example + +@cindex anonymous variable +If a variable is referred only once in a term, it needs not to be named +and one can use the character @code{_} to represent the variable. These +variables are known as anonymous variables. Note that different +occurrences of @code{_} on the same term represent @emph{different} +anonymous variables. + +@node Punctuation Tokens, Layout, Variables, Tokens +@subsection Punctuation Tokens +@cindex punctuation token + +Punctuation tokens consist of one of the following characters: +@example +( ) , [ ] @{ @} | +@end example + +These characters are used to group terms. + +@node Layout, ,Punctuation Tokens, Tokens +@subsection Layout +@cindex comment +Any characters with ASCII code less than or equal to 32 appearing before +a token are ignored. + +All the text appearing in a line after the character @i{%} is taken to +be a comment and ignored (including @i{%}). Comments can also be +inserted by using the sequence @code{/*} to start the comment and +@code{*} followed by @code{/} to finish it. In the presence of any sequence of comments or +layout characters, the YAP parser behaves as if it had found a +single blank character. The end of a file also counts as a blank +character for this purpose. + +@node Encoding, , Tokens, Syntax +@section Wide Character Support + +@cindex encodings + +@menu +* Stream Encoding:: How Prolog Streams can be coded +* BOM:: The Byte Order Mark +@end menu + +@cindex UTF-8 +@cindex Unicode +@cindex UCS +@cindex internationalization + +YAP now implements a SWI-Prolog compatible interface to wide +characters and the Universal Character Set (UCS). The following text +was adapted from the SWI-Prolog manual. + +YAP now supports wide characters, characters with character +codes above 255 that cannot be represented in a single byte. +@emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard +that specifies a unique 31-bits unsigned integer for any character in +any language. It is a superset of 16-bit Unicode, which in turn is +a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS +can handle strings holding characters from multiple languages and +character classification (uppercase, lowercase, digit, etc.) and +operations such as case-conversion are unambiguously defined. + +For this reason YAP, following SWI-Prolog, has two representations for +atoms. If the text fits in ISO Latin-1, it is represented as an array +of 8-bit characters. Otherwise the text is represented as an array of +wide chars, which may take 16 or 32 bits. This representational issue +is completely transparent to the Prolog user. Users of the foreign +language interface sometimes need to be aware of these issues though. + +Character coding comes into view when characters of strings need to be +read from or written to file or when they have to be communicated to +other software components using the foreign language interface. In this +section we only deal with I/O through streams, which includes file I/O +as well as I/O through network sockets. + + +@node Stream Encoding, , BOM, Encoding +@subsection Wide character encodings on streams + + + +Although characters are uniquely coded using the UCS standard +internally, streams and files are byte (8-bit) oriented and there are a +variety of ways to represent the larger UCS codes in an 8-bit octet +stream. The most popular one, especially in the context of the web, is +UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII +character, while bytes 128...255 are used for multi-byte +encoding of characters placed higher in the UCS space. Especially on +MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is +also popular. + +Prolog I/O streams have a property called @emph{encoding} which +specifies the used encoding that influence @code{get_code/2} and +@code{put_code/2} as well as all the other text I/O predicates. + +The default encoding for files is derived from the Prolog flag +@code{encoding}, which is initialised from the environment. If the +environment variable @env{LANG} ends in "UTF-8", this encoding is +assumed. Otherwise the default is @code{text} and the translation is +left to the wide-character functions of the C-library (note that the +Prolog native UTF-8 mode is considerably faster than the generic +@code{mbrtowc()} one). The encoding can be specified explicitly in +@code{load_files/2} for loading Prolog source with an alternative +encoding, @code{open/4} when opening files or using @code{set_stream/2} on +any open stream (not yet implemented). For Prolog source files we also +provide the @code{encoding/1} directive that can be used to switch +between encodings that are compatible to US-ASCII (@code{ascii}, +@code{iso_latin_1}, @code{utf8} and many locales). +@c See also +@c \secref{intsrcfile} for writing Prolog files with non-US-ASCII +@c characters and \secref{unicodesyntax} for syntax issues. +For +additional information and Unicode resources, please visit +@uref{http://www.unicode.org/}. + +YAP currently defines and supports the following encodings: + +@itemize @bullet +@item octet +Default encoding for @emph{binary} streams. This causes +the stream to be read and written fully untranslated. + +@item ascii +7-bit encoding in 8-bit bytes. Equivalent to @code{iso_latin_1}, +but generates errors and warnings on encountering values above +127. + +@item iso_latin_1 +8-bit encoding supporting many western languages. This causes +the stream to be read and written fully untranslated. + +@item text +C-library default locale encoding for text files. Files are read and +written using the C-library functions @code{mbrtowc()} and +@code{wcrtomb()}. This may be the same as one of the other locales, +notably it may be the same as @code{iso_latin_1} for western +languages and @code{utf8} in a UTF-8 context. + +@item utf8 +Multi-byte encoding of full UCS, compatible to @code{ascii}. +See above. + +@item unicode_be +Unicode Big Endian. Reads input in pairs of bytes, most +significant byte first. Can only represent 16-bit characters. + +@item unicode_le +Unicode Little Endian. Reads input in pairs of bytes, least +significant byte first. Can only represent 16-bit characters. +@end itemize + +Note that not all encodings can represent all characters. This implies +that writing text to a stream may cause errors because the stream +cannot represent these characters. The behaviour of a stream on these +errors can be controlled using @code{open/4} or @code{set_stream/2} (not +implemented). Initially the terminal stream write the characters using +Prolog escape sequences while other streams generate an I/O exception. + + +@node BOM, Stream Encoding, , Encoding +@subsection BOM: Byte Order Mark + +@cindex BOM +@cindex Byte Order Mark +From @ref{Stream Encoding}, you may have got the impression text-files are +complicated. This section deals with a related topic, making live often +easier for the user, but providing another worry to the programmer. +@strong{BOM} or @emph{Byte Order Marker} is a technique for +identifying Unicode text-files as well as the encoding they use. Such +files start with the Unicode character @code{0xFEFF}, a non-breaking, +zero-width space character. This is a pretty unique sequence that is not +likely to be the start of a non-Unicode file and uniquely distinguishes +the various Unicode file formats. As it is a zero-width blank, it even +doesn't produce any output. This solves all problems, or ... + +Some formats start of as US-ASCII and may contain some encoding mark to +switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header. +Such formats often explicitly forbid the the use of a UTF-8 BOM. In +other cases there is additional information telling the encoding making +the use of a BOM redundant or even illegal. + +The BOM is handled by the @code{open/4} predicate. By default, text-files are +probed for the BOM when opened for reading. If a BOM is found, the +encoding is set accordingly and the property @code{bom(true)} is +available through @code{stream_property/2}. When opening a file for +writing, writing a BOM can be requested using the option +@code{bom(true)} with @code{open/4}. +