@c -*- mode: texinfo; coding: utf-8; -*- @node Syntax, Loading Programs, Run, Top @chapter Syntax We will describe the syntax of YAP at two levels. We first will describe the syntax for Prolog terms. In a second level we describe the @i{tokens} from which Prolog @i{terms} are built. @menu * Formal Syntax:: Syntax of terms * Tokens:: Syntax of Prolog tokens * Encoding:: How characters are encoded and Wide Character Support @end menu @node Formal Syntax, Tokens, ,Syntax @section Syntax of Terms @cindex syntax Below, we describe the syntax of YAP terms from the different classes of tokens defined above. The formalism used will be @emph{BNF}, extended where necessary with attributes denoting integer precedence or operator type. @example term ----> subterm(1200) end_of_term_marker subterm(N) ----> term(M) [M <= N] term(N) ----> op(N, fx) subterm(N-1) | op(N, fy) subterm(N) | subterm(N-1) op(N, xfx) subterm(N-1) | subterm(N-1) op(N, xfy) subterm(N) | subterm(N) op(N, yfx) subterm(N-1) | subterm(N-1) op(N, xf) | subterm(N) op(N, yf) term(0) ----> atom '(' arguments ')' | '(' subterm(1200) ')' | '@{' subterm(1200) '@}' | list | string | number | atom | variable arguments ----> subterm(999) | subterm(999) ',' arguments list ----> '[]' | '[' list_expr ']' list_expr ----> subterm(999) | subterm(999) list_tail list_tail ----> ',' list_expr | ',..' subterm(999) | '|' subterm(999) @end example @noindent Notes: @itemize @bullet @item @i{op(N,T)} denotes an atom which has been previously declared with type @i{T} and base precedence @i{N}. @item Since ',' is itself a pre-declared operator with type @i{xfy} and precedence 1000, is @i{subterm} starts with a '(', @i{op} must be followed by a space to avoid ambiguity with the case of a functor followed by arguments, e.g.: @example + (a,b) [the same as '+'(','(a,b)) of arity one] @end example versus @example +(a,b) [the same as '+'(a,b) of arity two] @end example @item In the first rule for term(0) no blank space should exist between @i{atom} and '('. @item @cindex end of term Each term to be read by the YAP parser must end with a single dot, followed by a blank (in the sense mentioned in the previous paragraph). When a name consisting of a single dot could be taken for the end of term marker, the ambiguity should be avoided by surrounding the dot with single quotes. @end itemize @node Tokens, Encoding, Formal Syntax, Syntax @section Prolog Tokens @cindex token Prolog tokens are grouped into the following categories: @menu * Numbers:: Integer and Floating-Point Numbers * Strings:: Sequences of Characters * Atoms:: Atomic Constants * Variables:: Logical Variables * Punctuation Tokens:: Tokens that separate other tokens * Layout:: Comments and Other Layout Rules @end menu @node Numbers, Strings, ,Tokens @subsection Numbers @cindex number Numbers can be further subdivided into integer and floating-point numbers. @menu * Integers:: How Integers are read and represented * Floats:: Floating Point Numbers @end menu @node Integers, Floats, ,Numbers @subsubsection Integers @cindex integer Integer numbers are described by the following regular expression: @example := @{+|0@{xXo@}@}+ @end example @noindent where @{...@} stands for optionality, @i{+} optional repetition (one or more times), @i{} denotes one of the characters 0 ... 9, @i{|} denotes or, and @i{} denotes the character "'". The digits before the @i{} character, when present, form the number basis, that can go from 0, 1 and up to 36. Letters from @code{A} to @code{Z} are used when the basis is larger than 10. Note that if no basis is specified then base 10 is assumed. Note also that the last digit of an integer token can not be immediately followed by one of the characters 'e', 'E', or '.'. Following the ISO standard, YAP also accepts directives of the form @code{0x} to represent numbers in hexadecimal base and of the form @code{0o} to represent numbers in octal base. For usefulness, YAP also accepts directives of the form @code{0X} to represent numbers in hexadecimal base. Example: the following tokens all denote the same integer @example 10 2'1010 3'101 8'12 16'a 36'a 0xa 0o12 @end example Numbers of the form @code{0'a} are used to represent character constants. So, the following tokens denote the same integer: @example 0'd 100 @end example YAP (version @value{VERSION}) supports integers that can fit the word size of the machine. This is 32 bits in most current machines, but 64 in some others, such as the Alpha running Linux or Digital Unix. The scanner will read larger or smaller integers erroneously. @node Floats, , Integers,Numbers @subsubsection Floating-point Numbers @cindex floating-point number Floating-point numbers are described by: @example := +@{+@} @{@}+ |++ @{@{@}+@} @end example @noindent where @i{} denotes the decimal-point character '.', @i{} denotes one of 'e' or 'E', and @i{} denotes one of '+' or '-'. Examples: @example 10.0 10e3 10e-3 3.1415e+3 @end example Floating-point numbers are represented as a double in the target machine. This is usually a 64-bit number. @node Strings, Atoms, Numbers,Tokens @subsection Character Strings @cindex string Strings are described by the following rules: @example string --> '"' string_quoted_characters '"' string_quoted_characters --> '"' '"' string_quoted_characters string_quoted_characters --> '\' escape_sequence string_quoted_characters string_quoted_characters --> string_character string_quoted_characters escape_sequence --> 'a' | 'b' | 'r' | 'f' | 't' | 'n' | 'v' escape_sequence --> '\' | '"' | ''' | '`' escape_sequence --> at_most_3_octal_digit_seq_char '\' escape_sequence --> 'x' at_most_2_hexa_digit_seq_char '\' @end example where @code{string_character} in any character except the double quote and escape characters. Examples: @example "" "a string" "a double-quote:""" @end example The first string is an empty string, the last string shows the use of double-quoting. The implementation of YAP represents strings as lists of integers. Since YAP 4.3.0 there is no static limit on string size. Escape sequences can be used to include the non-printable characters @code{a} (alert), @code{b} (backspace), @code{r} (carriage return), @code{f} (form feed), @code{t} (horizontal tabulation), @code{n} (new line), and @code{v} (vertical tabulation). Escape sequences also be include the meta-characters @code{\}, @code{"}, @code{'}, and @code{`}. Last, one can use escape sequences to include the characters either as an octal or hexadecimal number. The next examples demonstrates the use of escape sequences in YAP: @example "\x0c\" "\01\" "\f" "\\" @end example The first three examples return a list including only character 12 (form feed). The last example escapes the escape character. Escape sequences were not available in C-Prolog and in original versions of YAP up to 4.2.0. Escape sequences can be disable by using: @example :- yap_flag(character_escapes,false). @end example @node Atoms, Variables, Strings, Tokens @subsection Atoms @cindex atom Atoms are defined by one of the following rules: @example atom --> solo-character atom --> lower-case-letter name-character* atom --> symbol-character+ atom --> single-quote single-quote atom --> ''' atom_quoted_characters ''' atom_quoted_characters --> ''' ''' atom_quoted_characters atom_quoted_characters --> '\' atom_sequence string_quoted_characters atom_quoted_characters --> character string_quoted_characters @end example where: @example denotes one of: ! ; denotes one of: # & * + - . / : < = > ? @@ \ ^ ~ ` denotes one of: a...z denotes one of: _ a...z A...Z 0....9 denotes: ' @end example and @code{string_character} denotes any character except the double quote and escape characters. Note that escape sequences in strings and atoms follow the same rules. Examples: @example a a12x '$a' ! => '1 2' @end example Version @code{4.2.0} of YAP removed the previous limit of 256 characters on an atom. Size of an atom is now only limited by the space available in the system. @node Variables, Punctuation Tokens, Atoms, Tokens @subsection Variables @cindex variable Variables are described by: @example + @end example where @example denotes one of: _ A...Z denotes one of: _ a...z A...Z @end example @cindex anonymous variable If a variable is referred only once in a term, it needs not to be named and one can use the character @code{_} to represent the variable. These variables are known as anonymous variables. Note that different occurrences of @code{_} on the same term represent @emph{different} anonymous variables. @node Punctuation Tokens, Layout, Variables, Tokens @subsection Punctuation Tokens @cindex punctuation token Punctuation tokens consist of one of the following characters: @example ( ) , [ ] @{ @} | @end example These characters are used to group terms. @node Layout, ,Punctuation Tokens, Tokens @subsection Layout @cindex comment Any characters with ASCII code less than or equal to 32 appearing before a token are ignored. All the text appearing in a line after the character @i{%} is taken to be a comment and ignored (including @i{%}). Comments can also be inserted by using the sequence @code{/*} to start the comment and @code{*} followed by @code{/} to finish it. In the presence of any sequence of comments or layout characters, the YAP parser behaves as if it had found a single blank character. The end of a file also counts as a blank character for this purpose. @node Encoding, , Tokens, Syntax @section Wide Character Support @cindex encodings @menu * Stream Encoding:: How Prolog Streams can be coded * BOM:: The Byte Order Mark @end menu @cindex UTF-8 @cindex Unicode @cindex UCS @cindex internationalization YAP now implements a SWI-Prolog compatible interface to wide characters and the Universal Character Set (UCS). The following text was adapted from the SWI-Prolog manual. YAP now supports wide characters, characters with character codes above 255 that cannot be represented in a single byte. @emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard that specifies a unique 31-bits unsigned integer for any character in any language. It is a superset of 16-bit Unicode, which in turn is a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS can handle strings holding characters from multiple languages and character classification (uppercase, lowercase, digit, etc.) and operations such as case-conversion are unambiguously defined. For this reason YAP, following SWI-Prolog, has two representations for atoms. If the text fits in ISO Latin-1, it is represented as an array of 8-bit characters. Otherwise the text is represented as an array of wide chars, which may take 16 or 32 bits. This representational issue is completely transparent to the Prolog user. Users of the foreign language interface sometimes need to be aware of these issues though. Character coding comes into view when characters of strings need to be read from or written to file or when they have to be communicated to other software components using the foreign language interface. In this section we only deal with I/O through streams, which includes file I/O as well as I/O through network sockets. @node Stream Encoding, , BOM, Encoding @subsection Wide character encodings on streams Although characters are uniquely coded using the UCS standard internally, streams and files are byte (8-bit) oriented and there are a variety of ways to represent the larger UCS codes in an 8-bit octet stream. The most popular one, especially in the context of the web, is UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII character, while bytes 128...255 are used for multi-byte encoding of characters placed higher in the UCS space. Especially on MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is also popular. Prolog I/O streams have a property called @emph{encoding} which specifies the used encoding that influence @code{get_code/2} and @code{put_code/2} as well as all the other text I/O predicates. The default encoding for files is derived from the Prolog flag @code{encoding}, which is initialised from the environment. If the environment variable @env{LANG} ends in "UTF-8", this encoding is assumed. Otherwise the default is @code{text} and the translation is left to the wide-character functions of the C-library (note that the Prolog native UTF-8 mode is considerably faster than the generic @code{mbrtowc()} one). The encoding can be specified explicitly in @code{load_files/2} for loading Prolog source with an alternative encoding, @code{open/4} when opening files or using @code{set_stream/2} on any open stream (not yet implemented). For Prolog source files we also provide the @code{encoding/1} directive that can be used to switch between encodings that are compatible to US-ASCII (@code{ascii}, @code{iso_latin_1}, @code{utf8} and many locales). @c See also @c \secref{intsrcfile} for writing Prolog files with non-US-ASCII @c characters and \secref{unicodesyntax} for syntax issues. For additional information and Unicode resources, please visit @uref{http://www.unicode.org/}. YAP currently defines and supports the following encodings: @itemize @bullet @item octet Default encoding for @emph{binary} streams. This causes the stream to be read and written fully untranslated. @item ascii 7-bit encoding in 8-bit bytes. Equivalent to @code{iso_latin_1}, but generates errors and warnings on encountering values above 127. @item iso_latin_1 8-bit encoding supporting many western languages. This causes the stream to be read and written fully untranslated. @item text C-library default locale encoding for text files. Files are read and written using the C-library functions @code{mbrtowc()} and @code{wcrtomb()}. This may be the same as one of the other locales, notably it may be the same as @code{iso_latin_1} for western languages and @code{utf8} in a UTF-8 context. @item utf8 Multi-byte encoding of full UCS, compatible to @code{ascii}. See above. @item unicode_be Unicode Big Endian. Reads input in pairs of bytes, most significant byte first. Can only represent 16-bit characters. @item unicode_le Unicode Little Endian. Reads input in pairs of bytes, least significant byte first. Can only represent 16-bit characters. @end itemize Note that not all encodings can represent all characters. This implies that writing text to a stream may cause errors because the stream cannot represent these characters. The behaviour of a stream on these errors can be controlled using @code{open/4} or @code{set_stream/2} (not implemented). Initially the terminal stream write the characters using Prolog escape sequences while other streams generate an I/O exception. @node BOM, Stream Encoding, , Encoding @subsection BOM: Byte Order Mark @cindex BOM @cindex Byte Order Mark From @ref{Stream Encoding}, you may have got the impression that text-files are complicated. This section deals with a related topic, making live often easier for the user, but providing another worry to the programmer. @strong{BOM} or @emph{Byte Order Marker} is a technique for identifying Unicode text-files as well as the encoding they use. Such files start with the Unicode character @code{0xFEFF}, a non-breaking, zero-width space character. This is a pretty unique sequence that is not likely to be the start of a non-Unicode file and uniquely distinguishes the various Unicode file formats. As it is a zero-width blank, it even doesn't produce any output. This solves all problems, or ... Some formats start of as US-ASCII and may contain some encoding mark to switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header. Such formats often explicitly forbid the the use of a UTF-8 BOM. In other cases there is additional information telling the encoding making the use of a BOM redundant or even illegal. The BOM is handled by the @code{open/4} predicate. By default, text-files are probed for the BOM when opened for reading. If a BOM is found, the encoding is set accordingly and the property @code{bom(true)} is available through @code{stream_property/2}. When opening a file for writing, writing a BOM can be requested using the option @code{bom(true)} with @code{open/4}.