doc updates
This commit is contained in:
parent
250099cfe8
commit
ca0c646793
493
docs/syntax.tex
Normal file
493
docs/syntax.tex
Normal file
@ -0,0 +1,493 @@
|
||||
@c -*- mode: texinfo; coding: utf-8; -*-
|
||||
|
||||
|
||||
@node Syntax, Loading Programs, Run, Top
|
||||
@chapter Syntax
|
||||
|
||||
We will describe the syntax of YAP at two levels. We first will
|
||||
describe the syntax for Prolog terms. In a second level we describe
|
||||
the @i{tokens} from which Prolog @i{terms} are
|
||||
built.
|
||||
|
||||
@menu
|
||||
* Formal Syntax:: Syntax of terms
|
||||
* Tokens:: Syntax of Prolog tokens
|
||||
* Encoding:: How characters are encoded and Wide Character Support
|
||||
@end menu
|
||||
|
||||
@node Formal Syntax, Tokens, ,Syntax
|
||||
@section Syntax of Terms
|
||||
@cindex syntax
|
||||
|
||||
Below, we describe the syntax of YAP terms from the different
|
||||
classes of tokens defined above. The formalism used will be @emph{BNF},
|
||||
extended where necessary with attributes denoting integer precedence or
|
||||
operator type.
|
||||
|
||||
@example
|
||||
term ----> subterm(1200) end_of_term_marker
|
||||
|
||||
subterm(N) ----> term(M) [M <= N]
|
||||
|
||||
term(N) ----> op(N, fx) subterm(N-1)
|
||||
| op(N, fy) subterm(N)
|
||||
| subterm(N-1) op(N, xfx) subterm(N-1)
|
||||
| subterm(N-1) op(N, xfy) subterm(N)
|
||||
| subterm(N) op(N, yfx) subterm(N-1)
|
||||
| subterm(N-1) op(N, xf)
|
||||
| subterm(N) op(N, yf)
|
||||
|
||||
term(0) ----> atom '(' arguments ')'
|
||||
| '(' subterm(1200) ')'
|
||||
| '@{' subterm(1200) '@}'
|
||||
| list
|
||||
| string
|
||||
| number
|
||||
| atom
|
||||
| variable
|
||||
|
||||
arguments ----> subterm(999)
|
||||
| subterm(999) ',' arguments
|
||||
|
||||
list ----> '[]'
|
||||
| '[' list_expr ']'
|
||||
|
||||
list_expr ----> subterm(999)
|
||||
| subterm(999) list_tail
|
||||
|
||||
list_tail ----> ',' list_expr
|
||||
| ',..' subterm(999)
|
||||
| '|' subterm(999)
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
Notes:
|
||||
|
||||
@itemize @bullet
|
||||
|
||||
@item
|
||||
@i{op(N,T)} denotes an atom which has been previously declared with type
|
||||
@i{T} and base precedence @i{N}.
|
||||
|
||||
@item
|
||||
Since ',' is itself a pre-declared operator with type @i{xfy} and
|
||||
precedence 1000, is @i{subterm} starts with a '(', @i{op} must be
|
||||
followed by a space to avoid ambiguity with the case of a functor
|
||||
followed by arguments, e.g.:
|
||||
|
||||
@example
|
||||
+ (a,b) [the same as '+'(','(a,b)) of arity one]
|
||||
@end example
|
||||
versus
|
||||
@example
|
||||
+(a,b) [the same as '+'(a,b) of arity two]
|
||||
@end example
|
||||
|
||||
@item
|
||||
In the first rule for term(0) no blank space should exist between
|
||||
@i{atom} and '('.
|
||||
|
||||
@item
|
||||
@cindex end of term
|
||||
Each term to be read by the YAP parser must end with a single
|
||||
dot, followed by a blank (in the sense mentioned in the previous
|
||||
paragraph). When a name consisting of a single dot could be taken for
|
||||
the end of term marker, the ambiguity should be avoided by surrounding the
|
||||
dot with single quotes.
|
||||
|
||||
@end itemize
|
||||
|
||||
@node Tokens, Encoding, Formal Syntax, Syntax
|
||||
@section Prolog Tokens
|
||||
@cindex token
|
||||
|
||||
Prolog tokens are grouped into the following categories:
|
||||
|
||||
@menu
|
||||
* Numbers:: Integer and Floating-Point Numbers
|
||||
* Strings:: Sequences of Characters
|
||||
* Atoms:: Atomic Constants
|
||||
* Variables:: Logical Variables
|
||||
* Punctuation Tokens:: Tokens that separate other tokens
|
||||
* Layout:: Comments and Other Layout Rules
|
||||
@end menu
|
||||
|
||||
@node Numbers, Strings, ,Tokens
|
||||
@subsection Numbers
|
||||
@cindex number
|
||||
|
||||
Numbers can be further subdivided into integer and floating-point numbers.
|
||||
|
||||
@menu
|
||||
* Integers:: How Integers are read and represented
|
||||
* Floats:: Floating Point Numbers
|
||||
@end menu
|
||||
|
||||
@node Integers, Floats, ,Numbers
|
||||
@subsubsection Integers
|
||||
@cindex integer
|
||||
|
||||
Integer numbers
|
||||
are described by the following regular expression:
|
||||
|
||||
@example
|
||||
<integer> := @{<digit>+<single-quote>|0@{xXo@}@}<alpha_numeric_char>+
|
||||
@end example
|
||||
@noindent
|
||||
where @{...@} stands for optionality, @i{+} optional repetition (one or
|
||||
more times), @i{<digit>} denotes one of the characters 0 ... 9, @i{|}
|
||||
denotes or, and @i{<single-quote>} denotes the character "'". The digits
|
||||
before the @i{<single-quote>} character, when present, form the number
|
||||
basis, that can go from 0, 1 and up to 36. Letters from @code{A} to
|
||||
@code{Z} are used when the basis is larger than 10.
|
||||
|
||||
Note that if no basis is specified then base 10 is assumed. Note also
|
||||
that the last digit of an integer token can not be immediately followed
|
||||
by one of the characters 'e', 'E', or '.'.
|
||||
|
||||
Following the ISO standard, YAP also accepts directives of the
|
||||
form @code{0x} to represent numbers in hexadecimal base and of the form
|
||||
@code{0o} to represent numbers in octal base. For usefulness,
|
||||
YAP also accepts directives of the form @code{0X} to represent
|
||||
numbers in hexadecimal base.
|
||||
|
||||
Example:
|
||||
the following tokens all denote the same integer
|
||||
@example
|
||||
@code{10 2'1010 3'101 8'12 16'a 36'a 0xa 0o12}
|
||||
@end example
|
||||
|
||||
Numbers of the form @code{0'a} are used to represent character
|
||||
constants. So, the following tokens denote the same integer:
|
||||
@example
|
||||
@code{0'd 100}
|
||||
@end example
|
||||
|
||||
YAP (version @value{VERSION}) supports integers that can fit
|
||||
the word size of the machine. This is 32 bits in most current machines,
|
||||
but 64 in some others, such as the Alpha running Linux or Digital
|
||||
Unix. The scanner will read larger or smaller integers erroneously.
|
||||
|
||||
@node Floats, , Integers,Numbers
|
||||
@subsubsection Floating-point Numbers
|
||||
@cindex floating-point number
|
||||
|
||||
Floating-point numbers are described by:
|
||||
|
||||
@example
|
||||
<float> := <digit>+@{<dot><digit>+@}
|
||||
<exponent-marker>@{<sign>@}<digit>+
|
||||
|<digit>+<dot><digit>+
|
||||
@{<exponent-marker>@{<sign>@}<digit>+@}
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
where @i{<dot>} denotes the decimal-point character '.',
|
||||
@i{<exponent-marker>} denotes one of 'e' or 'E', and @i{<sign>} denotes
|
||||
one of '+' or '-'.
|
||||
|
||||
Examples:
|
||||
@example
|
||||
@code{10.0 10e3 10e-3 3.1415e+3}
|
||||
@end example
|
||||
|
||||
Floating-point numbers are represented as a double in the target
|
||||
machine. This is usually a 64-bit number.
|
||||
|
||||
@node Strings, Atoms, Numbers,Tokens
|
||||
@subsection Character Strings
|
||||
@cindex string
|
||||
|
||||
Strings are described by the following rules:
|
||||
@example
|
||||
string --> '"' string_quoted_characters '"'
|
||||
|
||||
string_quoted_characters --> '"' '"' string_quoted_characters
|
||||
string_quoted_characters --> '\'
|
||||
escape_sequence string_quoted_characters
|
||||
string_quoted_characters -->
|
||||
string_character string_quoted_characters
|
||||
|
||||
escape_sequence --> 'a' | 'b' | 'r' | 'f' | 't' | 'n' | 'v'
|
||||
escape_sequence --> '\' | '"' | ''' | '`'
|
||||
escape_sequence --> at_most_3_octal_digit_seq_char '\'
|
||||
escape_sequence --> 'x' at_most_2_hexa_digit_seq_char '\'
|
||||
@end example
|
||||
where @code{string_character} in any character except the double quote
|
||||
and escape characters.
|
||||
|
||||
Examples:
|
||||
@example
|
||||
@code{"" "a string" "a double-quote:""" }
|
||||
@end example
|
||||
|
||||
The first string is an empty string, the last string shows the use of
|
||||
double-quoting. The implementation of YAP represents strings as
|
||||
lists of integers. Since YAP 4.3.0 there is no static limit on string
|
||||
size.
|
||||
|
||||
Escape sequences can be used to include the non-printable characters
|
||||
@code{a} (alert), @code{b} (backspace), @code{r} (carriage return),
|
||||
@code{f} (form feed), @code{t} (horizontal tabulation), @code{n} (new
|
||||
line), and @code{v} (vertical tabulation). Escape sequences also be
|
||||
include the meta-characters @code{\}, @code{"}, @code{'}, and
|
||||
@code{`}. Last, one can use escape sequences to include the characters
|
||||
either as an octal or hexadecimal number.
|
||||
|
||||
The next examples demonstrates the use of escape sequences in YAP:
|
||||
|
||||
@example
|
||||
@code{"\x0c\" "\01\" "\f" "\\" }
|
||||
@end example
|
||||
|
||||
The first three examples return a list including only character 12 (form
|
||||
feed). The last example escapes the escape character.
|
||||
|
||||
Escape sequences were not available in C-Prolog and in original
|
||||
versions of YAP up to 4.2.0. Escape sequences can be disable by using:
|
||||
@example
|
||||
@code{:- yap_flag(character_escapes,false).}
|
||||
@end example
|
||||
|
||||
|
||||
@node Atoms, Variables, Strings, Tokens
|
||||
@subsection Atoms
|
||||
@cindex atom
|
||||
|
||||
Atoms are defined by one of the following rules:
|
||||
@example
|
||||
atom --> solo-character
|
||||
atom --> lower-case-letter name-character*
|
||||
atom --> symbol-character+
|
||||
atom --> single-quote single-quote
|
||||
atom --> ''' atom_quoted_characters '''
|
||||
|
||||
|
||||
atom_quoted_characters --> ''' ''' atom_quoted_characters
|
||||
atom_quoted_characters --> '\' atom_sequence string_quoted_characters
|
||||
atom_quoted_characters --> character string_quoted_characters
|
||||
@end example
|
||||
where:
|
||||
@example
|
||||
<solo-character> denotes one of: ! ;
|
||||
<symbol-character> denotes one of: # & * + - . / : <
|
||||
= > ? @@ \ ^ ~ `
|
||||
<lower-case-letter> denotes one of: a...z
|
||||
<name-character> denotes one of: _ a...z A...Z 0....9
|
||||
<single-quote> denotes: '
|
||||
@end example
|
||||
|
||||
and @code{string_character} denotes any character except the double quote
|
||||
and escape characters. Note that escape sequences in strings and atoms
|
||||
follow the same rules.
|
||||
|
||||
Examples:
|
||||
@example
|
||||
@code{a a12x '$a' ! => '1 2'}
|
||||
@end example
|
||||
|
||||
Version @code{4.2.0} of YAP removed the previous limit of 256
|
||||
characters on an atom. Size of an atom is now only limited by the space
|
||||
available in the system.
|
||||
|
||||
@node Variables, Punctuation Tokens, Atoms, Tokens
|
||||
@subsection Variables
|
||||
@cindex variable
|
||||
|
||||
Variables are described by:
|
||||
@example
|
||||
<variable-starter><variable-character>+
|
||||
@end example
|
||||
where
|
||||
@example
|
||||
<variable-starter> denotes one of: _ A...Z
|
||||
<variable-character> denotes one of: _ a...z A...Z
|
||||
@end example
|
||||
|
||||
@cindex anonymous variable
|
||||
If a variable is referred only once in a term, it needs not to be named
|
||||
and one can use the character @code{_} to represent the variable. These
|
||||
variables are known as anonymous variables. Note that different
|
||||
occurrences of @code{_} on the same term represent @emph{different}
|
||||
anonymous variables.
|
||||
|
||||
@node Punctuation Tokens, Layout, Variables, Tokens
|
||||
@subsection Punctuation Tokens
|
||||
@cindex punctuation token
|
||||
|
||||
Punctuation tokens consist of one of the following characters:
|
||||
@example
|
||||
( ) , [ ] @{ @} |
|
||||
@end example
|
||||
|
||||
These characters are used to group terms.
|
||||
|
||||
@node Layout, ,Punctuation Tokens, Tokens
|
||||
@subsection Layout
|
||||
@cindex comment
|
||||
Any characters with ASCII code less than or equal to 32 appearing before
|
||||
a token are ignored.
|
||||
|
||||
All the text appearing in a line after the character @i{%} is taken to
|
||||
be a comment and ignored (including @i{%}). Comments can also be
|
||||
inserted by using the sequence @code{/*} to start the comment and
|
||||
@code{*} followed by @code{/} to finish it. In the presence of any sequence of comments or
|
||||
layout characters, the YAP parser behaves as if it had found a
|
||||
single blank character. The end of a file also counts as a blank
|
||||
character for this purpose.
|
||||
|
||||
@node Encoding, , Tokens, Syntax
|
||||
@section Wide Character Support
|
||||
|
||||
@cindex encodings
|
||||
|
||||
@menu
|
||||
* Stream Encoding:: How Prolog Streams can be coded
|
||||
* BOM:: The Byte Order Mark
|
||||
@end menu
|
||||
|
||||
@cindex UTF-8
|
||||
@cindex Unicode
|
||||
@cindex UCS
|
||||
@cindex internationalization
|
||||
|
||||
YAP now implements a SWI-Prolog compatible interface to wide
|
||||
characters and the Universal Character Set (UCS). The following text
|
||||
was adapted from the SWI-Prolog manual.
|
||||
|
||||
YAP now supports wide characters, characters with character
|
||||
codes above 255 that cannot be represented in a single byte.
|
||||
@emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard
|
||||
that specifies a unique 31-bits unsigned integer for any character in
|
||||
any language. It is a superset of 16-bit Unicode, which in turn is
|
||||
a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS
|
||||
can handle strings holding characters from multiple languages and
|
||||
character classification (uppercase, lowercase, digit, etc.) and
|
||||
operations such as case-conversion are unambiguously defined.
|
||||
|
||||
For this reason YAP, following SWI-Prolog, has two representations for
|
||||
atoms. If the text fits in ISO Latin-1, it is represented as an array
|
||||
of 8-bit characters. Otherwise the text is represented as an array of
|
||||
wide chars, which may take 16 or 32 bits. This representational issue
|
||||
is completely transparent to the Prolog user. Users of the foreign
|
||||
language interface sometimes need to be aware of these issues though.
|
||||
|
||||
Character coding comes into view when characters of strings need to be
|
||||
read from or written to file or when they have to be communicated to
|
||||
other software components using the foreign language interface. In this
|
||||
section we only deal with I/O through streams, which includes file I/O
|
||||
as well as I/O through network sockets.
|
||||
|
||||
|
||||
@node Stream Encoding, , BOM, Encoding
|
||||
@subsection Wide character encodings on streams
|
||||
|
||||
|
||||
|
||||
Although characters are uniquely coded using the UCS standard
|
||||
internally, streams and files are byte (8-bit) oriented and there are a
|
||||
variety of ways to represent the larger UCS codes in an 8-bit octet
|
||||
stream. The most popular one, especially in the context of the web, is
|
||||
UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII
|
||||
character, while bytes 128...255 are used for multi-byte
|
||||
encoding of characters placed higher in the UCS space. Especially on
|
||||
MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is
|
||||
also popular.
|
||||
|
||||
Prolog I/O streams have a property called @emph{encoding} which
|
||||
specifies the used encoding that influence @code{get_code/2} and
|
||||
@code{put_code/2} as well as all the other text I/O predicates.
|
||||
|
||||
The default encoding for files is derived from the Prolog flag
|
||||
@code{encoding}, which is initialised from the environment. If the
|
||||
environment variable @env{LANG} ends in "UTF-8", this encoding is
|
||||
assumed. Otherwise the default is @code{text} and the translation is
|
||||
left to the wide-character functions of the C-library (note that the
|
||||
Prolog native UTF-8 mode is considerably faster than the generic
|
||||
@code{mbrtowc()} one). The encoding can be specified explicitly in
|
||||
@code{load_files/2} for loading Prolog source with an alternative
|
||||
encoding, @code{open/4} when opening files or using @code{set_stream/2} on
|
||||
any open stream (not yet implemented). For Prolog source files we also
|
||||
provide the @code{encoding/1} directive that can be used to switch
|
||||
between encodings that are compatible to US-ASCII (@code{ascii},
|
||||
@code{iso_latin_1}, @code{utf8} and many locales).
|
||||
@c See also
|
||||
@c \secref{intsrcfile} for writing Prolog files with non-US-ASCII
|
||||
@c characters and \secref{unicodesyntax} for syntax issues.
|
||||
For
|
||||
additional information and Unicode resources, please visit
|
||||
@uref{http://www.unicode.org/}.
|
||||
|
||||
YAP currently defines and supports the following encodings:
|
||||
|
||||
@itemize @bullet
|
||||
@item octet
|
||||
Default encoding for @emph{binary} streams. This causes
|
||||
the stream to be read and written fully untranslated.
|
||||
|
||||
@item ascii
|
||||
7-bit encoding in 8-bit bytes. Equivalent to @code{iso_latin_1},
|
||||
but generates errors and warnings on encountering values above
|
||||
127.
|
||||
|
||||
@item iso_latin_1
|
||||
8-bit encoding supporting many western languages. This causes
|
||||
the stream to be read and written fully untranslated.
|
||||
|
||||
@item text
|
||||
C-library default locale encoding for text files. Files are read and
|
||||
written using the C-library functions @code{mbrtowc()} and
|
||||
@code{wcrtomb()}. This may be the same as one of the other locales,
|
||||
notably it may be the same as @code{iso_latin_1} for western
|
||||
languages and @code{utf8} in a UTF-8 context.
|
||||
|
||||
@item utf8
|
||||
Multi-byte encoding of full UCS, compatible to @code{ascii}.
|
||||
See above.
|
||||
|
||||
@item unicode_be
|
||||
Unicode Big Endian. Reads input in pairs of bytes, most
|
||||
significant byte first. Can only represent 16-bit characters.
|
||||
|
||||
@item unicode_le
|
||||
Unicode Little Endian. Reads input in pairs of bytes, least
|
||||
significant byte first. Can only represent 16-bit characters.
|
||||
@end itemize
|
||||
|
||||
Note that not all encodings can represent all characters. This implies
|
||||
that writing text to a stream may cause errors because the stream
|
||||
cannot represent these characters. The behaviour of a stream on these
|
||||
errors can be controlled using @code{open/4} or @code{set_stream/2} (not
|
||||
implemented). Initially the terminal stream write the characters using
|
||||
Prolog escape sequences while other streams generate an I/O exception.
|
||||
|
||||
|
||||
@node BOM, Stream Encoding, , Encoding
|
||||
@subsection BOM: Byte Order Mark
|
||||
|
||||
@cindex BOM
|
||||
@cindex Byte Order Mark
|
||||
From @ref{Stream Encoding}, you may have got the impression text-files are
|
||||
complicated. This section deals with a related topic, making live often
|
||||
easier for the user, but providing another worry to the programmer.
|
||||
@strong{BOM} or @emph{Byte Order Marker} is a technique for
|
||||
identifying Unicode text-files as well as the encoding they use. Such
|
||||
files start with the Unicode character @code{0xFEFF}, a non-breaking,
|
||||
zero-width space character. This is a pretty unique sequence that is not
|
||||
likely to be the start of a non-Unicode file and uniquely distinguishes
|
||||
the various Unicode file formats. As it is a zero-width blank, it even
|
||||
doesn't produce any output. This solves all problems, or ...
|
||||
|
||||
Some formats start of as US-ASCII and may contain some encoding mark to
|
||||
switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header.
|
||||
Such formats often explicitly forbid the the use of a UTF-8 BOM. In
|
||||
other cases there is additional information telling the encoding making
|
||||
the use of a BOM redundant or even illegal.
|
||||
|
||||
The BOM is handled by the @code{open/4} predicate. By default, text-files are
|
||||
probed for the BOM when opened for reading. If a BOM is found, the
|
||||
encoding is set accordingly and the property @code{bom(true)} is
|
||||
available through @code{stream_property/2}. When opening a file for
|
||||
writing, writing a BOM can be requested using the option
|
||||
@code{bom(true)} with @code{open/4}.
|
||||
|
Reference in New Issue
Block a user