495 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			495 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| @c -*- mode: texinfo; coding: utf-8; -*-
 | |
| 
 | |
| 
 | |
| @node Syntax, Loading Programs, Run, Top
 | |
| @chapter Syntax
 | |
| 
 | |
| We will describe the syntax of YAP at two levels. We first will
 | |
| describe the syntax for Prolog terms. In a second level we describe
 | |
| the @i{tokens} from which Prolog @i{terms} are
 | |
| built.
 | |
| 
 | |
| @menu
 | |
| * Formal Syntax:: Syntax of terms 
 | |
| * Tokens:: Syntax of Prolog tokens
 | |
| * Encoding:: How characters are encoded and Wide Character Support
 | |
| @end menu
 | |
| 
 | |
| @node Formal Syntax, Tokens, ,Syntax
 | |
| @section Syntax of Terms
 | |
| @cindex syntax
 | |
| 
 | |
| Below, we describe the syntax of YAP terms from the different
 | |
| classes of tokens defined above. The formalism used will be @emph{BNF},
 | |
| extended where necessary with attributes denoting integer precedence or
 | |
| operator type.
 | |
| 
 | |
| @example
 | |
|  term       ---->     subterm(1200)   end_of_term_marker
 | |
| 
 | |
|  subterm(N) ---->     term(M)         [M <= N]
 | |
| 
 | |
|  term(N)    ---->     op(N, fx) subterm(N-1)
 | |
|              |        op(N, fy) subterm(N)
 | |
|              |        subterm(N-1) op(N, xfx) subterm(N-1)
 | |
|              |        subterm(N-1) op(N, xfy) subterm(N)
 | |
|              |        subterm(N) op(N, yfx) subterm(N-1)
 | |
|              |        subterm(N-1) op(N, xf)
 | |
|              |        subterm(N) op(N, yf)
 | |
| 
 | |
|  term(0)   ---->      atom '(' arguments ')'
 | |
|              |        '(' subterm(1200)  ')'
 | |
|              |        '@{' subterm(1200)  '@}'
 | |
|              |        list
 | |
|              |        string
 | |
|              |        number
 | |
|              |        atom
 | |
|              |        variable
 | |
| 
 | |
|  arguments ---->      subterm(999)
 | |
|              |        subterm(999) ',' arguments
 | |
| 
 | |
|  list      ---->      '[]'
 | |
|              |        '[' list_expr ']'
 | |
| 
 | |
|  list_expr ---->      subterm(999)
 | |
|              |        subterm(999) list_tail
 | |
| 
 | |
|  list_tail ---->      ',' list_expr
 | |
|              |        ',..' subterm(999)
 | |
|              |        '|' subterm(999)
 | |
| @end example
 | |
| 
 | |
| @noindent
 | |
| Notes:
 | |
| 
 | |
| @itemize @bullet
 | |
| 
 | |
| @item
 | |
| @i{op(N,T)} denotes an atom which has been previously declared with type
 | |
| @i{T} and base precedence @i{N}.
 | |
| 
 | |
| @item
 | |
| Since ',' is itself a pre-declared operator with type @i{xfy} and
 | |
| precedence 1000, is @i{subterm} starts with a '(', @i{op} must be
 | |
| followed by a space to avoid ambiguity with the case of a functor
 | |
| followed by arguments, e.g.:
 | |
| 
 | |
| @example
 | |
| + (a,b)        [the same as '+'(','(a,b)) of arity one]
 | |
| @end example
 | |
| versus
 | |
| @example
 | |
| +(a,b)         [the same as '+'(a,b) of arity two]
 | |
| @end example
 | |
| 
 | |
| @item
 | |
| In the first rule for term(0) no blank space should exist between
 | |
| @i{atom} and '('.
 | |
| 
 | |
| @item
 | |
| @cindex end of term
 | |
| Each term to be read by the YAP parser must end with a single
 | |
| dot, followed by a blank (in the sense mentioned in the previous
 | |
| paragraph). When a name consisting of a single dot could be taken for
 | |
| the end of term marker, the ambiguity should be avoided by surrounding the
 | |
| dot with single quotes.
 | |
| 
 | |
| @end itemize
 | |
| 
 | |
| @node Tokens, Encoding, Formal Syntax, Syntax
 | |
| @section Prolog Tokens
 | |
| @cindex token
 | |
| 
 | |
| Prolog tokens are grouped into the following categories:
 | |
| 
 | |
| @menu
 | |
| * Numbers:: Integer and Floating-Point Numbers
 | |
| * Strings:: Sequences of Characters
 | |
| * Atoms:: Atomic Constants
 | |
| * Variables:: Logical Variables
 | |
| * Punctuation Tokens:: Tokens that separate other tokens
 | |
| * Layout:: Comments and Other Layout Rules
 | |
| @end menu
 | |
| 
 | |
| @node Numbers, Strings, ,Tokens
 | |
| @subsection Numbers
 | |
| @cindex number
 | |
| 
 | |
| Numbers can be further subdivided into integer and floating-point numbers.
 | |
| 
 | |
| @menu
 | |
| * Integers:: How Integers are read and represented
 | |
| * Floats:: Floating Point Numbers
 | |
| @end menu
 | |
| 
 | |
| @node Integers, Floats, ,Numbers
 | |
| @subsubsection Integers
 | |
| @cindex integer
 | |
| 
 | |
| Integer numbers
 | |
| are described by the following regular expression:
 | |
| 
 | |
| @example
 | |
| <integer> := @{<digit>+<single-quote>|0@{xXo@}@}<alpha_numeric_char>+
 | |
| @end example
 | |
| @noindent
 | |
| where @{...@} stands for optionality, @i{+} optional repetition (one or
 | |
| more times), @i{<digit>} denotes one of the characters 0 ... 9, @i{|}
 | |
| denotes or, and @i{<single-quote>} denotes the character "'". The digits
 | |
| before the @i{<single-quote>} character, when present, form the number
 | |
| basis, that can go from 0, 1 and up to 36. Letters from @code{A} to
 | |
| @code{Z} are used when the basis is larger than 10.
 | |
| 
 | |
| Note that if no basis is specified then base 10 is assumed. Note also
 | |
| that the last digit of an integer token can not be immediately followed
 | |
| by one of the characters 'e', 'E', or '.'.
 | |
| 
 | |
| Following the ISO standard, YAP also accepts directives of the
 | |
| form @code{0x} to represent numbers in hexadecimal base and of the form
 | |
| @code{0o} to represent numbers in octal base. For usefulness,
 | |
| YAP also accepts directives of the form @code{0X} to represent
 | |
| numbers in hexadecimal base.
 | |
| 
 | |
| Example:
 | |
| the following tokens all denote the same integer
 | |
| @example
 | |
| 10  2'1010  3'101  8'12  16'a  36'a  0xa  0o12
 | |
| @end example
 | |
| 
 | |
| Numbers of the form @code{0'a} are used to represent character
 | |
| constants. So, the following tokens denote the same integer:
 | |
| @example
 | |
| 0'd  100
 | |
| @end example
 | |
| 
 | |
| YAP (version @value{VERSION}) supports integers that can fit
 | |
| the word size of the machine. This is 32 bits in most current machines,
 | |
| but 64 in some others, such as the Alpha running Linux or Digital
 | |
| Unix. The scanner will read larger or smaller integers erroneously.
 | |
| 
 | |
| @node Floats, , Integers,Numbers
 | |
| @subsubsection Floating-point Numbers
 | |
| @cindex floating-point number
 | |
| 
 | |
| Floating-point numbers are described by:
 | |
| 
 | |
| @example
 | |
|    <float> := <digit>+@{<dot><digit>+@}
 | |
|                <exponent-marker>@{<sign>@}<digit>+
 | |
|             |<digit>+<dot><digit>+
 | |
|                @{<exponent-marker>@{<sign>@}<digit>+@}
 | |
| @end example
 | |
| 
 | |
| @noindent
 | |
| where @i{<dot>} denotes the decimal-point character '.',
 | |
| @i{<exponent-marker>} denotes one of 'e' or 'E', and @i{<sign>} denotes
 | |
| one of '+' or '-'.
 | |
| 
 | |
| Examples:
 | |
| @example
 | |
| 10.0   10e3   10e-3   3.1415e+3
 | |
| @end example
 | |
| 
 | |
| Floating-point numbers are represented as a double in the target
 | |
| machine. This is usually a 64-bit number.
 | |
| 
 | |
| @node Strings, Atoms, Numbers,Tokens
 | |
| @subsection Character Strings
 | |
| @cindex string
 | |
| 
 | |
| Strings are described by the following rules:
 | |
| @example
 | |
|   string --> '"' string_quoted_characters '"'
 | |
| 
 | |
|   string_quoted_characters --> '"' '"' string_quoted_characters
 | |
|   string_quoted_characters --> '\'
 | |
|                           escape_sequence string_quoted_characters
 | |
|   string_quoted_characters -->
 | |
|                           string_character string_quoted_characters
 | |
| 
 | |
|   escape_sequence --> 'a' | 'b' | 'r' | 'f' | 't' | 'n' | 'v'
 | |
|   escape_sequence --> '\' | '"' | ''' | '`'
 | |
|   escape_sequence --> at_most_3_octal_digit_seq_char '\'
 | |
|   escape_sequence --> 'x' at_most_2_hexa_digit_seq_char '\'
 | |
| @end example
 | |
| where @code{string_character} in any character except the double quote
 | |
| and escape characters.
 | |
| 
 | |
| Examples:
 | |
| @example
 | |
| ""   "a string"   "a double-quote:""" 
 | |
| @end example
 | |
| 
 | |
| The first string is an empty string, the last string shows the use of
 | |
| double-quoting. The implementation of YAP represents strings as
 | |
| lists of integers. Since YAP 4.3.0 there is no static limit on string
 | |
| size.
 | |
| 
 | |
| Escape sequences can be used to include the non-printable characters
 | |
| @code{a} (alert), @code{b} (backspace), @code{r} (carriage return),
 | |
| @code{f} (form feed), @code{t} (horizontal tabulation), @code{n} (new
 | |
| line), and @code{v} (vertical tabulation). Escape sequences also be
 | |
| include the meta-characters @code{\}, @code{"}, @code{'}, and
 | |
| @code{`}. Last, one can use escape sequences to include the characters
 | |
| either as an octal or hexadecimal number.
 | |
| 
 | |
| The next examples demonstrates the use of escape sequences in YAP:
 | |
| 
 | |
| @example
 | |
| "\x0c\" "\01\" "\f" "\\" 
 | |
| @end example
 | |
| 
 | |
| The first three examples return a list including only character 12 (form
 | |
| feed). The last example escapes the escape character.
 | |
| 
 | |
| Escape sequences were not available in C-Prolog and in original
 | |
| versions of YAP up to 4.2.0. Escape sequences can be disable by using:
 | |
| @example
 | |
| :- yap_flag(character_escapes,false).
 | |
| @end example
 | |
| 
 | |
| 
 | |
| @node Atoms, Variables, Strings, Tokens
 | |
| @subsection Atoms
 | |
| @cindex atom
 | |
| 
 | |
| Atoms are defined by one of the following rules:
 | |
| @example
 | |
|    atom --> solo-character
 | |
|    atom --> lower-case-letter name-character*
 | |
|    atom --> symbol-character+
 | |
|    atom --> single-quote  single-quote
 | |
|    atom --> ''' atom_quoted_characters '''
 | |
| 
 | |
| 
 | |
|   atom_quoted_characters --> ''' ''' atom_quoted_characters
 | |
|   atom_quoted_characters --> '\' atom_sequence string_quoted_characters
 | |
|   atom_quoted_characters --> character string_quoted_characters
 | |
| @end example
 | |
| where:
 | |
| @example
 | |
|    <solo-character>     denotes one of:    ! ;
 | |
|    <symbol-character>   denotes one of:    # & * + - . / : < 
 | |
|                                            = > ? @@ \ ^ ~ `
 | |
|    <lower-case-letter>  denotes one of:    a...z
 | |
|    <name-character>     denotes one of:    _ a...z A...Z 0....9
 | |
|    <single-quote>       denotes:           '
 | |
| @end example
 | |
| 
 | |
| and @code{string_character} denotes any character except the double quote
 | |
| and escape characters. Note that escape sequences in strings and atoms
 | |
| follow the same rules.
 | |
| 
 | |
| Examples:
 | |
| @example
 | |
| a   a12x   '$a'   !   =>  '1 2'
 | |
| @end example
 | |
| 
 | |
| Version @code{4.2.0} of YAP removed the previous limit of 256
 | |
| characters on an atom. Size of an atom is now only limited by the space
 | |
| available in the system.
 | |
| 
 | |
| @node Variables, Punctuation Tokens, Atoms, Tokens
 | |
| @subsection Variables
 | |
| @cindex variable
 | |
| 
 | |
| Variables are described by:
 | |
| @example
 | |
|    <variable-starter><variable-character>+
 | |
| @end example
 | |
| where
 | |
| @example
 | |
|   <variable-starter>   denotes one of:    _ A...Z
 | |
|   <variable-character> denotes one of:    _ a...z A...Z
 | |
| @end example
 | |
| 
 | |
| @cindex anonymous variable
 | |
| If a variable is referred only once in a term, it needs not to be named
 | |
| and one can use the character @code{_} to represent the variable. These
 | |
| variables are known as anonymous variables. Note that different
 | |
| occurrences of @code{_} on the same term represent @emph{different}
 | |
| anonymous variables. 
 | |
| 
 | |
| @node Punctuation Tokens, Layout, Variables, Tokens
 | |
| @subsection Punctuation Tokens
 | |
| @cindex punctuation token
 | |
| 
 | |
| Punctuation tokens consist of one of the following characters:
 | |
| @example
 | |
| ( ) , [ ] @{ @} |
 | |
| @end example
 | |
| 
 | |
| These characters are used to group terms.
 | |
| 
 | |
| @node Layout, ,Punctuation Tokens, Tokens
 | |
| @subsection Layout
 | |
| @cindex comment
 | |
| Any characters with ASCII code less than or equal to 32 appearing before
 | |
| a token are ignored.
 | |
| 
 | |
| All the text appearing in a line after the character @i{%} is taken to
 | |
| be a comment and ignored (including @i{%}).  Comments can also be
 | |
| inserted by using the sequence @code{/*} to start the comment and
 | |
| @code{*} followed by @code{/} to finish it. In the presence of any sequence of comments or
 | |
| layout characters, the YAP parser behaves as if it had found a
 | |
| single blank character. The end of a file also counts as a blank
 | |
| character for this purpose.
 | |
| 
 | |
| @node Encoding, , Tokens, Syntax
 | |
| @section Wide Character Support
 | |
| 
 | |
| @cindex encodings
 | |
| 
 | |
| @menu
 | |
| * Stream Encoding:: How Prolog Streams can be coded
 | |
| * BOM:: The Byte Order Mark
 | |
| @end menu
 | |
| 
 | |
| @cindex UTF-8
 | |
| @cindex Unicode
 | |
| @cindex UCS
 | |
| @cindex internationalization
 | |
| 
 | |
| YAP now implements a SWI-Prolog compatible interface to wide
 | |
| characters and the Universal Character Set (UCS). The following text
 | |
| was adapted from the SWI-Prolog manual.
 | |
| 
 | |
| YAP now  supports wide characters, characters with character
 | |
| codes above 255 that cannot be represented in a single byte.
 | |
| @emph{Universal Character Set} (UCS) is the ISO/IEC 10646 standard
 | |
| that specifies a unique 31-bits unsigned integer for any character in
 | |
| any language.  It is a superset of 16-bit Unicode, which in turn is
 | |
| a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII.  UCS
 | |
| can handle strings holding characters from multiple languages and
 | |
| character classification (uppercase, lowercase, digit, etc.) and
 | |
| operations such as case-conversion are unambiguously defined.
 | |
| 
 | |
| For this reason YAP, following SWI-Prolog, has two representations for
 | |
| atoms. If the text fits in ISO Latin-1, it is represented as an array
 | |
| of 8-bit characters.  Otherwise the text is represented as an array of
 | |
| wide chars, which may take 16 or 32 bits.  This representational issue
 | |
| is completely transparent to the Prolog user.  Users of the foreign
 | |
| language interface sometimes need to be aware of these issues though.
 | |
| 
 | |
| Character coding comes into view when characters of strings need to be
 | |
| read from or written to file or when they have to be communicated to
 | |
| other software components using the foreign language interface. In this
 | |
| section we only deal with I/O through streams, which includes file I/O
 | |
| as well as I/O through network sockets.
 | |
| 
 | |
| 
 | |
| @node Stream Encoding, , BOM, Encoding
 | |
| @subsection Wide character encodings on streams
 | |
| 
 | |
| 
 | |
| 
 | |
| Although characters are uniquely coded using the UCS standard
 | |
| internally, streams and files are byte (8-bit) oriented and there are a
 | |
| variety of ways to represent the larger UCS codes in an 8-bit octet
 | |
| stream. The most popular one, especially in the context of the web, is
 | |
| UTF-8. Bytes 0...127 represent simply the corresponding US-ASCII
 | |
| character, while bytes 128...255 are used for multi-byte
 | |
| encoding of characters placed higher in the UCS space. Especially on
 | |
| MS-Windows the 16-bit Unicode standard, represented by pairs of bytes is
 | |
| also popular.
 | |
| 
 | |
| Prolog I/O streams have a property called @emph{encoding} which
 | |
| specifies the used encoding that influence @code{get_code/2} and
 | |
| @code{put_code/2} as well as all the other text I/O predicates.
 | |
| 
 | |
| The default encoding for files is derived from the Prolog flag
 | |
| @code{encoding}, which is initialised from the environment.  If the
 | |
| environment variable @env{LANG} ends in "UTF-8", this encoding is
 | |
| assumed. Otherwise the default is @code{text} and the translation is
 | |
| left to the wide-character functions of the C-library (note that the
 | |
| Prolog native UTF-8 mode is considerably faster than the generic
 | |
| @code{mbrtowc()} one).  The encoding can be specified explicitly in
 | |
| @code{load_files/2} for loading Prolog source with an alternative
 | |
| encoding, @code{open/4} when opening files or using @code{set_stream/2} on
 | |
| any open stream (not yet implemented). For Prolog source files we also
 | |
| provide the @code{encoding/1} directive that can be used to switch
 | |
| between encodings that are compatible to US-ASCII (@code{ascii},
 | |
| @code{iso_latin_1}, @code{utf8} and many locales).  
 | |
| @c See also
 | |
| @c \secref{intsrcfile} for writing Prolog files with non-US-ASCII
 | |
| @c characters and \secref{unicodesyntax} for syntax issues. 
 | |
| For
 | |
| additional information and Unicode resources, please visit
 | |
| @uref{http://www.unicode.org/}.
 | |
| 
 | |
| YAP currently defines and supports the following encodings:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item  octet
 | |
| Default encoding for @emph{binary} streams.  This causes
 | |
| the stream to be read and written fully untranslated.
 | |
| 
 | |
| @item  ascii
 | |
| 7-bit encoding in 8-bit bytes.  Equivalent to @code{iso_latin_1},
 | |
| but generates errors and warnings on encountering values above
 | |
| 127.
 | |
| 
 | |
| @item  iso_latin_1
 | |
| 8-bit encoding supporting many western languages.  This causes
 | |
| the stream to be read and written fully untranslated.
 | |
| 
 | |
| @item  text
 | |
| C-library default locale encoding for text files.  Files are read and
 | |
| written using the C-library functions @code{mbrtowc()} and
 | |
| @code{wcrtomb()}.  This may be the same as one of the other locales,
 | |
| notably it may be the same as @code{iso_latin_1} for western
 | |
| languages and @code{utf8} in a UTF-8 context.
 | |
| 
 | |
| @item  utf8
 | |
| Multi-byte encoding of full UCS, compatible to @code{ascii}.
 | |
| See above.
 | |
| 
 | |
| @item  unicode_be
 | |
| Unicode Big Endian.  Reads input in pairs of bytes, most
 | |
| significant byte first.  Can only represent 16-bit characters.
 | |
| 
 | |
| @item  unicode_le
 | |
| Unicode Little Endian.  Reads input in pairs of bytes, least
 | |
| significant byte first.  Can only represent 16-bit characters.
 | |
| @end itemize 
 | |
| 
 | |
| Note that not all encodings can represent all characters. This implies
 | |
| that writing text to a stream may cause errors because the stream
 | |
| cannot represent these characters. The behaviour of a stream on these
 | |
| errors can be controlled using @code{open/4} or @code{set_stream/2} (not
 | |
| implemented). Initially the terminal stream write the characters using
 | |
| Prolog escape sequences while other streams generate an I/O exception.
 | |
| 
 | |
| 
 | |
| @node BOM, Stream Encoding, , Encoding
 | |
| @subsection BOM: Byte Order Mark
 | |
| 
 | |
| @cindex BOM
 | |
| @cindex Byte Order Mark
 | |
| From @ref{Stream Encoding}, you may have got the impression that
 | |
| text-files are complicated. This section deals with a related topic,
 | |
| making live often easier for the user, but providing another worry to
 | |
| the programmer.  @strong{BOM} or @emph{Byte Order Marker} is a technique
 | |
| for identifying Unicode text-files as well as the encoding they
 | |
| use. Such files start with the Unicode character @code{0xFEFF}, a
 | |
| non-breaking, zero-width space character. This is a pretty unique
 | |
| sequence that is not likely to be the start of a non-Unicode file and
 | |
| uniquely distinguishes the various Unicode file formats. As it is a
 | |
| zero-width blank, it even doesn't produce any output. This solves all
 | |
| problems, or ...
 | |
| 
 | |
| Some formats start of as US-ASCII and may contain some encoding mark to
 | |
| switch to UTF-8, such as the @code{encoding="UTF-8"} in an XML header.
 | |
| Such formats often explicitly forbid the the use of a UTF-8 BOM. In
 | |
| other cases there is additional information telling the encoding making
 | |
| the use of a BOM redundant or even illegal.
 | |
| 
 | |
| The BOM is handled by the @code{open/4} predicate. By default, text-files are
 | |
| probed for the BOM when opened for reading. If a BOM is found, the
 | |
| encoding is set accordingly and the property @code{bom(true)} is
 | |
| available through @code{stream_property/2}. When opening a file for
 | |
| writing, writing a BOM can be requested using the option
 | |
| @code{bom(true)} with @code{open/4}.
 | |
| 
 |