This repository has been archived on 2023-08-20. You can view files and clone it, but cannot push or open issues or pull requests.
yap-6.3/packages/sgml/RDF/rdf2pl.doc
2009-03-13 19:39:06 +00:00

476 lines
17 KiB
Plaintext

\documentclass[11pt]{article}
\usepackage{pl}
\usepackage{html}
\usepackage{times}
\onefile
\htmloutput{html} % Output directory
\htmlmainfile{index} % Main document file
\bodycolor{white} % Page colour
\newcommand{\elem}[1]{{\tt\string<#1\string>}}
\begin{document}
\title{SWI-Prolog RDF parser}
\author{Jan Wielemaker \\
HCS, \\
University of Amsterdam \\
The Netherlands \\
E-mail: \email{jan@swi-prolog.org}}
\maketitle
\begin{abstract}
\url[RDF]{http://www.w3.org/RDF/} ({\bf R}esource {\bf D}escription {\bf
F}ormat) is a \url[W3C]{http://www.w3.org/} standard for expressing
meta-data about web-resources. It has two representations providing
the same semantics. RDF documents are normally transferred as XML
documents using the RDF-XML syntax. This format is unsuitable for
processing. The parser defined here converts an RDF-XML document into
the \jargon{triple} notation. The library \pllib{rdf_write} creates
an RDF/XML document from a list of triples.
\end{abstract}
\vfill
\tableofcontents
\vfill
\vfill
\newpage
\section{Introduction}
RDF is a promising standard for representing meta-data about documents
on the web as well as exchanging frame-based data (e.g. ontologies). RDF
is often associated with `semantics on the web'. It consists of a formal
data-model defined in terms of \jargon{triples}. In addition, a
\jargon{graph} model is defined for visualisation and an XML application
is defined for exchange.
`Semantics on the web' is also associated with the Prolog programming
language. It is assumed that Prolog is a suitable vehicle to reason with
the data expressed in RDF models. Most of the related web-infra
structure (e.g. XML parsers, DOM implementations) are defined in Java,
Perl, C or C+{+}.
Various routes are available to the Prolog user. Low-level XML parsing
is due to its nature best done in C or C+{+}. These languages produce
fast code. As XML/SGML are at the basis of most of the other web-related
formats we will benefit most here. XML and SGML, being very stable
specifications, make fast compiled languages even more attractive.
But what about RDF? RDF-XML is defined in XML, and provided with a
Prolog term representing the XML document processing it according to the
RDF syntax is quick and easy in Prolog. The alternative, getting yet
another library and language attached to the system, is getting less
attractive. In this document we explore the suitability of Prolog for
processing XML documents in general and into RDF in particular.
\section{Parsing RDF in Prolog}
We realised an RDF compiler in Prolog on top of the {\bf sgml2pl}
package (providing a name-space sensitive XML parser). The
transformation is realised in two passes.
The first pass rewrites the XML term into a Prolog term conveying the
same information in a more friendly manner. This transformation is
defined in a high-level pattern matching language defined on top of
Prolog with properties similar to DCG (Definite Clause Grammar).
The source of this translation is very close to the BNF notation used by
the \url[specification]{http://www.w3.org/TR/REC-rdf-syntax/}, so
correctness is `obvious'. Below is a part of the definition for RDF
containers. Note that XML elements are represented using a term of the
format:
\begin{quote}
\term{element}{Name, [AttrName = Value...], [Content ...]}
\end{quote}
\begin{code}
memberElt(LI) ::=
\referencedItem(LI).
memberElt(LI) ::=
\inlineItem(LI).
referencedItem(LI) ::=
element(\rdf(li),
[ \resourceAttr(LI) ],
[]).
inlineItem(literal(LI)) ::=
element(\rdf(li),
[ \parseLiteral ],
LI).
inlineItem(description(description, _, _, Properties)) ::=
element(\rdf(li),
[ \parseResource ],
\propertyElts(Properties)).
inlineItem(LI) ::=
element(\rdf(li),
[],
[\rdf_object(LI)]), !. % inlined object
inlineItem(literal(LI)) ::=
element(\rdf(li),
[],
[LI]). % string value
\end{code}
Expression in the rule that are prefixed by the \verb$\$ operator acts
as invocation of another rule-set. The body-term is converted into
a term where all rule-references are replaced by variables. The
resulting term is matched and translation of the arguments is achieved
by calling the appropriate rule. Below is the Prolog code for the
{\bf referencedItem} rule:
\begin{code}
referencedItem(A, element(B, [C], [])) :-
rdf(li, B),
resourceAttr(A, C).
\end{code}
Additional code can be added using a notation close to the Prolog
DCG notation. Here is the rule for a description, producing
properties both using {\bf propAttrs} and {\bf propertyElts}.
\begin{code}
description(description, About, BagID, Properties) ::=
element(\rdf('Description'),
\attrs([ \?idAboutAttr(About),
\?bagIdAttr(BagID)
| \propAttrs(PropAttrs)
]),
\propertyElts(PropElts)),
{ !, append(PropAttrs, PropElts, Properties)
}.
\end{code}
\section{Predicates}
The parser is designed to operate in various environments and therefore
provides interfaces at various levels. First we describe the top level
defined in \pllib{rdf}, simply parsing a RDF-XML file into a list of
triples. Please note these are {\em not} asserted into the database
because it is not necessarily the final format the user wishes to reason
with and it is not clean how the user wants to deal with multiple RDF
documents. Some options are using global URI's in one pool, in Prolog
modules or using an additional argument.
\begin{description}
\predicate{load_rdf}{2}{+File, -Triples}
Same as \term{load_rdf}{File, Triples, []}.
\predicate{load_rdf}{3}{+File, -Triples, +Options}
Read the RDF-XML file \arg{File} and return a list of \arg{Triples}.
\arg{Options} defines additional processing options. Currently defined
options are:
\begin{description}
\termitem{base_uri}{BaseURI}
If provided local identifiers and identifier-references are globalised
using this URI. If omited or the atom \verb$[]$, local identifiers are
not tagged.
\termitem{blank_nodes}{Mode}
If \arg{Mode} is \const{share} (default), blank-node properties (i.e.\
complex properties without identifier) are reused if they result in
exactly the same triple-set. Two descriptions are shared if their
intermediate description is the same. This means they should produce the
same set of triples in the same order. The value \const{noshare} creates
a new resource for each blank node.
\termitem{expand_foreach}{Boolean}
If \arg{Boolean} is \const{true}, expand \const{rdf:aboutEach} into
a set of triples. By default the parser generates
\term{rdf}{each(Container), Predicate, Subject}.
\termitem{lang}{Lang}
Define the initial language (i.e.\ pretend there is an \const{xml:lang}
declaration in an enclosing element).
\termitem{ignore_lang}{Bool}
If \const{true}, \const{xml:lang} declarations in the document are
ignored. This is mostly for compatibility with older versions of
this library that did not support language identifiers.
\termitem{convert_typed_literal}{:ConvertPred}
If the parser finds a literal with the \const{rdf:datatype}=\arg{Type}
attribute, call \term{ConvertPred}{+Type, +Content, -Literal}.
\arg{Content} is the XML element contentas returned by the XML
parser (a list). The predicate must unify \arg{Literal}
with a Prolog representation of \arg{Content} according to
\arg{Type} or throw an exception if the conversion cannot be made.
This option servers two purposes. First of all it can be used
to ignore type declarations for backward compatibility of this
library. Second it can be used to convert typed literals to
a meaningful Prolog representation. E.g.\ convert '42' to the
Prolog integer 42 if the type is \const{xsd:int} or a related
type.
\termitem{namespaces}{-List}
Unify \arg{List} with a list of \arg{NS}=\arg{URL} for each
encountered \const{xmlns}:\arg{NS}=\arg{URL} declaration found
in the source.
\termitem{entity}{+Name, +Value}
Overrule entity declaration in file. As it is common practice
to declare namespaces using entities in RDF/XML, this option
allows for changing the namespace without changing the file.
Multiple of these options are allowed.
\end{description}
The \arg{Triples} list is a list of \term{rdf}{Subject, Predicate,
Object} triples. \arg{Subject} is either a plain resource (an atom),
or one of the terms \term{each}{URI} or \term{prefix}{URI} with the
obvious meaning. \arg{Predicate} is either a plain atom for
explicitely non-qualified names or a term
\mbox{\arg{NameSpace}{\bf :}\arg{Name}}. If \arg{NameSpace} is the
defined RDF name space it is returned as the atom \const{rdf}.
Finally, \arg{Object} is a URI, a \arg{Predicate} or a term of the
format \term{literal}{Value} for literal values. \arg{Value} is
either a plain atom or a parsed XML term (list of atoms and elements).
\end{description}
\subsection{RDF Object representation} \label{sec:rdfobject}
The \emph{Object} (3rd) part of a triple can have several different
types. If the object is a resource it is returned as either a plain
atom or a term \mbox{\arg{NameSpace}{\bf :}\arg{Name}}. If it is a
literal it is returned as \term{literal}{Value}, where \arg{Value}
takes one of the formats defined below.
\begin{itemlist}
\item [An atom]
If the literal \arg{Value} is a plain atom is a literal value not
subject to a datatype or \const{xml:lang} qualifier.
\item [\term{lang}{LanguageID, Atom}]
If the literal is subject to an \const{xml:lang} qualifier
\arg{LanguageID} specifies the language and \arg{Atom} the
actual text.
\item [A list]
If the literal is an XML literal as created by
\mbox{parseType="Literal"}, the raw output of the XML parser for the
content of the element is returned. This content is a list of
\term{element}{Name, Attributes, Content} and atoms for CDATA parts as
described with the SWI-Prolog \url[SGML/XML
parser]{http://www.swi-prolog.org/packages/sgml2pl.html}
\item [\term{type}{Type, StringValue}]
If the literal has an \verb$rdf:datatype=$\arg{Type} a term of this
format is returned.
\end{itemlist}
\subsection{Name spaces}
XML name spaces are identified using a URI. Unfortunately various URI's
are in common use to refer to RDF. The \file{rdf_parser.pl} module
therefore defines the namespace as a multifile/1 predicate, that can be
extended by the user. For example, to parse the \url[Netscape
OpenDirectory]{http://www.mozilla.org/rdf/doc/inference.html}
\file{structure.rdf} file, the following declarations are used:
\begin{code}
:- multifile
rdf_parser:rdf_name_space/1.
rdf_parser:rdf_name_space('http://www.w3.org/TR/RDF/').
rdf_parser:rdf_name_space('http://directory.mozilla.org/rdf').
rdf_parser:rdf_name_space('http://dmoz.org/rdf').
\end{code}
The initial definition of this predicate is given below.
\begin{code}
rdf_name_space('http://www.w3.org/1999/02/22-rdf-syntax-ns#').
rdf_name_space('http://www.w3.org/TR/REC-rdf-syntax').
\end{code}
\subsection{Low-level access}
The above defined load_rdf/[2,3] is not always suitable. For example, it
cannot deal with documents where the RDF statement is embedded in an XML
document. It also cannot deal with really large documents (e.g.\ the
Netscape OpenDirectory project, currently about 90 MBytes), without huge
amounts of memory.
For really large documents, the {\bf sgml2pl} parser can be programmed
to handle the content of a specific element (i.e. \elem{rdf:RDF})
element-by-element. The parsing primitives defined in this section
can be used to process these one-by-one.
\begin{description}
\predicate{xml_to_rdf}{3}{+XML, +BaseURI, -Triples}
Process an XML term produced by load_structure/3 using the
\term{dialect}{xmlns} output option. \arg{XML} is either
a complete \elem{rdf:RDF} element, a list of RDF-objects
(container or description) or a single description of container.
\predicate{process_rdf}{3}{+Input, :OnTriples, +Options}
Exploits the call-back interface of {\bf sgml2pl}, calling
\term{\arg{OnTriples}}{Triples, File:Line} with the list of triples
resulting from a single top level RDF object for each RDF element in the
input as well as the source-location where the description started.
\arg{Input} is either a file name or term \term{stream}{Stream}. When
using a stream all triples are associated to the value of the
\const{base_uri} option. This predicate can be used to process arbitrary
large RDF files as the file is processed object-by-object. The example
below simply asserts all triples into the database:
\begin{code}
assert_list([], _).
assert_list([H|T], Source) :-
assert(H),
assert_list(T, Source).
?- process_rdf('structure,rdf', assert_list, []).
\end{code}
\arg{Options} are described with load_rdf/3. The option
\const{expand_foreach} is not supported as the container may be in a
different description. Additional it provides \const{embedded}:
\begin{description}
\termitem{embedded}{Boolean}
The predicate process_rdf/3 processes arbitrary XML documents, only
interpreting the content of \const{rdf:RDF} elements. If this option
is \const{false} (default), it gives a warning on elements that are
not processed. The option \term{embedded}{true} can be used to
process RDF embedded in \jargon{xhtml} without warnings.
\end{description}
\end{description}
\section{Writing RDF graphs}
The library \pllib{rdf_write} provides the inverse of load_rdf/2 using
the predicate rdf_write_xml/2. In most cases the RDF parser is used in
combination with the Semweb package providing \pllib{semweb/rdf_db}.
This library defines rdf_save/2 to save a named RDF graph from the
database to a file. This library writes a list of rdf terms to a stream.
It has been developed for the SeRQL server which computes an RDF graph
that needs to be transmitted in an HTTP request. As we see this as a
typical use-case scenario the library only provides writing to a stream.
\begin{description}
\predicate{rdf_write_xml}{2}{+Stream, +Triples}
Write an RDF/XML document to \arg{Stream} from the list of \arg{Triples}.
\arg{Stream} must use one of the following Prolog stream encodings:
\const{ascii}, \const{iso_latin_1} or \const{utf8}. Characters that
cannot be represented in the encoding are represented as XML entities.
Using ASCII is a good idea for documents that can be represented almost
completely in ASCII. For more international documents using UTF-8 creates
a more compact document that is easier to read.
\begin{code}
rdf_write(File, Triples) :-
open(File, write, Out, [encoding(utf8)]),
call_cleanup(rdf_write_xml(Out, Triples),
close(Out)).
\end{code}
\end{description}
\section{Testing the RDF translator}
A test-suite and driver program are provided by \file{rdf_test.pl} in
the source directory. To run these tests, load this file into Prolog in
the distribution directory. The test files are in the directory
\file{suite} and the proper output in \file{suite/ok}. Predicates
provided by \file{rdf_test.pl}:
\begin{description}
\predicate{suite}{1}{+N}
Run test \arg{N} using the file \file{suite/tN.rdf} and display the
RDF source, the intermediate Prolog representation and the resulting
triples.
\predicate{passed}{1}{+N}
Process \file{suite/tN.rdf} and store the resulting triples in
\file{suite/ok/tN.pl} for later validation by test/0.
\predicate{test}{0}{}
Run all tests and classify the result.
\end{description}
\appendix
\section{Metrics}
It took three days to write and one to document the Prolog RDF parser.
A significant part of the time was spent understanding the RDF
specification.
The size of the source (including comments) is given in the table
below.
\begin{center}
\begin{tabular}{|rrr|l|l|}
\hline
\bf lines & \bf words & \bf bytes & \bf file & \bf function \\
\hline
109 & 255 & 2663 & rdf.pl & Driver program \\
312 & 649 & 6416 & rdf_parser.pl & 1-st phase parser \\
246 & 752 & 5852 & rdf_triple.pl & 2-nd phase parser \\
126 & 339 & 2596 & rewrite.pl & rule-compiler \\
\hline
793 & 1995 & 17527 & total & \\
\hline
\end{tabular}
\end{center}
We also compared the performance using an RDF-Schema file generated by
\url[Protege-2000]{http://www.smi.stanford.edu/projects/protege/} and
interpreted as RDF. This file contains 162 descriptions in 50 Kbytes,
resulting in 599 triples. Environment: Intel Pentium-II/450 with
384 Mbytes memory running SuSE Linux 6.3.
The parser described here requires 0.15 seconds excluding 0.13 seconds
Prolog startup time to process this file. The \url[Pro
Solutions]{http://www.pro-solutions.com/rdfdemo/} parser (written in
Perl) requires 1.5 seconds exluding 0.25 seconds startup time.
\section{Installation}
\subsection{Unix systems}
Installation on Unix system uses the commonly found {\em configure},
{\em make} and {\em make install} sequence. SWI-Prolog should be
installed before building this package. If SWI-Prolog is not installed
as \program{pl}, the environment variable \env{PL} must be set to the
name of the SWI-Prolog executable. Installation is now accomplished
using:
\begin{code}
% ./configure
% make
% make install
\end{code}
This installs the Prolog library files in \file{$PLBASE/library}, where
\file{$PLBASE} refers to the SWI-Prolog `home-directory'.
\subsection{Windows}
Run the file \file{setup.pl} by double clicking it. This will install
the required files into the SWI-Prolog directory and update the
library directory.
\end{document}