476 lines
17 KiB
Plaintext
476 lines
17 KiB
Plaintext
|
\documentclass[11pt]{article}
|
||
|
\usepackage{pl}
|
||
|
\usepackage{html}
|
||
|
\usepackage{times}
|
||
|
|
||
|
\onefile
|
||
|
\htmloutput{html} % Output directory
|
||
|
\htmlmainfile{index} % Main document file
|
||
|
\bodycolor{white} % Page colour
|
||
|
|
||
|
\newcommand{\elem}[1]{{\tt\string<#1\string>}}
|
||
|
|
||
|
\begin{document}
|
||
|
|
||
|
\title{SWI-Prolog RDF parser}
|
||
|
\author{Jan Wielemaker \\
|
||
|
HCS, \\
|
||
|
University of Amsterdam \\
|
||
|
The Netherlands \\
|
||
|
E-mail: \email{jan@swi-prolog.org}}
|
||
|
|
||
|
\maketitle
|
||
|
|
||
|
\begin{abstract}
|
||
|
\url[RDF]{http://www.w3.org/RDF/} ({\bf R}esource {\bf D}escription {\bf
|
||
|
F}ormat) is a \url[W3C]{http://www.w3.org/} standard for expressing
|
||
|
meta-data about web-resources. It has two representations providing
|
||
|
the same semantics. RDF documents are normally transferred as XML
|
||
|
documents using the RDF-XML syntax. This format is unsuitable for
|
||
|
processing. The parser defined here converts an RDF-XML document into
|
||
|
the \jargon{triple} notation. The library \pllib{rdf_write} creates
|
||
|
an RDF/XML document from a list of triples.
|
||
|
\end{abstract}
|
||
|
|
||
|
\vfill
|
||
|
|
||
|
\tableofcontents
|
||
|
|
||
|
\vfill
|
||
|
\vfill
|
||
|
|
||
|
\newpage
|
||
|
|
||
|
\section{Introduction}
|
||
|
|
||
|
RDF is a promising standard for representing meta-data about documents
|
||
|
on the web as well as exchanging frame-based data (e.g. ontologies). RDF
|
||
|
is often associated with `semantics on the web'. It consists of a formal
|
||
|
data-model defined in terms of \jargon{triples}. In addition, a
|
||
|
\jargon{graph} model is defined for visualisation and an XML application
|
||
|
is defined for exchange.
|
||
|
|
||
|
`Semantics on the web' is also associated with the Prolog programming
|
||
|
language. It is assumed that Prolog is a suitable vehicle to reason with
|
||
|
the data expressed in RDF models. Most of the related web-infra
|
||
|
structure (e.g. XML parsers, DOM implementations) are defined in Java,
|
||
|
Perl, C or C+{+}.
|
||
|
|
||
|
Various routes are available to the Prolog user. Low-level XML parsing
|
||
|
is due to its nature best done in C or C+{+}. These languages produce
|
||
|
fast code. As XML/SGML are at the basis of most of the other web-related
|
||
|
formats we will benefit most here. XML and SGML, being very stable
|
||
|
specifications, make fast compiled languages even more attractive.
|
||
|
|
||
|
But what about RDF? RDF-XML is defined in XML, and provided with a
|
||
|
Prolog term representing the XML document processing it according to the
|
||
|
RDF syntax is quick and easy in Prolog. The alternative, getting yet
|
||
|
another library and language attached to the system, is getting less
|
||
|
attractive. In this document we explore the suitability of Prolog for
|
||
|
processing XML documents in general and into RDF in particular.
|
||
|
|
||
|
|
||
|
\section{Parsing RDF in Prolog}
|
||
|
|
||
|
We realised an RDF compiler in Prolog on top of the {\bf sgml2pl}
|
||
|
package (providing a name-space sensitive XML parser). The
|
||
|
transformation is realised in two passes.
|
||
|
|
||
|
The first pass rewrites the XML term into a Prolog term conveying the
|
||
|
same information in a more friendly manner. This transformation is
|
||
|
defined in a high-level pattern matching language defined on top of
|
||
|
Prolog with properties similar to DCG (Definite Clause Grammar).
|
||
|
|
||
|
The source of this translation is very close to the BNF notation used by
|
||
|
the \url[specification]{http://www.w3.org/TR/REC-rdf-syntax/}, so
|
||
|
correctness is `obvious'. Below is a part of the definition for RDF
|
||
|
containers. Note that XML elements are represented using a term of the
|
||
|
format:
|
||
|
|
||
|
\begin{quote}
|
||
|
\term{element}{Name, [AttrName = Value...], [Content ...]}
|
||
|
\end{quote}
|
||
|
|
||
|
\begin{code}
|
||
|
memberElt(LI) ::=
|
||
|
\referencedItem(LI).
|
||
|
memberElt(LI) ::=
|
||
|
\inlineItem(LI).
|
||
|
|
||
|
referencedItem(LI) ::=
|
||
|
element(\rdf(li),
|
||
|
[ \resourceAttr(LI) ],
|
||
|
[]).
|
||
|
|
||
|
inlineItem(literal(LI)) ::=
|
||
|
element(\rdf(li),
|
||
|
[ \parseLiteral ],
|
||
|
LI).
|
||
|
inlineItem(description(description, _, _, Properties)) ::=
|
||
|
element(\rdf(li),
|
||
|
[ \parseResource ],
|
||
|
\propertyElts(Properties)).
|
||
|
inlineItem(LI) ::=
|
||
|
element(\rdf(li),
|
||
|
[],
|
||
|
[\rdf_object(LI)]), !. % inlined object
|
||
|
inlineItem(literal(LI)) ::=
|
||
|
element(\rdf(li),
|
||
|
[],
|
||
|
[LI]). % string value
|
||
|
\end{code}
|
||
|
|
||
|
Expression in the rule that are prefixed by the \verb$\$ operator acts
|
||
|
as invocation of another rule-set. The body-term is converted into
|
||
|
a term where all rule-references are replaced by variables. The
|
||
|
resulting term is matched and translation of the arguments is achieved
|
||
|
by calling the appropriate rule. Below is the Prolog code for the
|
||
|
{\bf referencedItem} rule:
|
||
|
|
||
|
\begin{code}
|
||
|
referencedItem(A, element(B, [C], [])) :-
|
||
|
rdf(li, B),
|
||
|
resourceAttr(A, C).
|
||
|
\end{code}
|
||
|
|
||
|
Additional code can be added using a notation close to the Prolog
|
||
|
DCG notation. Here is the rule for a description, producing
|
||
|
properties both using {\bf propAttrs} and {\bf propertyElts}.
|
||
|
|
||
|
\begin{code}
|
||
|
description(description, About, BagID, Properties) ::=
|
||
|
element(\rdf('Description'),
|
||
|
\attrs([ \?idAboutAttr(About),
|
||
|
\?bagIdAttr(BagID)
|
||
|
| \propAttrs(PropAttrs)
|
||
|
]),
|
||
|
\propertyElts(PropElts)),
|
||
|
{ !, append(PropAttrs, PropElts, Properties)
|
||
|
}.
|
||
|
\end{code}
|
||
|
|
||
|
|
||
|
\section{Predicates}
|
||
|
|
||
|
The parser is designed to operate in various environments and therefore
|
||
|
provides interfaces at various levels. First we describe the top level
|
||
|
defined in \pllib{rdf}, simply parsing a RDF-XML file into a list of
|
||
|
triples. Please note these are {\em not} asserted into the database
|
||
|
because it is not necessarily the final format the user wishes to reason
|
||
|
with and it is not clean how the user wants to deal with multiple RDF
|
||
|
documents. Some options are using global URI's in one pool, in Prolog
|
||
|
modules or using an additional argument.
|
||
|
|
||
|
\begin{description}
|
||
|
\predicate{load_rdf}{2}{+File, -Triples}
|
||
|
Same as \term{load_rdf}{File, Triples, []}.
|
||
|
|
||
|
\predicate{load_rdf}{3}{+File, -Triples, +Options}
|
||
|
Read the RDF-XML file \arg{File} and return a list of \arg{Triples}.
|
||
|
\arg{Options} defines additional processing options. Currently defined
|
||
|
options are:
|
||
|
|
||
|
\begin{description}
|
||
|
\termitem{base_uri}{BaseURI}
|
||
|
If provided local identifiers and identifier-references are globalised
|
||
|
using this URI. If omited or the atom \verb$[]$, local identifiers are
|
||
|
not tagged.
|
||
|
|
||
|
\termitem{blank_nodes}{Mode}
|
||
|
If \arg{Mode} is \const{share} (default), blank-node properties (i.e.\
|
||
|
complex properties without identifier) are reused if they result in
|
||
|
exactly the same triple-set. Two descriptions are shared if their
|
||
|
intermediate description is the same. This means they should produce the
|
||
|
same set of triples in the same order. The value \const{noshare} creates
|
||
|
a new resource for each blank node.
|
||
|
|
||
|
\termitem{expand_foreach}{Boolean}
|
||
|
If \arg{Boolean} is \const{true}, expand \const{rdf:aboutEach} into
|
||
|
a set of triples. By default the parser generates
|
||
|
\term{rdf}{each(Container), Predicate, Subject}.
|
||
|
|
||
|
\termitem{lang}{Lang}
|
||
|
Define the initial language (i.e.\ pretend there is an \const{xml:lang}
|
||
|
declaration in an enclosing element).
|
||
|
|
||
|
\termitem{ignore_lang}{Bool}
|
||
|
If \const{true}, \const{xml:lang} declarations in the document are
|
||
|
ignored. This is mostly for compatibility with older versions of
|
||
|
this library that did not support language identifiers.
|
||
|
|
||
|
\termitem{convert_typed_literal}{:ConvertPred}
|
||
|
If the parser finds a literal with the \const{rdf:datatype}=\arg{Type}
|
||
|
attribute, call \term{ConvertPred}{+Type, +Content, -Literal}.
|
||
|
\arg{Content} is the XML element contentas returned by the XML
|
||
|
parser (a list). The predicate must unify \arg{Literal}
|
||
|
with a Prolog representation of \arg{Content} according to
|
||
|
\arg{Type} or throw an exception if the conversion cannot be made.
|
||
|
|
||
|
This option servers two purposes. First of all it can be used
|
||
|
to ignore type declarations for backward compatibility of this
|
||
|
library. Second it can be used to convert typed literals to
|
||
|
a meaningful Prolog representation. E.g.\ convert '42' to the
|
||
|
Prolog integer 42 if the type is \const{xsd:int} or a related
|
||
|
type.
|
||
|
|
||
|
\termitem{namespaces}{-List}
|
||
|
Unify \arg{List} with a list of \arg{NS}=\arg{URL} for each
|
||
|
encountered \const{xmlns}:\arg{NS}=\arg{URL} declaration found
|
||
|
in the source.
|
||
|
|
||
|
\termitem{entity}{+Name, +Value}
|
||
|
Overrule entity declaration in file. As it is common practice
|
||
|
to declare namespaces using entities in RDF/XML, this option
|
||
|
allows for changing the namespace without changing the file.
|
||
|
Multiple of these options are allowed.
|
||
|
\end{description}
|
||
|
|
||
|
The \arg{Triples} list is a list of \term{rdf}{Subject, Predicate,
|
||
|
Object} triples. \arg{Subject} is either a plain resource (an atom),
|
||
|
or one of the terms \term{each}{URI} or \term{prefix}{URI} with the
|
||
|
obvious meaning. \arg{Predicate} is either a plain atom for
|
||
|
explicitely non-qualified names or a term
|
||
|
\mbox{\arg{NameSpace}{\bf :}\arg{Name}}. If \arg{NameSpace} is the
|
||
|
defined RDF name space it is returned as the atom \const{rdf}.
|
||
|
Finally, \arg{Object} is a URI, a \arg{Predicate} or a term of the
|
||
|
format \term{literal}{Value} for literal values. \arg{Value} is
|
||
|
either a plain atom or a parsed XML term (list of atoms and elements).
|
||
|
\end{description}
|
||
|
|
||
|
|
||
|
\subsection{RDF Object representation} \label{sec:rdfobject}
|
||
|
|
||
|
The \emph{Object} (3rd) part of a triple can have several different
|
||
|
types. If the object is a resource it is returned as either a plain
|
||
|
atom or a term \mbox{\arg{NameSpace}{\bf :}\arg{Name}}. If it is a
|
||
|
literal it is returned as \term{literal}{Value}, where \arg{Value}
|
||
|
takes one of the formats defined below.
|
||
|
|
||
|
\begin{itemlist}
|
||
|
\item [An atom]
|
||
|
If the literal \arg{Value} is a plain atom is a literal value not
|
||
|
subject to a datatype or \const{xml:lang} qualifier.
|
||
|
|
||
|
\item [\term{lang}{LanguageID, Atom}]
|
||
|
If the literal is subject to an \const{xml:lang} qualifier
|
||
|
\arg{LanguageID} specifies the language and \arg{Atom} the
|
||
|
actual text.
|
||
|
|
||
|
\item [A list]
|
||
|
If the literal is an XML literal as created by
|
||
|
\mbox{parseType="Literal"}, the raw output of the XML parser for the
|
||
|
content of the element is returned. This content is a list of
|
||
|
\term{element}{Name, Attributes, Content} and atoms for CDATA parts as
|
||
|
described with the SWI-Prolog \url[SGML/XML
|
||
|
parser]{http://www.swi-prolog.org/packages/sgml2pl.html}
|
||
|
|
||
|
\item [\term{type}{Type, StringValue}]
|
||
|
If the literal has an \verb$rdf:datatype=$\arg{Type} a term of this
|
||
|
format is returned.
|
||
|
\end{itemlist}
|
||
|
|
||
|
|
||
|
\subsection{Name spaces}
|
||
|
|
||
|
XML name spaces are identified using a URI. Unfortunately various URI's
|
||
|
are in common use to refer to RDF. The \file{rdf_parser.pl} module
|
||
|
therefore defines the namespace as a multifile/1 predicate, that can be
|
||
|
extended by the user. For example, to parse the \url[Netscape
|
||
|
OpenDirectory]{http://www.mozilla.org/rdf/doc/inference.html}
|
||
|
\file{structure.rdf} file, the following declarations are used:
|
||
|
|
||
|
\begin{code}
|
||
|
:- multifile
|
||
|
rdf_parser:rdf_name_space/1.
|
||
|
|
||
|
rdf_parser:rdf_name_space('http://www.w3.org/TR/RDF/').
|
||
|
rdf_parser:rdf_name_space('http://directory.mozilla.org/rdf').
|
||
|
rdf_parser:rdf_name_space('http://dmoz.org/rdf').
|
||
|
\end{code}
|
||
|
|
||
|
The initial definition of this predicate is given below.
|
||
|
|
||
|
\begin{code}
|
||
|
rdf_name_space('http://www.w3.org/1999/02/22-rdf-syntax-ns#').
|
||
|
rdf_name_space('http://www.w3.org/TR/REC-rdf-syntax').
|
||
|
\end{code}
|
||
|
|
||
|
|
||
|
\subsection{Low-level access}
|
||
|
|
||
|
The above defined load_rdf/[2,3] is not always suitable. For example, it
|
||
|
cannot deal with documents where the RDF statement is embedded in an XML
|
||
|
document. It also cannot deal with really large documents (e.g.\ the
|
||
|
Netscape OpenDirectory project, currently about 90 MBytes), without huge
|
||
|
amounts of memory.
|
||
|
|
||
|
For really large documents, the {\bf sgml2pl} parser can be programmed
|
||
|
to handle the content of a specific element (i.e. \elem{rdf:RDF})
|
||
|
element-by-element. The parsing primitives defined in this section
|
||
|
can be used to process these one-by-one.
|
||
|
|
||
|
\begin{description}
|
||
|
\predicate{xml_to_rdf}{3}{+XML, +BaseURI, -Triples}
|
||
|
Process an XML term produced by load_structure/3 using the
|
||
|
\term{dialect}{xmlns} output option. \arg{XML} is either
|
||
|
a complete \elem{rdf:RDF} element, a list of RDF-objects
|
||
|
(container or description) or a single description of container.
|
||
|
|
||
|
\predicate{process_rdf}{3}{+Input, :OnTriples, +Options}
|
||
|
|
||
|
Exploits the call-back interface of {\bf sgml2pl}, calling
|
||
|
\term{\arg{OnTriples}}{Triples, File:Line} with the list of triples
|
||
|
resulting from a single top level RDF object for each RDF element in the
|
||
|
input as well as the source-location where the description started.
|
||
|
\arg{Input} is either a file name or term \term{stream}{Stream}. When
|
||
|
using a stream all triples are associated to the value of the
|
||
|
\const{base_uri} option. This predicate can be used to process arbitrary
|
||
|
large RDF files as the file is processed object-by-object. The example
|
||
|
below simply asserts all triples into the database:
|
||
|
|
||
|
\begin{code}
|
||
|
assert_list([], _).
|
||
|
assert_list([H|T], Source) :-
|
||
|
assert(H),
|
||
|
assert_list(T, Source).
|
||
|
|
||
|
?- process_rdf('structure,rdf', assert_list, []).
|
||
|
\end{code}
|
||
|
|
||
|
\arg{Options} are described with load_rdf/3. The option
|
||
|
\const{expand_foreach} is not supported as the container may be in a
|
||
|
different description. Additional it provides \const{embedded}:
|
||
|
|
||
|
\begin{description}
|
||
|
\termitem{embedded}{Boolean}
|
||
|
The predicate process_rdf/3 processes arbitrary XML documents, only
|
||
|
interpreting the content of \const{rdf:RDF} elements. If this option
|
||
|
is \const{false} (default), it gives a warning on elements that are
|
||
|
not processed. The option \term{embedded}{true} can be used to
|
||
|
process RDF embedded in \jargon{xhtml} without warnings.
|
||
|
\end{description}
|
||
|
|
||
|
\end{description}
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
\section{Writing RDF graphs}
|
||
|
|
||
|
The library \pllib{rdf_write} provides the inverse of load_rdf/2 using
|
||
|
the predicate rdf_write_xml/2. In most cases the RDF parser is used in
|
||
|
combination with the Semweb package providing \pllib{semweb/rdf_db}.
|
||
|
This library defines rdf_save/2 to save a named RDF graph from the
|
||
|
database to a file. This library writes a list of rdf terms to a stream.
|
||
|
It has been developed for the SeRQL server which computes an RDF graph
|
||
|
that needs to be transmitted in an HTTP request. As we see this as a
|
||
|
typical use-case scenario the library only provides writing to a stream.
|
||
|
|
||
|
\begin{description}
|
||
|
\predicate{rdf_write_xml}{2}{+Stream, +Triples}
|
||
|
Write an RDF/XML document to \arg{Stream} from the list of \arg{Triples}.
|
||
|
\arg{Stream} must use one of the following Prolog stream encodings:
|
||
|
\const{ascii}, \const{iso_latin_1} or \const{utf8}. Characters that
|
||
|
cannot be represented in the encoding are represented as XML entities.
|
||
|
Using ASCII is a good idea for documents that can be represented almost
|
||
|
completely in ASCII. For more international documents using UTF-8 creates
|
||
|
a more compact document that is easier to read.
|
||
|
|
||
|
\begin{code}
|
||
|
rdf_write(File, Triples) :-
|
||
|
open(File, write, Out, [encoding(utf8)]),
|
||
|
call_cleanup(rdf_write_xml(Out, Triples),
|
||
|
close(Out)).
|
||
|
\end{code}
|
||
|
\end{description}
|
||
|
|
||
|
|
||
|
\section{Testing the RDF translator}
|
||
|
|
||
|
A test-suite and driver program are provided by \file{rdf_test.pl} in
|
||
|
the source directory. To run these tests, load this file into Prolog in
|
||
|
the distribution directory. The test files are in the directory
|
||
|
\file{suite} and the proper output in \file{suite/ok}. Predicates
|
||
|
provided by \file{rdf_test.pl}:
|
||
|
|
||
|
\begin{description}
|
||
|
\predicate{suite}{1}{+N}
|
||
|
Run test \arg{N} using the file \file{suite/tN.rdf} and display the
|
||
|
RDF source, the intermediate Prolog representation and the resulting
|
||
|
triples.
|
||
|
\predicate{passed}{1}{+N}
|
||
|
Process \file{suite/tN.rdf} and store the resulting triples in
|
||
|
\file{suite/ok/tN.pl} for later validation by test/0.
|
||
|
\predicate{test}{0}{}
|
||
|
Run all tests and classify the result.
|
||
|
\end{description}
|
||
|
|
||
|
\appendix
|
||
|
|
||
|
\section{Metrics}
|
||
|
|
||
|
It took three days to write and one to document the Prolog RDF parser.
|
||
|
A significant part of the time was spent understanding the RDF
|
||
|
specification.
|
||
|
|
||
|
The size of the source (including comments) is given in the table
|
||
|
below.
|
||
|
|
||
|
\begin{center}
|
||
|
\begin{tabular}{|rrr|l|l|}
|
||
|
\hline
|
||
|
\bf lines & \bf words & \bf bytes & \bf file & \bf function \\
|
||
|
\hline
|
||
|
109 & 255 & 2663 & rdf.pl & Driver program \\
|
||
|
312 & 649 & 6416 & rdf_parser.pl & 1-st phase parser \\
|
||
|
246 & 752 & 5852 & rdf_triple.pl & 2-nd phase parser \\
|
||
|
126 & 339 & 2596 & rewrite.pl & rule-compiler \\
|
||
|
\hline
|
||
|
793 & 1995 & 17527 & total & \\
|
||
|
\hline
|
||
|
\end{tabular}
|
||
|
\end{center}
|
||
|
|
||
|
|
||
|
We also compared the performance using an RDF-Schema file generated by
|
||
|
\url[Protege-2000]{http://www.smi.stanford.edu/projects/protege/} and
|
||
|
interpreted as RDF. This file contains 162 descriptions in 50 Kbytes,
|
||
|
resulting in 599 triples. Environment: Intel Pentium-II/450 with
|
||
|
384 Mbytes memory running SuSE Linux 6.3.
|
||
|
|
||
|
The parser described here requires 0.15 seconds excluding 0.13 seconds
|
||
|
Prolog startup time to process this file. The \url[Pro
|
||
|
Solutions]{http://www.pro-solutions.com/rdfdemo/} parser (written in
|
||
|
Perl) requires 1.5 seconds exluding 0.25 seconds startup time.
|
||
|
|
||
|
|
||
|
\section{Installation}
|
||
|
|
||
|
\subsection{Unix systems}
|
||
|
|
||
|
Installation on Unix system uses the commonly found {\em configure},
|
||
|
{\em make} and {\em make install} sequence. SWI-Prolog should be
|
||
|
installed before building this package. If SWI-Prolog is not installed
|
||
|
as \program{pl}, the environment variable \env{PL} must be set to the
|
||
|
name of the SWI-Prolog executable. Installation is now accomplished
|
||
|
using:
|
||
|
|
||
|
\begin{code}
|
||
|
% ./configure
|
||
|
% make
|
||
|
% make install
|
||
|
\end{code}
|
||
|
|
||
|
This installs the Prolog library files in \file{$PLBASE/library}, where
|
||
|
\file{$PLBASE} refers to the SWI-Prolog `home-directory'.
|
||
|
|
||
|
\subsection{Windows}
|
||
|
|
||
|
Run the file \file{setup.pl} by double clicking it. This will install
|
||
|
the required files into the SWI-Prolog directory and update the
|
||
|
library directory.
|
||
|
|
||
|
\end{document}
|
||
|
|
||
|
|