Revised up to Section 7.

git-svn-id: https://yap.svn.sf.net/svnroot/yap/trunk@1831 b08c6af1-5177-4d33-ba66-4b1c6b8b522a
This commit is contained in:
kostis 2007-03-11 19:28:35 +00:00
parent 7afc0fdd07
commit 9ed8306415

View File

@ -48,6 +48,19 @@
\newcommand{\pta}{\bench{pta}\xspace} \newcommand{\pta}{\bench{pta}\xspace}
\newcommand{\tea}{\bench{tea}\xspace} \newcommand{\tea}{\bench{tea}\xspace}
%------------------------------------------------------------------------------ %------------------------------------------------------------------------------
\newcommand{\BreastCancer}{\bench{BreastCancer}\xspace}
\newcommand{\Carcinogenesis}{\bench{Carcinogenesis}\xspace}
\newcommand{\Choline}{\bench{Choline}\xspace}
\newcommand{\GeneExpression}{\bench{GeneExpression}\xspace}
\newcommand{\IEProtein}{\bench{IE-Protein\_Extraction}\xspace}
\newcommand{\Krki}{\bench{Krki}\xspace}
\newcommand{\KrkiII}{\bench{Krki~II}\xspace}
\newcommand{\Mesh}{\bench{Mesh}\xspace}
\newcommand{\Mutagenesis}{\bench{Mutagenesis}\xspace}
\newcommand{\Pyrimidines}{\bench{Pyrimidines}\xspace}
\newcommand{\Susi}{\bench{Susi}\xspace}
\newcommand{\Thermolysin}{\bench{Thermolysin}\xspace}
%------------------------------------------------------------------------------
\newenvironment{SmallProg}{\begin{tt}\begin{small}\begin{tabular}[b]{l}}{\end{tabular}\end{small}\end{tt}} \newenvironment{SmallProg}{\begin{tt}\begin{small}\begin{tabular}[b]{l}}{\end{tabular}\end{small}\end{tt}}
\newenvironment{ScriptProg}{\begin{tt}\begin{scriptsize}\begin{tabular}[b]{l}}{\end{tabular}\end{scriptsize}\end{tt}} \newenvironment{ScriptProg}{\begin{tt}\begin{scriptsize}\begin{tabular}[b]{l}}{\end{tabular}\end{scriptsize}\end{tt}}
\newenvironment{FootProg}{\begin{tt}\begin{footnotesize}\begin{tabular}[c]{l}}{\end{tabular}\end{footnotesize}\end{tt}} \newenvironment{FootProg}{\begin{tt}\begin{footnotesize}\begin{tabular}[c]{l}}{\end{tabular}\end{footnotesize}\end{tt}}
@ -120,7 +133,7 @@ For example, first argument indexing is sufficient for many Prolog
applications. However, it is clearly sub-optimal for applications applications. However, it is clearly sub-optimal for applications
accessing large databases; for a long time now, the database community accessing large databases; for a long time now, the database community
has recognized that good indexing is the basis for fast query has recognized that good indexing is the basis for fast query
processing~\cite{}. processing.
As logic programming applications grow in size, Prolog systems need to As logic programming applications grow in size, Prolog systems need to
efficiently access larger and larger data sets and the need for any- efficiently access larger and larger data sets and the need for any-
@ -144,7 +157,7 @@ the method needs to cater for code updates during runtime. Where our
schemes radically depart from current practice is that they generate schemes radically depart from current practice is that they generate
new byte code during runtime, in effect doing a form of just-in-time new byte code during runtime, in effect doing a form of just-in-time
compilation. In our experience these schemes pay off. We have compilation. In our experience these schemes pay off. We have
implemented \JITI in two different Prolog systems (Yap and XXX) and implemented \JITI in two different Prolog systems (YAP and XXX) and
have obtained non-trivial speedups, ranging from a few percent to have obtained non-trivial speedups, ranging from a few percent to
orders of magnitude, across a wide range of applications. Given these orders of magnitude, across a wide range of applications. Given these
results, we see very little reason for Prolog systems not to results, we see very little reason for Prolog systems not to
@ -226,14 +239,14 @@ systems currently do not provide the type of indexing that
applications require. Even in systems like Ciao~\cite{Ciao@SCP-05}, applications require. Even in systems like Ciao~\cite{Ciao@SCP-05},
which do come with built-in static analysis and more or less force which do come with built-in static analysis and more or less force
such a discipline on the programmer, mode information is not used for such a discipline on the programmer, mode information is not used for
multi-argument indexing! multi-argument indexing.
% The grand finale: % The grand finale:
The situation is actually worse for certain types of Prolog The situation is actually worse for certain types of Prolog
applications. For example, consider applications in the area of applications. For example, consider applications in the area of
inductive logic programming. These applications on the one hand have inductive logic programming. These applications on the one hand have
big demands for effective indexing since they need to efficiently high demands for effective indexing since they need to efficiently
access big datasets and on the other they are very unfit for static access big datasets and on the other they are unfit for static
analysis since queries are often ad hoc and generated only during analysis since queries are often ad hoc and generated only during
runtime as new hypotheses are formed or refined. runtime as new hypotheses are formed or refined.
% %
@ -241,11 +254,11 @@ Our thesis is that the Prolog abstract machine should be able to adapt
automatically to the runtime requirements of such or, even better, of automatically to the runtime requirements of such or, even better, of
all applications by employing increasingly aggressive forms of dynamic all applications by employing increasingly aggressive forms of dynamic
compilation. As a concrete example of what this means in practice, in compilation. As a concrete example of what this means in practice, in
this paper we will attack the problem of providing effective indexing this paper we will attack the problem of satisfying the indexing needs
during runtime. Naturally, we will base our technique on the existing of applications during runtime. Naturally, we will base our technique
support for indexing that the WAM provides, but we will extend this on the existing support for indexing that the WAM provides, but we
support with the technique of \JITI that we describe in the next will extend this support with the technique of \JITI that we describe
sections. in the next sections.
\section{Indexing in the WAM} \label{sec:prelims} \section{Indexing in the WAM} \label{sec:prelims}
@ -271,7 +284,7 @@ equivalently, \instr{N} is the size of the hash table). In each bucket
of this hash table and also in the bucket for the variable case of of this hash table and also in the bucket for the variable case of
\switchONterm the code performs a sequential backtracking search of \switchONterm the code performs a sequential backtracking search of
the clauses using a \TryRetryTrust chain of instructions. The \try the clauses using a \TryRetryTrust chain of instructions. The \try
instruction sets up a choice point, the \retry instructions (if any) instruction sets up a choice point, the \retry instructions (if~any)
update certain fields of this choice point, and the \trust instruction update certain fields of this choice point, and the \trust instruction
removes it. removes it.
@ -529,13 +542,14 @@ heuristically decide that some arguments are most likely than others
to be used in the \code{in} mode. Then we can simply place the to be used in the \code{in} mode. Then we can simply place the
\jitiONconstant instructions for these arguments \emph{before} the \jitiONconstant instructions for these arguments \emph{before} the
instructions for other arguments. This is possible since all indexing instructions for other arguments. This is possible since all indexing
instructions take the argument register number as an argument. instructions take the argument register number as an argument; their
order does not matter.
\subsection{From any argument indexing to multi-argument indexing} \subsection{From any argument indexing to multi-argument indexing}
%----------------------------------------------------------------- %-----------------------------------------------------------------
The scheme of the previous section gives us only single argument The scheme of the previous section gives us only single argument
indexing. However, all the infrastructure we need is already in place. indexing. However, all the infrastructure we need is already in place.
We can use it to obtain (fixed-order) multi-argument \JITI in a We can use it to obtain any fixed-order multi-argument \JITI in a
straightforward way. straightforward way.
Note that the compiler knows exactly the set of clauses that need to Note that the compiler knows exactly the set of clauses that need to
@ -650,7 +664,7 @@ requires the following extensions:
indexing will be based. Writing such a code walking procedure is not indexing will be based. Writing such a code walking procedure is not
hard.\footnote{In many Prolog systems, a procedure with similar hard.\footnote{In many Prolog systems, a procedure with similar
functionality often exists for the disassembler, the debugger, etc.} functionality often exists for the disassembler, the debugger, etc.}
\item Indexing on an argument that contains unconstrained variables \item Indexing on a position that contains unconstrained variables
for some clauses is tricky. The WAM needs to group clauses in this for some clauses is tricky. The WAM needs to group clauses in this
case and without special treatment creates two choice points for case and without special treatment creates two choice points for
this argument (one for the variables and one per each group of this argument (one for the variables and one per each group of
@ -658,7 +672,7 @@ requires the following extensions:
by now. Possible solutions to it are described in a 1987 paper by by now. Possible solutions to it are described in a 1987 paper by
Carlsson~\cite{FreezeIndexing@ICLP-87} and can be readily adapted to Carlsson~\cite{FreezeIndexing@ICLP-87} and can be readily adapted to
\JITI. Alternatively, in a simple implementation, we can skip \JITI \JITI. Alternatively, in a simple implementation, we can skip \JITI
for arguments with variables in some clauses. for positions with variables in some clauses.
\end{enumerate} \end{enumerate}
Before describing \JITI more formally, we remark on the following Before describing \JITI more formally, we remark on the following
design decisions whose rationale may not be immediately obvious: design decisions whose rationale may not be immediately obvious:
@ -800,26 +814,25 @@ to a \switchSTAR WAM instruction.
%------------------------------------------------------------------------- %-------------------------------------------------------------------------
\paragraph*{Complexity properties.} \paragraph*{Complexity properties.}
Complexity-wise, dynamic index construction does not add any overhead Index construction during runtime does not change the complexity of
to program execution. First, note that each demanded index table will query execution. First, note that each demanded index table will be
be constructed at most once. Also, a \jitiSTAR instruction will be constructed at most once. Also, a \jitiSTAR instruction will be
encountered only in cases where execution would examine all clauses in encountered only in cases where execution would examine all clauses in
the \TryRetryTrust chain.\footnote{This statement is possibly not the \TryRetryTrust chain.\footnote{This statement is possibly not
valid the presence of Prolog cuts.} The construction visits these valid the presence of Prolog cuts.} The construction visits these
clauses \emph{once} and then creates the index table in time linear in clauses \emph{once} and then creates the index table in time linear in
the number of clauses. One pass over the list of $\langle c, L the number of clauses as one pass over the list of $\langle c, L
\rangle$ pairs suffices. After index construction, execution will \rangle$ pairs suffices. After index construction, execution will
visit only a subset of these clauses as the index table will be visit a subset of these clauses as the index table will be consulted.
consulted.
%% Finally, note that the maximum number of \jitiSTAR instructions %% Finally, note that the maximum number of \jitiSTAR instructions
%% that will be visited for each query is bounded by the maximum %% that will be visited for each query is bounded by the maximum
%% number of index positions (symbols) in the clause heads of the %% number of index positions (symbols) in the clause heads of the
%% predicate. %% predicate.
Thus, in cases where \JITI is not effective, execution of a query will Thus, in cases where \JITI is not effective, execution of a query will
at most double due to dynamic index construction. In fact, this worst at most double due to dynamic index construction. In fact, this worst
case is extremely unlikely in practice. On the other hand, \JITI can case is pessimistic and extremely unlikely in practice. On the other
change the complexity of evaluating a predicate call from $O(n)$ to hand, \JITI can change the complexity of query evaluation from $O(n)$
$O(1)$ where $n$ is the number of clauses. to $O(1)$ where $n$ is the number of clauses.
\subsection{More implementation choices} \subsection{More implementation choices}
%--------------------------------------- %---------------------------------------
@ -857,9 +870,9 @@ instructions can either become inactive when this limit is reached, or
better yet we can recover the space of some tables. To do so, we can better yet we can recover the space of some tables. To do so, we can
employ any standard recycling algorithm (e.g., least recently used) employ any standard recycling algorithm (e.g., least recently used)
and reclaim the of index tables that are no longer in use. This is and reclaim the of index tables that are no longer in use. This is
easy to do by reverting the corresponding \jitiSTAR instructions back easy to do by reverting the corresponding \switchSTAR instructions
to \switchSTAR instructions. If the indices are needed again, they can back to \jitiSTAR instructions. If the indices are demanded again at a
simply be regenerated. time when memory is available, they can simply be regenerated.
\section{Demand-Driven Indexing of Dynamic Predicates} \label{sec:dynamic} \section{Demand-Driven Indexing of Dynamic Predicates} \label{sec:dynamic}
@ -893,9 +906,9 @@ arguments. As optimizations, we can avoid indexing for predicates with
only one clause (these are often used to simulate global variables) only one clause (these are often used to simulate global variables)
and we can exclude arguments where some clause has a variable. and we can exclude arguments where some clause has a variable.
Under logical update semantics calls to a dynamic goal execute in a Under logical update semantics calls to dynamic predicates execute in a
``snapshot'' of the corresponding predicate. In other words, each call ``snapshot'' of the corresponding predicate. In other words, each call
sees the clauses that existed at the time the call was made, even if sees the clauses that existed at the time when the call was made, even if
some of the clauses were later deleted or new clauses were asserted. some of the clauses were later deleted or new clauses were asserted.
If several calls are alive in the stack, several snapshots will be If several calls are alive in the stack, several snapshots will be
alive at the same time. The standard solution to this problem is to alive at the same time. The standard solution to this problem is to
@ -903,8 +916,8 @@ use time stamps to tell which clauses are \emph{live} for which calls.
% %
This solution complicates freeing index tables because (1) an index This solution complicates freeing index tables because (1) an index
table holds references to clauses, and (2) the table may be in use, table holds references to clauses, and (2) the table may be in use,
that is, it may be accesible from the execution stacks. A table thus that is, it may be accessible from the execution stacks. An index
is killed in several steps: table thus is killed in several steps:
\begin{enumerate} \begin{enumerate}
\item Detach the index table from the indexing tree. \item Detach the index table from the indexing tree.
\item Recursively \emph{kill} every child of the current table: \item Recursively \emph{kill} every child of the current table:
@ -920,6 +933,7 @@ is killed in several steps:
%% the \emph{itemset-node}, so the emulator reads all the instruction's %% the \emph{itemset-node}, so the emulator reads all the instruction's
%% arguments before executing the instruction. %% arguments before executing the instruction.
\section{Implementation in XXX and in YAP} \label{sec:impl} \section{Implementation in XXX and in YAP} \label{sec:impl}
%========================================================== %==========================================================
The implementation of \JITI in XXX follows a variant of the scheme The implementation of \JITI in XXX follows a variant of the scheme
@ -927,7 +941,7 @@ presented in Sect.~\ref{sec:static}. The compiler uses heuristics to
determine the best argument to index on (i.e., this argument is not determine the best argument to index on (i.e., this argument is not
necessarily the first) and employs \switchSTAR instructions for this necessarily the first) and employs \switchSTAR instructions for this
task. It also statically generates \jitiONconstant instructions for task. It also statically generates \jitiONconstant instructions for
other argument positions that are good candidates for \JITI. other arguments that are good candidates for \JITI.
Currently, an argument is considered a good candidate if it has only Currently, an argument is considered a good candidate if it has only
constants or only structure symbols in all clauses. Thus, XXX uses constants or only structure symbols in all clauses. Thus, XXX uses
only \jitiONconstant and \jitiONstructure instructions, never a only \jitiONconstant and \jitiONstructure instructions, never a
@ -935,11 +949,11 @@ only \jitiONconstant and \jitiONstructure instructions, never a
symbols.\footnote{Instead, it prompts its user to request unification symbols.\footnote{Instead, it prompts its user to request unification
factoring for predicates that look likely to benefit from indexing factoring for predicates that look likely to benefit from indexing
inside compound terms. The user can then use the appropriate compiler inside compound terms. The user can then use the appropriate compiler
directive for these predicates.} For dynamic predicates \JITI is directive for these predicates.} For dynamic predicates, \JITI is
employed only if they consist of Datalog facts; if a clause which is employed only if they consist of Datalog facts; if a clause which is
not a Datalog fact is asserted, all dynamically created index tables not a Datalog fact is asserted, all dynamically created index tables
for the predicate are simply dropped and the \jitiONconstant for the predicate are simply killed and the \jitiONconstant
instruction becomes a \instr{noop}. All these are done automatically, instruction becomes a \instr{noop}. All this is done automatically,
but the user can disable \JITI in compiled code using an appropriate but the user can disable \JITI in compiled code using an appropriate
compiler option. compiler option.
@ -957,7 +971,8 @@ very much the same algorithm as static indexing: the key idea is that
most nodes in the index tree must be allocated separately so that they most nodes in the index tree must be allocated separately so that they
can grow or contract independently. YAP can index arguments where some can grow or contract independently. YAP can index arguments where some
clauses have unconstrained variables, but only for static predicates, clauses have unconstrained variables, but only for static predicates,
as it would complicate updates. as in dynamic code this would complicate support for logical update
semantics.
YAP uses the term JITI (Just-In-Time Indexing) to refer to \JITI. In YAP uses the term JITI (Just-In-Time Indexing) to refer to \JITI. In
the next section we will take the liberty to use this term as a the next section we will take the liberty to use this term as a
@ -1099,63 +1114,62 @@ this benchmark.
\end{verbatim} \end{verbatim}
\end{small} \end{small}
% Our experience with the indexing algorithm described here shows a %% Our experience with the indexing algorithm described here shows a
% significant performance improvement over the previous indexing code in %% significant performance improvement over the previous indexing code in
% our system. Quite often, this has allowed us to tackle applications %% our system. Quite often, this has allowed us to tackle applications
% which previously would not have been feasible. We next present some %% which previously would not have been feasible.
% results that show how useful the algorithms can be.
\subsection{Performance of \JITI on ILP applications} \label{sec:perf:ILP} \subsection{Performance of \JITI on ILP applications} \label{sec:perf:ILP}
%------------------------------------------------------------------------- %-------------------------------------------------------------------------
The need for \JITI was originally motivated by ILP applications. The need for \JITI was originally motivated by ILP applications.
Table~\ref{tab:ilp:time} shows JITI performance on some learning tasks Table~\ref{tab:ilp:time} shows JITI performance on some learning tasks
using the ALEPH system~\cite{ALEPH}. The dataset \bench{Krki} tries to using the ALEPH system~\cite{ALEPH}. The dataset \Krki tries to
learn rules from a small database of chess end-games; learn rules from a small database of chess end-games;
\bench{GeneExpression} learns rules for yeast gene activity given a \GeneExpression learns rules for yeast gene activity given a
database of genes, their interactions, and micro-array gene expression database of genes, their interactions, and micro-array gene expression
data; \bench{BreastCancer} processes real-life patient reports towards data; \BreastCancer processes real-life patient reports towards
predicting whether an abnormality may be malignant; predicting whether an abnormality may be malignant;
\bench{IE-Protein\_Extraction} processes information extraction from \IEProtein processes information extraction from
paper abstracts to search proteins; \bench{Susi} learns from shopping paper abstracts to search proteins; \Susi learns from shopping
patterns; and \bench{Mesh} learns rules for finite-methods mesh patterns; and \Mesh learns rules for finite-methods mesh
design. The datasets \bench{Carcinogenesis}, \bench{Choline}, design. The datasets \Carcinogenesis, \Choline,
\bench{Mutagenesis}, \bench{Pyrimidines}, and \bench{Thermolysin} are \Mutagenesis, \Pyrimidines, and \Thermolysin try to
about predicting chemical properties of compounds. The first three predict chemical properties of compounds. The first three
datasets store properties of interest as tables, but datasets store properties of interest as tables, but
\bench{Thermolysin} learns from the 3D-structure of a molecule's \Thermolysin learns from the 3D-structure of a molecule's
conformations. Several of these datasets are standard across Machine conformations. Several of these datasets are standard across the Machine
Learning literature. \bench{GeneExpression}~\cite{} and Learning literature. \GeneExpression~\cite{} and
\bench{BreastCancer}~\cite{} were partly developed by some of the \BreastCancer~\cite{} were partly developed by some of the
paper's authors. Most datasets perform simple queries in an paper's authors. Most datasets perform simple queries in an
extensional database. The exception is \bench{Mutagenesis} where extensional database. The exception is \Mutagenesis where
several predicates are defined intensionally, requiring extensive several predicates are defined intensionally, requiring extensive
computation. computation.
%------------------------------------------------------------------------------ %------------------------------------------------------------------------------
\begin{table}[ht] \begin{table}[t]
\centering \centering
\caption{Machine Learning (ILP) Datasets: Times are given in Seconds, \caption{Machine Learning (ILP) Datasets: Times are given in Seconds,
we give time for standard indexing with no indexing on dynamic we give time for standard indexing with no indexing on dynamic
predicates versus the \JITI implementation} predicates versus the \JITI implementation}
\label{tab:ilp:time} \label{tab:ilp:time}
\setlength{\tabcolsep}{3pt} \setlength{\tabcolsep}{3pt}
\begin {tabular}{|l||r|r|r|} \hline %\cline{1-3} \begin{tabular}{|l||r|r|r|} \hline %\cline{1-3}
& \multicolumn{3}{|c|}{Time (in secs)} \\ & \multicolumn{3}{|c|}{Time (in secs)} \\
\cline{2-4} \cline{2-4}
Benchmark & 1st & JITI &{\bf ratio} \\ Benchmark & 1st & JITI &{\bf ratio} \\
\hline \hline
\bench{BreastCancer} & 1450 & 88 & 16 \\ \BreastCancer & 1450 & 88 & 16 \\
\bench{Carcinogenesis} & 17,705 & 192 & 92 \\ \Carcinogenesis & 17,705 & 192 & 92 \\
\bench{Choline} & 14,766 & 1,397 & 11 \\ \Choline & 14,766 & 1,397 & 11 \\
\bench{GeneExpression} & 193,283 & 7,483 & 26 \\ \GeneExpression & 193,283 & 7,483 & 26 \\
\bench{IE-Protein\_Extraction} & 1,677,146 & 2,909 & 577 \\ \IEProtein & 1,677,146 & 2,909 & 577 \\
\bench{Krki} & 0.3 & 0.3 & 1 \\ \bench{Krki} & 0.3 & 0.3 & 1 \\
\bench{Krki II} & 1.3 & 1.3 & 1 \\ \bench{Krki II} & 1.3 & 1.3 & 1 \\
\bench{Mesh} & 4 & 3 & 1.3 \\ \Mesh & 4 & 3 & 1.3 \\
\bench{Mutagenesis} & 51,775 & 27,746 & 1.9 \\ \bench{Mutagenesis} & 51,775 & 27,746 & 1.9 \\
\bench{Pyrimidines} & 487,545 & 253,235 & 1.9 \\ \Pyrimidines & 487,545 & 253,235 & 1.9 \\
\bench{Susi} & 105,091 & 307 & 342 \\ \Susi & 105,091 & 307 & 342 \\
\bench{Thermolysin} & 50,279 & 5,213 & 10 \\ \Thermolysin & 50,279 & 5,213 & 10 \\
\hline \hline
\end{tabular} \end{tabular}
\end{table} \end{table}
@ -1163,30 +1177,30 @@ computation.
We compare times for 10 runs of the saturation/refinement cycle of the We compare times for 10 runs of the saturation/refinement cycle of the
ILP system. Table~\ref{tab:ilp:time} shows time results. The ILP system. Table~\ref{tab:ilp:time} shows time results. The
\bench{Krki} datasets have small search spaces and small databases, so \Krki datasets have small search spaces and small databases, so
they achieve the same performance under both versions: they achieve the same performance under both versions:
there is no slowdown. The \bench{Mesh}, \bench{Mutagenesis}, and there is no slowdown. The \Mesh, \Mutagenesis, and
\bench{Pyrimides} applications do not benefit much from indexing in \Pyrimidines applications do not benefit much from indexing in
the database, but they do benefit from indexing in the dynamic the database, but they do benefit from indexing in the dynamic
representation of the search space, as their running times halve. representation of the search space, as their running times halve.
The \bench{BreastCancer} and \bench{GeneExpression} applications use The \BreastCancer and \GeneExpression applications use data in
1NF data (that is, unstructured data). The benefit here is mostly from 1NF (that is, unstructured data). The benefit here is mostly from
multiple-argument indexing. \bench{BreastCancer} is particularly multiple-argument indexing. \BreastCancer is particularly
interesting. It consists of 40 binary relations with 65k elements interesting. It consists of 40 binary relations with 65k elements
each, where the first argument is the key, like in each, where the first argument is the key, like in \sgCyl. We know
\bench{sg\_cyl}. We know that most calls have the first argument that most calls have the first argument bound, hence indexing was not
bound, hence indexing was not expected to matter very much. Instead, expected to matter very much. Instead, the results show \JITI running
the results show \JITI running time to improve by an order of time to improve by an order of magnitude. Like \sgCyl, this
magnitude. Like in \bench{sg\_cyl}, this suggests that even a small suggests that even a small percentage of badly indexed calls can end
percentage of badly indexed calls can come to dominate running time. up dominating runtime.
\bench{IE-Protein\_Extraction} and \bench{Thermolysin} are example \IEProtein and \Thermolysin are example
applications that manipulate structured data. applications that manipulate structured data.
\bench{IE-Protein\_Extraction} is the largest dataset we consider, \IEProtein is the largest dataset we consider,
and indexing is simply critical: it is not possible to run the and indexing is absolutely critical: it is not possible to run the
application in reasonable time with one argument application in reasonable time with first argument
indexing. \bench{Thermolysin} is smaller and performs some indexing. \Thermolysin is smaller and performs some
computation per query: even so, indexing improves performance by an computation per query: even so, indexing improves performance by an
order of magnitude. order of magnitude.
@ -1201,79 +1215,81 @@ order of magnitude.
Benchmark & \textbf{Clause} & {\bf Index} & \textbf{Clause} & {\bf Index} \\ Benchmark & \textbf{Clause} & {\bf Index} & \textbf{Clause} & {\bf Index} \\
% \textbf{Benchmarks} & & Total & T & W & S & & Total & T & C & W & S \\ % \textbf{Benchmarks} & & Total & T & W & S & & Total & T & C & W & S \\
\hline \hline
\bench{BreastCancer} \BreastCancer
& 60940 & 46887 & 60,940 & 46,887
% & 46242 & 3126 & 125 % & 46242 & 3126 & 125
& 630 & 14 & 630 & 14
% &42 & 18& 57 &6 % &42 & 18& 57 &6
\\ \\
\bench{Carcinogenesis} \Carcinogenesis
& 1801 & 2678 & 1801 & 2678
% &1225 & 587 & 865 % &1225 & 587 & 865
& 13512 & 942 & 13,512 & 942
%& 291 & 91 & 457 & 102 %& 291 & 91 & 457 & 102
\\ \\
\bench{Choline} & 666 & 174 \Choline & 666 & 174
% &67 & 48 & 58 % &67 & 48 & 58
& 3172 & 174 & 3172 & 174
% & 76 & 4 & 48 & 45 % & 76 & 4 & 48 & 45
\\ \\
\bench{GeneExpression} & 46726 & 22629
% &6780 & 6473 & 9375
& 116463 & 9015
%& 2703 & 932 & 3910 & 1469
\\
\bench{IE-Protein\_Extraction} &146033 & 129333 \GeneExpression
& 46,726 & 22,629
% &6780 & 6473 & 9375
& 116,463 & 9015
%& 2703 & 932 & 3910 & 1469
\\
\bench{IE-Protein\_Extraction}
& 146,033 & 129,333
%&39279 & 24322 & 65732 %&39279 & 24322 & 65732
& 53423 & 1531 & 53,423 & 1531
%& 467 & 108 & 868 & 86 %& 467 & 108 & 868 & 86
\\ \\
\bench{Krki} & 678 & 117 \bench{Krki} & 678 & 117
%&52 & 24 & 40 %&52 & 24 & 40
& 2047 & 24 & 2047 & 24
%& 10 & 2 & 10 & 1 %& 10 & 2 & 10 & 1
\\ \\
\bench{Krki II} & 1866 & 715 \bench{Krki II} & 1866 & 715
%&180 & 233 & 301 %&180 & 233 & 301
& 2055 & 26 & 2055 & 26
%& 11 & 2 & 11 & 1 %& 11 & 2 & 11 & 1
\\ \\
\bench{Mesh} & 802 & 161 \bench{Mesh} & 802 & 161
%&49 & 18 & 93 %&49 & 18 & 93
& 2149 & 109 & 2149 & 109
%& 46 & 4 & 35 & 22 %& 46 & 4 & 35 & 22
\\ \\
\bench{Mutagenesis} & 1412 & 1848 \bench{Mutagenesis} & 1412 & 1848
%&1045 & 291 & 510 %&1045 & 291 & 510
& 4302 & 595 & 4302 & 595
%& 156 & 114 & 264 & 61 %& 156 & 114 & 264 & 61
\\ \\
\bench{Pyrimidines} & 774 & 218 \bench{Pyrimidines} & 774 & 218
%&76 & 63 & 77 %&76 & 63 & 77
& 25840 & 12291 & 25,840 & 12,291
%& 4847 & 43 & 3510 & 3888 %& 4847 & 43 & 3510 & 3888
\\ \\
\bench{Susi} & 5007 & 2509 \bench{Susi} & 5007 & 2509
%&855 & 578 & 1076 %&855 & 578 & 1076
& 4497 & 759 & 4497 & 759
%& 324 & 58 & 256 & 120 %& 324 & 58 & 256 & 120
\\ \\
\bench{Thermolysin} & 2317 & 929 \bench{Thermolysin} & 2317 & 929
%&429 & 184 & 315 %&429 & 184 & 315
& 116129 & 7064 & 116,129 & 7064
%& 3295 & 1438 & 2160 & 170 %& 3295 & 1438 & 2160 & 170
\\ \\
\hline \hline
\end{tabular} \end{tabular}
\end{table*} \end{table*}
@ -1287,12 +1303,12 @@ usage on \emph{static} predicates. Static data-base sizes range from
146MB (\bench{IE-Protein\_Extraction} to less than a MB 146MB (\bench{IE-Protein\_Extraction} to less than a MB
(\bench{Choline}, \bench{Krki}, \bench{Mesh}). Indexing code can be (\bench{Choline}, \bench{Krki}, \bench{Mesh}). Indexing code can be
more than the original code, as in \bench{Mutagenesis}, or almost as more than the original code, as in \bench{Mutagenesis}, or almost as
much, eg, \bench{IE-Protein\_Extraction}. In most cases the YAP \JITI much, e.g., \bench{IE-Protein\_Extraction}. In most cases the YAP \JITI
adds at least a third and often a half to the original data-base. A adds at least a third and often a half to the original data-base. A
more detailed analysis shows the source of overhead to be very more detailed analysis shows the source of overhead to be very
different from dataset to dataset. In \bench{IE-Protein\_Extraction} different from dataset to dataset. In \bench{IE-Protein\_Extraction}
the problem is that hash tables are very large. Hash tables are also the problem is that hash tables are very large. Hash tables are also
where most space is spent in \bench{Susi}. In \bench{BreastCancer} where most space is spent in \bench{Susi}. In \BreastCancer
hash tables are actually small, so most space is spent in hash tables are actually small, so most space is spent in
\TryRetryTrust chains. \bench{Mutagenesis} is similar: even though YAP \TryRetryTrust chains. \bench{Mutagenesis} is similar: even though YAP
spends a large effort in indexing it still generates long spends a large effort in indexing it still generates long