change ILP text\

git-svn-id: https://yap.svn.sf.net/svnroot/yap/trunk@1825 b08c6af1-5177-4d33-ba66-4b1c6b8b522a
2007-03-10 19:39:52 +00:00 · 2007-03-10 19:39:52 +00:00 · 570bce634d
commit 570bce634d
parent 63a4ae736d
1 changed files with 135 additions and 477 deletions
--- a/docs/index/iclp07.tex
+++ b/docs/index/iclp07.tex
@ -936,6 +936,12 @@ and report times in seconds.

 \subsection{JITI Speedups} \label{sec:perf:speedups}
 %---------------------------------------------------
+% Our experience with the indexing algorithm described here shows a
+% significant performance improvement over the previous indexing code in
+% our system. Quite often, this has allowed us to tackle applications
+% which previously would not have been feasible. We next present some
+% results that show how useful the algorithms can be.
+
       Here I already have "compress", "mutagenesis" and "sg\_cyl"
       The "sg\_cyl" has a really impressive speedup (2 orders of
       magnitude).  We should keep the explanation in your text.
@ -943,217 +949,6 @@ and report times in seconds.
       If time permits, we should also add some FSA benchmarks
       (e.g. "k963", "dg5" and "tl3" from PLDI)

-\subsection{JITI in ILP} \label{sec:perf:ILP}
-%--------------------------------------------
-The need for just-in-time indexing was originally motivated by ILP
-applications.  Table~\ref{tab:aleph} shows JITI performance on some
-learning tasks using the ALEPH system~\cite{}. The dataset
-\texttt{Krki} tries to learn rules for chess end-games;
-\texttt{GeneExpression} learns rules for yeast gene activity given a
-database of genes, their interactions, plus micro-array data;
-\texttt{BreastCancer} processes real-life patient reports towards
-predicting whether an abnormality may be malignant;
-\texttt{IE-Protein\_Extraction} processes information extraction from
-paper abstracts to search proteins; \texttt{Susi} learns from shopping
-patterns; and \texttt{Mesh} learns rules for finite-methods mesh
-design. The datasets \texttt{Carcinogenesis}, \texttt{Choline},
-\texttt{Mutagenesis}, \texttt{Pyrimidines}, and \texttt{Thermolysin}
-are about predicting chemical properties of compounds. The first three
-datasets present properties of interest as boolean attributes, but
-\texttt{Thermolysin} learns from the 3D structure of a molecule.
-Several of these datasets are standard across Machine Learning
-literature.  \texttt{GeneExpression}~\cite{} and
-\texttt{BreastCancer}~\cite{} were partly developed by some of the
-authors.  Most datasets perform simple queries in an extensional
-database. The exception is \texttt{Mutagenesis} where several
-predicates are defined intensionally, requiring extensive computation.
-
-
-\begin{table}[ht]
-%\vspace{-\intextsep}
-%\begin{table}[htbp] 
-%\centering
-  \centering
-  \begin {tabular}{|l|r|r|r|r|} \hline %\cline{1-3}                             
-    &  \multicolumn{2}{|c|}{\bf Time in sec.}  & \bf \JITI \\
-    {\bf Benchs.}  & \bf $A1$   & \bf JITI & \bf Ratio \\
-    \hline
-    \texttt{BreastCancer}      & 1450    & 88 & 16\\
-    \texttt{Carcinogenesis}    & 17,705    & 192  &92\\
-    \texttt{Choline}           & 14,766    & 1,397  & 11  \\
-    \texttt{GeneExpression}    & 193,283     & 7,483    & 26    \\
-    \texttt{IE-Protein\_Extraction} &  1,677,146      & 2,909    & 577    \\
-    \texttt{Krki}              & 0.3        & 0.3      & 1      \\
-    \texttt{Krki II}           & 1.3     & 1.3     & 1     \\
-    \texttt{Mesh}              & 4    & 3  & 1.3  \\
-    \texttt{Mutagenesis}       & 51,775  & 27,746 & 1.9\\
-    \texttt{Pyrimidines}       & 487,545     & 253,235  & 1.9    \\
-    \texttt{Susi}              & 105,091    & 307    & 342  \\
-    \texttt{Thermolysin}       & 50,279      &  5,213     & 10      \\
-    \hline
-\end{tabular}
-\caption{Machine Learning (ILP) Datasets: Times are given in Seconds,
-  we give time for standard indexing with no indexing on dynamic
-  predicates versus the \JITI implementation}
-\label{tab:aleph}
-\end{table}
-
-We compare times for 10 runs of the saturation/refinement cycle of the
-ILP system.  Table~\ref{tab:aleph} shows results. The \texttt{Krki}
-datasets have small search spaces and small databases, so they
-essentially maintain performance. The \texttt{Mesh},
-\texttt{Mutagenesis}, and \texttt{Pyrimides} applications do not
-benefit much from indexing in the database, but they do benefit from
-indexing in the dynamic representation of the search space, as their
-running times halve.
-
-The \texttt{BreastCancer} and \texttt{GeneExpression} applications use
-1NF data (that is, unstructured data). The benefit here is from
-multiple-argument indexing.  \texttt{BreastCancer} is particularly
-interesting. It consists of 40 binary relations with 65k elements
-each, where the first argument is the key, like in \texttt{sg\_cyl}. We
-know that most calls have the first argument bound, hence indexing was
-not expected to matter very much. Instead, the results show \JITI
-running time to improve by an order of magnitude. Like in
-\texttt{sg\_cyl}, this suggests that relatively small numbers of badly
-indexed calls can dominate running time.
-
-\texttt{IE-Protein\_Extraction} and \texttt{Thermolysin} are example
-applications that manipulate structured data.
-\texttt{IE-Protein\_Extraction} is the largest dataset we considered,
-and indexing is simply critical: we could not run the application in
-reasonable time without JITI. \texttt{Thermolysin} is smaller and
-performs some computation per query: even so, indexing is very
-important.
-
-\begin{table*}[ht]
-  \centering
-  \begin {tabular}{|l|r|r|r|r|r||r|r|r|r|r|r|} \hline %\cline{1-3}
-    &  \multicolumn{5}{|c||}{\bf Static Code}  & \multicolumn{6}{|c|}{\bf Dynamic Code \& IDB} \\
-    &  \textbf{Clause} & \multicolumn{4}{|c||}{\bf Indexing Code}  & \textbf{Clause} & \multicolumn{5}{|c|}{\bf Indexing Code} \\
-    \textbf{Benchmarks} &   & Total & T & W & S &  & Total & T & C & W & S  \\
-    \hline
-    \texttt{BreastCancer}      & 60940 & 46887 & 46242 &
-    3126 & 125  & 630 & 14 &42 & 18& 57 &6 \\
-
-    \texttt{Carcinogenesis}    & 1801 & 2678
-    &1225 & 587 & 865  & 13512 & 942     & 291 & 91 & 457 & 102
- \\
-
-    \texttt{Choline}  & 666 & 174
-    &67 & 48 & 58 & 3172 & 174
-    & 76 & 4 & 48 & 45
- \\
-    \texttt{GeneExpression}    &  46726 & 22629
-    &6780 & 6473 & 9375 & 116463 & 9015
-    & 2703 & 932 & 3910 & 1469
- \\
-
-    \texttt{IE-Protein\_Extraction}    &146033 & 129333
-    &39279 & 24322 & 65732 & 53423 & 1531
-    & 467 & 108 & 868 & 86
- \\
-
-    \texttt{Krki}              & 678 & 117
-    &52 & 24 & 40 & 2047 & 24
-    & 10 & 2 & 10 & 1
- \\
-
-    \texttt{Krki II}           & 1866 & 715
-    &180 & 233 & 301 & 2055 & 26
-    & 11 & 2 & 11 & 1
- \\
-
-    \texttt{Mesh}              & 802 & 161
-    &49 & 18 & 93 & 2149 & 109
-    & 46 & 4 & 35 & 22
- \\
-
-    \texttt{Mutagenesis}       & 1412 & 1848
-    &1045 & 291 & 510 & 4302 & 595
-    & 156 & 114 & 264 & 61
- \\
-
-    \texttt{Pyrimidines}       & 774 & 218
-    &76 & 63 & 77 & 25840 & 12291
-    & 4847 & 43 & 3510 & 3888
- \\
-
-    \texttt{Susi}              & 5007 & 2509
-    &855 & 578 & 1076 & 4497 & 759
-    & 324 & 58 & 256 & 120
- \\
-
-    \texttt{Thermolysin}       & 2317 & 929
-    &429 & 184 & 315 & 116129 & 7064
-    & 3295 & 1438 & 2160 & 170
- \\
-
-    \hline
-\end{tabular}
-\caption{Memory Performance on Machine Learning (ILP) Datasets: memory
-  usage is given in KB}
-\label{tab:ilpmem}
-\end{table*}
-
-
-We have seen that using the \JITI does not impose a significant
-overhead. Table~\ref{tab:ilpmem} discusses the memory cost . It
-measures memory being spend at a point near the end of execution.
-Because dynamic memory expands and contracts, we chose a point where
-memory usage should be at a maximum. The first five columns show data
-usage on static predicates.  The leftmost sub-column represents the
-code used for clauses; the next sub-columns represent space used in
-indices for static predicates: the first column gives total usage,
-which consists of space used in the main tree, the expanded
-wait-nodes, and hash-tables.
-
-Static data-base sizes range from 146MB to 666KB, the latter mostly in
-system libraries. The impact of indexing code varies widely: it is
-more than the original code for \texttt{Mutagenesis}, almost as much
-for \texttt{IE-Protein\_Extraction}, and in most cases it adds at
-least a third and often a half to the original data-base. It is
-interesting to check the source of the space overhead: if the source
-are hash-tables, we can expect this is because of highly-complex
-indices. If overhead is in \emph{wait-nodes}, this again suggests a
-sophisticated indexing structure. Overhead in the main tree may be
-caused by a large number of nodes, or may be caused by \texttt{try}
-nodes.
-
-One first conclusion is that \emph{wait-nodes} are costly space-wise,
-even if they are needed to achieve sensible compilation times.  On the
-other hand, whether the space is allocated to the tree or to the
-hashes varies widely. \texttt{IE-Protein\_Extraction} is an example
-where the indices seem very useful: most space was spent in the
-hash-tables, although we still are paying much for \emph{wait-nodes}.
-\texttt{BreastCancer} has very small hash-tables, because it
-attributes range over small domains, but indexing is useful (we
-believe this is because we are only interested in the first solution
-in this case).
-
-This version of the ILP system stores most dynamic data in the IDB.
-The size of reflects the search space, and is largely independent of
-the program's static data (notice that small applications such as
-\texttt{Krki} do tend to have a small search space). Aleph's author
-very carefully designed the system to work around overheads in
-accessing the data-base, so indexing should not be as important.  In
-fact, indexing has a much lower space overhead in this case,
-suggesting it is not so critical.  On the other hand, looking at the
-actual results shows that indexing is working well: most space is
-spent on hashes and the tree, little space is spent on \texttt{try}
-instructions. It is hard to separate the contributions of JITI on
-static and dynamic data, but the results for \texttt{Mesh} and
-\texttt{Mutagenesis}, where the JITI probably has little impact on
-static code, suggest a factor of two from indexing on the IDB in this
-case.
-
-
-% Our experience with the indexing algorithm described here shows a
-% significant performance improvement over the previous indexing code in
-% our system. Quite often, this has allowed us to tackle applications
-% which previously would not have been feasible. We next present some
-% results that show how useful the algorithms can be.
-
 Next, we present performance results for demand-driven indexing on a
 number of benchmarks and real-life applications. Throughout, we
 compare performance with single argument indexing. We use YAP-5.1.2
@ -1216,128 +1011,40 @@ improves performance in the latter case only, but this does make a
 large difference, as the WAM code has to visit all thousand clauses if
 the second argument is unbound.

-The graph reachability datasets because they both use the same
-program, but on different databases. The t-test does not show a
-significant difference
-
-} the database
-itself.  The JITI brings little benefits on the linear graphs if we
-call the \texttt{path/3} predicates with left or right recursion. On
-the other hand, it always improves performance when using the doubly
-recursive version, and it always improves performance on the tree
-graph.
-
-To understand why, we first consider the simplest execution pattern,
-given by the left-recursive procedure. The code for the LRF is:
-
-\begin{verbatim}
-path1(X,Y,[X,Y]) :- arc(X,Y).
-path1(X,Y,[X|P]) :- arc(X,Z), 
-                    path1(Z,Y,P), 
-                    not_member(X,P).
-\end{verbatim}
-\noindent
-Careful inspection of the program shows that \texttt{arc/3} can be
-accessed with different modes. First, given the top-level goal
-$path1(X,Y,\_)$ the two clauses for \texttt{path1/3} call
-\texttt{arc/3} with both arguments free.  Second, the recursive call
-to \texttt{path1/3} can call \texttt{arc/3} in the base clause with
-\emph{both arguments bound}. If the graph is linear, the second
-argument is functionally dependent on the first, and indexing on the
-first argument is sufficient. But, if the graph has a branching factor
-$> 1$, WAM style first argument indexing will lead to backtracking,
-whereas the JITI can perform direct lookup through the hash tables.
-This explains the performance improvement for the \texttt{tree}
-graphs.
-
-Do such improvements hold for real applications? An interesting
-application of tabled Prolog is in program analysis, often based in
-Anderson's points-to analysis~\cite{anderson-phd}. In this framework,
-imperative programs are encoded as a set of facts, and properties of
-interest are encoded rules. Program properties can be verified by
-checking the closure of the rules. Such programs therefore have
-similar properties to the \texttt{path} benchmarks, and should
-generate similar performance. Table~\ref{tab:pa} shows such
-applications. The first analyses a smallish program and the second the
-\texttt{javac} benchmark.
-
-\begin{table}[ht]
-  \centering
-  \begin {tabular}{|l|r|r||r|r|r||} \hline %\cline{1-3}
-    &  \multicolumn{2}{|c||}{\bf Time in sec.}  &
-    \multicolumn{3}{|c||}{\bf Static Space in KB.} \\
-    {\bf Benchs.}  & \bf $A_1$   & \bf JITI & \bf Clause & \multicolumn{2}{|c||}{\bf Indices} \\
-    &    &  &  & \bf $A_1$   & \bf JITI \\
-    \hline
-    \texttt{pta}    & 14  & 1.7  & 845   & 318  & 351 \\
-    \texttt{tea}    & 800 & 36.9 & 36781 & 1793 & 2848 \\
-    \hline
-\end{tabular}
-\caption{Program Analysis}
-\label{tab:pa}
-\end{table}
-
-Table~\ref{tab:pa} shows total running times, and size of static
-data-base in KB for a YAP run. The first column shows the size in
-clauses, the other two show the size of the indices when using
-single-argument indexing and the JITI.
-
-
-\begin{table}[ht]
-%\vspace{-\intextsep}
-%\begin{table}[htbp] 
-%\centering
-  \centering
-  \begin {tabular}{|l|r|r|r|r|} \hline %\cline{1-3}
-    &  \multicolumn{2}{|c|}{\bf Time in sec.}  & \bf JITI \\
-    {\bf Benchs.}  & \bf $A1$   & \bf JITI & \bf Ratio \\
-    \hline
-    \texttt{BreastCancer}      & 1450    & 88 & 16\\
-    \texttt{Carcinogenesis}    & 17,705    & 192  &92\\
-    \texttt{Choline}           & 14,766    & 1,397  & 11  \\
-    \texttt{GeneExpression}    & 193,283     & 7,483    & 26    \\
-    \texttt{IE-Protein\_Extraction} &  1,677,146      & 2,909    & 577    \\
-    \texttt{Krki}              & 0.3        & 0.3      & 1      \\
-    \texttt{Krki II}           & 1.3     & 1.3     & 1     \\
-    \texttt{Mesh}              & 4    & 3  & 1.3  \\
-    \texttt{Mutagenesis}       & 51,775  & 27,746 & 1.9\\
-    \texttt{Pyrimidines}       & 487,545     & 253,235  & 1.9    \\
-    \texttt{Susi}              & 105,091    & 307    & 342  \\
-    \texttt{Thermolysin}       & 50,279      &  5,213     & 10      \\
-    \hline
-\end{tabular}
-\caption{Machine Learning (ILP) Datasets}
-\label{tab:aleph}
-\end{table}
-
-
-
-JITI was originally motivated by applications in the area of Machine
-Learning that try to learn rules from databases (our compiler is used
-on a number of such systems). Table~\ref{tab:aleph} shows performance
-for one of the most popular such systems in some detail.  The datasets
+\subsection{JITI in ILP} \label{sec:perf:ILP}
+%--------------------------------------------
+The need for just-in-time indexing was originally motivated by ILP
+applications.  Table~\ref{tab:aleph} shows JITI performance on some
+learning tasks using the ALEPH system~\cite{}. The dataset
+\texttt{Krki} tries to learn rules from a small database of chess
+end-games; \texttt{GeneExpression} learns rules for yeast gene
+activity given a database of genes, their interactions, and
+micro-array gene expression data; \texttt{BreastCancer} processes
+real-life patient reports towards predicting whether an abnormality
+may be malignant; \texttt{IE-Protein\_Extraction} processes
+information extraction from paper abstracts to search proteins;
+\texttt{Susi} learns from shopping patterns; and \texttt{Mesh} learns
+rules for finite-methods mesh design. The datasets
 \texttt{Carcinogenesis}, \texttt{Choline}, \texttt{Mutagenesis},
 \texttt{Pyrimidines}, and \texttt{Thermolysin} are about predicting
-chemical properties of compounds. Most queries perform very simple
-queries in an extensional database; \texttt{Mutagenesis} includes
-several predicates defined as rules; and \texttt{Thermolysin} performs
-simple 3D distance computations.  \texttt{Krki} are chess end-games.
-\texttt{GeneExpression} processes micro-array data,
-\texttt{BreastCancer} real-life patient reports,
-\texttt{IE-Protein\_Extraction} information extraction from paper
-abstracts that mention proteins, \texttt{Susi} shopping patterns, and
-\texttt{Mesh} finite-methods mesh design. Several of these datasets
-are standard across Machine Learning literature.
-\texttt{GeneExpression} and \texttt{BreastCancer} were partly
-developed by the authors.
+chemical properties of compounds. The first three datasets store
+properties of interest as tables, but \texttt{Thermolysin} learns from
+the 3D-structure of a molecule's conformations.  Several of these
+datasets are standard across Machine Learning literature.
+\texttt{GeneExpression}~\cite{} and \texttt{BreastCancer}~\cite{} were
+partly developed by some of the authors.  Most datasets perform simple
+queries in an extensional database. The exception is
+\texttt{Mutagenesis} where several predicates are defined
+intensionally, requiring extensive computation.
+

 \begin{table}[ht]
 %\vspace{-\intextsep}
 %\begin{table}[htbp] 
 %\centering
  \centering
-  \begin {tabular}{|l|r|r|r|r|} \hline %\cline{1-3}
-    &  \multicolumn{2}{|c|}{\bf Time in sec.}  & \bf JITI \\
+  \begin {tabular}{|l|r|r|r|r|} \hline %\cline{1-3}                             
+    &  \multicolumn{2}{|c|}{\bf Time in sec.}  & \bf \JITI \\
    {\bf Benchs.}  & \bf $A1$   & \bf JITI & \bf Ratio \\
    \hline
    \texttt{BreastCancer}      & 1450    & 88 & 16\\
@ -1354,210 +1061,161 @@ developed by the authors.
    \texttt{Thermolysin}       & 50,279      &  5,213     & 10      \\
    \hline
 \end{tabular}
-\caption{Machine Learning (ILP) Datasets}
+\caption{Machine Learning (ILP) Datasets: Times are given in Seconds,
+  we give time for standard indexing with no indexing on dynamic
+  predicates versus the \JITI implementation}
 \label{tab:aleph}
 \end{table}

 We compare times for 10 runs of the saturation/refinement cycle of the
-ILP system.  Table~\ref{tab:aleph} shows very clearly the advantages
-of JITI: speedups range up to two orders of magnitude.  Applications
-such as \texttt{BreastCancer} and \texttt{GeneExpression} manipulate
-1NF data (that is, unstructured data). The first benefit is from
-multiple-argument indexing. Multi-argument is available in other
-Prolog systems~\cite{BIM,xsb-manual,ZhTaUs-small,SWI}), but using
-it would require extra user information that would be hard to most ILP
-users: the JITI provides that for free.  Just multi-argument indexing
-does not explain everything.  \texttt{BreastCancer} results were of
-particular interest to us because the dataset was to a large extent
-developed by the authors. It consists of 40 binary relations which are
-most often used with the first argument as a key (it is almost
-propositional learning). We did not expect a huge speedup, but the
-results show the opposite: calls with both arguments bound, or with
-the second argument bound may not be very frequent, but they are
-frequent enough to justify indexing.  This would be difficult to
-predict beforehand, even to experienced Prolog programmers.
+ILP system.  Table~\ref{tab:aleph} shows results. The \texttt{Krki}
+datasets have small search spaces and small databases, so they
+essentially achieve the same performance under both versions: there is
+no slowdown. The \texttt{Mesh}, \texttt{Mutagenesis}, and
+\texttt{Pyrimides} applications do not benefit much from indexing in
+the database, but they do benefit from indexing in the dynamic
+representation of the search space, as their running times halve.
+
+The \texttt{BreastCancer} and \texttt{GeneExpression} applications use
+1NF data (that is, unstructured data). The benefit here is mostly from
+multiple-argument indexing.  \texttt{BreastCancer} is particularly
+interesting. It consists of 40 binary relations with 65k elements
+each, where the first argument is the key, like in
+\texttt{sg\_cyl}. We know that most calls have the first argument
+bound, hence indexing was not expected to matter very much. Instead,
+the results show \JITI running time to improve by an order of
+magnitude. Like in \texttt{sg\_cyl}, this suggests that even a small
+percentage of badly indexed calls can come to dominate running time.

 \texttt{IE-Protein\_Extraction} and \texttt{Thermolysin} are example
 applications that manipulate structured data.
-\texttt{IE-Protein\_Extraction} is a large dataset, therefore indexing
-is simply critical: we could not run the application in reasonable
-time without JITI. \texttt{Thermolysin} is smaller and performs
-significant computation per query: even so, indexing is very
-important.
-
-Indexing is no magical bullet. On the flip side, \texttt{Mutagenesis}
-is an example where indexing does help, but not by much. The problem
-is that most time is spent on recursive predicates that were built to
-use the first argument. \texttt{Mutagenesis} also shows a concern with
-JITI: we generate large indices but we do not benefit very much.
+\texttt{IE-Protein\_Extraction} is the largest dataset we consider,
+and indexing is simply critical: it is not possible to run the
+application in reasonable time with one argument
+indexing. \texttt{Thermolysin} is smaller and performs some
+computation per query: even so, indexing improves performance by an
+order of magnitude.

 \begin{table*}[ht]
  \centering
-  \begin {tabular}{|l|r|r|r|r|r||r|r|r|r|r|r|} \hline %\cline{1-3}
-    &  \multicolumn{5}{|c||}{\bf Static Code}  & \multicolumn{6}{|c|}{\bf Dynamic Code \& IDB} \\
-    &  \textbf{Clause} & \multicolumn{4}{|c||}{\bf Indexing Code}  & \textbf{Clause} & \multicolumn{5}{|c|}{\bf Indexing Code} \\
-    \textbf{Benchmarks} &   & Total & T & W & S &  & Total & T & C & W & S  \\
+  \begin {tabular}{|l|r|r||r|r|} \hline %\cline{1-3}
+    &  \multicolumn{2}{|c||}{\bf Static Code}  & \multicolumn{2}{|c|}{\bf Dynamic Code} \\
+Benchmarks    &  \textbf{Clause} & {\bf Index}  & \textbf{Clause} & {\bf Index} \\
+%    \textbf{Benchmarks} &   & Total & T & W & S &  & Total & T & C & W & S  \\
    \hline
-    \texttt{BreastCancer}      & 60940 & 46887 & 46242 &
-    3126 & 125  & 630 & 14 &42 & 18& 57 &6 \\
+    \texttt{BreastCancer}
+    & 60940 & 46887 
+    % & 46242 & 3126  & 125
+    & 630  & 14
+    % &42 & 18& 57 &6
+    \\

-    \texttt{Carcinogenesis}    & 1801 & 2678
-    &1225 & 587 & 865  & 13512 & 942     & 291 & 91 & 457 & 102
- \\
+    \texttt{Carcinogenesis} 
+    & 1801 & 2678
+    % &1225 & 587 & 865
+    & 13512 & 942
+    %& 291 & 91 & 457 & 102
+    \\

    \texttt{Choline}  & 666 & 174
-    &67 & 48 & 58 & 3172 & 174
-    & 76 & 4 & 48 & 45
+    % &67 & 48 & 58
+    & 3172 & 174
+    % & 76 & 4 & 48 & 45
 \\
    \texttt{GeneExpression}    &  46726 & 22629
-    &6780 & 6473 & 9375 & 116463 & 9015
-    & 2703 & 932 & 3910 & 1469
+    % &6780 & 6473 & 9375
+    & 116463 & 9015
+    %& 2703 & 932 & 3910 & 1469
 \\

    \texttt{IE-Protein\_Extraction}    &146033 & 129333
-    &39279 & 24322 & 65732 & 53423 & 1531
-    & 467 & 108 & 868 & 86
+    %&39279 & 24322 & 65732
+    & 53423 & 1531
+    %& 467 & 108 & 868 & 86
 \\

    \texttt{Krki}              & 678 & 117
-    &52 & 24 & 40 & 2047 & 24
-    & 10 & 2 & 10 & 1
+    %&52 & 24 & 40
+    & 2047 & 24
+    %& 10 & 2 & 10 & 1
 \\

    \texttt{Krki II}           & 1866 & 715
-    &180 & 233 & 301 & 2055 & 26
-    & 11 & 2 & 11 & 1
+    %&180 & 233    & 301
+    & 2055 & 26
+    %& 11 & 2 & 11 & 1
 \\

    \texttt{Mesh}              & 802 & 161
-    &49 & 18 & 93 & 2149 & 109
-    & 46 & 4 & 35 & 22
+    %&49 & 18 & 93
+    & 2149 & 109
+    %& 46 & 4 & 35 & 22
 \\

    \texttt{Mutagenesis}       & 1412 & 1848
-    &1045 & 291 & 510 & 4302 & 595
-    & 156 & 114 & 264 & 61
+    %&1045 & 291 & 510
+    & 4302 & 595
+    %& 156 & 114 & 264 & 61
 \\

    \texttt{Pyrimidines}       & 774 & 218
-    &76 & 63 & 77 & 25840 & 12291
-    & 4847 & 43 & 3510 & 3888
+    %&76 & 63 & 77
+    & 25840 & 12291
+    %& 4847 & 43 & 3510 & 3888
 \\

    \texttt{Susi}              & 5007 & 2509
-    &855 & 578 & 1076 & 4497 & 759
-    & 324 & 58 & 256 & 120
+    %&855 & 578 & 1076
+    & 4497 & 759
+    %& 324 & 58 & 256 & 120
 \\

    \texttt{Thermolysin}       & 2317 & 929
-    &429 & 184 & 315 & 116129 & 7064
-    & 3295 & 1438 & 2160 & 170
+    %&429 & 184 & 315
+    & 116129 & 7064
+    %& 3295 & 1438 & 2160 & 170
 \\

    \hline
 \end{tabular}
-\caption{Memory Performance on Machine Learning (ILP) Datasets}
+\caption{Memory Performance on Machine Learning (ILP) Datasets: memory
+  usage is given in KB}
 \label{tab:ilpmem}
 \end{table*}


-In general, one would wonder whether the benefits in time correspond
-to costs in space. Figure~\ref{tab:ilpmem} shows memory performance at
-a point near the end of execution. Numbers are given in KB. Because
-dynamic memory expands and contracts, we chose a point where dynamic
-memory should be at maximum usage. The first five columns show data
-usage on static predicates.  The leftmost sub-column represents the
-code used for clause; the next sub-columns represent space used in
-indices for static predicates: the first column gives total usage,
-which consists of space used in the main tree, the expanded
-wait-nodes, and hash-tables.
+Table~\ref{tab:ilpmem} discusses the memory cost paid in using
+\JITI. The table presents data obtained at a point near the end of
+execution.  Because dynamic memory expands and contracts, we chose a
+point where memory usage should be at a maximum. The first two numbers
+show data usage on \emph{static} predicates. Static data-base sizes
+range from 146MB (\texttt{IE-Protein|_Extraction} to less than a MB
+(\texttt{Choline}, \texttt{Krki}, \texttt{Mesh}). Indexing code can be
+more than the original code, as in \texttt{Mutagenesis}, or almost as
+much, eg, \texttt{IE-Protein\_Extraction}. In most cases the YAP \JITI
+adds at least a third and often a half to the original data-base. A
+more detailed analysis shows the source of overhead to be very
+different from dataset to dataset. In \texttt{IE-Protein|_Extraction}
+the problem is that hash tables are very large. Hash tables are also
+where most space is spent in \texttt{Susi}. In \texttt{BreastCancer}
+hash tables are actually small, so most space is spent in
+\TryRetryTrust chains. \texttt{Mutagenesis} is similar: even though
+YAP spends a large effort in indexing it still generates long
+\TryRetryTrust chains. Storing sets of matching clauses at \jitiSTAR
+nodes takes usually over 10\% of total memory usage, but is never dominant.

-Static data-base sizes range from 146MB to 666KB, the latter mostly in
-system libraries. The impact of indexing code varies widely: it is
-more than the original code for \texttt{Mutagenesis}, almost as much
-for \texttt{IE-Protein\_Extraction}, and in most cases it adds at
-least a third and often a half to the original data-base. It is
-interesting to check the source of the space overhead: if the source
-are hash-tables, we can expect this is because of highly-complex
-indices. If overhead is in \emph{wait-nodes}, this again suggests a
-sophisticated indexing structure. Overhead in the main tree may be
-caused by a large number of nodes, or may be caused by \texttt{try}
-nodes.
-
-One first conclusion is that \emph{wait-nodes} are costly space-wise,
-even if they are needed to achieve sensible compilation times.  On the
-other hand, whether the space is allocated to the tree or to the
-hashes varies widely. \texttt{IE-Protein\_Extraction} is an example
-where the indices seem very useful: most space was spent in the
-hash-tables, although we still are paying much for \emph{wait-nodes}.
-\texttt{BreastCancer} has very small hash-tables, because it
-attributes range over small domains, but indexing is useful (we
-believe this is because we are only interested in the first solution
-in this case).
-
-This version of the ILP system stores most dynamic data in the IDB.
-The size of reflects the search space, and is largely independent of
-the program's static data (notice that small applications such as
-\texttt{Krki} do tend to have a small search space). Aleph's author
-very carefully designed the system to work around overheads in
+This version of ALEPH uses the internal data-base to store the IDB.
+The size of reflects the search space, and is to some extent
+independent of the program's static data, although small applications
+such as \texttt{Krki} do tend to have a small search space. ALEPH's
+author very carefully designed the system to work around overheads in
 accessing the data-base, so indexing should not be as important.  In
 fact, indexing has a much lower space overhead in this case,
-suggesting it is not so critical.  On the other hand, looking at the
-actual results shows that indexing is working well: most space is
-spent on hashes and the tree, little space is spent on \texttt{try}
-instructions. It is hard to separate the contributions of JITI on
-static and dynamic data, but the results for \texttt{Mesh} and
-\texttt{Mutagenesis}, where the JITI probably has little impact on
-static code, suggest a factor of two from indexing on the IDB in this
-case.
+suggesting it is not so critical. A more detailed analysis shows tha
+indexing is working well: most space is spent on hashes tables and on
+internal nodes of tree, and relatively little space is spent on
+\TryRetryTrust chains.

-Last, we discuss a natural language application, Van Noord's FSA
-toolbox. This is an implementation of a set of finite state automata
-for natural language tasks. The system includes a test suite with 150
-tasks. We selected the 10 tasks with longer-running times in the
-single argument version.
-
-\begin{table}[ht]
-  \centering
-  \begin {tabular}{|l|r|r||r|r|r||} \hline %\cline{1-3}
-    &  \multicolumn{2}{|c||}{\bf Time in msec.}  &
-    \multicolumn{3}{|c||}{\bf Dynamic Space in KB.} \\
-    {\bf Benchs.}  & \bf $A_1$   & \bf JITI & \bf Clause & \multicolumn{2}{|c||}{\bf Indices} \\
-    &    &  &  & \bf $A_1$   & \bf JITI \\
-    \hline
-    \texttt{k963}   & 1944 & 684  & 1348   & 26  & 40 \\
-    \texttt{k961}   & 1972 & 652  & 1348 & 26 & 40 \\
-    \texttt{k962}   & 1996 & 668  & 1350 & 26 & 40 \\
-    \texttt{drg3}   & 3532 & 3641 & 649 & 19 & 35 \\
-    \texttt{d2ph}   & 3612 & 3667 & 649 & 19 & 35 \\
-    \texttt{d2m}    & 3952 & 3668 & 649 & 19 & 35 \\
-    \texttt{ld1}    & 4084 & 4016 & 649 & 19 & 35 \\
-    \texttt{dg5}    & 6084 & 1352 & 3305 & 39 & 61 \\
-    \texttt{g2p}    & 25212& 14120 & 10373 & 47 & 67 \\
-    \texttt{tl3}    & 74476& 14925 & 14306 & 70 & 49 \\
-    \hline
-\end{tabular}
-\caption{Performance on a Natural Language Application}
-\label{tab:fsa}
-\end{table}
-
-FSA is very different from the two previous examples. These are
-relatively complex algorithms, and there is relatively little
-``data''. Even so, Table~\ref{tab:fsa} shows significant speedups from
-using JITI. Note that Table~\ref{tab:fsa} only shows memory
-performance on dynamic data: static data does not show very
-significant differences.  The results show two different types of
-tasks. In cases such as \texttt{tl3} or \texttt{dg5} JITI gives a
-significant speedup; in tasks such as \texttt{drg3} the difference
-does not seem to be significant, and it some cases JITI is slower.
-Analysis show that the tasks that do well are the tasks that use
-dynamic predicates. In this case, indexing is beneficial. Although
-there is an increase in total code, the indices are good: there is a
-reduction in the code for \texttt{try} instructions, and an increase
-in code for hash-tables, which indicates dynamic predicates are
-indexing well. In tasks such as \texttt{drg3} and friends, the JITI
-does not bring much benefits, whereas it spends extra time compiling
-and takes extra space.


 \section{Concluding Remarks}