diff --git a/docs/index/iclp07.tex b/docs/index/iclp07.tex index 948bb1fe5..59e8f54a9 100644 --- a/docs/index/iclp07.tex +++ b/docs/index/iclp07.tex @@ -49,14 +49,13 @@ \newcommand{\tea}{\bench{tea}\xspace} %------------------------------------------------------------------------------ \newcommand{\BreastCancer}{\bench{BreastCancer}\xspace} -\newcommand{\Carcinogenesis}{\bench{Carcinogenesis}\xspace} +\newcommand{\Carcino}{\bench{Carcinogenesis}\xspace} \newcommand{\Choline}{\bench{Choline}\xspace} -\newcommand{\GeneExpression}{\bench{GeneExpression}\xspace} +\newcommand{\GeneExpr}{\bench{GeneExpression}\xspace} \newcommand{\IEProtein}{\bench{IE-Protein\_Extraction}\xspace} \newcommand{\Krki}{\bench{Krki}\xspace} \newcommand{\KrkiII}{\bench{Krki~II}\xspace} \newcommand{\Mesh}{\bench{Mesh}\xspace} -\newcommand{\Mutagenesis}{\bench{Mutagenesis}\xspace} \newcommand{\Pyrimidines}{\bench{Pyrimidines}\xspace} \newcommand{\Susi}{\bench{Susi}\xspace} \newcommand{\Thermolysin}{\bench{Thermolysin}\xspace} @@ -1013,8 +1012,8 @@ in parentheses. For each variant of transitive closure, we issue two queries: one with mode \code{(in,out)} and one with mode \code{(out,out)}. % -For YAP, indices on the first argument and \TryRetryTrust are built on -all benchmarks under \JITI. +For YAP, indices on the first argument and \TryRetryTrust chains are +built on all benchmarks under \JITI. % For XXX, \JITI triggers on no benchmark but the \jitiONconstant instructions are executed for the three \bench{tc\_?\_oo} benchmarks. @@ -1069,8 +1068,9 @@ columns separately. %-------------------------------------------------------------------------- On the other hand, when \JITI is effective, it can significantly improve time performance. We use the following programs and -applications:\TODO{If time permits, we should also add FSA benchmarks -(\bench{k963}, \bench{dg5} and \bench{tl3})} +applications: +%% \TODO{For the journal version we should also add FSA benchmarks +%% (\bench{k963}, \bench{dg5} and \bench{tl3})} \begin{description} \item[\sgCyl] The same generation DB benchmark on a $24 \times 24 \times 2$ cylinder. We issue the open query. @@ -1122,52 +1122,80 @@ difference in this benchmark. \subsection{Performance of \JITI on ILP applications} \label{sec:perf:ILP} %------------------------------------------------------------------------- The need for \JITI was originally noticed in inductive logic -programming applications. Table~\ref{tab:ilp:time} shows \JITI -performance on some learning tasks using the ALEPH -system~\cite{ALEPH}. The dataset \Krki tries to learn rules from a -small database of chess end-games; \GeneExpression learns rules for +programming applications, which tend to issue ad hoc queries during +runtime and their indexing requirements cannot be determined at +compile time. On the other hand, these applications operate on lots of +data, so memory consumption is a reasonable concern. We evaluate +JITI's time and space performance on some learning tasks using the +ALEPH system~\cite{ALEPH}. We use the following datasets: +% +% Table~\ref{tab:ilp:time} shows JITI performance. +The dataset \Krki tries to learn rules from a +small database of chess end-games; \GeneExpr learns rules for yeast gene activity given a database of genes, their interactions, and micro-array gene expression data; \BreastCancer processes real-life patient reports towards predicting whether an abnormality may be malignant; \IEProtein processes information extraction from paper abstracts to search proteins; \Susi learns from shopping patterns; and \Mesh learns rules for finite-methods mesh design. The datasets -\Carcinogenesis, \Choline, \Pyrimidines, and +\Carcino, \Choline, \Pyrimidines, and \Thermolysin try to predict chemical properties of compounds. The first three datasets store properties of interest as tables, but \Thermolysin learns from the 3D-structure of a molecule's conformations. Several of these datasets are standard across the -Machine Learning literature. \GeneExpression~\cite{ilp-regulatory06} +Machine Learning literature. \GeneExpr~\cite{ilp-regulatory06} and \BreastCancer~\cite{DBLP:conf/ijcai/DavisBDPRCS05} were partly -developed by some of the paper's authors. Most datasets perform simple +developed by an author of this paper. Most datasets perform simple queries in an extensional database. %------------------------------------------------------------------------------ \begin{table}[t] \centering - \caption{Machine Learning (ILP) Datasets: Times are given in Seconds, - we give time for standard indexing with no indexing on dynamic - predicates versus the \JITI implementation} - \label{tab:ilp:time} + \caption{Time and space performance on Machine Learning (ILP) Datasets} + \label{tab:ilp} \setlength{\tabcolsep}{3pt} - \begin{tabular}{|l||r|r|r|} \hline %\cline{1-3} - & \multicolumn{3}{|c|}{Time (in secs)} \\ + \subfigure[Time (in seconds)]{\label{tab:ilp:time} + \begin{tabular}{|l||r|r|r||} \hline + & \multicolumn{3}{|c||}{Time (in secs)} \\ \cline{2-4} - Benchmark & 1st & JITI &{\bf ratio} \\ + Benchmark & 1st & JITI &{\bf ratio} \\ \hline - \BreastCancer & 1450 & 88 & 16 \\ - \Carcinogenesis & 17,705 & 192 & 92 \\ - \Choline & 14,766 & 1,397 & 11 \\ - \GeneExpression & 193,283 & 7,483 & 26 \\ - \IEProtein & 1,677,146 & 2,909 & 577 \\ - \bench{Krki} & 0.3 & 0.3 & 1 \\ - \bench{Krki II} & 1.3 & 1.3 & 1 \\ - \Mesh & 4 & 3 & 1.3 \\ - \Pyrimidines & 487,545 & 253,235 & 1.9 \\ - \Susi & 105,091 & 307 & 342 \\ - \Thermolysin & 50,279 & 5,213 & 10 \\ + \BreastCancer & 1,450 & 88 & 16 \\ + \Carcino & 17,705 & 192 & 92 \\ + \Choline & 14,766 & 1,397 & 11 \\ + \GeneExpr & 193,283 & 7,483 & 26 \\ + \IEProtein & 1,677,146 & 2,909 & 577 \\ + \Krki & 0.3 & 0.3 & 1 \\ + \KrkiII & 1.3 & 1.3 & 1 \\ + \Mesh & 4 & 3 & 1.3 \\ + \Pyrimidines & 487,545 & 253,235 & 1.9 \\ + \Susi & 105,091 & 307 & 342 \\ + \Thermolysin & 50,279 & 5,213 & 10 \\ \hline -\end{tabular} + \end{tabular} + } + \subfigure[Memory usage (in KB)]{\label{tab:ilp:memory} + \begin{tabular}{||r|r|r|r||} \hline + \multicolumn{2}{||c|}{Static code} + & \multicolumn{2}{|c||}{Dynamic code} \\ + \hline + \multicolumn{1}{||c|}{Clauses} & \multicolumn{1}{c}{Index} + & \multicolumn{1}{|c|}{Clauses} & \multicolumn{1}{c||}{Index}\\ + \hline + 60,940 & 46,887 & 630 & 14 \\ + 1,801 & 2,678 & 13,512 & 942 \\ + 666 & 174 & 3,172 & 174 \\ + 46,726 & 22,629 & 116,463 & 9,015 \\ + 146,033 & 129,333 & 53,423 & 1,531 \\ + 678 & 117 & 2,047 & 24 \\ + 1,866 & 715 & 2,055 & 26 \\ + 802 & 161 & 2,149 & 109 \\ + 774 & 218 & 25,840 & 12,291 \\ + 5,007 & 2,509 & 4,497 & 759 \\ + 2,317 & 929 & 116,129 & 7,064 \\ + \hline + \end{tabular} + } \end{table} %------------------------------------------------------------------------------ @@ -1179,7 +1207,7 @@ the same performance under both versions: there is no slowdown. The in the database, but they do benefit from indexing in the dynamic representation of the search space, as their running times halve. -The \BreastCancer and \GeneExpression applications use data in +The \BreastCancer and \GeneExpr applications use data in 1NF (that is, unstructured data). The benefit here is mostly from multiple-argument indexing. \BreastCancer is particularly interesting. It consists of 40 binary relations with 65k elements @@ -1199,90 +1227,6 @@ indexing. \Thermolysin is smaller and performs some computation per query: even so, indexing improves performance by an order of magnitude. -\begin{table*}[ht] - \centering - \caption{Memory Performance on Machine Learning (ILP) Datasets: memory - usage is given in KB} - \label{tab:ilp:memory} - \setlength{\tabcolsep}{3pt} - \begin {tabular}{|l|r|r||r|r|} \hline %\cline{1-3} - & \multicolumn{2}{|c||}{\bf Static Code} & \multicolumn{2}{|c|}{\bf Dynamic Code} \\ - Benchmark & \textbf{Clause} & {\bf Index} & \textbf{Clause} & {\bf Index} \\ -% \textbf{Benchmarks} & & Total & T & W & S & & Total & T & C & W & S \\ - \hline - \BreastCancer - & 60,940 & 46,887 - % & 46242 & 3126 & 125 - & 630 & 14 - % &42 & 18& 57 &6 - \\ - - \Carcinogenesis - & 1801 & 2678 - % &1225 & 587 & 865 - & 13,512 & 942 - %& 291 & 91 & 457 & 102 - \\ - - \Choline & 666 & 174 - % &67 & 48 & 58 - & 3172 & 174 - % & 76 & 4 & 48 & 45 - \\ - - \GeneExpression - & 46,726 & 22,629 - % &6780 & 6473 & 9375 - & 116,463 & 9015 - %& 2703 & 932 & 3910 & 1469 - \\ - - \bench{IE-Protein\_Extraction} - & 146,033 & 129,333 - %&39279 & 24322 & 65732 - & 53,423 & 1531 - %& 467 & 108 & 868 & 86 - \\ - - \bench{Krki} & 678 & 117 - %&52 & 24 & 40 - & 2047 & 24 - %& 10 & 2 & 10 & 1 - \\ - - \bench{Krki II} & 1866 & 715 - %&180 & 233 & 301 - & 2055 & 26 - %& 11 & 2 & 11 & 1 - \\ - - \bench{Mesh} & 802 & 161 - %&49 & 18 & 93 - & 2149 & 109 - %& 46 & 4 & 35 & 22 - \\ - - \bench{Pyrimidines} & 774 & 218 - %&76 & 63 & 77 - & 25,840 & 12,291 - %& 4847 & 43 & 3510 & 3888 - \\ - - \bench{Susi} & 5007 & 2509 - %&855 & 578 & 1076 - & 4497 & 759 - %& 324 & 58 & 256 & 120 - \\ - - \bench{Thermolysin} & 2317 & 929 - %&429 & 184 & 315 - & 116,129 & 7064 - %& 3295 & 1438 & 2160 & 170 - \\ - \hline -\end{tabular} -\end{table*} - Table~\ref{tab:ilp:memory} shows the memory cost paid for \JITI. The table presents data obtained at a point near the end of execution. @@ -1291,7 +1235,7 @@ memory usage should be at a maximum. The first two numbers show data usage on \emph{static} predicates. Static data-base sizes range from 146MB (\bench{IE-Protein\_Extraction} to less than a MB (\bench{Choline}, \bench{Krki}, \bench{Mesh}). Indexing code can grow -to be as large as than the original code, as in \Carcinogenesis, or +to be as large as than the original code, as in \Carcino, or almost as much, e.g., \bench{IE-Protein\_Extraction}. In most cases the YAP \JITI adds at least a third and often a half to the original data-base. A more detailed analysis shows the source of overhead to be @@ -1306,16 +1250,15 @@ usage, but is never dominant. This version of ALEPH uses the internal data-base to store the IDB. The size of reflects the search space, and is to some extent independent of the program's static data, although small applications -such as \bench{Krki} do tend to have a small search space. ALEPH's +such as \bench{Krki} tend to have a small search space. ALEPH's author very carefully designed the system to work around overheads in -accessing the data-base, so indexing should not be as critical. The -low overheads suggest that the \JITI is working well, as confirmed in -a more detailed analysis: most space is spent on hashes tables and on +accessing the database, so indexing should not be as critical. The +low overheads suggest that \JITI is working well, as confirmed in +a more detailed analysis: most space is spent on hash tables and on internal nodes of tree, and relatively little space is spent on \TryRetryTrust chains. - \section{Concluding Remarks} %=========================== \begin{itemize}