SIGMOD 2010 Repeatability & Workability Evaluation
                               for paper #396
     "Continuous Sampling for Online Aggregation Over Multiple Queries"
                                     by
                     Sai Wu, Beng Chin Ooi, Kian-Lee Tan
  School of Computing, National University of Singapore, Singapore, 117417
                    {wusai, ooibc, tankl}@comp.nus.edu.sg


Hardware & Software environment
===============================

       | Paper                   | Review                          
-------+-------------------------+---------------------------------
 class | server                  | desktop                         
 CPU   | AMD Opteron 8356        | Intel Core2Quad Q6600           
 cores | 4                       | 4                               
 GHz   | 2.3 (AMD)               | 2.4                             
       | 2.6 (authors)           |                                 
 RAM   | 128 GB                  | 8 GB                            
       | (but: `java -Xmx2000m`) |                                 
 OS    | RedHat Enterprise 4.7   | Fedora 12                       
       | (Linux 2.6.9)           | (Linux 2.6.31)                  
 Java  | Sun JDK v6              | OpenJDK 1.6.0_18 (IcedTea6 1.8) 


Submission
==========

The authors provided 
 - their Java source code and respective pre-compiled byte code.;
 - the required data sets (6 columns of the TPC-H "lineitem" tables and 8
   columns of the materialized foreign-key join between TPC-H tables
   "lineitem" and "orders" on attribute "orderkey") as CSV text files for
   scale factors 1, 2, 3, 4, 5 --- scaled down by a factor 2 compared to the
   paper to reduce the submission size (voluntary choice of the authors, not
   required by RWE);
 - basic ("one-liner") shell scripts to individually run the experiments
   for each figure in the paper; these scripts basically call the generic
   "BatchRun" class with the figure number as parameter.


Repeatability Evaluation
========================

Process
-------

The reviewer choose to run the pre-compiled byte code as provided by the
authors.  Though the authors also provided their source code, they did not
provide any read-to-use build environment, or at least more detailed
instructions than "import the sources into an IDE like Eclipse".  Setting up
such an environment using and IDE like Eclipse or ANT was no option for the
reviewer. The reviewer also did not review the source code in detail.

Running the originally provided shell scripts to execute the experiments
worked without errors. The total run took about 30 hours. However, it then
appears that all experiments merely appended their results in
"human-readable" textual for to a single output file. The authors had
originally not provided any scripts the re-create the graphs in their paper
from these results. Hence, an assessment of the results by comparing them
"visually" to the once presented in the paper was not (easily) feasible.
Extracting the numerical results and creating 16 performance graphs by hand
was no option for the reviewer.

On request, the authors modified their code to produce "machine-readable"
data files and gnuplot scripts to re-create their graphs. Over a period of 2
weeks, various iterations of bug fixes were required to get the new output
format and gnuplot scripts working correctly. For Figure 15, the reviewer
eventually fixed the shell & gnuplot scripts to get correct results and
graphs.

The authors did not include a setup to repeat their reference base-line
experiments with PostgreSQL in Figures 6 & 7. Time constraints and lack of
detailed instructions made it impossible for the reviewer to setup such
experiments with PostgreSQL or an alternative DBMS.

Detailed Results
----------------

* Figures 6 & 7 "Effect of Data Size" (for query templates 1 & 2, resp.)

  The repeatability graph (and hence results) appear to be similar to those
  in the paper.  For the proposed techniques AQP-Baseline, AGP-Direct &
  AQP-Graph, the graph basically shows three indistinguishable horizontal
  (constant) line close to 0.  The scaling of the graph to accommodate the
  linearly growing PostgreSQL and AQP-Complete results (while not using a
  logarithmic scale on the y-axis) prohibits any more detailed inspection.

* Figures 8 & 9 "Effect of Error Bound" (for query templates 1 & 2, resp.)

  In Figure 8, the repeated results match the original results quite well
  for AQP-Graph & AQP-Direct.  The results for AQP-Baseline are about a
  factor 1.3 to 2 slower, increasing the gap between AQP-Baseline &
  AQP-Direct.  In Figure 9, the repeated results match the original results
  quite well for AQP-Baseline, while the results for AQP-Direct & AQP-Graph
  are about a factor 2 to 4 faster than in the paper, again increasing the
  gap between AQP-Baseline & AQP-Direct.  While we have no explanation for
  these difference, we do not consider them crucial.

* Figures 10 & 11 "Effect of Confidence" (for query templates 1 & 2, resp.)

  Similar observations as for Figures 8 & 9 above.

* Figure 15 "Effect of Result Sharing"

  The repeated results differ considerably from the ones in the paper.  As
  opposed to almost reaching 1 for Progress 0-85, 0-90 & 0-95, the "Improved
  Ratio" merely reaches ~0.75 and drops below 0.70 already as of Progress
  0-90 (T1) resp.  0-95 (T2).  For Progress 0-98, it is only ~0.64 for both
  templates while it is 0.71 in the paper.
  We do consider these differences crucial.

* Figure 16 "Effect of Concurrency"

  The repeated results match the original ones, except that the curve for
  AQP-Graph does not show the "dent", but rather an almost straight slightly
  linearly increasing line.

* Figure 18 "Effect of Partition Numbers"

  The results for AQP-Baseline & AQP-Graph show the same tendencies as the
  original ones.  However, the results for AQP-Direct remain almost identical
  to those of AQP-Graph with growing number of partitions, while they
  become significantly worse in the paper.
  We do consider these differences crucial.

* Figure 21 "Scan Vs Sampling (Error Rate)"

  The results for AQP-Direct & AQP-Graph show the same tendencies as the
  original ones, except that the cross-over point is slightly shifted to the
  right (from ~0.01 to ~0.025).  For estimated error rated <= 0.03, the real
  error rate for AQP-Baseline is much higher than in the paper (and hence
  also much higher than with AQP-Direct & AQP-Graph.
  We cannot judge, how crucial these differences are.

* Figures 12, 13, 14, 17, 19, 20

  The repeated results match the original ones with differences not
  exceeding usual variation and hardware differences.

Summary
-------

All experiments could be repeated. The repeatability results match the
original ones quite well (with differences not exceeding usual variation and
hardware differences) for 8 of the 16 figures. 6 figures show partly
considerable but (as far as we can judge) non-crucial differences. 2 figures
show (potentially) crucial differences.


Workability Evaluation
======================

Due to lack of time and lack of detailed documentation and instructions, the
reviewer was not able to perform any workability evaluation, like, e.g.,
using different error or confidence parameters, different query templates,
different data sets, or comparison to other DBMSs or online aggregation
proposals in the database research literature.