SIGMOD 2010 Repeatability & Workability Evaluation for paper #396 "Continuous Sampling for Online Aggregation Over Multiple Queries" by Sai Wu, Beng Chin Ooi, Kian-Lee Tan School of Computing, National University of Singapore, Singapore, 117417 {wusai, ooibc, tankl}@comp.nus.edu.sg Hardware & Software environment =============================== | Paper | Review -------+-------------------------+--------------------------------- class | server | desktop CPU | AMD Opteron 8356 | Intel Core2Quad Q6600 cores | 4 | 4 GHz | 2.3 (AMD) | 2.4 | 2.6 (authors) | RAM | 128 GB | 8 GB | (but: `java -Xmx2000m`) | OS | RedHat Enterprise 4.7 | Fedora 12 | (Linux 2.6.9) | (Linux 2.6.31) Java | Sun JDK v6 | OpenJDK 1.6.0_18 (IcedTea6 1.8) Submission ========== The authors provided - their Java source code and respective pre-compiled byte code.; - the required data sets (6 columns of the TPC-H "lineitem" tables and 8 columns of the materialized foreign-key join between TPC-H tables "lineitem" and "orders" on attribute "orderkey") as CSV text files for scale factors 1, 2, 3, 4, 5 --- scaled down by a factor 2 compared to the paper to reduce the submission size (voluntary choice of the authors, not required by RWE); - basic ("one-liner") shell scripts to individually run the experiments for each figure in the paper; these scripts basically call the generic "BatchRun" class with the figure number as parameter. Repeatability Evaluation ======================== Process ------- The reviewer choose to run the pre-compiled byte code as provided by the authors. Though the authors also provided their source code, they did not provide any read-to-use build environment, or at least more detailed instructions than "import the sources into an IDE like Eclipse". Setting up such an environment using and IDE like Eclipse or ANT was no option for the reviewer. The reviewer also did not review the source code in detail. Running the originally provided shell scripts to execute the experiments worked without errors. The total run took about 30 hours. However, it then appears that all experiments merely appended their results in "human-readable" textual for to a single output file. The authors had originally not provided any scripts the re-create the graphs in their paper from these results. Hence, an assessment of the results by comparing them "visually" to the once presented in the paper was not (easily) feasible. Extracting the numerical results and creating 16 performance graphs by hand was no option for the reviewer. On request, the authors modified their code to produce "machine-readable" data files and gnuplot scripts to re-create their graphs. Over a period of 2 weeks, various iterations of bug fixes were required to get the new output format and gnuplot scripts working correctly. For Figure 15, the reviewer eventually fixed the shell & gnuplot scripts to get correct results and graphs. The authors did not include a setup to repeat their reference base-line experiments with PostgreSQL in Figures 6 & 7. Time constraints and lack of detailed instructions made it impossible for the reviewer to setup such experiments with PostgreSQL or an alternative DBMS. Detailed Results ---------------- * Figures 6 & 7 "Effect of Data Size" (for query templates 1 & 2, resp.) The repeatability graph (and hence results) appear to be similar to those in the paper. For the proposed techniques AQP-Baseline, AGP-Direct & AQP-Graph, the graph basically shows three indistinguishable horizontal (constant) line close to 0. The scaling of the graph to accommodate the linearly growing PostgreSQL and AQP-Complete results (while not using a logarithmic scale on the y-axis) prohibits any more detailed inspection. * Figures 8 & 9 "Effect of Error Bound" (for query templates 1 & 2, resp.) In Figure 8, the repeated results match the original results quite well for AQP-Graph & AQP-Direct. The results for AQP-Baseline are about a factor 1.3 to 2 slower, increasing the gap between AQP-Baseline & AQP-Direct. In Figure 9, the repeated results match the original results quite well for AQP-Baseline, while the results for AQP-Direct & AQP-Graph are about a factor 2 to 4 faster than in the paper, again increasing the gap between AQP-Baseline & AQP-Direct. While we have no explanation for these difference, we do not consider them crucial. * Figures 10 & 11 "Effect of Confidence" (for query templates 1 & 2, resp.) Similar observations as for Figures 8 & 9 above. * Figure 15 "Effect of Result Sharing" The repeated results differ considerably from the ones in the paper. As opposed to almost reaching 1 for Progress 0-85, 0-90 & 0-95, the "Improved Ratio" merely reaches ~0.75 and drops below 0.70 already as of Progress 0-90 (T1) resp. 0-95 (T2). For Progress 0-98, it is only ~0.64 for both templates while it is 0.71 in the paper. We do consider these differences crucial. * Figure 16 "Effect of Concurrency" The repeated results match the original ones, except that the curve for AQP-Graph does not show the "dent", but rather an almost straight slightly linearly increasing line. * Figure 18 "Effect of Partition Numbers" The results for AQP-Baseline & AQP-Graph show the same tendencies as the original ones. However, the results for AQP-Direct remain almost identical to those of AQP-Graph with growing number of partitions, while they become significantly worse in the paper. We do consider these differences crucial. * Figure 21 "Scan Vs Sampling (Error Rate)" The results for AQP-Direct & AQP-Graph show the same tendencies as the original ones, except that the cross-over point is slightly shifted to the right (from ~0.01 to ~0.025). For estimated error rated <= 0.03, the real error rate for AQP-Baseline is much higher than in the paper (and hence also much higher than with AQP-Direct & AQP-Graph. We cannot judge, how crucial these differences are. * Figures 12, 13, 14, 17, 19, 20 The repeated results match the original ones with differences not exceeding usual variation and hardware differences. Summary ------- All experiments could be repeated. The repeatability results match the original ones quite well (with differences not exceeding usual variation and hardware differences) for 8 of the 16 figures. 6 figures show partly considerable but (as far as we can judge) non-crucial differences. 2 figures show (potentially) crucial differences. Workability Evaluation ====================== Due to lack of time and lack of detailed documentation and instructions, the reviewer was not able to perform any workability evaluation, like, e.g., using different error or confidence parameters, different query templates, different data sets, or comparison to other DBMSs or online aggregation proposals in the database research literature.