SIGMOD 2010 Repeatability & Workability Evaluation for paper #12 Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava Hardware & Software environment =============================== | Paper | Review 1 -------+----------------------------+----------------------------------------------------------------- class | laptop | desktop CPU | AMD Quad-Core Opteron 8356 | AMD Athlon(tm) 64 X2 Dual Core Processor 4600+, cpu MHz 1000.000 cores | 4 | 2 RAM | 128 GB | 2 GB HDD | (not specified) | 250GB SATA WD Caviar SE 3.5" 7200rpm ATA300 Hard Drive 8MB Cache OS | Red Hat Linux | Fedora 11 Linux version 2.6.30.9-96.fc11.x86_64 Submission ========== The authors provided - source code for all their code and for competitor techniques; - data generators; - scripts to re-run all experiments and collect the output of the proposed methods in a text file; Repeatability Evaluation ======================== Process ------- The given instructions are clear and simple to follow. The code did not compile initially, but could be fixed easily. The code had some bugs when dealing with the protein dataset, which led to segfaults, but got (partially) fixed quickly by the authors. In the original submission, the bug was avoided by fixing in the benchmark scripts a different page size than the one used in the paper. Detailed Results ---------------- The paper reports on an impressive number of experiments (29 in total). Each experiment is very labour-intensive (create a new index structure and runs 100 queries) and took roughly 24 hours to run on our machines. * Table 9 We consistently report visibly better times for BD, while the times for BGC and BGL stay close to those reported in your paper. * Figures 4a,b,c Reproducible. * Figure 4d BGC times can be reproduced, and thus the discussion in the paper could be verified. BD times can also be reproduced, except for theta=16, in which case our time is about two orders of magnitude worse (78 seconds vs. 2 seconds). BGL is also consistently worse than BD, this contradicts the figure in the paper: it needs between 7.5 and 8 seconds for all values of theta. Flamingo is better than reported. * Figures 5a,b Reproducible. * Figure 5c The IO times for BD and BGL can be reproduced very accurately, but not for BGC, which is much better in the repeated results by about one order of magnitude. The authors confirmed that the BGC curve is a little wrong on the first two points. * Figure 5d The repeated results vary considerably from those reported. This has been also confirmed by the author. * Figures 6a,b,c The trends for all BD, BGC, and BGL are reproducible (also relative to each other). * Figure 6d Overall, the trends and relative positioning of BD, BGC, and BGL are reproducible. However, the repeated results (querying IO times divided by 1000) are worse than those reported by a factor of 2-3. In contrast, the repeated IO times obtained for the other experiments (not with protein data) are very close to those reported in the paper. * Figures 7a,b,c The processing times for BD, BGC, and BGL can be reproduced very accurately. * Figure 7d The trends are reproducible, the timings and relative positioning of the three B^ed-tree methods are not: regardless of the k value (for top-k), BD needs about 95 seconds, BGC needs about 15 seconds, and BGL needs about 28 seconds. This is different from 0.1 seconds reported in the figure for all three. The relative order on the performance for the three methods in the repeated results is exactly the reverse of the order reported in the paper (in the paper all three methods perform almost identical). None of the Flamingo experiments worked, they were aborted after running them for more than 1000 minutes each. * Figure 8a BD and BGC performed as reported in the paper, BGL performed slightly better than BGC in the repeated experiments, whereas in the paper BGL is reported to perform worse. * Figure 8b BD and BGC performed as reported in the paper, with the exceptions that for 128mb available memory, we obtain much better IO performance (900 vs. 40000, and 14000 vs. 38000). BGL for 8mb and 16mb performed better in our experiments than as reported in the paper (85000 vs. 100000). * Figure 8c BD performed as reported in the paper, BGC performed consistently worse in our experiments (10,000 IO time extra, except for 128mb when the times are the same), and BGL performed as reported in the paper. * Figure 8d The trends and relative positioning of BD, BGC, and BGL are reproducible. However, the repeated results (querying IO times divided by 1000) are worse than those reported in the paper by a factor of 2-3: 120 bs. 60 for BD, 35 vs. 12 for BGC, and 100 vs. 38 for BGL. * Figures 9a,b,c,d Repeatable. * Figure 10a Repeatable. * Figure 10b Trend reproducible. Repeated results are however at least three orders of magnitude worse (k=1,2,4,8,16): repeated: 149.3, 246.3, 284.0, 328.3, 362.3 seconds; reported: 0.2, 0.35, 0.41, 0.51, 0.6 seconds. * Figure 11a Repeatable for BGC. * Figure 11b This experiment could not be run for page sizes 8kb or 4kb even with the newly provided code. Summary ------- Overall, most of the experiments are reproducible. We consistently found problems with the reported experiments on the protein data set. Some other experiments have repeatability problems, but they do not seriously affect the claims made in the paper on the performance of the three proposed methods. The results with the competitors Flamingo and Mismatch are not reproducible. Workability Evaluation ====================== Workability was evaluated by varying the number of queries used in the experiments and the page size (4kb, 8kb, and 16kb). We found that workability is mostly achieved, with a few exceptions. In Figure 10b, we tried out with page size 8kb, but obtained very large CPU and IO times (in contrast to those reported in the paper). It has been suggested by the authors that this can be explained by an internal setting of the BGC algorithm (used 12 bits instead of 10), but it has not been taken further. Also, the experiment of Figure 11b could not be run for page sizes 8kb or 4kb even with the original or the newly provided code.