SIGMOD 2010 Repeatability & Workability Evaluation
                               for paper #12
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance
by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava

Hardware & Software environment
===============================

       | Paper                      | Review 1
-------+----------------------------+-----------------------------------------------------------------
 class | laptop			    | desktop
 CPU   | AMD Quad-Core Opteron 8356 | AMD Athlon(tm) 64 X2 Dual Core Processor 4600+, cpu MHz 1000.000
 cores | 4                          | 2
 RAM   | 128 GB                     | 2 GB
 HDD   | (not specified)            | 250GB SATA WD Caviar SE 3.5" 7200rpm ATA300 Hard Drive 8MB Cache
 OS    | Red Hat Linux              | Fedora 11 Linux version 2.6.30.9-96.fc11.x86_64


Submission
==========

The authors provided
 - source code for all their code and for competitor techniques;
 - data generators;
 - scripts to re-run all experiments and collect the output of the 
   proposed methods in a text file;

Repeatability Evaluation
========================

Process
-------

The given instructions are clear and simple to follow.  The code did
not compile initially, but could be fixed easily.  The code had some
bugs when dealing with the protein dataset, which led to segfaults,
but got (partially) fixed quickly by the authors. In the original
submission, the bug was avoided by fixing in the benchmark scripts a
different page size than the one used in the paper.

Detailed Results
----------------

The paper reports on an impressive number of experiments (29 in
total). Each experiment is very labour-intensive (create a new index
structure and runs 100 queries) and took roughly 24 hours to run on
our machines.

* Table 9

  We consistently report visibly better times for BD, while the times
  for BGC and BGL stay close to those reported in your paper.

* Figures 4a,b,c

  Reproducible.

* Figure 4d

  BGC times can be reproduced, and thus the discussion in the paper
  could be verified. BD times can also be reproduced, except for
  theta=16, in which case our time is about two orders of magnitude
  worse (78 seconds vs. 2 seconds). BGL is also consistently worse
  than BD, this contradicts the figure in the paper: it needs between
  7.5 and 8 seconds for all values of theta. Flamingo is better than
  reported.

* Figures 5a,b
  
  Reproducible.

* Figure 5c

  The IO times for BD and BGL can be reproduced very accurately, but
  not for BGC, which is much better in the repeated results by about
  one order of magnitude. The authors confirmed that the BGC curve is
  a little wrong on the first two points.

* Figure 5d

  The repeated results vary considerably from those reported. This has
  been also confirmed  by the author.

* Figures 6a,b,c

  The trends for all BD, BGC, and BGL are reproducible (also relative
  to each other).

* Figure 6d

  Overall, the trends and relative positioning of BD, BGC, and BGL are
  reproducible. However, the repeated results (querying IO times
  divided by 1000) are worse than those reported by a factor of
  2-3. In contrast, the repeated IO times obtained for the other
  experiments (not with protein data) are very close to those reported
  in the paper.

* Figures 7a,b,c

  The processing times for BD, BGC, and BGL can be reproduced very
  accurately.

* Figure 7d

  The trends are reproducible, the timings and relative positioning of
  the three B^ed-tree methods are not: regardless of the k value (for
  top-k), BD needs about 95 seconds, BGC needs about 15 seconds, and
  BGL needs about 28 seconds. This is different from 0.1 seconds
  reported in the figure for all three. The relative order on the
  performance for the three methods in the repeated results is exactly
  the reverse of the order reported in the paper (in the paper all
  three methods perform almost identical).

  None of the Flamingo experiments worked, they were aborted after
  running them for more than 1000 minutes each.

* Figure 8a

  BD and BGC performed as reported in the paper, BGL performed
  slightly better than BGC in the repeated experiments, whereas in the
  paper BGL is reported to perform worse.

* Figure 8b

  BD and BGC performed as reported in the paper, with the exceptions
  that for 128mb available memory, we obtain much better IO
  performance (900 vs. 40000, and 14000 vs. 38000). BGL for 8mb and
  16mb performed better in our experiments than as reported in the
  paper (85000 vs. 100000).

* Figure 8c

  BD performed as reported in the paper, BGC performed consistently
  worse in our experiments (10,000 IO time extra, except for 128mb
  when the times are the same), and BGL performed as reported in the
  paper.

* Figure 8d
  
  The trends and relative positioning of BD, BGC, and BGL are
  reproducible. However, the repeated results (querying IO times
  divided by 1000) are worse than those reported in the paper by a
  factor of 2-3: 120 bs. 60 for BD, 35 vs. 12 for BGC, and 100 vs. 38
  for BGL.

* Figures 9a,b,c,d

  Repeatable.

* Figure 10a

  Repeatable.

* Figure 10b

 Trend reproducible. Repeated results are however at least three
 orders of magnitude worse (k=1,2,4,8,16): repeated: 149.3, 246.3,
 284.0, 328.3, 362.3 seconds; reported: 0.2, 0.35, 0.41, 0.51, 0.6
 seconds.

* Figure 11a

  Repeatable for BGC.

* Figure 11b

  This experiment could not be run for page sizes 8kb or 4kb even with
  the newly provided code.


Summary
-------

Overall, most of the experiments are reproducible. We consistently
found problems with the reported experiments on the protein data
set. Some other experiments have repeatability problems, but they do
not seriously affect the claims made in the paper on the performance
of the three proposed methods. The results with the competitors
Flamingo and Mismatch are not reproducible.


Workability Evaluation
======================

Workability was evaluated by varying the number of queries used in the
experiments and the page size (4kb, 8kb, and 16kb). We found that
workability is mostly achieved, with a few exceptions. In Figure 10b,
we tried out with page size 8kb, but obtained very large CPU and IO
times (in contrast to those reported in the paper). It has been
suggested by the authors that this can be explained by an internal
setting of the BGC algorithm (used 12 bits instead of 10), but it has
not been taken further. Also, the experiment of Figure 11b could not
be run for page sizes 8kb or 4kb even with the original or the newly
provided code.