GENERAL OBSERVATION:

the results I obtained locally tend to support the paper's conclusions even more strongly than those reported in the paper (but were also noisier). In other words, the base case performed worse than reported while the paper's approach always worked at least as well as advertised.
 

SETUP: 

worked as advertized out of the box. Some rough runtime estimates for the different experiments would have been nice, though.


REPEATABILITY:

- contention in the unmanaged case was significantly higher on my machine. Fig 4.a peaked at 100k elements with 780k clones, and all the peaks in Fig 4.b were about twice as high. 
- at small group-by cardinalities (< 100) the local clone method consistently outperformed global clones by 10-50% (Fig 5). 
- measurements seem fairly noisy, especially in Fig 5. The trends are the same, but most of the spikes and dips differed significantly between my results and those in the paper. This is probablya combination of short runtimes (< 500ms) and few samples per data point (4, I believe). Thread creation/synchronization times are probably the most significant source of variability here as it reduces overall parallelism (threads start or end late; cf TODO in aggregation2.c). Fig 8.a was effected especially strongly.
- In Fig 8.b the performance of "without prime increase" never recovered after the initial performance cliff at x=1000. Performance of "with prime increase" was unaffected.


WORKABILITY:

- It is somewhat surprising that a paper focusing on scalability includes no graphs which actually demonstrate it. I reran the experiments from Fig5 (for all three contention management approaches), fixing the group-by cardinality at 16,16k and 16M, and instead varying thread counts. I then modified existing scripts to also generate the resulting 9 graphs. Local and global performed remarkably similarly (and well) for group-by cardinalities of 16k and 16M, regardless of the input distribution. For small cardinality (16) Sorted input was a "bad" case which consistently ran 20-30% slower than other distributions. With no contention management in place the behavior was about what I would expect: zipfian and self-similar never scale past 8 and 32 threads, respectively, even for a cardinality of 16M; sorted and repeated runs do well even for G=16, and the other distributions fall in between.