Practical Algorithms for Self Scaling Histograms, or Better than Average Data Collection

Michael Greenwald

Stanford DSG

Practical Algorithms for Self Scaling Histograms, or Better than Average Data Collection

Background

Problem: Accurately Recovering Distribution

Example

Example (continued)

Example (continued)

Solution: Minimize Available Error

Error computation

How Histograms recover the distribution

Histograms and CDF

Interpolating the CDF

Converting Interpolated CDF to buckets

Effect of bucket boundary on Available Error

Algorithm for newDataPoint

Equi-Error buckets

What if we are willing to trade off some accuracy for performance?

Multi-Modal Histograms

Computing Multi-modal parameters

Discrete Distributions

Results

Accuracy as a function of Buckets

Accuracy/Buckets vs. Time

Conclusions

Related work