Skip to main content
Log in

A unifying framework for 0-sampling algorithms

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The problem of building an 0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an 0-sampler, based on sampling, recovery and selection. We analyze the implementation of an 0-sampler within this framework, and show how prior constructions of 0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of 0-samplers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. More generally, we also seek solutions so that, given sketches of vectors a and b, we can form a sketch of (a+b) and sample from the 0-distribution on (a+b). All the algorithms that we discuss have this property.

  2. We note that tighter bounds are possible via a similar construction and a more involved analysis: adapting the approach of [11] improves the log term from log(s/δ r ) to log1/δ r , and the analysis of [26] further improves it to log s 1/δ r .

  3. Jowhari et al. [18] first present their algorithm assuming a random oracle, and then they remove this assumption through the use of the pseudo-random generator of Nisan [23].

  4. This level is ⌈log(2N/k)⌉ for the 0-sampler with k-wise independence, and ⌈logN/ϵ⌉ for the variant with pairwise independence.

References

  1. Achlioptas, D.: Database-friendly random projections. In: ACM Principles of Database Systems, pp. 274–281 (2001)

    Google Scholar 

  2. Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 459–467 (2012)

    Chapter  Google Scholar 

  3. Barkay, N., Porat, E., Shalem, B.: Feasible Sampling of Non-strict Turnstile Data Streams (2012). arXiv:1209.5566

  4. Beyer, K., Gemulla, R., Haas, P.J., Reinwald, B., Sismanis, Y.: Distinct-value synopses for multiset operations. Commun. ACM 52(10), 87–95 (2009)

    Article  Google Scholar 

  5. Cormode, G., Firmani, D.: On unifying the space of 0 sampling algorithms. In: Meeting on Algorithm Engineering & Experiments, pp. 163–172 (2013)

    Google Scholar 

  6. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases, pp. 3–20 (2008)

    Google Scholar 

  7. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)

    Google Scholar 

  8. Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: International Conference on Very Large Data Bases, pp. 25–36 (2005)

    Google Scholar 

  9. Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. Now Publishers, Hanover (2012)

    Google Scholar 

  10. Dasgupta, S., Gupta, A.: An Elementary Proof of the Johnson–Lindenstrauss Lemma. International Computer Science Institute, Berkeley (1999). Tech. Rep. TR-99-006

    Google Scholar 

  11. Eppstein, D., Goodrich, M.T.: Space-efficient straggler identification in round-trip data streams via Newton’s identitities and invertible Bloom filters. In: Workshop on Algorithms and Data Structures, pp. 637–648 (2007)

    Chapter  Google Scholar 

  12. Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry, pp. 142–149 (2005)

    Google Scholar 

  13. Ganguly, S.: Counting distinct items over update streams. Theor. Comput. Sci. 378(3), 211–222 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  14. Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: ACM Symposium on Theory of Computing, pp. 237–246 (2007)

    Google Scholar 

  15. Indyk, P.: A small approximately min-wise independent family of hash functions. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 454–456 (1999)

    Google Scholar 

  16. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  17. Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. Contemp. Math. 26, 189–206 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  18. Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for l p samplers, finding duplicates in streams, and related problems. In: ACM Principles of Database Systems, pp. 49–58 (2011)

    Google Scholar 

  19. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: ACM Principles of Database Systems, pp. 41–52 (2010)

    Google Scholar 

  20. Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)

    Article  Google Scholar 

  21. Metwally, A., Agrawal, D., El Abbadi, A.: Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In: EDBT, pp. 618–629 (2008)

    Chapter  Google Scholar 

  22. Monemizadeh, M., Woodruff, D.P.: 1-pass relative-error l p -sampling with applications. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1143–1160 (2010)

    Chapter  Google Scholar 

  23. Nisan, N.: Pseudorandom generators for space-bounded computations. In: ACM Symposium on Theory of Computing, pp. 204–212 (1990)

    Google Scholar 

  24. Patrascu, M., Thorup, M.: The power of simple tabulation hashing. In: ACM Symposium on Theory of Computing, pp. 1–10 (2011)

    Google Scholar 

  25. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

    Google Scholar 

  26. Price, E.: Efficient sketches for the set query problem. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 41–56 (2011)

    Chapter  Google Scholar 

  27. Schmidt, J.P., Siegel, A., Srinivasan, A.: Chernoff–Hoeffding bounds for applications with limited independence. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 331–340 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Cormode.

Additional information

Communicated by: Feifei Li and Suman Nath.

This paper is an extended version of [5].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cormode, G., Firmani, D. A unifying framework for 0-sampling algorithms. Distrib Parallel Databases 32, 315–335 (2014). https://doi.org/10.1007/s10619-013-7131-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7131-9

Keywords

Navigation