skip to main content
research-article
Open access

Subspace Exploration: Bounds on Projected Frequency Estimation

Published: 20 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Given an $n \times d$ dimensional dataset A, a projection query specifies a subset $C \subseteq [d]$ of columns which yields a new $n \times |C|$ array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show $2^Ømega(d) $ lower bounds. However, we present upper bounds which demonstrate space dependency better than $2^d$. That is, for $c,c' \in (0,1)$ and a parameter $N=2^d$ an $N^c$-approximation can be obtained in space $\min(N^c', n)$, showing that it is possible to improve on the naï ve approach of keeping information for all $2^d$ subsets of d columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.

    References

    [1]
    N. Alon, Y. Matias, and M. Szegedy. 1999. The Space Complexity of Approximating the Frequency Moments. JCSS: Journal of Computer and System Sciences, Vol. 58 (1999), 137--147.
    [2]
    Sepehr Assadi, Sanjeev Khanna, Yang Li, and Val Tannen. 2016. Algorithms for Provisioning Queries and Analytics. In International Conference on Database Theory. 18:1--18:18.
    [3]
    Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P Woodruff. 2017. BPTree: An $ell_2$ heavy hitters algorithm sing constant memory. In Proceedings of Principles of Database Systems. ACM, 361--376.
    [4]
    Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. 2018a. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (APPROX/RANDOM 2018), Vol. 116. 7:1--7:22. https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2018.7
    [5]
    Vladimir Braverman, Robert Krauthgamer, and Lin F. Yang. 2018b. Universal Streaming of Subset Norms. CoRR, Vol. abs/1812.00241 (2018). arxiv: 1812.00241 http://arxiv.org/abs/1812.00241
    [6]
    Pern Hui Chia, Damien Desfontaines, Irippuge Milinda Perera, Daniel Simmons-Marengo, Chao Li, Wei-Yen Day, Qiushi Wang, and Miguel Guevara. 2019. KHyperLogLog: Estimating Reidentifiability and Joinability of Large Data at Scale. In IEEE Symposium on Security and Privacy (SP) . 867--881.
    [7]
    Benjamin Doerr. 2020. Probabilistic tools for the analysis of randomized optimization heuristics. In Theory of Evolutionary Computation . Springer, 1--87.
    [8]
    David Galvin. 2014. Three tutorial lectures on entropy and counting. arXiv preprint arXiv:1406.7872 (2014).
    [9]
    Rajesh Jayaram and David P. Woodruff. 2018. Perfect $ell_p$ Sampling in a Data Stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 544--555.
    [10]
    T. S. Jayram and D. P. Woodruff. 2009. The Data Stream Space Complexity of Cascaded Norms. In IEEE Symposium on Foundations of Computer Science (FOCS). 765--774. https://doi.org/10.1109/FOCS.2009.82
    [11]
    Daniel M Kane, Jelani Nelson, and David P Woodruff. 2010. An Optimal Algorithm for the Distinct Elements Problem. In Proceedings of Principles of database systems. ACM, 41--52.
    [12]
    Ilan Kremer, Noam Nisan, and Dana Ron. 1999. On Randomized One-Round Communication Complexity. Computational Complexity, Vol. 8, 1 (1999), 21--49.
    [13]
    Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, and Yikun Xian. 2018. Finding Subcube Heavy Hitters in Analytics Data Streams. In Proceedings of the 2018 World Wide Web Conference. 1705--1714. https://doi.org/10.1145/3178876.3186082
    [14]
    Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. 2016. Heavy Hitters via Cluster-Preserving Clustering. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS. 61--70.
    [15]
    Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace Clustering for High Dimensional Data: a review. SIGKDD Explorations, Vol. 6, 1 (2004), 90--105. https://doi.org/10.1145/1007730.1007731
    [16]
    Sublinear.info. [n.d.]. Open Problem 94. https://sublinear.info/index.php?title=Open_Problems:94.
    [17]
    Srikanta Tirthapura and David P. Woodruff. 2012. A General Method for Estimating Correlated Aggregates over a Data Stream. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1--5 April, 2012 . 162--173.
    [18]
    Hoa Vu. 2018. Data Stream Algorithms for Large Graphs and High Dimensional Data . Ph.D. Dissertation. U. Massachusetts at Amherst.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
    June 2021
    440 pages
    ISBN:9781450383813
    DOI:10.1145/3452021
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distinct elements
    2. frequency moments
    3. projection queries

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 642 of 2,707 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 190
      Total Downloads
    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media