skip to main content
tutorial

Sampling for big data: a tutorial

Published: 24 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.

    Supplementary Material

    Part 1 of 3 (p1975-sidebyside1.mp4)
    Part 2 of 3 (p1975-sidebyside2.mp4)
    Part 3 of 3 (p1975-sidebyside3.mp4)

    Cited By

    View all
    • (2023)Survey of Distributed Computing Frameworks for Supporting Big Data AnalysisBig Data Mining and Analytics10.26599/BDMA.2022.90200146:2(154-169)Online publication date: Jun-2023
    • (2023)Context-aware Big Data Quality Assessment: A Scoping ReviewJournal of Data and Information Quality10.1145/360370715:3(1-33)Online publication date: 22-Aug-2023
    • (2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
    • Show More Cited By

    Index Terms

    1. Sampling for big data: a tutorial

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2014
      2028 pages
      ISBN:9781450329569
      DOI:10.1145/2623330
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 August 2014

      Check for updates

      Author Tag

      1. random sampling

      Qualifiers

      • Tutorial

      Funding Sources

      • Yahoo! Research

      Conference

      KDD '14
      Sponsor:

      Acceptance Rates

      KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Survey of Distributed Computing Frameworks for Supporting Big Data AnalysisBig Data Mining and Analytics10.26599/BDMA.2022.90200146:2(154-169)Online publication date: Jun-2023
      • (2023)Context-aware Big Data Quality Assessment: A Scoping ReviewJournal of Data and Information Quality10.1145/360370715:3(1-33)Online publication date: 22-Aug-2023
      • (2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
      • (2023)Design and Analysis of a Digital Media Course Case Study in an Art University2023 13th International Conference on Information Technology in Medicine and Education (ITME)10.1109/ITME60234.2023.00079(360-369)Online publication date: 24-Nov-2023
      • (2022)A comprehensive study of data intelligence in the context of big data analyticsWeb Intelligence10.3233/WEB-21048020:1(53-66)Online publication date: 17-May-2022
      • (2022)Fast Error-Bounded Distance Distribution ComputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305824134:11(5364-5377)Online publication date: 1-Nov-2022
      • (2022)Case Design and Analysis of Digital Media Courses Incorporating Aesthetic Education2022 12th International Conference on Information Technology in Medicine and Education (ITME)10.1109/ITME56794.2022.00158(756-760)Online publication date: Nov-2022
      • (2022)A Comprehensive Review on Data Partitioning and Sampling Techniques for Processing Big Data2022 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS)10.1109/ICPECTS56089.2022.10047766(1-6)Online publication date: 8-Dec-2022
      • (2021)Big data quality framework: a holistic approach to continuous quality managementJournal of Big Data10.1186/s40537-021-00468-08:1Online publication date: 29-May-2021
      • (2020)Estimating the number of connected components in a graph via subgraph samplingBernoulli10.3150/19-BEJ114726:3Online publication date: 1-Aug-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media