Conference Papers for summer 2019

The following papers have been accepted for publication in 2019:

Independent sets in vertex-arrival streams (G. Cormode, J. Dark, and C. Konrad). In International Colloquium on Automata, Languages and Programming (ICALP), 2019.

We consider the maximal and maximum independent set problems in three models of graph streams:
In the edge model we see a stream of edges which collectively define a graph; this model is well-studied for a variety of problems. We show that the space complexity for a one-pass streaming algorithm to find a maximal independent set is quadratic (i.e. we must store all edges). We further show that it is not much easier if we only require approximate maximality. This contrasts strongly with the other two vertex-based models, where one can greedily find an exact solution in only the space needed to store the independent set.
In the “explicit” vertex model, the input stream is a sequence of vertices making up the graph. Every vertex arrives along with its incident edges that connect to previously arrived vertices. Various graph problems require substantially less space to solve in this setting than in edge-arrival streams. We show that every one-pass c-approximation streaming algorithm for maximum independent set (MIS) on explicit vertex streams requires Ω((n2)/(c6)) bits of space, where n is the number of vertices of the input graph. It is already known that Θ~((n2)/(c2)) bits of space are necessary and sufficient in the edge arrival model (Halldórsson et al. 2012), thus the MIS problem is not significantly easier to solve under the explicit vertex arrival order assumption. Our result is proved via a reduction from a new multi-party communication problem closely related to pointer jumping.
In the “implicit” vertex model, the input stream consists of a sequence of objects, one per vertex. The algorithm is equipped with a function that maps pairs of objects to the presence or absence of edges, thus defining the graph. This model captures, for example, geometric intersection graphs such as unit disc graphs. Our final set of results consists of several improved upper and lower bounds for interval and square intersection graphs, in both explicit and implicit streams. In particular, we show a gap between the hardness of the explicit and implicit vertex models for interval graphs.

Answering range queries under local differential privacy (G. Cormode, T. Kulkarni, and D. Srivastava). In International Conference on Very Large Data Bases (VLDB), 2019

Counting the fraction of a population having an input within a specified interval i.e. a range query, is a fundamental data analysis primitive. Range queries can also be used to compute other core statistics such as quantiles, and to build prediction models. However, frequently the data is subject to privacy concerns when it is drawn from individuals, and relates for example to their financial, health, religious or political status. In this paper, we introduce and analyze methods to support range queries under the local variant of differential privacy, an emerging standard for privacy-preserving data analysis.

The local model requires that each user releases a noisy view of her private data under a privacy guarantee. While many works address the problem of range queries in the trusted aggregator setting, this problem has not been addressed specifically under untrusted aggregation (local DP) model even though many primitives have been developed recently for estimating a discrete distribution. We describe and analyze two classes of approaches for range queries, based on hierarchical histograms and the Haar wavelet transform. We show that both have strong theoretical accuracy guarantees on variance. In practice, both methods are fast and require minimal computation and communication resources. Our experiments show that the wavelet approach is most accurate in high privacy settings, while the hierarchical approach dominates for weaker privacy requirements.

Streaming algorithms for bin packing and vector scheduling (G. Cormode and P. Veselý). In Workshop on Approximation and Online Algorithms, 2019.

Problems involving the efficient arrangement of simple objects, as captured by bin packing and makespan scheduling, are fundamental tasks in combinatorial optimization. These are well understood in the traditional online and offline cases, but have been less well-studied when the volume of the input is truly massive, and cannot even be read into memory. This is captured by the streaming model of computation, where the aim is to approximate the cost of the solution in one pass over the data, using small space. As a result, streaming algorithms produce concise input summaries that approximately preserve the optimum value.

We design the first efficient streaming algorithms for these fundamental problems in combinatorial optimization. For Bin Packing, we provide a streaming asymptotic 1+-approximation with O~(1/ε) memory, where O~ hides logarithmic factors. Moreover, such a space bound is essentially optimal. Our algorithm implies a streaming d+ε-approximation for Vector Bin Packing in d dimensions, running in space O~((d)/(ε)). For the related Vector Scheduling problem, we show how to construct an input summary in space O~(d2·m / ε2) that preserves the optimum value up to a factor of 2 – (1)/(m) +ε, where m is the number of identical machines.

 Efficient interactive proofs for linear algebra (G. Cormode and C. Hickey). In Proceedings of International Symposium on Algorithms and Computation (ISAAC), 2019.

Motivated by the growth in outsourced data analysis, we describe methods for verifying basic linear algebra operations performed by a cloud service without having to recalculate the entire result. We provide novel protocols in the streaming setting for inner product, matrix multiplication and vector-matrix-vector multiplication where the number of rounds of interaction can be adjusted to tradeoff space, communication, and duration of the protocol. Previous work suggests that the costs of these interactive protocols are optimized by choosing O(logn) rounds. However, we argue that we can reduce the number of rounds without incurring a significant time penalty by considering the total end-to-end time, so fewer rounds and larger messages are preferable. We confirm this claim with an experimental study that shows that a constant number of rounds gives the fastest protocol.

Towards a theory of parameterized streaming algorithms (R. Chitnis and G. Cormode). In International Symposium on Parameterized and Exact Computation, 2019.

Parameterized complexity attempts to give a more fine-grained analysis of the complexity of problems: instead of measuring the running time as a function of only the input size, we analyze the running time with respect to additional parameters. This approach has proven to be highly successful in delineating our understanding of NP-hard problems. Given this success with the TIME resource, it seems but natural to use this approach for dealing with the SPACE resource. First attempts in this direction have considered a few individual problems, with some success: Fafianie and Kratsch [MFCS’14] and Chitnis et al. [SODA’15] introduced the notions of streaming kernels and parameterized streaming algorithms respectively. For example, the latter shows how to refine the Ω(n2) bit lower bound for finding a minimum Vertex Cover (VC) in the streaming setting by designing an algorithm for the parameterized k-VC problem which uses O(k2logn) bits. In this paper, we initiate a systematic study of graph problems from the paradigm of parameterized streaming algorithms. We first define a natural hierarchy of space complexity classes of FPS, SubPS, SemiPS, SupPS and BPS, and then obtain tight classifications for several well-studied graph problems such as Longest Path, Feedback Vertex Set, Dominating Set, Girth, Treewidth, etc. into this hierarchy. On the algorithmic side, our parameterized streaming algorithms use techniques from the FPT world such as bidimensionality, iterative compression and bounded-depth search trees. On the hardness side, we obtain lower bounds for the parameterized streaming complexity of various problems via novel reductions from problems in communication complexity. We also show a general (unconditional) lower bound for space complexity of parameterized streaming algorithms for a large class of problems inspired by the recently developed frameworks for showing (conditional) kernelization lower bounds. Parameterized algorithms and streaming algorithms are approaches to cope with TIME and SPACE intractability respectively. It is our hope that this work on parameterized streaming algorithms leads to two-way flow of ideas between these two previously separated areas of theoretical computer science.

 

Keynote talk at PODC

Invited keynote talk at Principles of Distributed Computation (PODC) 2018 in July 2018 on “Data summarization and distributed computation

The notion of summarization is to provide a compact representation of data which approximately captures its essential characteristics. If such summaries can be created, they can lead to efficient distributed algorithms which exchange summaries in order to compute a desired function. In this talk, I’ll describe recent efforts in this direction for problems inspired by machine learning: building graphical models over evolving, distributed training examples, and solving robust regression problems over large, distributed data sets.

Workshop on Algorithms for data summarization, March 2018

There is a workshop on algorithms for data summarization (streaming, sampling, sketching, property testing, sublinear algorithms and loosely related topics beyond) to be held at the University of Warwick, UK during March 2018 (19th – 22nd), organized by Graham Cormode and Artur Czumaj. The format of the workshop will be talks from experts, with plenty of opportunities for discussion and collaboration. Funding supporting the workshop is from European Research Council and UK EPSRC.

For more information, please contact the organizers.

Conference and Journal publications

A number of conference and journal papers have been accepted for publication over the winter break.  These include:

G. Cormode and J. Dark. Fast sketch-based recovery of correlation outliers. In International Conference on Database Theory, 2018.

Many data sources can be interpreted as time-series, and a key problem is to identify which pairs out of a large collection of signals are highly correlated. We expect that there will be few, large, interesting correlations, while most signal pairs do not have any strong correlation. We abstract this as the problem of identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. Dimensionality reduction can remove dependence on the number of observations, but further techniques are required to tame the quadratic (in n) cost of a search through all possible pairs. We develop a new algorithm for rapidly finding large correlations based on sketch techniques with an added twist: we quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach.

G. Cormode and C. Hickey. Cheap checking for cloud computing: Statistical analysis via annotated data streams. In AISTATS, 2018.

As the popularity of outsourced computation increases, questions of accuracy and trust between the client and the cloud computing services become ever more relevant. Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory and costs are kept to a minimum. Our verification protocols are based on defining “proofs” which are easy to create and check. These add only a small overhead to reporting the result of the computation itself. We build up a series of protocols for elementary statistical methods, to create more complex protocols for Ordinary Least Squares, Principal Component Analysis and Linear Discriminant Analysis. We show that these are very efficient in practice.

G. Cormode, T. Kulkarni, and D. Srivastava. Constrained differential privacy for count data. In International Conference on Data Engineering (ICDE), 2018.

Concern about how to aggregate sensitive user data without compromising individual privacy is a major barrier to greater availability of data. The model of differential privacy has emerged as an accepted model to release sensitive information while giving a statistical guarantee for privacy. Many different algorithms are possible to address different target functions. We focus on the core problem of count queries, and seek to design mechanisms to release data associated with a group of n individuals. Prior work has focused on designing mechanisms by raw optimization of a loss function, without regard to the consequences on the results. This can leads to mechanisms with undesirable properties, such as never reporting some outputs (gaps), and overreporting others (spikes). We tame these pathological behaviors by introducing a set of desirable properties that mechanisms can obey. Any combination of these can be satisfied by solving a linear program (LP) which minimizes a cost function, with constraints enforcing the properties. We focus on a particular cost function, and provide explicit constructions that are optimal for certain combinations of properties, and show a closed form for their cost. In the end, there are only a handful of distinct optimal mechanisms to choose between: one is the well-known (truncated) geometric mechanism; the second a novel mechanism that we introduce here, and the remainder are found as the solution to particular LPs. These all avoid the bad behaviors we identify. We demonstrate in a set of experiments on real and synthetic data which is preferable in practice, for different combinations of data distributions, constraints, and privacy parameters.

G. Cormode, A. Dasgupta, A. Goyal, and C. H. Lee. An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs. PLOS ONE, 2018.

Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users’ queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with “vanilla” LSH, even when using the same amount of space.

The three conference presentations will take place over the coming months.

Streaming algorithms for matching size estimation in sparse graphs at ESA 2017

The paper “Streaming algorithms for matching size estimation in sparse graphs” by G. Cormode, H. Jowhari, M. Monemizadeh, and S. Muthukrishnan has been selected for publication in ESA 2017, in Vienna in September.

The abstract is as follows:

Estimating the size of the maximum matching is a canonical problem in graph analysis, and one that has attracted extensive study over a range of different computational models. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs.

(Insert-Only Streams) We present a one-pass algorithm that takes O(clogn) space and approximates the size of the maximum matching in graphs with arboricity c within a factor of O(c). This improves significantly upon the state-of-the-art O~(c n2/3)-space streaming algorithms, and is the first poly-logarithmic space algorithm for this problem.

(Dynamic Streams) Given a dynamic graph stream (i.e., inserts and deletes) of edges of an underlying c-bounded arboricity graph, we present an one-pass algorithm that uses space O~(c10/3n2/3) and returns an O(c)-estimator for the size of the maximum matching on the condition that the number edge deletions in the stream is bounded by O(c n). For this class of inputs, our algorithm improves the state-of-the-art O~(c n4/5)-space algorithms, where the O~(.) notation hides logarithmic in n dependencies.

In contrast to prior work, our results take more advantage of the streaming access to the input and characterize the matching size based on the ordering of the edges in the stream in addition to the degree distributions and structural properties of the sparse graphs.

Keynote in Symposium on Experimental Algorithms

Engineering streaming algorithms, June 2017.
Invited talk at Symposium on Experimental Algorithms.

Streaming algorithms must process a large quantity of small updates quickly to allow queries about the input to be answered from a small summary. Initial work on streaming algorithms laid out theoretical results, and subsequent efforts have involved engineering these for practical use. Informed by experiments, streaming algorithms have been widely implemented and used in practice. This talk will survey this line of work, and identify some lessons learned.

Adams Prize 2017

Professor Graham Cormode has been awarded the 2017 Adams Prize by the Cambridge Faculty of Mathematics. The award recognizes his work on “Statistical Analysis of Big Data”, and is awarded jointly with Professor Richard Samworth of Cambridge. Professor Cormode says,

My work, in common with Prof Samworth’s, is about finding mathematical representations of data that allow useful information to be extracted effectively and accurately. These techniques allow ever larger quantities of data to be handled on ordinary computers.

Professor Cormode’s work on “data sketches” has been used in companies such as Netflix, Yahoo, Twitter, Google, AT&T and Sprint. He is currently leading Warwick’s involvement in the Alan Turing Institute at London, and working on questions to do with verification of machine learning, and privacy.

The prize is worth £15,000 and will be split equally between the two recipients.

 

ERC Consolidator Grant

Graham Cormode has been awarded a prestigious ERC consolidator grant worth 1.5M euro, to support his research.  The grant is for a project entitled “Small Summaries for Big Data”. The project focuses on the area of the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. The project will substantially advance the state of the art in data summarization, to the point where accurate and effective summaries are available for a wide array of problems, and can be used seamlessly in applications that process big data.

Microsoft EMEA Scholarship in Algorithms for Massive Data Analysis (closed)

This position has been filled.

A Microsoft Research scholarship place is available to study algorithms for massive data analysis, leading to a PhD in Computer Science. Increasingly, we are faced with larger and larger volumes of data from which to extract insights and intelligence. Of particular interest is data that can be represented as a graph or (adjacency) matrix. A promising approach is to look for ways to “sketch” such structures: to build a representation that is much more compact than the input, but which allows some function of interest on the original data to be approximated accurately using the sketch. Such sketches are well-known and widely used for data that can be represented as a vector (such as to identify the most frequent elements, or to count the number of distinct items). The goal of this scholarship project is to develop new algorithms for sketching of massive graphs and matrices, and to demonstrate their usefulness via theoretical analysis and empirical evaluation. The hope is to advance our knowledge of the theory in this area, and design algorithms which can be used in practice, such as for querying data represented as a (massive) graph, clustering/partitioning graph structured data, and optimization problems over large graphs.

The scholarship will support tuition fees and stipend to study at University of Warwick, under the guidance of Professor Graham Cormode and Dr. Milan Vojnovic of Microsoft Research. The Microsoft PhD Scholarship Programme recognises and supports exceptional students who show the potential to make an outstanding contribution to science and computing. Each Microsoft scholarship consists of an annual bursary for up to a maximum of three years.

During the course of their PhD, Scholars are invited to Microsoft Research in Cambridge for an annual Summer School that includes a series of talks of academic interest and posters sessions, which provides the Scholars the opportunity to present their work to Microsoft researchers and a number of Cambridge academics. There is also the possibility of internships at Microsoft Research. Applicants require a first-class Honours degree or equivalent in Computer Science, Mathematics or Computer Engineering, experience in programming and aptitude for mathematical subjects. Knowledge of algorithms, linear algebra, graph theory and probability is essential. A Masters degree is desirable. Before the Scholarship can be awarded the candidate must also undergo the formal admission procedure to the university of Warwick, and approval from Microsoft Research. The scholarship covers fees for students from European Union countries. In exceptional cases, it may be possible to support students from outside the EU.

To apply, please contact Graham Cormode or Milan Vojnovic directly with a CV and description of your experience relevant to this project. Please apply by January 31 2015 for full consideration. Further details and suggested reading is available from Prof. Graham Cormode (G.Cormode@Warwick.ac.uk).

Summer School on Hashing and Applications

Hashing is used everywhere in computing and is getting increasingly important with the exploding amount of data. A summer school at the University of Copenhagen provided an in-depth introduction to hashing, both theory and applications. The topics ranged from modern theory of hashing, to actual implementations of hash functions that are both efficient and provide the necessary probabilistic guarantees. Application areas studied, included sketching and data bases, similarity estimation, and machine learning.

Slides and video recordings from the summer school are available now.  Neustar has a nice write up of the event.