Hashing plays a fundamental role in many probabilistic data structures, from basic hash tables to more sophisticated schemes such as Bloom filters. In these applications, a \emph{hash function} $h$ maps a \emph{universe} of possible values, typically large, to a set of \emph{buckets}, typically much smaller. Hash-based data structures satisfy a variety of probabilistic guarantees. For instance, we may be interested in the \emph{false positive rate}: what is the probability that a data structure mistakenly identifies an element as being stored in a collection, when it was never stored before? We may also be interested in \emph{load properties}, such as: what is the chance that some bucket in the data structure overflows? A typical way to prove these properties is to treat random hash functions as a \emph{balls-into-bins} process: randomly throw $N$ balls into $B$ bins one at a time, where each bin is chosen from some known distribution, and report the number of balls in each bin at the end of the process. For example, hashing $N$ unique elements into $B$ bins can be modeled as throwing $N$ balls into $B$ bins, where each bin is drawn uniformly at random.

While this modeling is convenient, one complication is that the counts of the elements in the different buckets are not probabilistically independent: one bin containing many elements makes it more likely that other bins contain few elements. The lack of independence makes it difficult to reason about multiple bins, for instance bounding the number of empty bins. Moreover, many common tools for analyzing probabilistic processes, like concentration bounds, usually require independence. This subtlety has also been a source of problems in pen-and-paper analyses of probabilistic data structures. For instance, the standard analysis of the Bloom filter bounds the number of occupied bins in order to bound the false positive rate. The original version of this analysis~\citep{DBLP:journals/cacm/Bloom70}, and also repeated in many papers, assumes that the bin counts are independent. However, \citet{DBLP:journals/ipl/BoseGKMMMST08} pointed out that this assumption is incorrect, and in fact the claimed upper-bound on the false-positive rate is actually a \emph{lower} bound; proving a correct bound on the false-positive rate required a substantially more complicated argument. Recently, \citet{DBLP:conf/cav/GopinathanS20} mechanized a correct, but complex proof in Coq.

We aim to develop a simpler method to formally reason about hash-based data structures and balls-into-bins processes, drawing on a key concept in probability theory: \emph{negative dependence}.

Wed 19 Jan

Displayed time zone: Eastern Time (US & Canada) change

16:40 - 17:30
Reasoning about Probabilistic Programs and AlgorithmsPOPL at Salon III
Chair(s): Armando Solar-Lezama Massachusetts Institute of Technology
16:40
25m
Research paper
A Separation Logic for Negative DependenceRemote
POPL
Jialu Bao Cornell University, Marco Gaboardi Boston University, Justin Hsu Cornell University, Joseph Tassarotti Boston College
DOI Media Attached
17:05
25m
Research paper
Reasoning about “Reasoning about Reasoning”: Semantics and Contextual Equivalence for Probabilistic Programs with Nested Queries and RecursionRemote
POPL
Yizhou Zhang University of Waterloo, Nada Amin Harvard University
DOI Media Attached