Locality sensitive hashing

Locality Sensitive Hashing (LSH) is a method of performing probabalistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

Definition
A locality sensitive hashing scheme is defined with respect to a universe of items $$U$$ and a distance metric $$\phi : U \times U \to [0,1]$$. An LSH scheme is a family of hash functions $$H$$ coupled with a distribution $$D$$ over the functions such that a function $$h \in H$$ chosen according to $$D$$ satifies the property that $$Pr[h(a) = h(b)] = \phi(a,b)$$ for any $$a,b \in U$$.

Applications
LSH has been applied to several problem domains including
 * Near Duplicate Detection
 * Image Similarity Identification
 * Gene Expression Similarity Identification
 * Audio Similarity Identification

Methods
Methods for LSH must vary based on the universe $$U$$ and the nature of $$\phi$$.

Min-Wise Independent Permutations
Suppose $$U$$ is composed of subsets of some ground set of enumerable items $$S$$ and the distance metric of interest is the Jaccard index $$J$$. If $$\pi$$ is a permutation on the indices of $$S$$, for $$A \subseteq S$$ let $$h(A) = \min_{a \in A} \{ \pi(a) \}$$. Each possible choice of $$\pi$$ defines a single hash function $$h$$ mapping input sets to integers.

Define the function family $$H$$ to be the set of all such functions and let $$D$$ be the uniform distribution. Given two sets $$A,B \subseteq S$$ the event that $$h(A) = h(B)$$ corresponds exactly to the event that the minimizer of $$\pi$$ lies inside $$A \bigcap B$$. As $$h$$ was chosen uniformly at random, $$Pr[h(A) = h(B)] = J(A,B)\,$$ and $$(H,D)\,$$ define an LSH scheme for the Jaccard metric.

Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen $$\pi$$. Currently no polynomially-sized min-wise independent hash families are known.

Random Projection
The random projection method of LSH is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector $$r$$) at the outset and use the hyperplane to hash input vectors.

Given an input vector $$v$$ and a hyperplane defined by $$r$$, we let $$h(v) = sgn(v \cdot r)$$. That is, $$h(v) = \pm 1$$ depending on which side of the hyperplane $$v$$ lies.

Each possible choice of $$r$$ defines a single function. Let $$H$$ be the set of all such functions and let $$D$$ be the uniform distribution once again. It is not difficult to prove that, for two vectors $$u,v$$, $$Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}$$, where $$\theta(u,v)$$ is the angle between $$u$$ and $$v$$. $$1 - \frac{\theta(u,v)}{\pi}$$ is closely related to $$\cos(\theta(u,v))$$.

In this instance hashing produces only a single bit. Two vectors bits match with probability proportional to the cosine of the angle between them.

p-Stable Distribution
The hash function $$h_{\mathbf{a},b} (\boldsymbol{\upsilon}) : \mathcal{R}^d \to \mathcal{N} $$ maps a d dimensional vector $$\boldsymbol{\upsilon}$$ onto a set of integers. Each hash function in the family is indexed by a choice of random $$\mathbf{a}$$ and $$b$$ where $$\mathbf{a}$$ is a d dimensional vector with entries chosen independently from a p-stable distribution and $$b$$ is a real number chosen uniformly from the range [0,r]. For a fixed $$\mathbf{a},b$$ the hash function $$h_{\mathbf{a},b}$$ is given by $$h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \mathbf{v}+b}{r} \right \rfloor $$

Related Papers

 * Charikar, M.S. Similarity Estimation Techniques From Rounding Algorithms.
 * Broder, A., Charikar, M.S., Frieze, A., Mitzenmacher, M., Min-Wise Independent Permutations
 * Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S., LSH Scheme Based on p-Stable Distributions