Nearest neighbor search

Nearest neighbor search (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in metric space M and a query point q &isin;M, find the closest point in S to q. In many cases, M is taken to be d-dimensional Euclidean space and distance is measured by Euclidean distance or Manhattan distance.

Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning a residence to the nearest post office.

Applications
The Nearest Neighbor Search problem arises in several fields including
 * Pattern recognition - in particular for optical character recognition
 * Statistical classification- see k-nearest neighbor algorithm
 * Computer vision
 * Databases - e.g. content-based image retrieval
 * Coding theory - see maximum likelihood decoding
 * Data compression - see MPEG-2 standard
 * Recommendation systems
 * Internet marketing - see contextual advertising and behavioral targeting
 * DNA sequencing
 * Spell checking - suggesting correct spelling
 * Plagiarism detection
 * Contact searching algorithms in FEA

Methods
Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithm are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. Informal observation usually referred as curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

Linear search
The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, works for small databases but quickly becomes intractable as either the size or the dimensionality of the problem becomes large. Linear search has a running time of O(N*d) where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database.

Space partitioning
Starting from 1970s branch and bound methodology was applied to the problem. In the case of Euclidean space this approach is known as spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the kd-Tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. For constant dimension query time complexity is O(log N). R-tree data structure was designed to support nearest neighbor search in dynamic context. It has efficient algorithms for insertions and deletions.

In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include VP-tree and Bk-tree.

Locality sensitive hashing
Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.

Nearest neighbor search in spaces with small intrinsic dimension
The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

Variants
There are a lot variant of the NNS problem and the two most well-known are the k-Nearest Neighbor Search and the &epsilon;-Approximate Nearest Neighbor Search.


 * k-Nearest Neighbor Search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors.


 * &epsilon;-Approximate Nearest Neighbor Search is becoming an increasingly popular tool for fighting the Curse of dimensionality.


 * Nearest Neighbor Distance Ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.