MinHash
Template:Short description In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference,[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.[1]
Jaccard similarity and minimum hash values
The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let Template:Math be a set and Template:Math and Template:Math be subsets of Template:Math, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:
This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate Template:Math quickly, without explicitly computing the intersection and union.
Let Template:Math be a hash function that maps the members of Template:Math to distinct integers, let Template:Math be a random permutation of the elements of the set Template:Math, and for any subset Template:Math of Template:Math define Template:Math to be the minimal member of Template:Math with respect to Template:Mathβthat is, the member Template:Math of Template:Math with the minimum value of Template:Math. (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)
Now, applying Template:Math to both Template:Math and Template:Math, and assuming no hash collisions, we see that the values are equal (Template:Math) if and only if among all elements of , the element with the minimum hash value lies in the intersection . The probability of this being true is exactly the Jaccard index, therefore:
That is, the probability that Template:Math is true is equal to the similarity Template:Math, assuming drawing Template:Math from a uniform distribution. In other words, if Template:Math is the random variable that is one when Template:Math and zero otherwise, then Template:Math is an unbiased estimator of Template:Math. Template:Math has too high a variance to be a useful estimator for the Jaccard similarity on its own, because is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.
Algorithm
Variant with many hash functions
The simplest version of the minhash scheme uses Template:Math different hash functions, where Template:Math is a fixed integer parameter, and represents each set Template:Math by the Template:Math values of Template:Math for these Template:Math functions.
To estimate Template:Math using this version of the scheme, let Template:Math be the number of hash functions for which Template:Math, and use Template:Math as the estimate. This estimate is the average of Template:Math different 0-1 random variables, each of which is one when Template:Math and zero otherwise, and each of which is an unbiased estimator of Template:Math. Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is Template:Math.[3]
Therefore, for any constant Template:Math there is a constant Template:Math such that the expected error of the estimate is at most Template:Math. For example, 400 hashes would be required to estimate Template:Math with an expected error less than or equal to .05.
Variant with a single hash function
It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme avoids this penalty by using only a single hash function and uses it to select multiple values from each set rather than selecting only a single minimum value per hash function. Let Template:Math be a hash function, and let Template:Math be a fixed integer. If Template:Math is any set of Template:Math or more values in the domain of Template:Math, define Template:Math to be the subset of the Template:Math members of Template:Math that have the smallest values of Template:Math. This subset Template:Math is used as a signature for the set Template:Math, and the similarity of any two sets is estimated by comparing their signatures.
Specifically, let A and B be any two sets. Then Template:Math is a set of k elements of Template:Math, and if h is a random function then any subset of k elements is equally likely to be chosen; that is, Template:Math is a simple random sample of Template:Math. The subset Template:Math is the set of members of Template:Math that belong to the intersection Template:Math. Therefore, |Template:Math|/Template:Math is an unbiased estimator of Template:Math. The difference between this estimator and the estimator produced by multiple hash functions is that Template:Math always has exactly Template:Math members, whereas the multiple hash functions may lead to a smaller number of sampled elements due to the possibility that two different hash functions may have the same minima. However, when Template:Math is small relative to the sizes of the sets, this difference is negligible.
By standard Chernoff bounds for sampling without replacement, this estimator has expected error Template:Math, matching the performance of the multiple-hash-function scheme.
Time analysis
The estimator Template:Math can be computed in time Template:Math from the two signatures of the given sets, in either variant of the scheme. Therefore, when Template:Math and Template:Math are constants, the time to compute the estimated similarity from the signatures is also constant. The signature of each set can be computed in linear time on the size of the set, so when many pairwise similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a full comparison of the members of each set. Specifically, for set size Template:Math the many hash variant takes Template:Math time. The single hash variant is generally faster, requiring Template:Math time to maintain the queue of minimum hash values assuming Template:Math.[1]
Incorporating weights
A variety of techniques to introduce weights into the computation of MinHashes have been developed. The simplest extends it to integer weights.[4] Extend our hash function Template:Mvar to accept both a set member and an integer, then generate multiple hashes for each item, according to its weight. If item Template:Mvar occurs Template:Mvar times, generate hashes . Run the original algorithm on this expanded set of hashes. Doing so yields the weighted Jaccard Index as the collision probability.
Further extensions that achieve this collision probability on real weights with better runtime have been developed, one for dense data,[5] and another for sparse data.[6]
Another family of extensions use exponentially distributed hashes. A uniformly random hash between 0 and 1 can be converted to follow an exponential distribution by CDF inversion. This method exploits the many beautiful properties of the minimum of a set of exponential variables.
This yields as its collision probability the probability Jaccard index[7]
Min-wise independent permutations
In order to implement the MinHash scheme as described above, one needs the hash function Template:Math to define a random permutation on Template:Math elements, where Template:Math is the total number of distinct elements in the union of all of the sets to be compared. But because there are Template:Math different permutations, it would require Template:Math bits just to specify a truly random permutation, an infeasibly large number for even moderate values of Template:Math. Because of this fact, by analogy to the theory of universal hashing, there has been significant work on finding a family of permutations that is "min-wise independent", meaning that for any subset of the domain, any element is equally likely to be the minimum. It has been established that a min-wise independent family of permutations must include at least
different permutations, and therefore that it needs Template:Math bits to specify a single permutation, still infeasibly large.[2]
Practical min-wise independent hash functions
Because of the above impracticality, two variant notions of min-wise independence have been introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most Template:Math.[8] Approximate min-wise independence has at most a fixed probability Template:Math of varying from full independence.[9]
In 1999 Piotr Indyk proved[10] that any k-wise independent family of hash functions is also approximately min-wise independent for large enough. In particular, there are constants such that if , then
for all sets and . (Note, here means the probability is at most a factor too big, and at most too small.)
This guarantee is, among other things, sufficient to give the Jaccard bound required by the MinHash algorithm. That is, if and are sets, then
Since k-wise independent hash functions can be specified using just bits, this approach is much more practical than using completely min-wise independent permutations.
Another practical family of hash functions that give approximate min-wise independence is Tabulation hashing.
Applications
The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents.[1][2][11] Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.[12]
In data mining, Template:Harvtxt use MinHash as a tool for association rule learning. Given a database in which each entry has multiple attributes (viewed as a 0β1 matrix with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.[13]
The MinHash algorithm has been adapted for bioinformatics, where the problem of comparing genome sequences has a similar theoretical underpinning to that of comparing documents on the web. MinHash-based tools[14][15] allow rapid comparison of whole genome sequencing data with reference genomes (around 3 minutes to compare one genome with the 90000 reference genomes in RefSeq), and are suitable for speciation and maybe a limited degree of microbial sub-typing. There are also applications for metagenomics [14] and the use of MinHash derived algorithms for genome alignment and genome assembly.[16] Accurate average nucleotide identity (ANI) values can be generated very efficiently with MinHash-based algorithms.[17]
Other uses
The MinHash scheme may be seen as an instance of locality-sensitive hashing, a collection of techniques for using hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques exist for Hamming distance between sets and cosine distance between vectors; locality sensitive hashing has important applications in nearest neighbor search algorithms.[18] For large distributed systems, and in particular MapReduce, there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.[19]
Evaluation and benchmarks
A large scale evaluation was conducted by Google in 2006 [20] to compare the performance of Minhash and SimHash[21] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[22] and using Minhash and LSH for Google News personalization.[23]
See also
References
- β 1.0 1.1 1.2 1.3 Template:Citation.
- β 2.0 2.1 2.2 Template:Citation.
- β Template:Citation.
- β Template:Citation
- β Template:Cite arXiv
- β Template:Cite book
- β Template:Cite book
- β Template:Citation.
- β Template:Citation.
- β Indyk, Piotr. "A small approximately min-wise independent family of hash functions." Journal of Algorithms 38.1 (2001): 84-90.
- β Template:Cite book
- β Template:Citation; Template:Citation.
- β Template:Citation.
- β 14.0 14.1 Template:Cite journal
- β Template:Cite web
- β Template:Cite journal
- β Template:Cite journal
- β Template:Citation.
- β Template:Cite arXiv.
- β Template:Citation.
- β Template:Citation.
- β Template:Citation.
- β Template:Citation.