org.clojars.punit-naik.clj-ml.lsh
band-hash
(band-hash band-size minhash-list)
Takes the minhash signature of a string and partitions it according to `band-size`
Then we hash each "band" (partition) as similar strings will tend have at least one matching hashed band
compare-records
(compare-records records)
Compares a list of records/string with each other using `org.clojars.punit-naik.clj-ml.utils.string/reversed-levenstein-distance`
find-possible-duplicates
(find-possible-duplicates shingle-size hash-count band-size match-threshold data)
Takes a collection of strings (`data`) and finds out the similar strings from the collection
hash-n-times
(hash-n-times sh-list n)
Hashes a shingles list `n` times
merge-candidates
(merge-candidates candidate-list)
merge-candidates-recursive
(merge-candidates-recursive candidate-list)
min-hash
(min-hash hash-values)
Takes the lists of hashed values (where all of them have the same size)
and finds the minimum hash value at the position āiā from every list
thereby generating a single list of hash values which is the minhash signature of that string