org.clojars.punit-naik.clj-ml.lsh

band-hash

(band-hash band-size minhash-list)
Takes the minhash signature of a string and partitions it according to `band-size`
Then we hash each "band" (partition) as similar strings will tend have at least one matching hashed band

compare-records

(compare-records records)
Compares a list of records/string with each other using `org.clojars.punit-naik.clj-ml.utils.string/reversed-levenstein-distance`

find-possible-duplicates

(find-possible-duplicates shingle-size hash-count band-size match-threshold data)
Takes a collection of strings (`data`) and finds out the similar strings from the collection

hash-n-times

(hash-n-times sh-list n)
Hashes a shingles list `n` times

merge-candidates

(merge-candidates candidate-list)

merge-candidates-recursive

(merge-candidates-recursive candidate-list)

min-hash

(min-hash hash-values)
Takes the lists of hashed values (where all of them have the same size)
and finds the minimum hash value at the position ā€˜i’ from every list
thereby generating a single list of hash values which is the minhash signature of that string