Near Duplicate Detection

This package helps in finding clusters of near-duplicate documents in a large corpus. The code represents each document as a set of n-grams contained in the document. The similarity between two documents is computed using the Jaccard coefficient between the two corresponding sets of n-grams. However, pairwise similarity computation is avoided through the use of hashing and probabilistic computation of the Jaccard coefficient. For details refer to the paper by Broder et al. [1]
Here is a README which explains instructions for usage and input data format.

