We want to calculate the similarity of several thousands of texts. The number of texts can go upto 100 K. Each text is in 1 .txt file and each file has a number: [login to view URL], [login to view URL], etc.
After that, we want to extract the less similar texts. With 2 options: extract the x less similar texts or extract all the text with a maximum similarity of n %.
A table must be generated, indicating the number of texts we can extract with a maximum similarity ratio of x %, with x going from 0 to 100, by increments of 1.
The tool must be running on demand on HPC.
We are opened to hire several people to achieve this goal if it's necessary: a mathematician to write the calculation algorithm, a computational linguist and someone experimented with HPC.