I'm trying to find a solution, idea, theory or a way I can use to find and "cluster" similar documents. I have a textual document (in hebrew) from source1 which I want to compare to hunderds of thousands of documents from other sources. Currentlly i'm squizzing cpu power to find the top 30 repeated words from each document and comparing those to other 30 words from different documents.. Obviously, this simple task does not scale too good. Any idea on where I should look?
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Complete ownership and distribution copyrights to all work purchased.
## Platform
If code is provided, preferbly as c/c++ , perl or js code.