We have a MySQL database of GB's in size. The content is tens of thousands of chat dialogs. What we need is an analysis of this database in order to extract the following results:
- All words will be compared with every other word in the database to list every possible typo made typing that word.
Let's say we have the word "abcde".? Whole database will be analysed to find out words such as "abcd", "abdce", "bcde", "absde", etc. to form a set of "abcde typos". We need as much possibility as we can extract, so there are absolutely no limits to the possibilities whatsoever.
- All word groups? will be extracted using a similar comparison algorithm, this time among all sentences, instead of words.
Let's say we have the following word groups:
abcde fghj
abcde klmno
abcde prs
We need to figure out all two and three-word groups matching this pattern.
We'll provide the database in XML format.