Hello, I am software developer with experience in many languages and techniques
Regarding your project, let me note that there is no builtin functions in PHP that allows to read PDF and WORD files, however it can be done with third party solutions that need to be installed on hosting server, so generally it's not a problem, but please note that additional installation steps on the web server will be needed
Regarding comparison of strings, cosine distance is more about numerical vectors comparison, strings can be converted to them in one or another way, counting number of words for example, but this way doesn't take into consideration typing errors or words with the same roots for example
Did you consider Edit distance ? or in particularly Levenshtein distance ? this is method for comparison of words based on number of operations (usually insertion, deletion) needed to transform one word to another, method was developed specially for words comparison, I think it will better suit your needs
Please let me know what you think about this
Looking forward to hear from you,
Denis