Create a Java based text mining application that crawls the web for content relevant for given search terms. For example:
Search term: "australian home design"
Example process: After querying the major search engines and other relevant sites like Wikipedia and [login to view URL] (for example) the software would extract the most relevant paragraphs for the given search term from the web pages it deems most applicable. After that an XHTML document would be created by the software that would look somewhat similar to the results returned by a Google search (but simpler, with more text).
Example output:
File name: [login to view URL]
<html>
<head>
<title>Australian home design</title>
</head>
<body>
<h1>Australian home design: best references</h1>
<h2><a href="[login to view URL]">Home site Australia</a>: relevant text</h2>
<p>some relevant text from the site</p>
<p>more relevant text from the site</p>
<h2><a href="[login to view URL]">Home designs of Australia</a></h2>
<p>some relevant text from the site</p>
<p>more relevant text from the site</p>
etc. etc. etc.
</body>
</html>
I would fully expect the creator of this software to use relevant open source tools and Java libraries.
I suspect Nutch/Lucene might come in handy. As might GATE ([login to view URL]). Reading Wikipedia's article on text mining might also yield more information regarding useful tools and algorithms. There is probably heaps of stuff out there that I know nothing about.
The Java source code must be delivered and A&A DesignTek (my company) must be able to use it as it sees fit. The code must be well documented using Javadoc (including package level documentation) and all relevant variable names must be meaningful and written in camel case (e.g. simpleVariableName) and use English terms (e.g. using the traditional values i, j and k for counters in for loops is OK, but using Swahili names for variables is not).