I have a list of local company web sites.
To save a lot of human effort I need a spider program that can visit each of the web sites on the list. The spider will need to explore the home page and some pages on the site but it will not follow links off onto other sites.
I want to sell Web design and Web hosting services to these companies so I'd like to collect some information to help identify good targets.
I want the spider to create a record for each web site it visits and extract the following data.
Sometimes the data may not be on the home page but on a page linked from it.
Data required:
Postal address
Telephone number
Fax number
email address - NB may be text or image link
If there is no email address, start an smtp session and establish if one of admin@ info@ sales@ etc is a valid email
Can the home page be found?
Any http error codes - 403, etc?
Any home page meta tags including description, key words, generator, content-type, etc
home page title
Frame set on home page?
Any html forms on the site?
How many links off the home page?
Is there a [login to view URL] file?
Any broken links?
Also calculate the Google PR of the home page of each site
Any scripting used on the site and what type - PHP, ASP, etc
Any flash movies on the site?
IP address of server
Name of hosting company - perhaps the Autonomous System Number
Domain name and whois information:
Domain type - .uk, .com, etc
registrant
Registrant's agent
Renewal date
Registration status
Name servers
What else do you suggest the script should check for?
Please detail what your bid includes and when you could complete the job by.
I don't mind if the tool runs on my Windows PC or is a script to run on my web server.