Website crawler scraper

Completed Posted Feb 24, 2014 Paid on delivery
Completed Paid on delivery

+Start crawling from a list of the URLs specified by user;

+Supports wide range of character sets support with automated character set and language [login to view URL] character sets [login to view URL] phrase segmenting (tokenizing) for Chinese, Japanese, Korean and [login to view URL] segmenting for Chinese, Japanese, Korean, and [login to view URL] SGML entities like 'à' and ISO-Latin-1 characters can be indexed and [login to view URL] problem to crawl any unicode character encoding (china symbol letter, japan, korea letter,arabic, hebrew, turkish, thailand, greek, baltic, cyrillic, utf-8 windows-12xx)

+Spider picture and video source code and extract right mysql file(create tables)

+Checks website source code and returns:Site Title,Site Meta Description,Site Keywords,Site page size,Search term site url and much more

+Reasonable duplicate domain and duplicate content detection to avoid re-crawling of identical sites on different domains. ([login to view URL] vs [login to view URL], and a million other sites that use multiple domains for the same content.)

+Understanding GET parameters, and what's a "search result" across many site-specific search engines. For example, some page may link to a search result page on another site's internal search with some GET parameters. Don't want to crawl these result pages.

+Block the unwanted [login to view URL] and cookies manage for anonymous access and cache crawled [login to view URL] caching gives significant time reduction in search [login to view URL] cleaning algorithm

+Detect broken links;(should automatically ignore broken links).Duplicate data detection and removal. Duplicate detection to stop web scraping when old data is reached.

+Crawling rules and multithreaded downloading (up to 50 threads).Can perform parallel and multi-threaded indexing for faster updating.

+Apply Regular Expressions (RegEx) on Text or HTML source of web pages and scrape the matching portion. Extract using XPath

+Update every N min - to specify how often the program will scrape the target website

+Extract data from documents such as PDF or Docx documents by using 3rd party document converters.

+Script should be able to return these results in less than 5 seconds

+Scrape data from Social media (Twitter, FB and other Social Sites) and blogs on a real time basis

+export (100;1000;10000;100000.......) results per file

+Crawled informations export to sql and mysql file(automatic mysql create table,insert into,values title,meta,keywords,page size,search term site url etc... and much more functionality in sql )

C Programming C# Programming C++ Programming MySQL PHP

Project ID: #5482126

About the project

4 proposals Remote project Active Feb 27, 2014

Awarded to:

cliscwang

Hi. I'm very interesting in scrapping project. I'm a expert in C Programming, C# Programming, C++ Programming, MySQL, PHP, HTML5. I can help you at short time and cheap price. your project will be complete clearly More

$80 USD in 3 days
(25 Reviews)
4.6

4 freelancers are bidding on average $106 for this job

kshapawi

I have done crawler before based on random URL which I took from google and collected email addresses from the page. Managed to collect more than 100,000 email addresses from all over the world. It could go more as I s More

$120 USD in 10 days
(0 Reviews)
0.0