The goal of the project is to build a scalable web scraper which should scrape data from more a dozen different websites at first. Later on, it should be possible to upscale the scraper to a few thousand websites.
Those websites are known and should be added iteratively to the scraper. The websites have a different structure each which is why the development and maintenance costs per site need to stay as small as possible. The aim is to scrape the websites on a weekly basis at first. Later on, the scraping intervals should be reduced to a daily basis or even shorter. The scraped data needs to be stored in an useful and efficient way in a database in the cloud. Furthermore, the scraping must be intolerant to changes in the designs of the websites and it must prevent being blocked.
Currently, a simple scraper in Python exists which can scrape a few websites by using the Selenium library. However, this does not need to be continued at all cost.
The following tasks are part of your engagement for the project:
o Developing a modular and scalable software architecture for the web scraping project (preferably with Python)
o Containerizing the program in Docker
o Deploying and managing the containers in the cloud, probably with AWS and Kafka
o Implementing different measures to prevent blacklisting and being blocked
o Setting up a SQL database, probably PostgreSQL with AWS
The following tasks might be part of a further engagement:
o Implementing the web scrapers for a large number of different websites
o Maintaining and monitoring the scrapers for the websites
o Adding a web crawler to find additional websites
o Parsing the stored data and processing them into a more useful format
o Web Scraping (Importance: 9/10)
o Python (Importance: 7/10)
o Docker (Importance: 8/10)
o AWS (Importance: 5/10)
o Kafka or other Pipelining/Queuing Tools (Importance: 8/10)
o Cloud Databases (Importance: 6/10)
o PostgreSQL (Importance: 10/10)
You are expected to work closely together with our developer in Germany. The tasks above need to be coordinated and done in cooperation with him. Therefore, a willingness to work between 10 AM and 10 PM Central European Time is required.
We wish to get to know you first by working together in a limited project scope. If you are a fit for our team, we are willing to intensify our cooperation with you and hire you for future projects.
10 freelancers are bidding on average €10/hour for this job
Hi Sir Nice to meet you i am expert in python with web scraping at high level. I agree with your time zone confidential level of skiils you wrote above. Plase come in chat and show me details