
Open
Posted
•
Ends in 1 day
Paid on delivery
I run a regional event directory that follows more than 400 venue URLs. At the moment I rely on Manus AI, which only captures about 60-70 % of what is published. I want a purpose-built scraper that raises that coverage to virtually 100 % while preventing duplicates from slipping into the database. The sources you will have to handle are varied—Google Calendar feeds, PDFs posted on the venues’ sites, Facebook event widgets, and a mix of other custom formats. Some pages are JavaScript-heavy, others serve flat files, and a few hide the information in weekly PDF flyers, so the solution will likely combine techniques such as headless browsing, HTML parsing, PDF text extraction/OCR, and selective API use. The tool must plug directly into my existing site workflow (a Laravel backend with a MySQL database). I already have endpoints for create/update actions; the scraper just needs to push normalized JSON to them and include a simple hash or fingerprint system so the same event is never imported twice. Deliverables • A modular scraper (Python preferred, but I’m open) with separate handlers for Google Calendar, PDFs (OCR where needed), Facebook events, and a catch-all HTML parser for custom formats. • A lightweight deduplication module that compares new events against the existing table by title, date, venue, and hash. • Deployment script or Dockerfile so I can spin the service up on an Ubuntu VPS. • Setup notes and commented code so I can extend it to new venues later on. Acceptance criteria • Test run across the current 400 URLs shows ≥95 % capture of unique events. • No duplicate entries created during the same run or across consecutive runs. • All fields (title, date/time, venue name, source URL) populate correctly in my database. • Runtime per full scan stays under two hours on a 2-vCPU VPS. If you have deep experience with Python scraping frameworks (Scrapy, Selenium, Playwright), PDF parsing libraries, and API integration, I’d love to see an outline of how you’d tackle the mix of formats and keep the data clean.
Project ID: 40488908
137 proposals
Open for bidding
Remote project
Active 2 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
137 freelancers are bidding on average $2,156 USD for this job

Hi — Elias here from Miami. I see you're looking to enhance your event directory by scraping data from over 400 venue URLs. This is crucial for keeping your platform up-to-date and competitive. The real technical challenge lies in managing the complexity of scraping multiple sources reliably. What usually matters most is ensuring the system is scalable and can handle changes in website structures without frequent breakdowns. A common issue with such integrations is maintaining data consistency while minimizing server load. My approach would involve setting up a modular architecture using Python with Scrapy for efficient scraping, alongside a robust MySQL database for storage. I would implement error handling and logging to monitor performance and quickly address issues, ensuring long-term maintainability. I've worked on similar projects where data integrity and reliability were paramount, equipping me with insights on preventing common pitfalls. A few questions to better understand the scope: Q1 – Are there specific venues or data points that are more critical to scrape? Q2 – What are your expectations regarding update frequency for this data? Q3 – How do you envision handling potential changes in venue website structures? Happy to go through the details and suggest the best technical approach. Looking forward to hearing from you.
$2,500 USD in 10 days
8.4
8.4

⭐⭐⭐⭐⭐ Create a Reliable Event Scraper for Your Regional Directory ❇️ Hi My Friend, I hope you're doing well. I've reviewed your project requirements and see you are looking for a custom web scraper to boost event coverage. Look no further; Zohaib is here to help you! My team has successfully completed 50+ similar projects for web scraping. I will develop a modular scraper that captures data from various sources and prevents duplicates effectively. ➡️ Why Me? I can easily create your event scraper as I have 5 years of experience in web scraping and data extraction. My expertise includes Python, HTML parsing, and API integration. Not only this, but I also have a strong grip on PDF parsing and deduplication techniques, ensuring a clean and efficient data flow for your project. ➡️ Let's have a quick chat to discuss your project in detail and let me show you samples of my previous work. Looking forward to discussing with you in chat. ➡️ Skills & Experience: ✅ Python Programming ✅ Web Scraping ✅ HTML Parsing ✅ PDF Text Extraction ✅ API Integration ✅ Data Deduplication ✅ OCR Techniques ✅ Laravel Integration ✅ MySQL Database ✅ Docker Deployment ✅ Event Data Normalization ✅ JavaScript Handling Waiting for your response! Best Regards, Zohaib
$1,500 USD in 2 days
8.2
8.2

Scraping event data requires a robust architecture to handle site-specific structures, dynamic content, and potential anti-bot measures. I can build a reliable script that pulls this data accurately and stores it in your database while maintaining the necessary performance standards to avoid detection. My technical background spans over 15 years in web development, including extensive work with PHP, Linux environments, and database management. I frequently handle complex data integrations, utilizing my background in software architecture to ensure the scraping process is not only effective but scalable and easy to maintain long-term. My bid is $1916.56 with delivery in 1 days. Send me a message so we can discuss the target sources and get this integration moving.
$1,916.56 USD in 1 day
8.1
8.1

Hi, I will build a modular Python scraper with dedicated handlers for Google Calendar feeds, Facebook event widgets, PDF flyers (using OCR where needed), and a catch-all HTML parser — all pushing normalized JSON to your existing Laravel endpoints. I will also deliver the deduplication module, Dockerfile, and documented code for adding new venues. For deduplication, I will generate a composite hash from normalized title + date + venue ID, then check it against your MySQL table before every insert. Normalizing strings — lowercasing, stripping whitespace, removing filler words — before hashing catches the subtle variations that cause duplicates when venues repost the same event with slightly different formatting. Ready to start whenever you are. Kamran
$1,667 USD in 25 days
8.5
8.5

Hi, ★★★ Python Expert ★★★ 6+ Years of Experience ★★★ I can build a modular scraper that captures nearly 100% of events with robust deduplication. This will include: - Separate handlers for Google Calendar, PDFs, Facebook events, and custom HTML formats. - A deduplication module to ensure unique entries. - Deployment script or Dockerfile for easy setup. - Comprehensive setup notes and commented code for future extensions. I will utilize Python scraping frameworks like Scrapy and Selenium, along with PDF parsing libraries, to handle the varied sources efficiently. Ready to start once you provide access to the existing site workflow and any specific requirements for the scraper. Thanks!
$1,500 USD in 7 days
8.4
8.4

Hello, We've thoroughly reviewed your project on Advanced Event Scraping Integration, and we're excited about the opportunity to collaborate. We understand the intricacies of capturing comprehensive event data and ensuring seamless integration with your existing Laravel workflow. Having successfully delivered a similar project involving complex data sources like Google Calendar, PDFs, and varied web formats, we're equipped to enhance your coverage to virtually 100% and eliminate duplicates. Our expertise in Python, web scraping frameworks like Scrapy and Selenium, and OCR for PDFs will ensure efficient and accurate data capture. With over 8 years of experience in AI-first product development and automation, our team specializes in developing robust, scalable solutions. We've built intelligent systems that integrate seamlessly with different APIs and databases, ensuring unmatched precision and efficiency. We invite you to share more details so we can tailor a comprehensive proposal for you within 24 hours. Let's elevate your event directory to new heights. Best regards, Puru Gupta Top 1% Freelancer | All-in-One Solution Provider
$3,000 USD in 30 days
7.9
7.9

Hi there, I understand you need a purpose-built data pipeline to replace Manus AI. The system will cycle through your 400+ venue URLs, detect the source type (JS-heavy page, PDF, GCal feed, FB widget), and route it to a specialized handler. Each handler will extract event data, normalize it into a clean JSON object, generate a unique hash for deduplication, and push it to your existing Laravel API endpoints. Technical approach: This will be a Python application. We'll use Playwright for JS-heavy sites, pdfplumber and Tesseract (OCR) for PDFs, and direct feed/API parsing for calendars. A central orchestrator will manage tasks, and the system will be containerized with Docker. Deduplication will use a SHA256 hash of normalized event title, date, and venue. Core modules: - URL Dispatcher: Routes URLs to the correct extraction module. - Pluggable Extractors: Handlers for HTML/JS, PDF (text/OCR), iCal, and Facebook. - Normalization Engine: Standardizes raw data into your target JSON schema. - Deduplication Service: Checks event fingerprints before API submission. Implementation strategy: We'll begin with the most common source type for a quick win. We'll develop and test each module iteratively in a staging environment against a copy of your database to ensure data integrity and prevent duplicates. Dockerization from day one ensures a smooth deployment and easy maintenance. Questions: 1. What's the approximate distribution of source types (e.g., % of PDFs, % of JS-heavy sites) across the 400 URLs? 2. How should the system handle a source that fails to scrape after multiple retries-log and skip, or send an alert? 3. For the PDFs, do the event layouts tend to be consistent, or do they vary significantly between publications? Regards, Rohit
$2,000 USD in 25 days
8.0
8.0

Hi, We’ve built similar web scrapers that extract data from multiple sources, including Google Calendar, Facebook, and Eventbrite, and we’ve developed custom solutions for PDF parsing and OCR. We also have extensive experience with Laravel, so we can seamlessly integrate the scraper with your existing backend. For this project, we’d use a combination of headless browsers and dedicated libraries to handle different types of sources. We’d also implement a robust deduplication system to ensure only unique events are saved. Let’s schedule a 10-minute call to discuss your project in more detail and see if I’m the right fit. I usually respond within 10 minutes. Best regards, Adil
$2,270.20 USD in 21 days
7.5
7.5

Hi, this is exactly the kind of scraping project where reliability comes from treating each source type separately instead of forcing everything through one generic crawler. I would build this as a modular Python scraper with dedicated handlers for Google Calendar feeds, PDF flyers/OCR, Facebook/event widgets where accessible, and custom HTML/JavaScript-heavy venue pages. Each event would be normalized into a consistent JSON shape, fingerprinted for duplicate prevention, then pushed into your existing Laravel/MySQL workflow through your create/update endpoints. I would focus on: - Source-specific parsers with clear logs and failure reporting - Event fingerprinting/deduplication across 400+ venues - PDF/OCR handling for weekly flyers and non-standard pages - API-ready JSON output plus retry/error handling so coverage gaps are visible A sensible first milestone is a pilot across a representative set of venues, then scale the handlers once accuracy and duplicate rules are proven. Best, Dr. Syafiq
$3,000 USD in 21 days
7.6
7.6

Hello, I have reviewed your requirement for building a high-accuracy event scraping system across 400+ diverse sources and fully understand the need for near 100% capture with strict deduplication and clean Laravel integration. I have 10+ years of experience in Python-based scraping systems using Playwright, Selenium, Scrapy, OCR (Tesseract), PDF parsing, and API-driven pipelines. I have also built large-scale data aggregation systems with deduplication logic and normalized data structures. My approach: • Hybrid scraping engine (Playwright + Scrapy + API + PDF/OCR handlers) • Source-specific modules for Google Calendar, Facebook events, PDFs, and dynamic websites • Intelligent deduplication using hash + semantic matching (title/date/venue) • Normalized JSON pipeline directly integrated with your Laravel endpoints • Dockerized deployment for easy VPS setup and scaling • Optimized scheduling to complete full 400+ URL scan under performance constraints WE WILL PROVIDE COMPLETE SOURCE CODE, DOCKER SETUP, AND FULL DOCUMENTATION. WE WILL WORK USING AGILE MILESTONES AND ENSURE EASY EXTENSIBILITY FOR FUTURE VENUES. WE WILL ALSO PROVIDE 2 YEARS OF FREE ONGOING TECHNICAL SUPPORT AND ASSISTANCE FROM SETUP TO PRODUCTION DEPLOYMENT. I am confident I can significantly improve your current 60–70% capture rate to near-complete coverage with clean, duplicate-free data flow. I eagerly await your positive response. Thanks.
$2,250 USD in 7 days
7.6
7.6

Hi, This is exactly the type of large-scale data aggregation and automation system my team specializes in. We have extensive experience building high-volume scraping platforms that combine Playwright, Selenium, Scrapy, OCR pipelines, PDF extraction, API integrations, and intelligent deduplication workflows. For your event directory, I would build a modular Python-based architecture with dedicated handlers for Google Calendar feeds, Facebook events, PDF flyers (including OCR for scanned documents), and custom venue websites. The system would normalize all event data into a unified schema before pushing JSON directly into your Laravel API endpoints. To prevent duplicates, I would implement a fingerprinting engine using title, venue, datetime, source URL, and fuzzy matching to catch near-duplicate events across different sources. The platform would be fully Dockerized for Ubuntu deployment, scalable to 400+ venues, and designed for easy extension as new formats emerge. Relevant Projects: https://www.freelancer.com/projects/php/Sharepoint-RAG-SQL-GPT-agent/reviews https://www.freelancer.com/projects/selenium/Weekly-Golf-Tee-Time-Booking?frm=ludiac&sb=t https://www.freelancer.com/projects/php/SQL-RAG-GPT-Agent-with/details https://www.freelancer.com/projects/gpt-agent/Data-Analyst-Required/reviews https://www.freelancer.com/projects/php/OpenAI-Prompts-for-Telco-Support/reviews Thanks
$2,250 USD in 14 days
7.7
7.7

Hi there, I can help you with the event scraper to get that coverage up past 95%. I've done similar work before, mixing Playwright for the JS-heavy pages, PyMuPDF for PDF text extraction with Tesseract fallback for scanned flyers, and a hash-based dedup system keyed on title+date+venue+source URL. I'd build it modular so each source type has its own handler, and push normalized JSON straight to your Laravel endpoints. The whole thing would be Dockerized for easy deployment on your VPS. I'm Edward, been doing this kind of scraping work for years. Happy to chat more about the approach.
$1,500 USD in 7 days
7.6
7.6

What you’re really missing isn’t scraping capability, it’s source-specific handling and normalization. A single parser will always struggle when one venue exposes a Google Calendar feed, another publishes scanned PDF flyers, and another renders events entirely through JavaScript. I’d build this as a modular Python service where each source type has its own extraction pipeline. First I’d classify the 400 URLs and route them through dedicated handlers (Calendar, PDF/OCR, Facebook, custom HTML/JS). Then everything gets normalized into a common event schema before being pushed to your Laravel endpoints. That separation usually makes coverage jump significantly because each format is treated on its own terms rather than through generic scraping. For deduplication, I’d combine a deterministic fingerprint with fuzzy matching on title, venue, and event date so recurring scans don’t create duplicates even when source formatting changes slightly. I built a similar data-ingestion system in Python for aggregating content from inconsistent third-party sources, where reliability mattered more than raw scraping volume. I’d also add source-level monitoring so failed venues are immediately visible instead of silently reducing coverage. I’m ready to start by auditing a sample of the current 400 URLs and identifying the highest-loss sources first.
$2,250 USD in 7 days
6.9
6.9

As a seasoned AI Full Stack Developer with 10+ years' experience in Software Development, especially in Advanced Data Mining and Web Scraping, I am confident that I can build the advanced Event Scraping Integration you need. My proficiency in Python, particularly in popular scraping frameworks like Scrapy and Selenium, aligns seamlessly with your project requirements. The fact that your existing backend is built on Laravel, which I am well-versed in, is an added advantage. Moreover, my competence in building modular and efficient solutions with selective API usage perfectly suits the complex task of scraping and integrating data from a wide array of sources like Google Calendar feeds, PDFs, and Facebook event widgets. Furthermore, the deployment script or Dockerfile I will provide will ensure easy creation and management of instances for your service on Ubuntu VPS.
$2,500 USD in 14 days
6.8
6.8

As the CEO and founder of Web Crest, I lead a team that specializes in building scalable and impactful digital solutions, and our mastery includes Python and PHP, both of which are key for the seamless implementation of your preferred technology stack-choice Laravel, MySQL, and more. Our major strengths lie in AI & Automation which aligns perfectly with your objective for an advanced scraper. We've built intelligent systems like chatbots, data extraction tools, and web platforms capable of handling similar complex digital demands. Your project touches on areas that we have executed numerous times before: Scraping with precision, working with various data formats including PDF text extraction/OCR, API integration, and ensuring clean databases with no duplicate entries. We have used Python libraries such as Scrapy and Selenium (same as Playwright) countless times to scrape all sorts of websites and gather data efficiently.
$2,000 USD in 7 days
6.7
6.7

Hello!, This is James from Hollywood. I've carefully reviewed your project on advanced event scraping integration, and I understand the importance of efficiently gathering data from over 400 venue URLs. With 15 years of experience in PHP, Python, web scraping, and API integration, I'm confident in delivering a robust solution tailored to your needs. I’ve been part of the Shopify partners program since 2016, building numerous apps and themes, which has honed my skills in data handling and automation. My approach includes: 1. Assessing the current manual processes to identify areas for automation. 2. Designing a scalable scraping architecture using Scrapy and Selenium. 3. Setting up a MySQL database for efficient data storage and retrieval. 4. Implementing a user-friendly interface for easy access to the scraped data. Could you please clarify the following questions to help me better understand the project? 1. Are there specific data points you want to prioritize from the venues? 2. What is your preferred frequency for scraping updates? 3. Do you have any existing infrastructure or tools you'd like to integrate with? I look forward to the opportunity to discuss how we can make your event directory even more powerful!
$2,000 USD in 10 days
6.3
6.3

Your scraper challenge is clear. I’ve built multi-source scrapers that combine headless browsing, PDF OCR, and API pulls for event-heavy sites before—handling JS-rendered content and tricky PDF flyers with solid results. Here’s how I’d approach your project: - Use Playwright to handle JS-heavy pages and Facebook widgets, ensuring all dynamic content loads. - Build a PDF parser module with OCR (Tesseract or similar) to extract event details from flyers, combined with text heuristics to identify dates and venue. - Collect Google Calendar feeds via their API or direct ICS parsing for structured data. - Create a catch-all HTML parser fallback using BeautifulSoup for custom formats or flat files. - Normalize all event data into JSON matching your endpoint schema. - Implement a deduplication system: a hash fingerprint based on title+date+venue, plus fuzzy string matching to catch slight variations before pushing updates. - Dockerize the whole pipeline; provide scripts for deployment on your Ubuntu VPS. - Write setup notes that include how to add new venue handlers without overhaul. Two quick thoughts: 1) How often do events repeat or update on the venues' sites—are incremental updates needed? 2) Could you share a few sample PDFs or example event URLs to verify OCR approach early? With a modular design, this will be scalable and maintain 2-hr runtime on modest VPS hardware. Ready to start building this solution now.
$3,000 USD in 7 days
5.9
5.9

Hello! It sounds like you're looking to enhance your event directory with a more efficient scraping solution. To address the need for reliable data collection from over 400 venue URLs, I suggest utilizing a combination of Scrapy for its effective crawling capabilities and Selenium for handling dynamic content. This approach will allow us to gather data consistently while managing the intricacies of different website structures. I'd start by setting up a Scrapy project to outline the scraping logic, then integrate Selenium for any JavaScript-heavy pages that require it. Q1: What specific data points are you looking to extract from each venue? Q2: Do you have any existing infrastructure for storing the scraped data, or would you like recommendations for that? Q3: Are there particular anti-scraping measures on the target sites that we should be aware of? Let’s get this project moving smoothly.
$2,250 USD in 17 days
6.6
6.6

Hi, I can build a modular event scraping system that reliably extracts near-complete coverage from mixed sources and pushes clean, deduplicated events into your Laravel + MySQL pipeline. I’ve built Node/Python webhook and crawler systems with Playwright, Scrapy-style pipelines, PDF parsing, and API normalization logic focused on data accuracy and idempotency at scale. I’ll design a Playwright-based crawler with separate handlers for calendars, PDFs (OCR), Facebook embeds, and HTML parsing, plus a fingerprint-based deduplication layer before syncing to your existing endpoints. Are your current Laravel endpoints already accepting idempotency keys or should I design the full duplicate-control logic end-to-end on the scraper side? Best Regards, Fizza Nadeem K
$1,500 USD in 10 days
5.8
5.8

Hi, I can help you You want a tool that grabs almost all events from 400 venue links, no repeats, and sends clean event data into your site. It needs to read many kinds of pages, some with scripts, some with PDFs, some with feeds, and some with widgets. It should be quick, reliable, and easy to run on your server, and fill in all fields the same way every time. This will take a few days, I've been doing this type of work for years. I have short walkthrough videos on my Freelancer profile showing similar work. 1) What do you have now for the list of venue links and your create/update endpoints? 2) What should the final event record look like and which fields are required? Ideally, we have a call and go through the details together so I can make sure I understand everything correctly, address any questions, and give you a quote and timeline. Would that work? Best, Nicolas
$2,250 USD in 7 days
5.6
5.6

Leesburg, United States
Payment method verified
Member since Jun 6, 2024
$15-25 USD / hour
$15-25 USD / hour
$375-750 USD
$250-750 USD
$2-8 USD / hour
$10-30 AUD
$1500-3000 USD
₹600-1500 INR
₹12500-37500 INR
₹12500-37500 INR
$250-750 USD
€30-250 EUR
$30-250 USD
₹12500-37500 INR
$15-25 USD / hour
₹12500-37500 INR
$250-750 AUD
min $50 USD / hour
$10-30 USD
$250-750 USD
$30-250 USD
₹400-750 INR / hour
$10-30 CAD
$100-250 USD