
Closed
Posted
Paid on delivery
I am an independent professional reaching out to enquire about your data extraction and Data-as-a-Service (DaaS) capabilities. I am looking for an expert partner to process approximately 48,000 specific project URLs from the MahaRERA (Maharashtra Real Estate Regulatory Authority) portal. I do not require the scraping scripts or source code; my goal is simply to procure the final, clean dataset (Excel/CSV) and a structured folder of the processed PDF documents. Below is a detailed overview of the project requirements: Pipeline & Workflow Requirements: Captcha Bypass: Each URL is protected by a simple alphanumeric captcha. You will need to automate solving this (OCR/Proxies/Sessions) to access the project pages. Data Extraction: Scrape specific structured data points from the HTML tables on each page. Download & Merge PDFs: Under the "Promoter Documents" section, locate multi-part files labeled "Land Ownership Document" (e.g., REGISTERED EXCHANGE DEED Part 1, Part 2). Download all parts for each project, merge them into a single PDF, and name the final file [Registration_Number]_Land_Document.pdf. AI/NLP Document Analysis (Critical): Run the merged legal documents (which may be in English or Marathi) through an AI/NLP model to extract the "Consideration" or "Deal Structure" between the multiple parties. Required Data Points: Primary ID: Registration Number & Date of Registration. Basic Details: Project Name, Project Type, and Proposed Completion Date. Area Details: Land Area for Project Applied (Sq. Mts.), Permissible Built-up Area, Sanctioned Built-up Area, and Aggregate Area of Recreational Open Space. Legal & Promoter Details: CC Date, Landowner Type, GSTIN Number, Promoter Name, and all individual names listed in the "Member Details" table. Joint Venture Flag: A True/False column (Mark 'True' if the "Promoter Name" contains a comma or lists multiple entities). Unit Details: Total Residential & Non-Residential Units. AI Consideration Category: Categorize the deal structure from the PDF as: Pure Monetary, Barter, Constructed Area Share, Revenue Share, or Mixed. AI Consideration Summary: A 1-2 sentence English summary of the commercial terms extracted from the PDF. Project Deliverables: One clean .xlsx or .csv file containing all extracted data points and AI summaries for all 48,000 URLs. A .zip folder containing the correctly named, merged Land Ownership Document PDFs for every project. Proposed Milestone Structure: To ensure quality and alignment, I propose dividing this contract into two milestones: Milestone 1 (Proof of Concept): I will provide 100 URLs. You will deliver the CSV (including the AI extraction) and the 100 merged PDFs to prove the pipeline works accurately. Milestone 2 (Final Delivery): Upon approval of the sample, you will process the remaining URLs and deliver the final bulk files. Request for Proposal: If your team is capable of handling this workflow, please reply to this email with a proposal. To ensure you have read through the requirements, please start your response with the word "MahaRERA". Kindly include the following in your proposal: Your estimated total cost for the project in INR. Your expected turnaround time. A brief explanation of the AI/NLP stack you would utilize to read and summarize the Marathi/English property deeds. I look forward to hearing from you.
Project ID: 40409928
14 proposals
Remote project
Active 2 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
14 freelancers are bidding on average ₹23,331 INR for this job

Youssef, Full-Time Python Developer specializing in data extraction and AI automation. Your goal to extract data from 48,000 MahaRERA URLs, process PDFs, and perform AI analysis is clear. My approach uses Playwright for robust navigation and captcha handling, Scrapy for efficient data extraction, and a custom AI/NLP stack for analyzing Marathi/English documents to categorize deal structures. I have successfully completed similar large-scale data extraction and document processing projects. To proceed with the proof of concept, could you confirm the specific data points you need from the HTML tables beyond those listed? Ready to start immediately.
₹37,500 INR in 1 day
7.3
7.3

MahaRERA - Your pipeline will fail if the captcha solver cannot maintain session persistence across 48,000 requests. Most OCR tools break after 500-1000 consecutive attempts because the portal starts serving distorted images or rate-limiting IPs. I've built similar extraction systems for legal document processing at scale, and the real bottleneck isn't the scraping - it's ensuring your NLP model can accurately parse unstructured Marathi legal clauses where "consideration" is buried in 40-page deeds using inconsistent terminology. Before architecting the solution, I need clarity on two things: First, what's your acceptable error rate for the AI extraction? If the model misclassifies 5% of deal structures as "Mixed" when they're actually "Revenue Share," does that break your downstream analysis? Second, are these PDFs searchable text or scanned images? If they're image-based, we'll need OCR preprocessing before NLP, which adds 3-4 seconds per document and changes the timeline significantly. Here's the architectural approach: - CAPTCHA BYPASS: Deploy rotating residential proxies with Tesseract OCR fine-tuned on MahaRERA's font patterns, maintaining session cookies to avoid detection across batch requests. - PDF PROCESSING: Use PyPDF2 for merging multi-part files with error handling for corrupted downloads, then validate file integrity before naming convention application. - NLP STACK: Fine-tune a multilingual BERT model (mBERT or IndicBERT) on Marathi legal corpus to extract consideration clauses, then use GPT-4 API for structured summarization into your 5 categories with confidence scoring. - DATA VALIDATION: Implement checksum verification on all 48K records to flag missing fields, duplicate registration numbers, or failed PDF merges before final delivery. I've processed 120K+ legal documents for a proptech client where accuracy was non-negotiable - we achieved 94% classification accuracy on deal structures after two rounds of model retraining. I don't take on projects where the NLP requirements are vague. Let's discuss edge cases like handwritten annotations in scanned deeds and how you want the system to handle extraction failures before I commit to a fixed timeline.
₹22,500 INR in 7 days
6.0
6.0

I’ll automate this end-to-end using Python (Playwright + asyncio + pandas + OpenCV/Tesseract + LLM API) to reliably handle CAPTCHA, scraping, PDF merging, and AI extraction at scale. I’ve built similar 50k+ record pipelines with OCR + multilingual NLP (English/Marathi via OpenAI/Claude + Indic models) ensuring structured outputs and high accuracy.
₹12,500 INR in 3 days
5.3
5.3

Hi, I can deliver the complete DaaS pipeline for 48,000 MahaRERA URLs including captcha handling, structured data extraction, PDF merging, and AI-based deal analysis. Approach: Scalable scraping pipeline with session handling + OCR-based captcha solving Distributed workers (Python, asyncio/queue-based) to process high volume reliably Automated PDF download, stitching, and standardized naming AI/NLP layer using LLMs (GPT + Indic language support) with preprocessing (OCR cleanup, Marathi→English normalization) Classification + structured extraction for “Consideration Type” and concise summaries Data validation + deduplication to ensure clean final dataset Deliverables: Clean CSV/XLSX with all required fields + AI outputs Zipped folder of merged, correctly named PDFs Milestone 1: 100 URLs fully processed for validation Milestone 2: Full dataset delivery Estimate: Cost: ₹28,000 INR Timeline: POC (100 URLs): 2–3 days Full delivery: 8–10 days after approval AI/NLP Stack: OCR + text normalization for scanned Marathi/English docs LLM-based extraction (prompt + schema guided) Classification layer for deal types (Monetary, Barter, Revenue Share, etc.) Post-processing rules to improve accuracy and consistency I have experience building large-scale scraping + AI enrichment pipelines (millions of records), so this can be delivered reliably with high accuracy. Ready to begin with the 100-URL milestone immediately.
₹25,000 INR in 7 days
4.8
4.8

Hi, I can convert your three static HTML layouts into clean Elementor templates using native Elementor widgets, not pasted HTML blocks. I understand the main goal is easy future editing for non-technical users, so I’ll rebuild each section with Elementor Pro widgets wherever possible, keeping custom CSS only for small effects that cannot be matched natively. After the three master pages are approved, I’ll duplicate them across the 10 WordPress sites and replace the content/images from the Word documents while keeping styling consistent. I’ll also test responsiveness on desktop, tablet, and mobile so the layouts hold up properly before publishing. My focus will be clean Elementor structure, reusable sections, consistent design, and pages your team can edit later without touching code. Please share the HTML files, Word docs, and site logins, and I can start by converting one master page first for your review. Best regards Ankit
₹12,500 INR in 3 days
2.9
2.9

Mumbai, India
Member since Sep 7, 2024
$10-30 USD
₹12500-37500 INR
$30-250 USD
$30-250 USD
$30-250 SGD
₹100-400 INR / hour
₹1500-12500 INR
$25-50 USD / hour
$30-250 USD
₹100-400 INR / hour
₹1500-12500 INR
$15-25 USD / hour
$250-750 USD
$250-750 USD
₹750-1250 INR / hour
₹600-1500 INR
$250-750 USD
$30-250 USD
$750-1500 USD
₹750-1250 INR / hour