Web Scraping

24 September 2024

I scraped 500+ real estate projects from private portals

Photograph of houses

Introduction

I scraped over 500 real estate projects from private portals. Here’s an overview of the technical challenges encountered and how this data can turn into business opportunities for my client.

The technical challenges of scraping private portals

Access to private portals

The first challenge was getting access to these portals. Luckily, with a few good contacts, we managed to get accounts for 6 major French real estate developers’ portals without too much trouble.

Outdated technologies and lack of API

Next, we had to deal with the tech behind these portals. Old systems, no public API available, so I had to write some "homemade" scripts, but that’s also what makes scraping quite interesting.

Slow sites and periods of downtime

On top of that, the sites were often slow and would go "offline" at certain times, like nights and weekends. It’s a bit frustrating when you want to make progress and everything is at a standstill.

Technical solutions applied

HTTP requests vs. headless browsers

I always prefer direct HTTP requests over using a headless browser for scraping.
Why? Simply because headless browsers (like Puppeteer or Selenium) are very resource-intensive. They fully simulate a real user’s navigation, which is useful but often overkill for most scraping tasks. In this case, I managed to do everything with HTTP requests, avoiding the heaviness of a headless browser.

Spoofing the client to bypass limitations

One of the first steps was to structure my HTTP requests to mimic the behavior of a regular user on the site. You connect via the browser, check the console to see how it works (tokens, cookies, etc.), then reproduce all of that in the HTTP requests, injecting the right headers like User-Agent and cookies.

HTTP Request to fetch programs

Adding delays to avoid detection

Since this was scraping from portals with login, it was easy to detect. To avoid drawing attention and getting blocked, I added random delays between each request. The idea was not to spam and to space the requests over several hours. This made the scraping more discreet and respectful of the server’s rhythm, especially when the infrastructure was a bit fragile.

Using XPath to extract data

xPaths

For some portals (the more modern ones), I found external API calls (used by the client) via the browser’s console. In those cases, it was easy: I just retrieved the data directly from the API, which greatly simplified the process. The main work then was formatting the data to match our fields.

For the other portals without an API, I used XPath to extract the specific elements from each page. In my opinion, this is the best solution to precisely target elements on a webpage. So, for each crawler that didn’t have an API, I created a dedicated XPath file with all the necessary elements. This allowed me to efficiently automate the collection of data from these static or semi-dynamic pages.

Data formatting and structuring

This is where the bulk of the work begins: fitting the squares into the squares and the triangles into the triangles. Each field must match our own data structure (type & name), and that sometimes requires quite a bit of finesse to get everything aligned.

Identifying the data structure

In this type of portal, the data structure is fairly standard. You have real estate programs, and for each program, a list of lots that varies in length depending on the project size. Each lot or program has associated documents and photos. It was obvious I’d have to scrape multiple pages for each program and link them together. This meant clearly identifying the relationships between these pages and flagging the data correctly to connect everything in the end.

Handling files and photos

Extracting documents and photos wasn’t a big problem, except for some portals where MIME types weren’t properly configured. For example, I received a PDF file that was identified as a PNG MIME type. Needless to say, that was a bit tricky to handle. I had to add specific conditions to process these types of cases.

Other than that, file extraction was mostly a matter of patience. With slow servers, it was necessary to wait calmly for everything to download.

Data formatting

The most tedious but essential part was restructuring the data, especially after extraction via XPath. My method is to write a first script that retrieves everything possible, even the seemingly unnecessary details. Once I have all the data, I sort through it, group it, and format each element. This includes formatting numbers, identifying enumerations, and matching the fields with our own data structure.

Field uniformity

A simple example: formatting the lot surface areas. Sometimes you find "42m²", other times "42m2", and sometimes just "42", or even "42 m²" with a space. This example is basic, but I also encountered much more complex cases, with many variations like this that had to be managed.

42m2

Results: actionable data and competitive advantage

Volume of data collected

In the end, after running the 6 crawlers, I scraped over 500 real estate programs, with hundreds of thousands of lots, documents, and photos. My client directly integrated this data into their platform, which now allows them to resell this information to their own clients, giving them privileged access to data they wouldn’t have been able to obtain otherwise.

Business impact

The data extracted isn’t just a bunch of numbers or listings: for companies in the real estate sector, it’s a real lever. For example, knowing in advance which real estate projects are in the pipeline allows you to position yourself before tenders, prepare bids, or even contact developers directly to anticipate their needs. Exploiting this information can represent millions of euros in opportunities. It’s a huge competitive advantage in an ultra-competitive market.

Lessons learned and best practices

  • Favor HTTP requests: When possible, avoid headless browsers. It’s lighter and faster.
  • Take the time to properly format the data: It’s sometimes tedious, but it makes data processing much smoother later on and opens up more possibilities for exploitation.

Conclusion: scraping, a major competitive advantage

Each scraping project has its own specificities and challenges, but when done right, it allows access to crucial information for businesses. Scraping isn’t just about extracting data: it’s a strategic lever for generating new business opportunities and positioning yourself in competitive markets.

Anthony LEDRU Photo

Anthony LEDRU

Web Scraping & SaaS Development

About me

I specialize in the development of SaaS solutions and large-scale web scraping.