Scraping 10,000 Records At Once With aiohttp

A need for speed

This article descibes how I used aiohttp and rotating proxies to scrape data extremely fast. The entire code can be found here.

Recently I scraped some data off a backend API for a client. He needed roughly 10,000 data points with scores from darts tournaments. I was able to reverse engineer the API by analyzing network traffic and recreating requests with Insomnia. I then set up a script to loop through each page. A pretty standard procedure for web scraping.

I found that the API had quite a bit of latency, each request was taking up to 10 seconds to complete. At that rate, I would be waiting for days to get my data.

I was able to scrape the entire dataset in less than 30 minutes using the techniques described here.

The solution

It was time to pull out the big guns: aiohttp. Aiohttp is an asynchronous HTTP library with many features. Asynchronous (async) programming allows you to run a multitude of tasks concurrently. I also used Python's asyncio module, which provides a high-level interface for asynchronous programming.

Asyncio creates an entirely new programming paradigm, so it can be challenging to learn at first. Superfastpython has some good tutorials on it.

Getting started

Import the following libraries to get started:

import aiohttp # Async HTTP library
import asyncio # Python asynchronous programming library
import json 
from contextlib import contextmanager # Useful decorator for context management
import random

Proxy allocation

To avoid being blocked for sending thousands of simultaneous requests, using proxies is necessary. I was able to accomplish this using a list of datacenter proxies from Smartproxy.

First, read your proxy list from a text file.

# Read your proxy list from .txt file
with open("./datacenter_proxies.txt") as f:
    proxy_list = [line.strip() for line in f.readlines()]

When rotating proxies, we could simply use the random.choice() function to select one. This has the major issue that the same proxy may be used simultaneously, which could result in an IP ban. To avoid this, we use Python's context manager protocol to "allocate" a proxy to a task.

proxies_in_use = []

@contextmanager
def allocate_proxy():
    """Get a free proxy and hold it as in use"""
    available_proxies = [p for p in proxy_list if p not in proxies_in_use] # Select proxies that are not in use
    if available_proxies:
        proxy = random.choice(available_proxies) 
    else:
        proxy = random.choice(proxy_list) # If there is not an available proxy, we resort to the random.choice method for the entire list.
    try:
        proxies_in_use.append(proxy)
        yield proxy
    finally:
        proxies_in_use.remove(proxy)

Because of the @contextmanager decorator, we are able to use this function as a "with" statement. This makes for clean and manageable code when scaling up.

with allocate_proxy() as proxy:
    # do stuff with proxy...

Preparing scraping functions

With the proxy allocation set up, I am now able to set up the code for scraping. First, define an async function that scrapes a url with a pre-existing aiohttp.ClientSession object.

async def scrape_page(url: str, session: aiohttp.ClientSession):
    """Scrape the JSON from a page"""
    with allocate_proxy() as proxy:
        async with session.get(url, proxy=proxy, timeout=5000) as resp:
            print("Success!")
            return await resp.json()

Aiohttp is limited by default to roughly 1000 concurrent requests. A workaround is to use a Semaphore. Semaphores are an old school synchronization primitive, normally intended to restrict access to a shared resource. However, in this example, the semaphore allows us to increase access. More details on this can be found here.

async def bound_scrape(url, session, semaphore: asyncio.Semaphore):
    """Scrape the page with binding by semaphore"""
    async with semaphore:
        return await scrape_page(url, session)

Scraping the data

Now that we have our scraping code ready to go, we can write a main() function to scrape the data.

async def main():
    base_url = "https://api-igamedc.igamemedia.com/api/MssWeb/fixtures/" # The API url
    sem = asyncio.Semaphore(25000) # Create a Semaphore
    empty_page_count = 0
    page_range = (1000, 10244) 
    tasks = []
    async with aiohttp.ClientSession() as session: # Scrape the data for each page
        for page in range(*page_range):
           url = base_url + str(page) 
           tasks.append(asyncio.create_task(bound_scrape(url, session, sem)))
           await asyncio.sleep(0.2) # Added a timeout to avoid completely slamming the server
           print("Request sent.")

        results = await asyncio.gather(*tasks, return_exceptions=True)

    return results

We can now run our main() scraping funciton using asyncio.run(). After that, we get rid of any exceptions and save our data.

raw_data = asyncio.run(main())
data = [d for d in raw_data if not isinstance(d, Exception)] # Remove any exceptions from list of data

with open("data.json", "w") as f: # Save the data
    json.dump(data, f)

Conclusion

Writing this scraper would have been much quicker, and taken less lines of code if I had stuck with the requests module. It would have been a 10 minute job at most to write it this way. Writing and debugging this code took me about an hour. However due to the amount of data and the latency, getting my data would have taken way days. Deciding whether asynchronous programming is right for the job depends on the use case. In my case, it was very useful.