Scraping TikTok Profiles At Scale With Asynchronous Python

Why use asynchronous web scraping

When it comes to efficient data extraction, not all scraping methods were created equal. This guide will show you how to scale up your TikTok data collection with concurrency.

The techniques presented in this article have the potential to cut your scraping duration from running overnight to done in minutes. All of the code for this project can be found on our Github.

Traditional scraping methods

Most guides will point you towards one of two methods: Selenium or Python Requests. Both of these methods have the major pitfall of operating synchronously. In other words, you can only send a single request at a time and must wait for them to complete one-by-one. Additionally, Selenium automation is considered to be inefficient and clunky due to its high computational overhead and fragile nature.

Choosing our weapons: aiohttp and proxies

While the methods listed above work just fine in many cases, they are not ideal when you need to scrape a lot of data, fast. Today, will be using an asynchronous library called aiohttp to accomplish this task.

TikTok hands out IP bans like candy to web scrapers, so we need some good proxies to make sure our requests get through. I use rotating datacenter proxy service from Bright Data, and I cannot recommend them highly enough.

Inspecting the TikTok profile data

As with any scraping task, the first step is to inspect our target. Navigate over to this profile on tiktok, and open the browser HTML inspector. In the "Search HTML" bar within the inspector, type "SIGI_STATE". This should take you to a script tag with an ID of "SIGI_STATE" of type application/json. You should see something like the following screenshot: tiktok profile sigi state data This JSON contains all the profile data, and data for their latest 30 videos. We will simply extract the contents of this script tag using bs4 and parse the JSON using Python's json module.

I first saw this method implemented by David Teather in TikTok-API, an unofficial API for TikTok scraping.

Sending an HTTP request to TikTok

First things first, we will send our HTTP request to fetch the data from TikTok. Start by importing asyncio, aiohttp, and json.

import aiohttp # run pip install aiohttp to install
from bs4 import BeautifulSoup # run pip install bs4 to install
import asyncio
import json

Next, send an HTTP request and read the response as text. The syntax here is a little more complex than with the requests module because it uses the async context manager protocol.

url = "https://www.tiktok.com/@therock"
proxy = "my_rotating_proxy_url_here" # Enter your entire proxy URL - authentication and all.
async with aiohttp.ClientSession() as session:
    async with session.get(url, proxy=proxy) as response:
        text = await response.text()

If your proxies are good, the request should have gone without a problem. You can wrap your requests in a retry block for stability just in case.

Extracting the profile JSON data

The next step is to extract the profile JSON data from the script tag with ID SIGI_STATE. To do this, we will simply parse the HTML using BeautifulSoup and select the tag by ID.

soup = BeautifulSoup(text) # Parse the HTML
script_tag = soup.select_one("script#SIGI_STATE") # Select the SIGI_STATE tag
tag_contents = script_tag.contents[0] # Extract the contents
sigi_json = json.loads(tag_contents) # Load it into a python dictionary

There is a lot of data in this JSON, but we are most interested in the profile. To extract this, we simply need to index the UserModule from the dictionary.

profile_data = sigi_json.get("UserModule")

Speeding up with concurrency

That wasn't too bad, but the whole idea behind using aiohttp is to send concurrent requests. To do that, we will take a list of TikTok usernames and create an asyncio task to scrape each one of them. We will let the tasks run concurrently, and then gather the results. For this example, I will use a small list of 5 profiles, but this method scales easily into the thousands.

To do this, we will wrap the scraping code we wrote earlier into an async function. I am wrapping this code in a retry block in case some requests don't get through the first time.

async def scrape_profile(username: str):
    """Scrape a single TikTok profile by username"""
    url = "https://tiktok.com/@" + username
    for i in range(5):
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(url, proxy=proxy) as response:
                    text = await response.text()
                    soup = BeautifulSoup(text) 
                    script_tag = soup.select_one("script#SIGI_STATE") 
                    tag_contents = script_tag.contents[0] 
                    sigi_json = json.loads(tag_contents) 
                    return sigi_json.get("UserModule")
        except Exception as e:
            print(f"Request {i} failed with error {str(e)}")
            continue

Now, let's run this function on each of our target profiles in our list. To do this we will create a "task" to run our scrape_profile function for each profile, and then "gather" the results. We will pass the return_exceptions boolean parameter to the asyncio.gather() function so that we can handle any exceptions.

async def main():
    """Function from which we call our scraping function"""
    usernames = ["therock", "nyjah.huston", "kingrygarcia", "codescope", "khabylame"]
    tasks = [asyncio.create_task(scrape_profile(user)) for user in usernames] # Create scrape_profile() task for each profile in our list of usernames.
    results = asyncio.gather(*tasks, return_exceptions=True) # Get the results of the tasks
    return [res for res in results if not isinstance(res, Exception)] # Return the results of the data that did not cause an exception

Finally, run the main() function and save your result as a JSON file.

data = asyncio.run(main)
with open("output_file.json", "w") as f:
    json.dump(data, f)

Conclusion

While using aiohttp adds a little more complexity and some more lines of code, it is an extremely effective way to speed up and scale your scraping. This is one method Toughdata uses to scrape profiles in our TikTok follower scraper. All of the code for this project can be found on our Github.