Fast scraping in python with asyncio

Web scraping is one of those subjects that often appears in python discussions. There are many ways to do this, and there doesn't seem to be one best way. There are fully fledged frameworks like scrapy and more lightweight libraries like mechanize. Do-it-yourself solutions are also popular: one can go a long way by using requests and beautifulsoup or pyquery.

The reason for this diversity is that "scraping" actually covers multiple problems: you don't need to same tool to extract data from hundreds of pages and to automate some web workflow (like filling a few forms and getting some data back). I like the do-it-yourself approach because it's flexible, but it's not well-suited for massive data extraction, because requests does requests synchronously, and many requests means you have to wait a long time.

In this blog post, I'll present you an alternative to requests based on the new asyncio library : aiohttp. I use it to write small scraper that are really fast, and I'll show you how.

Basics of asyncio

asyncio is the asynchronous IO library that was introduced in python 3.4. You can also get it from pypi on python 3.3. It's quite complex and I won't go too much into details. Instead, I'll explain what you need to know to write asynchronous code with it. If you want to know more about, I invite you to read its documentation.

To make it simple, there are two things you need to know about : coroutines and event loops. Coroutines are like functions, but they can be suspended or resumed at certain points in the code. This is used to pause a coroutine while it waits for an IO (an HTTP request, for example) and execute another one in the meantime. We use the yield from keyword to state that we want the return value of a coroutine. An event loop is used to orchestrate the execution of the coroutines.

There is much more to asyncio, but that's all we need to know for now. It might be a little unclear from know, so let's look at some code.

aiohttp

aiohttp is a library designed to work with asyncio, with an API that looks like requests'. It's not very well documented for now, but there are some very useful examples. We'll first show its basic usage.

First, we'll define a coroutine to get a page and print it. We use asyncio.coroutine to decorate a function as a coroutine. aiohttp.request is a coroutine, and so is the read method, so we'll need to use yield from to call them. Apart from that, the code looks pretty straightforward:

@asyncio.coroutine
def print_page(url):
    response = yield from aiohttp.request('GET', url)
    body = yield from response.read_and_close(decode=True)
    print(body)

As we have seen, we can call a coroutine from another coroutine with yield from. To call a coroutine from synchronous code, we'll need an event loop. We can get the standard one with asyncio.get_event_loop() and run the coroutine on it using its run_until_complete() method. So, all we have to do to run the previous coroutine is:

loop = asyncio.get_event_loop()
loop.run_until_complete(print_page('http://example.com'))

A useful function is asyncio.wait, which takes a list a coroutines and returns a single coroutine that wrap them all, so we can write:

loop.run_until_complete(asyncio.wait([print_page('http://example.com/foo'),
                                      print_page('http://example.com/bar')]))

Another one is asyncio.as_completed, that takes a list of coroutines and returns an iterator that yields the coroutines in the order in which they are completed, so that when you iterate on it, you get each result as soon as it's available.

Scraping

Now that we know how to do asynchronous HTTP requests, we can write a scraper. The only other part we need is something to read the html. I use beautifulsoup for that, be others like pyquery or lxml.

For this example, we'll write a small scraper to get the torrent links for various linux distributions from the pirate bay.

First of all, a helper coroutine to perform GET requests:

@asyncio.coroutine
def get(*args, **kwargs):
    response = yield from aiohttp.request('GET', *args, **kwargs)
    return (yield from response.read_and_close(decode=True))

The parsing part. This post is not about beautifulsoup, so I'll keep it dumb and simple: we get the first magnet list of the page:

def first_magnet(page):
    soup = bs4.BeautifulSoup(page)
    a = soup.find('a', title='Download this torrent using magnet')
    return a['href']

The coroutine. With this url, results are sorted by number of seeders, so the first result is actually the most seeded:

@asyncio.coroutine
def print_magnet(query):
    url = 'http://thepiratebay.se/search/{}/0/7/0'.format(query)
    page = yield from get(url, compress=True)
    magnet = first_magnet(page)
    print('{}: {}'.format(query, magnet))

Finally, the code to call all of this:

distros = ['archlinux', 'ubuntu', 'debian']
loop = asyncio.get_event_loop()
f = asyncio.wait([print_magnet(d) for d in distros])
loop.run_until_complete(f)

Conclusion

And there you go, you have a small scraper that works asynchronously. That means the various pages are being downloaded at the same time, so this example is 3 times faster than the same code with requests. You should now be able to write your own scrapers in the same way.

You can find the resulting code, including the bonus tracks, in this gist.

Once you are comfortable with all this, I recommend you take a look at asyncio's documentation and aiohttp examples, which will show you all the potential asyncio has.

One limitation of this approach (in fact, any hand-made approach) is that there doesn't seem to be a standalone library to handle forms. Mechanize and scrapy have nice helpers to easily submit forms, but if you don't use them, you'll have to do it yourself. This is something that bugs be, so I might write such a library at some point (but don't count on it for now).

Bonus track: don't hammer the server

Doing 3 requests at the same time is cool, doing 5000, however, is not so nice. If you try to do too many requests at the same time, connections might start to get closed, or you might even get banned from the website.

To avoid this, you can use a semaphore. It is a synchronization tool that can be used to limit the number of coroutines that do something at some point. We'll just create the semaphore before creating the loop, passing as an argument the number of simultaneous requests we want to allow:

sem = asyncio.Semaphore(5)

Then, we just replace:

page = yield from get(url, compress=True)

by the same thing, but protected by a semaphore:

with (yield from sem):
    page = yield from get(url, compress=True)

This will ensure that at most 5 requests can be done at the same time.

Bonus track: progress bar

This one is just for free: tqdm is a nice library to make progress bars. This coroutine works just like asyncio.wait, but displays a progress bar indicating the completion of the coroutines passed to it:

@asyncio.coroutine
def wait_with_progress(coros):
    for f in tqdm.tqdm(asyncio.as_completed(coros), total=len(coros)):
        yield from f