Get the IMDb Top 250 list

Datetime:2016-08-22 22:27:18          Topic: HTML  Python           Share

Problem

From IMDb you want to get the list of the Top 100 movies.

Solution

There is a Top 250 list here: http://akas.imdb.com/chart/top . To access IMDb info, I use the excellent imdbpy package. It has a get_top250_movies() function but it returns an empty list :)

During my research I found this post on SO. It suggests that one should download the official IMDb dump from here . The Top 250 list is in the file ratings.list.gz . However, this file doesn’t contain the IMDb IDs of the movies, so it’s good for nothing :(

There was only one solution left: let’s do some scraping. Here is the Python code that did the job for me. I didn’t use BeautifulSoup just plain ol’ regular expressions:

import requests
import re

top250_url = "http://akas.imdb.com/chart/top"

def get_top250():
    r = requests.get(top250_url)
    html = r.text.split("\n")
    result = []
    for line in html:
        line = line.rstrip("\n")
        m = re.search(r'data-titleid="tt(\d+?)">', line)
        if m:
            _id = m.group(1)
            result.append(_id)
    #
    return result

It returns the IMDb IDs of the Top 250 movies. Then, using the imdbpy package you can ask all the information about a movie, since you have the movie ID.

Links





About List