S3 datasets¶

IMDb distributes some of its data as downloadable datasets. Cinemagoer can import this data into a database and make it accessible through its API. [1]

For this, you will first need to install SQLAlchemy and the libraries that are needed for the database server you want to use. Check out the SQLAlchemy dialects documentation for more detail.

Then, follow these steps:

Download the files from the following address and put all of them in the same directory: https://datasets.imdbws.com/
Create a database. Use a collation like utf8_unicode_ci.

Import the data using the s32cinemagoer.py script:

s32cinemagoer.py /path/to/the/tsv.gz/files/ URI

URI is the identifier used to access the SQL database. For example:

s32cinemagoer.py ~/Download/imdb-s3-dataset-2018-02-07/ \
    postgresql://user:password@localhost/imdb

Please notice that for some database engines (like MySQL and MariaDB) you may need to specify the charset on the URI and sometimes also the dialect, with something like ‘mysql+mysqldb://username:password@localhost/imdb?charset=utf8’

Once the import is finished - which should take about an hour or less on a modern system - you will have a SQL database with all the information and you can use the normal Cinemagoer API:

from imdb import Cinemagoer

ia = Cinemagoer('s3', 'postgresql://user:password@localhost/imdb')

results = ia.search_movie('the matrix')
for result in results:
    print(result.movieID, result)

matrix = results[0]
ia.update(matrix)
print(matrix.keys())

Note

Running the script again will drop the current tables and import the data again.

Note

Installing the tqdm package, a progress bar is shown while the database is populated and the –verbose argument is used.

[1]	Until the end of 2017, IMDb used to distribute a more comprehensive subset of its data in a different format. Cinemagoer can also import that data but note that the data is not being updated anymore. For more information, see Old data files.