Using Rodeo To Transform Olympics Data Into GIFs

Datetime:2016-08-23 01:45:11          Topic:          Share

Going For Gold

Like many people around the globe, I've spent the past week and a half pretty captivated by the biggest sporting spectacle in the world, the Olympics.

As I watched each genetic superhuman rack up medals while consuming some combination of pizza and beer on my couch, I noticed that certain countries seemed to stockpile wins each year while leaving others in the dust, and wondered how that had changed over time. Even better, would it be possible to find data on this and create some kind of visual from it? Turns out - It is!

Here's how I used pandas , cartopy , matplotlib.pyplot and images2gif to visualize olympic medal count data from past to present on a map of the globe, and then turn those images into a GIF.

All of this was done usingRodeo - a data science IDE created right here at Yhat.

A Preview Of What We'll Be Making!

We Need Olympics Data

To get started, we need data. Luckily, The Guardian's DataBlog posted an article a few years ago which provides a spreadsheet with all Summer Olympic medal winners from 1896 to 2008.

Here's what that dataset looks like:

City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver

We're going to download the Guardian DataBlog file and use the "ALL MEDALISTS" and "Country Codes" tabs for our data. pandas will help us organize and sort our data before start mapping.

Our final dataframe is going to need both country codes and country names, so let's first merge that "country_codes" CSV with our olympics data:

import os
import pandas as pd
filename = os.path.join("./olympics.csv")
df = pd.read_csv(filename)

cc_file = os.path.join("./country_codes.csv")
cc_df = pd.read_csv(cc_file)

df = df.merge(cc_df, on=['NOC'], how='left').rename(columns = {'country':'Country'})

The Olympics Started Out Weird

If at this point you're mostly distracted by "100m freestyle for sailors", and wondering what that was/why Greece seemed to be AMAZING at it -- have no fear.

We're on the same page, and I've already sailed down this rabbit hole for you.

The men's sailors 100 metre freestyle was one of the four swimming events on the Swimming at the 1896 Summer Olympics programme. The second swimming race was open only to sailors of the Greek Royal Navy. Eleven entered but only three actually took part in the event.

wikipedia.org

That's right. Greece won all 3 medals for this event because literally only Greek sailors could enter, and exactly 3 of them.

Some would call this an unfair swimming advantage.

On the other hand, the USA is currently allowed to compete with 2 actual live dolphins on their team, so we'll just call it even.

WHAT DO YOU TWO EVEN EAT?!

Anyway. Back to business.

Grouping And Sorting

This part was a little tricky, because we don't want to just plot every medal winner from every country - this would hugely throw off our data.

For example, if an individual American swimmer wins a gold medal, and the American rowing team wins silver, this should be counted as 1 gold (swimmer) and 1 silver (rowing team) for the country.

It should not be counted as 1 gold (swimmer), and n silvers (n = however many people fit on rowing boats. I feel like they're big boats. I may be thinking of vikings.)

To make sure our medal counts do their job, we're going to clip our data to just the columns we need, and then drop the duplicates to get rid of multi-player teams while keeping individual medals intact:

df = df[['Edition', 'City', 'Event','NOC','Country','Medal']]
df =  df.drop_duplicates()
df.head()
Edition City Event NOC Medal
1896 Athens 100m freestyle HUN Gold
1896 Athens 100m freestyle AUT Silver
1896 Athens 100m freestyle for sailors GRE Bronze
1896 Athens 100m freestyle for sailors GRE Gold
1896 Athens 100m freestyle for sailors GRE Silver

We're then going to set a "Count" column using the transform method, and then sort our data by year and medal count so it will be easier to work with once we get to plotting.

df['Count'] = df.groupby(['Edition','NOC','Country','Event', 'Medal'])['Medal'].transform('count')
df = df[['Edition', 'City', 'NOC', 'Country','Count']].groupby(['Edition', 'City','NOC','Country']).count().reset_index()
df = df.sort(['Edition', 'Count'], ascending=[1, 0])
df.head()

Success!:

Edition City NOC Country Count
1896 Athens GRE Greece 45
1896 Athens USA United States 19
1896 Athens GER Germany 13
1896 Athens FRA France 11
1896 Athens GBR United Kingdom 7

.... Weren't There Olympics In 2012?

Unfortunately for us, our data was posted in 2012 BEFORE the Summer Olympic games that year, so we will have to grab the 2012 data manually. Please hold...

OK! With a little help from a 2012 medal tally table via Topend Sports , Panda's read_html method, and some reformatting gymnastics, we're back in business:

import pandas as pd
last_ol = pd.read_html("http://www.topendsports.com/events/summer/medal-tally/2012.htm",header = 0)[0].rename(columns = {'Total':'Count'}).replace({'USA': 'United States'}, regex=True).replace({'Great Britain': 'United Kingdom'}, regex=True)
last_ol = last_ol.merge(cc_df, on=['Country'], how='left')
last_ol['Edition'] = pd.Series("2012", index=last_ol.index)
last_ol['City'] = pd.Series("London", index=last_ol.index)
last_ol = last_ol[['Edition', 'City', 'NOC', 'Country','Count']]
last_ol = last_ol.sort(['Edition', 'Count'], ascending=[1, 0])
Edition City NOC Country Count
2012 London USA United States 104
2012 London CHN China 88
2012 London RUS Russia 82
2012 London GBR United Kingdom 65
2012 London GER Germany 44

Now let's throw that onto our data set, and we should be good to go.

df = pd.concat([df, last_ol])

ONWARD!!

Time To Plot

We'll be using cartopy to visualize our data on a map, and matplotlib.pyplot for the actual plotting of the map.

Cartopy provides a huge selection of projections for easily creating maps. We're going to go with the simple PlateCarree method which gives us a nice, clear view of the globe:

Now for the tough part. We're going to cycle through each Olympic year, as well as each country in shpreader 's natural_earth shapefile, grabbing data and plotting along the way..

import matplotlib.pyplot as plt
import cartopy.io.shapereader as shpreader
import cartopy.crs as ccrs
maps = []
for _, olympics in df.groupby('Edition'):
    # set the figure size
    plt.figure(figsize=(12,16))
    # slice the top 10 medal winners from the group
    olympics = olympics.head(10)
    # create our globe projection, adding our boundaries and country borders along the way
    ax = plt.axes(projection=ccrs.PlateCarree())
    ax.add_feature(cartopy.feature.LAND)
    ax.set_extent([-145, 41, -5, 35])
    shpfilename = shpreader.natural_earth(resolution='110m',
                                          category='cultural',
                                          name='admin_0_countries')
    reader = shpreader.Reader(shpfilename)
    # loop through each country, and add some orange color if we find it in our Top 10 from that year. If we
    # don't find it, we'll turn it blue instead.
    countries = reader.records()
    for country in countries:
        print country.attributes['adm0_a3']
        if any(olympics.NOC == country.attributes['adm0_a3']) or any(olympics.Country == country.attributes['geounit']):
            ax.add_geometries(country.geometry, ccrs.PlateCarree(),
                              facecolor=(0.93, 0.325, 0.0666),
                              label=country.attributes['adm0_a3'])
        else:
            ax.add_geometries(country.geometry, ccrs.PlateCarree(),
                              facecolor=(0.21, 0.58, 0.84),
                              label=country.attributes['adm0_a3'])
    # add a blank row to keep consistent table size (1904 had only 9 competing countries)
    if year == "1904":
        olympics.loc[len(olympics)] = ['1904', 'St Louis', '', '', '']
    #add the Olympics Edition as our title
    year =  str(olympics['Edition'].iloc[0])
    plt.title(year,  fontsize=25)
    # add our table, format it, and place it on the bottom of the map
    table_vals= olympics[['Country','Count']].head(10).values
    col_labels=['Country','Medal Count']
    the_table = plt.table(cellText=table_vals,
                      colWidths = [0.5, 0.5],
                      colLabels=col_labels,
                      cellLoc='center',
                      loc='bottom',)
    the_table.set_fontsize(14)
    the_table.scale(1, 5)
    # save each figure as a png and add it to an array (we'll come back to this later)
    pic_file = "medal_count_" + year + ".png"
    maps.append(pic_file)
    plt.savefig(pic_file, bbox_inches='tight')
    # take a look at your map!
    plt.show()

medal_count_1896.png

Make It A GIF!

Converting all of these PNGs to a single GIF on our own is a little tricky, but luckily there is a module in the echomesh repo called images2gif that provides a helpful writeGif function, which takes an array of images and exports them as a GIF.

We'll use the Python Imaging Library (PIL), along with numpy to convert our files into a suitable "GIF-able" format, throw them into an array, and then pass that array into the writeGif function.

*There is one caveat with using the writeGif function. All of the images used to create the final GIF must have the exact same dimensions. Our 1904 winners only have 9 countries listed, which made the resulting PNG slightly shorter than the others, and therefore unpassable to the function. We fixed this by simply adding a blank row to that table.

from images2gif import writeGif
from PIL import Image
import numpy as np
gifs  = []
for file in maps:
    im = Image.open(file)
    im = np.asarray(im)
    imgs.append(im)
writeGif("olympics_globe.gif", gifs, duration=0.9, dither=0 )

TAH DAH!!

olympics_globe.gif