The 2022 World Cup in Qatar was the first to be held during the winter months in the Premier League to avoid the intense heat of the Middle Eastern summer, so the Premier League season was put on hold accomodate for this. With only a week break between the end of the World Cup and the restart of the Premier League, it’s clear that some players and clubs will be impacted more than others, depending on how many games they played during the international tournament.

This post aims to explore the impact of the World Cup on the Premier League.

1 Data Sources

In a previous post exploring the strange fact about the 2019 Champions League Final, transfermarkt was used as the data source by scraping webpages for players and teams. The same data source and libraries will be used here, but some changes are necessary to ensure we get the right level of granularity to ensure we canextract insights at the right level.

You can check out the transfermarkt scraping code here.

2 Extracting World Cup data

The key data for this analysis lies within transfermarkt’s pages for the games played during the World Cup. Our previous scraping functionality was designed to only extract player availability so we’ll need to tweak some of the scraping code so we can get the minutes played for each game.

You can view the changes made in this commit on github.

Once we have a list of all matches played by a player, we need to extract the minutes played. Unfortunately the data for each player is on a seperate page, so to aggregate data for a club, we have to make a request for each one.

Scraping world cup minutes data

from datetime import datetime
import pandas as pd

def get_world_cup_minutes(
        team_url:str, 
        world_cup_start = datetime(2022,11,20),
        world_cup_end = datetime(2022,12,18)
    )-> pd.DataFrame:

    team_players = teams.get_players(team_url)
    # build url of player to scrape data from
    team_player_urls = {player:"https://www.transfermarkt.com" + url for player,url in team_players.items()}

    # grab all player's matches
    team_minutes_played = []
    
    for player,url in team_player_urls.items():
        print("grabbing data for: " + player)
        match_data = players.get_match_data(url,world_cup_year)
        # possibility of player not playing any games 
        if match_data is None:
            continue
        min_played = players.get_minutes_played(match_data)
        # add player_name column so we can identify rows when we concatenate
        min_played['player_name'] = player
        team_minutes_played.append(min_played)

    all_minutes_played= pd.concat(team_minutes_played)

    

    world_cup_minutes = all_minutes_played.loc[all_minutes_played['Date'].between(world_cup_start, world_cup_end)]

    return world_cup_minutes

We can now use this function to extract data for each Premier league team:

# for some reason transfermarkt lists world cup on 2021 page
world_cup_year= "2021"

# get 2022 premier league clubs
prem_clubs_22 = leagues.get_prem_club_list(season="2022")

prem_world_cup_minutes = {}

for club,url in prem_clubs_22.items():
    team_minutes = get_world_cup_minutes(url)
    prem_world_cup_minutes[club] = team_minutes
    # add club name so we can identify rows by club when we concatenate rows
    team_minutes["club"] = club

prem_world_cup_minutes_df = pd.concat(prem_world_cup_minutes.values())

Finally we have our complete dataset that we can play around with!

prem_world_cup_minutes_df.sample(5)

	Date	Matchday	Home team.1	Away team.1	Result	min_played	subbed_on	subbed_off	player_name	club
0	2022-11-21	Group B	England	Iran	6:2	0	Groin Surgery	Groin Surgery	Emile Smith Rowe	Arsenal FC
3	2022-12-06	Round of 16	Morocco	Spain	3:0 on pens	0	on the bench	on the bench	David Raya	Brentford FC
5	2022-12-14	Semi-Finals	France	Morocco	2:0	0	Not in squad	Not in squad	Lucas Digne	Aston Villa
0	2022-11-21	Group B	England	Iran	6:2	0	on the bench	on the bench	Aaron Ramsdale	Arsenal FC
2	2022-11-29	Group B	Wales	England	0:3	77	NaN	77'	Daniel James	Fulham FC

3 Data processing

Now that the data is normalised, we can run some sense checks to ensure the data is what we expect.

prem_world_cup_minutes_df['player_name'].value_counts()

Lucas Digne            7
Mateo Kovacic          7
Raphaël Varane         7
N'Golo Kanté           7
Hakim Ziyech           7
                      ..
Thomas Partey          3
Tariq Lamptey          3
Wout Faes              3
Armel Bella-Kotchap    3
Philip Billing         3
Name: player_name, Length: 150, dtype: int64

This passes the eye test: Digne (France) and Kovacic (Croatia) both played all games in the World Cup since they got to the final and third place playoff, and unfortunately for Billing (Denmark) and Bella-Kotchap (Germany), their teams were knocked out at the group stage after playing only 3 games.

3.1 Adding country name

This data doesn’t currently contain the name of the country, and although we can get this by scraping more webpages, it isn’t ideal since we have to make more requests which is slow and writing more code to scrape the responses won’t be fun or productive.

We do have the data for each game played by the player, so a hacky way to get the team name is to calculate the mode of the teams in the subset of data for that player. I.e. for Lucas Digne (France), here are all the games played by him:

prem_world_cup_minutes_df[prem_world_cup_minutes_df['player_name'] =="Lucas Digne"]

	Date	Matchday	Home team.1	Away team.1	Result	subbed_on	subbed_off	player_name	club
0	2022-11-22	Group D	France	Australia	4:1	Not in squad	Not in squad	Lucas Digne	Aston Villa
1	2022-11-26	Group D	France	Denmark	2:1	Not in squad	Not in squad	Lucas Digne	Aston Villa
2	2022-11-30	Group D	Tunisia	France	1:0	Not in squad	Not in squad	Lucas Digne	Aston Villa
3	2022-12-04	Round of 16	France	Poland	3:1	Not in squad	Not in squad	Lucas Digne	Aston Villa
4	2022-12-10	Quarter-Finals	England	France	1:2	Not in squad	Not in squad	Lucas Digne	Aston Villa
5	2022-12-14	Semi-Finals	France	Morocco	2:0	Not in squad	Not in squad	Lucas Digne	Aston Villa
6	2022-12-18	Final	Argentina	France	7:5 on pens	Not in squad	Not in squad	Lucas Digne	Aston Villa

We can see that France show up the most in the Home team.1 and Away team.1 columns, so we can use this to calculate which team he plays for.

prem_world_cup_minutes_df[prem_world_cup_minutes_df['player_name'] =="Lucas Digne"][['Home team.1', 'Away team.1']].values

array([['France', 'Australia'],
       ['France', 'Denmark'],
       ['Tunisia', 'France'],
       ['France', 'Poland'],
       ['England', 'France'],
       ['France', 'Morocco'],
       ['Argentina', 'France']], dtype=object)

Since the team names are spread across two columns, we first convert this into a 2D numpy.array

teams = prem_world_cup_minutes_df[
    prem_world_cup_minutes_df['player_name'] =="Lucas Digne"
    ][
        ['Home team.1', 'Away team.1']
        ].values.flatten()
teams

array(['France', 'Australia', 'France', 'Denmark', 'Tunisia', 'France',
       'France', 'Poland', 'England', 'France', 'France', 'Morocco',
       'Argentina', 'France'], dtype=object)

And then flatten it so its a 1D array (like a list) and we can calculate the mode

pd.DataFrame(teams).mode().iloc[0,0]

'France'

We can applying this to the full dataset with DataFrame.apply and then merge it with the original dataset so we have an extra country column.

player_countries = prem_world_cup_minutes_df.groupby("player_name").apply(lambda df: pd.DataFrame(df[['Home team.1', 'Away team.1']].values.flatten()).mode())
player_countries = player_countries.reset_index().drop("level_1", axis=1).rename({0:'country'}, axis=1)
prem_world_cup_minutes_df = prem_world_cup_minutes_df.merge(player_countries, how='left', left_on="player_name", right_on="player_name")
prem_world_cup_minutes_df.head()

	Date	Matchday	Home team.1	Away team.1	Result	min_played	subbed_on	subbed_off	player_name	club	country
0	2022-11-24	Group G	Switzerland	Cameroon	1:0	90	NaN	NaN	Manuel Akanji	Manchester City	Switzerland
1	2022-11-28	Group G	Brazil	Switzerland	1:0	90	NaN	NaN	Manuel Akanji	Manchester City	Switzerland
2	2022-12-02	Group G	Serbia	Switzerland	2:3	90	NaN	NaN	Manuel Akanji	Manchester City	Switzerland
3	2022-12-06	Round of 16	Portugal	Switzerland	6:1	90	NaN	NaN	Manuel Akanji	Manchester City	Switzerland
4	2022-11-24	Group H	Portugal	Ghana	3:2	90	NaN	NaN	João Cancelo	Manchester City	Portugal

4 Visualisation

The aim of this visualisation was to assess the impact on various premier league teams, so we can start by aggregating by club and just adding up all the minutes.

club_minutes = prem_world_cup_minutes_df.groupby(['club']).sum(numeric_only=True)
_ = club_minutes.plot(kind='barh')

This plot seems a bit bare, and it misses context around the distribution of the minutes amongst the squad. For example, Bournemouth seems to have comparable minutes played with Brentford, but Bournemouth were represented by just one player (Phillip Billing of Denmark) vs Brentford who had 4 different players at the World Cup. You could argue the impact to Brentford is greater since Bournemouth can just rest Billing for a game or two, whereas resting 5 players is a taller order for Brentford.

4.1 Grouping by country

This seems to be committing the sin of way too much data in one plot, there just aren’t enough (significantly) different enough colours to differentiate between the countries, and even if we could - it’d be too much to take in for a visualisation.

4.2 Grouping by tournament progression

A more meaningful and accessible visualisation is to group together contries that made it to the same stages of the tournament. The progression through the world cup rounds indicates how much rest the players have had and also provides reasons behind the number of minutes: progressing to later rounds is likely to be the cause behind a higher number of minutes played.

4.3 Finishing touches

Since we are exploring the impact of fatigue from the world cup, it would make sense to order the y axis (premier league teams) by league position since this is likely to be affected following the world cup.

Adding premier league standings

from fuzzywuzzy import process
prem_standings = pd.read_html("https://www.bbc.co.uk/sport/football/tables")[0]
prem_standings = prem_standings.iloc[:-1, [0,2]].rename({'Unnamed: 0':'prem_position'}, axis=1).set_index('Team')

# taken from https://stackoverflow.com/a/56315491
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1
df = df.reset_index()
prem_standings = prem_standings.reset_index()
merged_standings = fuzzy_merge(df, prem_standings, key1='club', key2='Team', threshold=65, limit=1)
merged_standings.loc[merged_standings['club']=="Manchester United", 'matches'] = 'Man Utd'
merged_standings = merged_standings.merge(prem_standings, how='left', left_on='matches', right_on='Team').drop(['matches', 'Team'], axis=1)
merged_standings['prem_position'] = pd.to_numeric(merged_standings['prem_position'])
merged_standings.sort_values('prem_position', inplace=True, ascending=False)

df = merged_standings.drop("prem_position", axis=1).set_index("club")

We can change the colour scheme so it’s more accessible (see colorbrew.org) and somewhat matches the world cup theme, change the fonts, add some spacing and we end up with our final plot: