Practice Project - FIFA World Cup Analysis¶


Context¶


The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body. The championship is contested every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the Second World War. It is one of the most prestigious and important trophies in the sport of football.


Objective¶


A new football club named 'Brussels United FC' has just been inaugurated. As a member of this club, you have been assigned a task to carry analysis of the world cup data.


Data Dictionary¶


The World Cups dataset has the following information about all the World Cups in history till 2014.

Year: Year in which the world cup was held

Country: Country where the world cup was held

Winner: Team that won the world cup

Runners-Up: Team that came second

Third: Team that came third

Fourth: Team that came fourth

GoalsScored: Total goals scored in the world cup

QualifiedTeams: Number of teams that qualified for the world cup

MatchesPlayed: Total matches played in the world cup

Attendance: Total attendance in the world cup

Q 1: Import the necessary libraries and briefly explain the use of each library¶

In [ ]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
from google.colab import drive
drive.mount('/content/drive')

Write your Answer here:¶

Ans 1:

Numpy:

Numpy is used for handling Numbers, Numerical analysis. It is the fundamental package for array computing with Python.

Pandas:

Pandas are used to process the data. Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.

matplotlib.pyplot

Matplotlib is a visualization library & has been taken from the software Matlab. We are only considering one part of this library to show plotting, hence used .pyplot which means python plot.

Seaborn

Seaborn is another visualization library. When it comes to the visualization of statistical models like heat maps, Seaborn is among the reliable sources. This Python library is derived from matplotlib and closely integrated with Pandas data structures

Q 2: Which library can be used to read the WorldCups dataset? Read the dataset.¶

In [ ]:
fifa=pd.read_csv("/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FIFA/WorldCups.csv")

Write your Answer here:¶

Ans 2:

Pandas library can be used to import the WorldCups dataset

Q3. Show the last 10 records of the dataset. How many columns are there?¶

In [ ]:
fifa.tail(10)
Out[ ]:
Year Country Winner Runners-Up Third Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance
10 1978 Argentina Argentina Netherlands Brazil Italy 102 16 38 1.545.791
11 1982 Spain Italy Germany FR Poland France 146 24 52 2.109.723
12 1986 Mexico Argentina Germany FR France Belgium 132 24 52 2.394.031
13 1990 Italy Germany FR Argentina Italy England 115 24 52 2.516.215
14 1994 USA Brazil Italy Sweden Bulgaria 141 24 52 3.587.538
15 1998 France France Brazil Croatia Netherlands 171 32 64 2.785.100
16 2002 Korea/Japan Brazil Germany Turkey Korea Republic 161 32 64 2.705.197
17 2006 Germany Italy France Germany Portugal 147 32 64 3.359.439
18 2010 South Africa Spain Netherlands Germany Uruguay 145 32 64 3.178.856
19 2014 Brazil Germany Argentina Netherlands Brazil 171 32 64 3.386.810

Write your Answer here:¶

Ans 3:

There are 10 columns in the data

Q4. Show the first 10 records of the dataset.¶

In [ ]:
fifa.head(10)
Out[ ]:
Year Country Winner Runners-Up Third Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance
0 1930 Uruguay Uruguay Argentina USA Yugoslavia 70 13 18 590.549
1 1934 Italy Italy Czechoslovakia Germany Austria 70 16 17 363.000
2 1938 France Italy Hungary Brazil Sweden 84 15 18 375.700
3 1950 Brazil Uruguay Brazil Sweden Spain 88 13 22 1.045.246
4 1954 Switzerland Germany FR Hungary Austria Uruguay 140 16 26 768.607
5 1958 Sweden Brazil Sweden France Germany FR 126 16 35 819.810
6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia 89 16 32 893.172
7 1966 England England Germany FR Portugal Soviet Union 89 16 32 1.563.135
8 1970 Mexico Brazil Italy Germany FR Uruguay 95 16 32 1.603.975
9 1974 Germany Germany FR Netherlands Poland Brazil 97 16 38 1.865.753

Q5. What do you understand by the dimension of the dataset? Find the dimension of the fifa dataframe.¶

In [ ]:
fifa.shape
Out[ ]:
(20, 10)

Write your Answer here:¶

Ans 5:

The shape of the dataset is a tuple of 2 elements. The first element shows the number of rows in the data and the second element shows the number of columns in the data.

Q6. What do you understand by the size of the dataset? Find the size of the fifa dataframe.¶

In [ ]:
fifa.size
Out[ ]:
200

Write your Answer here:¶

Ans 6:

The size of the dataset is the total number of elements in the data i.e. product of the number of rows and number of columns.

Q7. What are the data types of all the variables in the data set?¶

Hint: Use the info() function to get all the information about the dataset.

In [ ]:
fifa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Year            20 non-null     int64 
 1   Country         20 non-null     object
 2   Winner          20 non-null     object
 3   Runners-Up      20 non-null     object
 4   Third           20 non-null     object
 5   Fourth          20 non-null     object
 6   GoalsScored     20 non-null     int64 
 7   QualifiedTeams  20 non-null     int64 
 8   MatchesPlayed   20 non-null     int64 
 9   Attendance      20 non-null     object
dtypes: int64(4), object(6)
memory usage: 1.7+ KB

Write your Answer here:¶

Ans 7:

  • There are two different data types - int64 (represents numerical variables) and object (represents categorical variables)
  • There are 4 numerical columns - Year, GoalsScored, QualifiedTeams, MatchesPlayed
  • The rest of the columns are categorical
  • Oddly, Attendance is a categorical variable here. This might be due to some commas or non-numerical entries in the column.

Q8. What do you mean by missing values? Are there any missing values in the fifa dataframe?¶

In [ ]:
fifa.isnull().values.any()
Out[ ]:
False

Write your Answer here:¶

Ans 8:

The Missing value(s) is/are any particular cell(s) in the dataset which is/are blank i.e. the information is missing.

The output of the above code (False) implies that there are no missing values in the data.

Q9. What do summary statistics of data represent? Find the summary statistics for the numerical variables (Dtype is int64) in the fifa data?¶

In [ ]:
fifa.describe()
Out[ ]:
Year GoalsScored QualifiedTeams MatchesPlayed
count 20.000000 20.000000 20.000000 20.000000
mean 1974.800000 118.950000 21.250000 41.800000
std 25.582889 32.972836 7.268352 17.218717
min 1930.000000 70.000000 13.000000 17.000000
25% 1957.000000 89.000000 16.000000 30.500000
50% 1976.000000 120.500000 16.000000 38.000000
75% 1995.000000 145.250000 26.000000 55.000000
max 2014.000000 171.000000 32.000000 64.000000

Write your Answer here:¶

Ans 9:

  • The minimum and the maximum number of goals scored in world cups from 1930-2014 are 70 and 171, respectively.
  • The average number of goals scored in a world cup is ~119.
  • The number of qualified teams and matches played has increased over the years which implies that the world cups are getting bigger which in turn implies that the popularity of the sport is increasing over the years.

Q 10. Plot the distribution plot for the variable 'MatchesPlayed'. Write detailed observations from the plot.¶

In [ ]:
sns.displot(fifa['MatchesPlayed'], kind='kde')
plt.show()
No description has been provided for this image

Write your Answer here:¶

Ans 10:

  • The plot shows that most of the observations lie between 20 and 60 i.e. majority of world cups had 20 to 60 matches.
  • The distribution looks fairly symmetric and there are two peaks in the plot around 25 and 60. A distribution with two peaks (modes) is called bimodal.

Q 11. Which country has won the world cup maximum times?¶

Hint: Use value_counts() function

value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently occurring.

In [ ]:
fifa['Winner'].value_counts()
Out[ ]:
Winner
Brazil        5
Italy         4
Germany FR    3
Uruguay       2
Argentina     2
England       1
France        1
Spain         1
Germany       1
Name: count, dtype: int64

Write your Answer here:¶

Ans 11:

Brazil has won the world cup the most number of times i.e. 5

Q12.¶

12.1 What is the mean of the variable 'Qualified teams'?¶

12.2 What is the median of the variable 'Qualified teams'?¶

12.3 What is the mode of the variable 'Qualified teams'?¶

Explain your answer¶

In [ ]:
m1 = fifa['QualifiedTeams'].mean()
print(m1)
m2 = fifa['QualifiedTeams'].median()
print(m2)
m3 = fifa['QualifiedTeams'].mode()[0]
print(m3)
21.25
16.0
16

Write your Answer here:¶

Ans 12:

  • The mean, median, and mode of the variable QualifiedTeams are 21.25, 16, and 16, respectively.
  • The mean is greater than the median which implies that the distribution of QualifiedTeams might be skewed to the right.

Q13. How many countries are above the mean level of 'Qualified Teams'?¶

In [ ]:
fifa[fifa['QualifiedTeams']>m1]['Country']
Out[ ]:
11           Spain
12          Mexico
13           Italy
14             USA
15          France
16     Korea/Japan
17         Germany
18    South Africa
19          Brazil
Name: Country, dtype: object

Write your Answer here:¶

Ans 13:

9 countries have more than the average number of qualified teams.

Q14. What is the median of variables 'GoalsScored' & 'MatchesPlayed'?¶

In [ ]:
GS_median = np.median(fifa['GoalsScored'])
print(GS_median)
QT_median = np.median(fifa['MatchesPlayed'])
print(QT_median)
120.5
38.0

Write your Answer here:¶

Ans 14:

The median number of goals scored and matches played are ~120 and 38, respectively.

Q15. Which country scored the minimum number of goals?¶

In [ ]:
fifa[fifa['GoalsScored']==fifa['GoalsScored'].min()]['Country']
Out[ ]:
0    Uruguay
1      Italy
Name: Country, dtype: object

Write your Answer here:¶

Ans 15:

There is a tie for the minimum number of goals scored. Uruguay and Italy both scored the minimum number of goals.

Q16. Plot the pairplots of 'GoalsScored', 'QualifiedTeams', 'MatchesPlayed'.¶

In [ ]:
sns.pairplot(fifa[['GoalsScored', 'QualifiedTeams', 'MatchesPlayed']])
plt.show()
No description has been provided for this image

Q17. Plot the scatterplot for variables 'Country' & 'Year'.¶

In [ ]:
sns.scatterplot(x=fifa['Year'], y=fifa['Country'])
plt.show()
No description has been provided for this image

Q18. Plot a countplot for the variable 'Winner' to understand the number of times a country won the world cup between 1930 to 2014.¶

In [ ]:
plt.figure(figsize=(10,6))

sns.countplot(x=fifa['Winner'])

plt.title('How many times Country played matches between 1930 to 2014')

plt.xlabel('Country')

plt.ylabel('Frequency')

plt.show()
No description has been provided for this image

Write your Answer here:¶

Ans 18:

  • As observed earlier, Brazil has won the world cup the most number of times i.e. 5
  • Notice that Germany has two entries - Germany FR and Germany. We can consider them the same.
  • Italy and Germany have won the world cup second most number of times i.e. 4
  • Uruguay and Argentina both have won the world cup twice.
  • England, France, and Spain each have 1 world cup title.

Q 19. Show boxplot and calculate the interquartile range for the variable 'GoalsScored'¶

In [ ]:
plt.boxplot(fifa['GoalsScored'])

plt.text(x=1.1,y=fifa['GoalsScored'].min(), s='min')
plt.text(x=1.1,y=fifa.GoalsScored.quantile(0.25), s='Q1')
plt.text(x=1.1,y=fifa['GoalsScored'].median(), s='median(Q2)')
plt.text(x=1.1,y=fifa.GoalsScored.quantile(0.75), s='Q3')
plt.text(x=1.1,y=fifa['GoalsScored'].max(), s='max')

plt.title('Boxplot of GoalsScored')
plt.ylabel('Goals')
plt.show()
No description has been provided for this image
In [ ]:
Q1 = fifa.quantile(q = .25, numeric_only = True)
Q3 = fifa.quantile(q = .75, numeric_only = True)
IQR = Q3 - Q1
print(IQR)
Year              38.00
GoalsScored       56.25
QualifiedTeams    10.00
MatchesPlayed     24.50
dtype: float64

Write your Answer here:¶

Ans 19:

  • The boxplot shows that there are no outliers for the number of goals scored.
  • The IQR for the number of goals scored is high which implies that there is variability in the number of goals scored in world cups which can be expected.

Q 20. Find and visualize the correlation relation among numeric variables¶

In [ ]:
corr_matrix = fifa.corr(numeric_only = True)

corr_matrix
Out[ ]:
Year GoalsScored QualifiedTeams MatchesPlayed
Year 1.000000 0.829886 0.895565 0.972473
GoalsScored 0.829886 1.000000 0.866201 0.876201
QualifiedTeams 0.895565 0.866201 1.000000 0.949164
MatchesPlayed 0.972473 0.876201 0.949164 1.000000
In [ ]:
sns.heatmap(corr_matrix, annot = True)

# display the plot
plt.show()
No description has been provided for this image

Write your Answer here:¶

Ans 20:

  • The Year has a strong positive correlation with all other numeric variables. This indicates that the number of goals scored, number of qualified teams, and the number of matches played have been increasing over the years. The world cup event has become more popular and bigger with time.
  • The number of goals scored is positively correlated with the number of qualified teams and the number of matches played which is understandable.
In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FIFA/Solution+Notebook+-+FIFA+World+Cup+Analysis.ipynb"
In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive