Data Manipulation and exploration project¶

May 2023¶

Web address to download the files in one zipped folder: https://www.kaggle.com/datasets/unsdsn/world-happiness¶

In [1]:
import pandas as pd

# Set the working directory to the folder containing the datasets
import os
os.chdir(r"C:\Users\Waneda\Documents\Personal documents\Projects\World happiness index_Python")
In [3]:
# Import the 2019.csv file as a Pandas dataframe
df2019 = pd.read_csv("2019.csv")
df2019.head()
Out[3]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
In [4]:
# Import the 2018.csv file as a Pandas dataframe
df2018 = pd.read_csv("2018.csv")
df2018.head()
Out[4]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.632 1.305 1.592 0.874 0.681 0.202 0.393
1 2 Norway 7.594 1.456 1.582 0.861 0.686 0.286 0.340
2 3 Denmark 7.555 1.351 1.590 0.868 0.683 0.284 0.408
3 4 Iceland 7.495 1.343 1.644 0.914 0.677 0.353 0.138
4 5 Switzerland 7.487 1.420 1.549 0.927 0.660 0.256 0.357

We can merge the 2018 and 2019 datasets since they have the exact same columns¶

In [5]:
# let us check whether the columns, data types in the two dataframes are exactly the same.

# Compare the columns of the two dataframes
print("Columns in 2019 dataframe: ")
print(df2019.columns)

print("\nColumns in 2018 dataframe: ")
print(df2018.columns)

# Compare the data types of the two dataframes
print("\nData types in 2019 dataframe: ")
print(df2019.dtypes)

print("\nData types in 2018 dataframe: ")
print(df2018.dtypes)
Columns in 2019 dataframe: 
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

Columns in 2018 dataframe: 
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

Data types in 2019 dataframe: 
Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object

Data types in 2018 dataframe: 
Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object

Since the column dimenssions are exactly the same for the two datasets, let us merge them into a new dataset "happiness", then add the column "Year" to the happiness dataset corresponding with the year where the dataset came from¶

In [6]:
# Add the Year column to the df2018 and df2019 dataframes
df2018 = df2018.assign(Year=2018)
df2019 = df2019.assign(Year=2019)

# Merge the df2018 and df2019 dataframes with the happiness dataframe
happiness = pd.concat([df2019.assign(Year=2019), df2018.assign(Year=2018)], sort=False)

# Print the first few rows of the merged dataframe to verify that it has been merged correctly
happiness.head()
Out[6]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 2019
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 2019
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 2019
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 2019
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 2019
In [7]:
# Checking the information of the new merged dataset
happiness.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 312 entries, 0 to 155
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  312 non-null    int64  
 1   Country or region             312 non-null    object 
 2   Score                         312 non-null    float64
 3   GDP per capita                312 non-null    float64
 4   Social support                312 non-null    float64
 5   Healthy life expectancy       312 non-null    float64
 6   Freedom to make life choices  312 non-null    float64
 7   Generosity                    312 non-null    float64
 8   Perceptions of corruption     311 non-null    float64
 9   Year                          312 non-null    int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 26.8+ KB

The happiness dataframe does not have the region column. I think this is an interesting column to have. However, it is only available in the 2016 and 2015 datasets. Let us import the 2016.csv dataset as a pandas dataframe.¶

In [8]:
# Import the 2016.csv file as a Pandas dataframe
df2016 = pd.read_csv("2016.csv")

# Display the first few rows of the 2016 dataframe
df2016.head(3)
Out[8]:
Country Region Happiness Rank Happiness Score Lower Confidence Interval Upper Confidence Interval Economy (GDP per Capita) Family Health (Life Expectancy) Freedom Trust (Government Corruption) Generosity Dystopia Residual
0 Denmark Western Europe 1 7.526 7.460 7.592 1.44178 1.16374 0.79504 0.57941 0.44453 0.36171 2.73939
1 Switzerland Western Europe 2 7.509 7.428 7.590 1.52733 1.14524 0.86303 0.58557 0.41203 0.28083 2.69463
2 Iceland Western Europe 3 7.501 7.333 7.669 1.42666 1.18326 0.86733 0.56624 0.14975 0.47678 2.83137
In [9]:
# Rename the 'Country or region' column to 'Country' in the happiness dataset
happiness.rename(columns={'Country or region': 'Country'}, inplace=True)

# Print the first few rows of the renamed dataframe to verify that the column name has been changed
happiness.head()
Out[9]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 2019
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 2019
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 2019
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 2019
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 2019

Now, the "Country" columns are exactly the same in the happiness dataset and df2016 dataset. However, the happiness dataset does not have the Region column. We will now use the "Region" column in the df2016 dataset to assign the relevant regions to countries in the "Countries" column of the happiness dataset.¶

In [10]:
# Create a dictionary that maps countries to regions
region_dict = df2016.set_index('Country')['Region'].to_dict()

# Map the regions to the countries in the happiness dataframe
happiness['Region'] = happiness['Country'].map(region_dict)

# Print the first few rows of the updated dataframe to verify that the regions have been assigned correctly
happiness.head()
Out[10]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year Region
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 2019 Western Europe
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 2019 Western Europe
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 2019 Western Europe
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 2019 Western Europe
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 2019 Western Europe

After adding the regions using the 2016 dataset, we can see some countries do not have regions assignes. Let's start by finding out which countries have a region which is empty and list them.¶

In [11]:
# Find which countries have a missing region
missing_regions = happiness[happiness['Region'].isna()]['Country'].unique()

# Print the list of countries with missing regions
print("Countries with missing regions:")
for country in missing_regions:
    print(country)
Countries with missing regions:
Trinidad & Tobago
Northern Cyprus
North Macedonia
Gambia
Mozambique
Swaziland
Lesotho
Central African Republic

We will go online and independently search for the region names of the countries in the list above, then assign them appropriately here. But first, let us check the region names in the datasets to make sure that we assign the missing countries to existing regions:¶

In [12]:
region_counts = happiness.groupby('Region')['Country'].count()
print(region_counts)
Region
Australia and New Zealand           4
Central and Eastern Europe         57
Eastern Asia                       12
Latin America and Caribbean        41
Middle East and Northern Africa    38
North America                       4
Southeastern Asia                  18
Southern Asia                      14
Sub-Saharan Africa                 71
Western Europe                     40
Name: Country, dtype: int64

Now we'll make a dictionary to assign region names, then add these to the correct countries in the happiness dataset¶

In [13]:
# Create a dictionary with appropriate region names for the missing countries
region_dict = {
    'Trinidad & Tobago': 'Latin America and Caribbean',
    'Northern Cyprus': 'Middle East and Northern Africa',
    'North Macedonia': 'Central and Eastern Europe',
    'Gambia': 'Sub-Saharan Africa',
    'Mozambique': 'Sub-Saharan Africa',
    'Swaziland': 'Sub-Saharan Africa',
    'Lesotho': 'Sub-Saharan Africa',
    'Central African Republic': 'Sub-Saharan Africa'
}

# Use the map() method to modify the Region column
happiness['Region'] = happiness['Region'].fillna(happiness['Country'].map(region_dict))
In [ ]:
# Now to check if the assignment was done correctly, let us check which countries still have no region assigned:
In [14]:
# Find which countries have a missing region
missing_regions1 = happiness[happiness['Region'].isna()]['Country'].unique()

# Print the list of countries with missing regions
print("Countries with missing regions:")
for country in missing_regions1:
    print(country)
Countries with missing regions:
In [ ]:
# We can also check the new region counts:
In [15]:
region_counts1 = happiness.groupby('Region')['Country'].count()
print(region_counts1)
Region
Australia and New Zealand           4
Central and Eastern Europe         58
Eastern Asia                       12
Latin America and Caribbean        43
Middle East and Northern Africa    40
North America                       4
Southeastern Asia                  18
Southern Asia                      14
Sub-Saharan Africa                 79
Western Europe                     40
Name: Country, dtype: int64

Now that the dataset is finally ready for analysis, we will perform a few basic operations to explore the data.¶

In [19]:
#display last five rows
happiness.tail()
Out[19]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year Region
151 152 Yemen 3.355 0.442 1.073 0.343 0.244 0.083 0.064 2018 Middle East and Northern Africa
152 153 Tanzania 3.303 0.455 0.991 0.381 0.481 0.270 0.097 2018 Sub-Saharan Africa
153 154 South Sudan 3.254 0.337 0.608 0.177 0.112 0.224 0.106 2018 Sub-Saharan Africa
154 155 Central African Republic 3.083 0.024 0.000 0.010 0.305 0.218 0.038 2018 Sub-Saharan Africa
155 156 Burundi 2.905 0.091 0.627 0.145 0.065 0.149 0.076 2018 Sub-Saharan Africa
In [24]:
# Let us sort the data in alphabetical order by the Country column.
happiness = happiness.sort_values(by='Country')
happiness.head()
Out[24]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year Region
153 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025 2019 Southern Asia
144 145 Afghanistan 3.632 0.332 0.537 0.255 0.085 0.191 0.036 2018 Southern Asia
111 112 Albania 4.586 0.916 0.817 0.790 0.419 0.149 0.032 2018 Central and Eastern Europe
106 107 Albania 4.719 0.947 0.848 0.874 0.383 0.178 0.027 2019 Central and Eastern Europe
83 84 Algeria 5.295 0.979 1.154 0.687 0.077 0.055 0.135 2018 Middle East and Northern Africa
In [27]:
# Now let us sort by GDP per capita in descending order
happiness = happiness.sort_values(by='GDP per capita', ascending=False)
happiness.head()
Out[27]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Year Region
19 20 United Arab Emirates 6.774 2.096 0.776 0.670 0.284 0.186 NaN 2018 Middle East and Northern Africa
28 29 Qatar 6.374 1.684 1.313 0.871 0.555 0.220 0.167 2019 Middle East and Northern Africa
31 32 Qatar 6.374 1.649 1.303 0.748 0.654 0.256 0.171 2018 Middle East and Northern Africa
13 14 Luxembourg 7.090 1.609 1.479 1.012 0.526 0.194 0.316 2019 Western Europe
16 17 Luxembourg 6.910 1.576 1.520 0.896 0.632 0.196 0.321 2018 Western Europe
In [29]:
# Now let us display the information of my home country, Zimbabwe.
zimbabwe_data = happiness[happiness['Country'] == 'Zimbabwe']
print(zimbabwe_data)
     Overall rank   Country  Score  GDP per capita  Social support  \
145           146  Zimbabwe  3.663           0.366           1.114   
143           144  Zimbabwe  3.692           0.357           1.094   

     Healthy life expectancy  Freedom to make life choices  Generosity  \
145                    0.433                         0.361       0.151   
143                    0.248                         0.406       0.132   

     Perceptions of corruption  Year              Region  
145                      0.089  2019  Sub-Saharan Africa  
143                      0.099  2018  Sub-Saharan Africa  
In [ ]:
# Let's produce a table showing the GDP per capita and Healthy life expectancy of the top 10 countries 
#with the highest Perceptions of corruption score.
In [31]:
# Sort by Perceptions of corruption score in descending order
sorted_happiness = happiness.sort_values(by='Perceptions of corruption', ascending=False)

# Select the top 5 countries with highest Perceptions of corruption score
top_10 = sorted_happiness.head(10)

# Select only the columns we need
top_10_gdp_hle = top_10[['Country', 'GDP per capita', 'Healthy life expectancy']]

# Display the table
print(top_10_gdp_hle)
         Country  GDP per capita  Healthy life expectancy
33     Singapore           1.529                    1.008
33     Singapore           1.572                    1.141
150       Rwanda           0.332                    0.400
151       Rwanda           0.359                    0.614
1        Denmark           1.383                    0.996
2        Denmark           1.351                    0.868
0        Finland           1.305                    0.874
0        Finland           1.340                    0.986
7    New Zealand           1.268                    0.876
8         Sweden           1.355                    0.913

That was all for this project! as a goodbye, let's make one colourful visualisation. The visualisation below shows the relationship between GDP and happiness scores, with the region used as color legend.¶

In [17]:
import matplotlib.pyplot as plt

# Create a dictionary to map region names to colors
colors = {
    'Western Europe': '#1f77b4',
    'North America': '#ff7f0e',
    'Australia and New Zealand': '#2ca02c',
    'Middle East and Northern Africa': '#d62728',
    'Latin America and Caribbean': '#9467bd',
    'Southeast Asia': '#8c564b',
    'Central and Eastern Europe': '#e377c2',
    'Eastern Asia': '#7f7f7f',
    'Sub-Saharan Africa': '#bcbd22'
}

# Create a scatter plot of Happiness Score versus GDP per Capita, colored by Region
fig, ax = plt.subplots(figsize=(10, 6))
for region, group in happiness.groupby('Region'):
    color = colors.get(region, '#999999')
    ax.scatter(group['GDP per capita'], group['Score'], label=region, color=color, alpha=0.7)

# Add axis labels and title
ax.set_xlabel('GDP per Capita')
ax.set_ylabel('Happiness Score')
ax.set_title('Happiness Score vs GDP per Capita by Region')

# Add a legend
ax.legend()

# Display the plot
plt.show()

We will need this dataset for another project dedicated to data visualisation. Let's export it as a .csv file:¶

In [18]:
# Export the happiness dataframe to a CSV file
happiness.to_csv('happiness.csv', index=False)

# Index=False is used to avoid exporting the dataframe index as a separate column in the CSV file

~END~¶