Palmers Penguin EDA with Python

Artwork by @allison_horst

Python
exploratory data analysis
Author

Olamide Adu

Published

December 6, 2024

Introduction

Python is referred to as an easy and beginner friendly language, and that is in some ways true. Python is not my first language, R is, and I have been using it for a while now, but about time I add another language to my arsenal. It’s been 3 months since I picked up Python as the next language, and I think a project is about due. My method of learning has most been reading text books, all not covered but more than 50% - 80% of the books covered. The books read are:

The Project

My first project would be an exploratory data analysis project performed on the Palmer penguins data. The goal of this blog post is to document just how well I have grown within the last 3 months. Without think much about it I prompted chatGPT to generate some EDA questions on the Palmer Penguins data.

Questions Generated

The questions generated is below:

  • How many missing values are there in the dataset, and in which columns?
  • What is the average body mass of penguins for each species?
  • Are there any differences in bill length among the three islands in the dataset?
  • What is the distribution of flipper length for each penguin species?
  • Is there any relationship between bill length and bill depth?
  • How does the average body mass compare between male and female penguins and its weight for each sex across the species?
  • What is the proportion of penguin species found on each island?
  • How does flipper length vary across the different species and islands?
  • What are the maximum and minimum body masses recorded in the dataset?
  • Are there any outliers in the bill length measurements for each species?

Prep

Firstly, I imported the necessary packages that will be used for this project. Oh, you’ll need to install the packages first if not installed. Use !pip install <package_name>

# Import library -----------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Next, I set the themes for graphs.

## Set style and themes for plots ---------------------
sns.set_style("whitegrid")
plt.style.use("tableau-colorblind10")

Solution

To begin we import the data and get a quick preview

# Import data --------------------------------
penguins = pd.read_csv("penguins.csv")
penguins.head()
id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

Since the data is imported, I will proceed with answering the questions. Some questions will be answered with graphs, while others may have tables as their result output.

  1. How many missing values are there in the dataset, and in which columns?
# Find and Remove missing data
penguins.isna().sum()
id                    0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64
## Rows with one or more missing data

print(f" The total number of missing observation is: {len(penguins.loc[penguins.isna().any(axis=1)])}")
 The total number of missing observation is: 11

The observations with the missing data is given below:

penguins[penguins.isna().any(axis=1)]
id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
3 3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
8 8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN 2007
9 9 Adelie Torgersen 42.0 20.2 190.0 4250.0 NaN 2007
10 10 Adelie Torgersen 37.8 17.1 186.0 3300.0 NaN 2007
11 11 Adelie Torgersen 37.8 17.3 180.0 3700.0 NaN 2007
47 47 Adelie Dream 37.5 18.9 179.0 2975.0 NaN 2007
178 178 Gentoo Biscoe 44.5 14.3 216.0 4100.0 NaN 2007
218 218 Gentoo Biscoe 46.2 14.4 214.0 4650.0 NaN 2008
256 256 Gentoo Biscoe 47.3 13.8 216.0 4725.0 NaN 2009
268 268 Gentoo Biscoe 44.5 15.7 217.0 4875.0 NaN 2009
271 271 Gentoo Biscoe NaN NaN NaN NaN NaN 2009

Before proceeding with to the next question, I will remove observations with missing data. I thing we can do without 11 observations.

penguins_cleaned = penguins.dropna()

penguins_cleaned.isna().sum() # rows with NAs removed
id                   0
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
dtype: int64
  1. What is the average body mass of penguins for each species?
## Average Body Mass for Each Species
penguins_cleaned.groupby("species").agg(
    mean_weight = pd.NamedAgg(column="body_mass_g", aggfunc="mean")
)
mean_weight
species
Adelie 3706.164384
Chinstrap 3733.088235
Gentoo 5092.436975
  1. Are there any differences in bill length among the three islands in the dataset?
penguins_cleaned.groupby("island").agg(
  mean_bill_length = pd.NamedAgg(column="bill_length_mm", aggfunc="mean")
).round(2)
mean_bill_length
island
Biscoe 45.25
Dream 44.22
Torgersen 39.04
  1. What is the distribution of flipper length for each penguin species?
adelie_species = penguins_cleaned[penguins_cleaned.species == "Adelie"]
gentoo_species = penguins_cleaned[penguins_cleaned.species == "Gentoo"]
chinstrap_species = penguins_cleaned[penguins_cleaned.species == "Chinstrap"]

plt.figure()
plt.hist(adelie_species.flipper_length_mm, label="Adelie")
plt.hist(gentoo_species.flipper_length_mm, label="Gentoo")
plt.hist(chinstrap_species.flipper_length_mm, label="Chinstrap")
plt.title(
    "Distribution of Flipper Length for the Three Penguin Species", 
    size=13, loc="left", fontweight="bold"
)
plt.xlabel("Flipper Length ($mm$)", size=9, fontweight="bold", loc="right")
plt.ylabel("Count", size=9, weight="bold", loc="top")
plt.legend()

  1. Is there any relationship between bill length and bill depth?
plt.figure()
sns.scatterplot(
    x="bill_length_mm",
    y="bill_depth_mm",
    data=penguins_cleaned,
    hue="species",  
)
plt.title("Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.xlabel("Bill length ($mm$)", size=9)
plt.ylabel("Bill depth ($mm$)", size=9)
Text(0, 0.5, 'Bill depth ($mm$)')

Alternatively I can fit a linear regression line for each species to investigate the relationship between bill length and depth.

rel_plt = sns.lmplot(
    x="bill_length_mm",
    y="bill_depth_mm",
    data=penguins_cleaned,
    hue="species",
    markers=["+", "o", "p"]
)
plt.tight_layout()
sns.move_legend(
    rel_plt,
    "upper right",
    frameon=True
)
plt.title("Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.xlabel("Bill length ($mm$)", size=9)
plt.ylabel("Bill depth ($mm$)", size=9)
Text(41.303875868055556, 0.5, 'Bill depth ($mm$)')

  1. How does the average body mass compare between male and female penguins and its weight for each sex across the species?
penguins_cleaned.groupby("sex").agg(
    mean_weight = pd.NamedAgg(column="body_mass_g", aggfunc="mean")
).round(2)
mean_weight
sex
female 3862.27
male 4545.68

Comparison of penguins weight according to sex across penguins species

plt_dt = penguins_cleaned.groupby(["sex", "species"])["body_mass_g"].agg("mean")

plt_dt = plt_dt.reset_index()
plt_dt
sex species body_mass_g
0 female Adelie 3368.835616
1 female Chinstrap 3527.205882
2 female Gentoo 4679.741379
3 male Adelie 4043.493151
4 male Chinstrap 3938.970588
5 male Gentoo 5484.836066
  1. What is the proportion of penguin species found on each island?
# Proportion of penguin species found on each island?

penguins_cleaned.value_counts(
    subset=["island","species"],
     normalize=True, sort=False
     ).round(2).reset_index()
island species proportion
0 Biscoe Adelie 0.13
1 Biscoe Gentoo 0.36
2 Dream Adelie 0.17
3 Dream Chinstrap 0.20
4 Torgersen Adelie 0.14
  1. How does flipper length vary across the different species and islands?
# How does flipper length vary across the different species and islands?

plt.figure()
sns.boxplot(
    x="island",
    y="flipper_length_mm",
    hue="species",
    data=penguins_cleaned
)
plt.xlabel("Island", loc="right", size=9, weight=900, style="italic")
plt.ylabel("Flipper Length ($mm$)",size=9, loc="top", style="italic", weight=900)
plt.title(
    "Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands",
    weight="heavy", size=14
)
Text(0.5, 1.0, 'Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands')

  1. What are the maximum and minimum body masses recorded in the dataset?
# Find max and min body_mass
## max
penguins_cleaned[penguins_cleaned.body_mass_g == penguins_cleaned.body_mass_g.max()]
id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
169 169 Gentoo Biscoe 49.2 15.2 221.0 6300.0 male 2007
penguins_cleaned.body_mass_g.max()
np.float64(6300.0)
## min
penguins_cleaned[penguins_cleaned.body_mass_g == penguins_cleaned.body_mass_g.min()]
id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
314 314 Chinstrap Dream 46.9 16.6 192.0 2700.0 female 2008
penguins_cleaned.body_mass_g.min()
np.float64(2700.0)
  1. Are there any outliers in the bill length measurements for each species?
sns.boxplot(
    x="bill_length_mm",
    y="species",
    data=penguins_cleaned,
    hue="species"
)
plt.xlabel("Bill Length ($mm$)", weight="bold", size=9, loc="right")
plt.ylabel("Species", weight="bold", size=9, loc="top")
plt.title("Distribution of Bill Length According to Species", size=14, weight="bold")
Text(0.5, 1.0, 'Distribution of Bill Length According to Species')

Conclusion

This was a wrap and I think its a good start to getting familiar with python. In this project, the following have been done, data aggregation, handling missing data, and visualization.