Palmers Penguin EDA with Python

Introduction

Python is referred to as an easy and beginner friendly language, and that is in some ways true. Python is not my first language, R is, and I have been using it for a while now, but about time I add another language to my arsenal. It’s been 3 months since I picked up Python as the next language, and I think a project is about due. My method of learning has most been reading text books, all not covered but more than 50% - 80% of the books covered. The books read are:

Python for Data Analysis 3e by the creator of the pandas package, Wes Mckinney, and
Python Data Science Handbook 2E by (Jake VanderPlas)[http://vanderplas.com/].

The Project

My first project would be an exploratory data analysis project performed on the Palmer penguins data. The goal of this blog post is to document just how well I have grown within the last 3 months. Without think much about it I prompted chatGPT to generate some EDA questions on the Palmer Penguins data.

Questions Generated

The questions generated is below:

How many missing values are there in the dataset, and in which columns?
What is the average body mass of penguins for each species?
Are there any differences in bill length among the three islands in the dataset?
What is the distribution of flipper length for each penguin species?
Is there any relationship between bill length and bill depth?
How does the average body mass compare between male and female penguins and its weight for each sex across the species?
What is the proportion of penguin species found on each island?
How does flipper length vary across the different species and islands?
What are the maximum and minimum body masses recorded in the dataset?
Are there any outliers in the bill length measurements for each species?

Prep

Firstly, I imported the necessary packages that will be used for this project. Oh, you’ll need to install the packages first if not installed. Use !pip install <package_name>

# Import library -----------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Next, I set the themes for graphs.

## Set style and themes for plots ---------------------
sns.set_style("whitegrid")
plt.style.use("tableau-colorblind10")

Solution

To begin we import the data and get a quick preview

# Import data --------------------------------
penguins = pd.read_csv("penguins.csv")
penguins.head()

	id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Since the data is imported, I will proceed with answering the questions. Some questions will be answered with graphs, while others may have tables as their result output.

How many missing values are there in the dataset, and in which columns?

# Find and Remove missing data
penguins.isna().sum()

id                    0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

## Rows with one or more missing data

print(f" The total number of missing observation is: {len(penguins.loc[penguins.isna().any(axis=1)])}")

 The total number of missing observation is: 11

The observations with the missing data is given below:

penguins[penguins.isna().any(axis=1)]

	id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
3	3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
8	8	Adelie	Torgersen	34.1	18.1	193.0	3475.0	NaN	2007
9	9	Adelie	Torgersen	42.0	20.2	190.0	4250.0	NaN	2007
10	10	Adelie	Torgersen	37.8	17.1	186.0	3300.0	NaN	2007
11	11	Adelie	Torgersen	37.8	17.3	180.0	3700.0	NaN	2007
47	47	Adelie	Dream	37.5	18.9	179.0	2975.0	NaN	2007
178	178	Gentoo	Biscoe	44.5	14.3	216.0	4100.0	NaN	2007
218	218	Gentoo	Biscoe	46.2	14.4	214.0	4650.0	NaN	2008
256	256	Gentoo	Biscoe	47.3	13.8	216.0	4725.0	NaN	2009
268	268	Gentoo	Biscoe	44.5	15.7	217.0	4875.0	NaN	2009
271	271	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN	2009

Before proceeding with to the next question, I will remove observations with missing data. I thing we can do without 11 observations.

penguins_cleaned = penguins.dropna()

penguins_cleaned.isna().sum() # rows with NAs removed

id                   0
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
dtype: int64

What is the average body mass of penguins for each species?

## Average Body Mass for Each Species
penguins_cleaned.groupby("species").agg(
    mean_weight = pd.NamedAgg(column="body_mass_g", aggfunc="mean")
)

	mean_weight
species
Adelie	3706.164384
Chinstrap	3733.088235
Gentoo	5092.436975

Are there any differences in bill length among the three islands in the dataset?

penguins_cleaned.groupby("island").agg(
  mean_bill_length = pd.NamedAgg(column="bill_length_mm", aggfunc="mean")
).round(2)

	mean_bill_length
island
Biscoe	45.25
Dream	44.22
Torgersen	39.04

What is the distribution of flipper length for each penguin species?

adelie_species = penguins_cleaned[penguins_cleaned.species == "Adelie"]
gentoo_species = penguins_cleaned[penguins_cleaned.species == "Gentoo"]
chinstrap_species = penguins_cleaned[penguins_cleaned.species == "Chinstrap"]

plt.figure()
plt.hist(adelie_species.flipper_length_mm, label="Adelie")
plt.hist(gentoo_species.flipper_length_mm, label="Gentoo")
plt.hist(chinstrap_species.flipper_length_mm, label="Chinstrap")
plt.title(
    "Distribution of Flipper Length for the Three Penguin Species", 
    size=13, loc="left", fontweight="bold"
)
plt.xlabel("Flipper Length ($mm$)", size=9, fontweight="bold", loc="right")
plt.ylabel("Count", size=9, weight="bold", loc="top")
plt.legend()

Is there any relationship between bill length and bill depth?

plt.figure()
sns.scatterplot(
    x="bill_length_mm",
    y="bill_depth_mm",
    data=penguins_cleaned,
    hue="species",  
)
plt.title("Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.xlabel("Bill length ($mm$)", size=9)
plt.ylabel("Bill depth ($mm$)", size=9)

Text(0, 0.5, 'Bill depth ($mm$)')

Alternatively I can fit a linear regression line for each species to investigate the relationship between bill length and depth.

rel_plt = sns.lmplot(
    x="bill_length_mm",
    y="bill_depth_mm",
    data=penguins_cleaned,
    hue="species",
    markers=["+", "o", "p"]
)
plt.tight_layout()
sns.move_legend(
    rel_plt,
    "upper right",
    frameon=True
)
plt.title("Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.xlabel("Bill length ($mm$)", size=9)
plt.ylabel("Bill depth ($mm$)", size=9)

Text(41.303875868055556, 0.5, 'Bill depth ($mm$)')

How does the average body mass compare between male and female penguins and its weight for each sex across the species?

penguins_cleaned.groupby("sex").agg(
    mean_weight = pd.NamedAgg(column="body_mass_g", aggfunc="mean")
).round(2)

	mean_weight
sex
female	3862.27
male	4545.68

Comparison of penguins weight according to sex across penguins species

plt_dt = penguins_cleaned.groupby(["sex", "species"])["body_mass_g"].agg("mean")

plt_dt = plt_dt.reset_index()
plt_dt

	sex	species	body_mass_g
0	female	Adelie	3368.835616
1	female	Chinstrap	3527.205882
2	female	Gentoo	4679.741379
3	male	Adelie	4043.493151
4	male	Chinstrap	3938.970588
5	male	Gentoo	5484.836066

What is the proportion of penguin species found on each island?

# Proportion of penguin species found on each island?

penguins_cleaned.value_counts(
    subset=["island","species"],
     normalize=True, sort=False
     ).round(2).reset_index()

	island	species	proportion
0	Biscoe	Adelie	0.13
1	Biscoe	Gentoo	0.36
2	Dream	Adelie	0.17
3	Dream	Chinstrap	0.20
4	Torgersen	Adelie	0.14

How does flipper length vary across the different species and islands?

# How does flipper length vary across the different species and islands?

plt.figure()
sns.boxplot(
    x="island",
    y="flipper_length_mm",
    hue="species",
    data=penguins_cleaned
)
plt.xlabel("Island", loc="right", size=9, weight=900, style="italic")
plt.ylabel("Flipper Length ($mm$)",size=9, loc="top", style="italic", weight=900)
plt.title(
    "Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands",
    weight="heavy", size=14
)

Text(0.5, 1.0, 'Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands')

What are the maximum and minimum body masses recorded in the dataset?

# Find max and min body_mass
## max
penguins_cleaned[penguins_cleaned.body_mass_g == penguins_cleaned.body_mass_g.max()]

	id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
169	169	Gentoo	Biscoe	49.2	15.2	221.0	6300.0	male	2007

penguins_cleaned.body_mass_g.max()

np.float64(6300.0)

## min
penguins_cleaned[penguins_cleaned.body_mass_g == penguins_cleaned.body_mass_g.min()]

	id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
314	314	Chinstrap	Dream	46.9	16.6	192.0	2700.0	female	2008

penguins_cleaned.body_mass_g.min()

np.float64(2700.0)

Are there any outliers in the bill length measurements for each species?

sns.boxplot(
    x="bill_length_mm",
    y="species",
    data=penguins_cleaned,
    hue="species"
)
plt.xlabel("Bill Length ($mm$)", weight="bold", size=9, loc="right")
plt.ylabel("Species", weight="bold", size=9, loc="top")
plt.title("Distribution of Bill Length According to Species", size=14, weight="bold")

Text(0.5, 1.0, 'Distribution of Bill Length According to Species')

Conclusion

This was a wrap and I think its a good start to getting familiar with python. In this project, the following have been done, data aggregation, handling missing data, and visualization.