# Import library -----------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Introduction
Python is referred to as an easy and beginner friendly language, and that is in some ways true. Python is not my first language, R is, and I have been using it for a while now, but about time I add another language to my arsenal. It’s been 3 months since I picked up Python as the next language, and I think a project is about due. My method of learning has most been reading text books, all not covered but more than 50% - 80% of the books covered. The books read are:
Python for Data Analysis 3e by the creator of the pandas package, Wes Mckinney, and
Python Data Science Handbook 2E by (Jake VanderPlas)[http://vanderplas.com/].
The Project
My first project would be an exploratory data analysis project performed on the Palmer penguins data. The goal of this blog post is to document just how well I have grown within the last 3 months. Without think much about it I prompted chatGPT to generate some EDA questions on the Palmer Penguins data.
Questions Generated
The questions generated is below:
- How many missing values are there in the dataset, and in which columns?
- What is the average body mass of penguins for each species?
- Are there any differences in bill length among the three islands in the dataset?
- What is the distribution of flipper length for each penguin species?
- Is there any relationship between bill length and bill depth?
- How does the average body mass compare between male and female penguins and its weight for each sex across the species?
- What is the proportion of penguin species found on each island?
- How does flipper length vary across the different species and islands?
- What are the maximum and minimum body masses recorded in the dataset?
- Are there any outliers in the bill length measurements for each species?
Prep
Firstly, I imported the necessary packages that will be used for this project. Oh, you’ll need to install the packages first if not installed. Use !pip install <package_name>
Next, I set the themes for graphs.
## Set style and themes for plots ---------------------
"whitegrid")
sns.set_style("tableau-colorblind10") plt.style.use(
Solution
To begin we import the data and get a quick preview
# Import data --------------------------------
= pd.read_csv("penguins.csv")
penguins penguins.head()
id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
Since the data is imported, I will proceed with answering the questions. Some questions will be answered with graphs, while others may have tables as their result output.
- How many missing values are there in the dataset, and in which columns?
# Find and Remove missing data
sum() penguins.isna().
id 0
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
year 0
dtype: int64
## Rows with one or more missing data
print(f" The total number of missing observation is: {len(penguins.loc[penguins.isna().any(axis=1)])}")
The total number of missing observation is: 11
The observations with the missing data is given below:
any(axis=1)] penguins[penguins.isna().
id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
3 | 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
8 | 8 | Adelie | Torgersen | 34.1 | 18.1 | 193.0 | 3475.0 | NaN | 2007 |
9 | 9 | Adelie | Torgersen | 42.0 | 20.2 | 190.0 | 4250.0 | NaN | 2007 |
10 | 10 | Adelie | Torgersen | 37.8 | 17.1 | 186.0 | 3300.0 | NaN | 2007 |
11 | 11 | Adelie | Torgersen | 37.8 | 17.3 | 180.0 | 3700.0 | NaN | 2007 |
47 | 47 | Adelie | Dream | 37.5 | 18.9 | 179.0 | 2975.0 | NaN | 2007 |
178 | 178 | Gentoo | Biscoe | 44.5 | 14.3 | 216.0 | 4100.0 | NaN | 2007 |
218 | 218 | Gentoo | Biscoe | 46.2 | 14.4 | 214.0 | 4650.0 | NaN | 2008 |
256 | 256 | Gentoo | Biscoe | 47.3 | 13.8 | 216.0 | 4725.0 | NaN | 2009 |
268 | 268 | Gentoo | Biscoe | 44.5 | 15.7 | 217.0 | 4875.0 | NaN | 2009 |
271 | 271 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN | 2009 |
Before proceeding with to the next question, I will remove observations with missing data. I thing we can do without 11 observations.
= penguins.dropna()
penguins_cleaned
sum() # rows with NAs removed penguins_cleaned.isna().
id 0
species 0
island 0
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
sex 0
year 0
dtype: int64
- What is the average body mass of penguins for each species?
## Average Body Mass for Each Species
"species").agg(
penguins_cleaned.groupby(= pd.NamedAgg(column="body_mass_g", aggfunc="mean")
mean_weight )
mean_weight | |
---|---|
species | |
Adelie | 3706.164384 |
Chinstrap | 3733.088235 |
Gentoo | 5092.436975 |
- Are there any differences in bill length among the three islands in the dataset?
"island").agg(
penguins_cleaned.groupby(= pd.NamedAgg(column="bill_length_mm", aggfunc="mean")
mean_bill_length round(2) ).
mean_bill_length | |
---|---|
island | |
Biscoe | 45.25 |
Dream | 44.22 |
Torgersen | 39.04 |
- What is the distribution of flipper length for each penguin species?
= penguins_cleaned[penguins_cleaned.species == "Adelie"]
adelie_species = penguins_cleaned[penguins_cleaned.species == "Gentoo"]
gentoo_species = penguins_cleaned[penguins_cleaned.species == "Chinstrap"]
chinstrap_species
plt.figure()="Adelie")
plt.hist(adelie_species.flipper_length_mm, label="Gentoo")
plt.hist(gentoo_species.flipper_length_mm, label="Chinstrap")
plt.hist(chinstrap_species.flipper_length_mm, label
plt.title("Distribution of Flipper Length for the Three Penguin Species",
=13, loc="left", fontweight="bold"
size
)"Flipper Length ($mm$)", size=9, fontweight="bold", loc="right")
plt.xlabel("Count", size=9, weight="bold", loc="top")
plt.ylabel( plt.legend()
- Is there any relationship between bill length and bill depth?
plt.figure()
sns.scatterplot(="bill_length_mm",
x="bill_depth_mm",
y=penguins_cleaned,
data="species",
hue
)"Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.title("Bill length ($mm$)", size=9)
plt.xlabel("Bill depth ($mm$)", size=9) plt.ylabel(
Text(0, 0.5, 'Bill depth ($mm$)')
Alternatively I can fit a linear regression line for each species to investigate the relationship between bill length and depth.
= sns.lmplot(
rel_plt ="bill_length_mm",
x="bill_depth_mm",
y=penguins_cleaned,
data="species",
hue=["+", "o", "p"]
markers
)
plt.tight_layout()
sns.move_legend(
rel_plt,"upper right",
=True
frameon
)"Relationship between Bill length($mm$) and Bill depth($mm$)", size=13, weight="bold")
plt.title("Bill length ($mm$)", size=9)
plt.xlabel("Bill depth ($mm$)", size=9) plt.ylabel(
Text(41.303875868055556, 0.5, 'Bill depth ($mm$)')
- How does the average body mass compare between male and female penguins and its weight for each sex across the species?
"sex").agg(
penguins_cleaned.groupby(= pd.NamedAgg(column="body_mass_g", aggfunc="mean")
mean_weight round(2) ).
mean_weight | |
---|---|
sex | |
female | 3862.27 |
male | 4545.68 |
Comparison of penguins weight according to sex across penguins species
= penguins_cleaned.groupby(["sex", "species"])["body_mass_g"].agg("mean")
plt_dt
= plt_dt.reset_index()
plt_dt plt_dt
sex | species | body_mass_g | |
---|---|---|---|
0 | female | Adelie | 3368.835616 |
1 | female | Chinstrap | 3527.205882 |
2 | female | Gentoo | 4679.741379 |
3 | male | Adelie | 4043.493151 |
4 | male | Chinstrap | 3938.970588 |
5 | male | Gentoo | 5484.836066 |
- What is the proportion of penguin species found on each island?
# Proportion of penguin species found on each island?
penguins_cleaned.value_counts(=["island","species"],
subset=True, sort=False
normalizeround(2).reset_index() ).
island | species | proportion | |
---|---|---|---|
0 | Biscoe | Adelie | 0.13 |
1 | Biscoe | Gentoo | 0.36 |
2 | Dream | Adelie | 0.17 |
3 | Dream | Chinstrap | 0.20 |
4 | Torgersen | Adelie | 0.14 |
- How does flipper length vary across the different species and islands?
# How does flipper length vary across the different species and islands?
plt.figure()
sns.boxplot(="island",
x="flipper_length_mm",
y="species",
hue=penguins_cleaned
data
)"Island", loc="right", size=9, weight=900, style="italic")
plt.xlabel("Flipper Length ($mm$)",size=9, loc="top", style="italic", weight=900)
plt.ylabel(
plt.title("Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands",
="heavy", size=14
weight )
Text(0.5, 1.0, 'Distribution of Flipper Length ($mm$), Penguins Species Across Different Islands')
- What are the maximum and minimum body masses recorded in the dataset?
# Find max and min body_mass
## max
== penguins_cleaned.body_mass_g.max()] penguins_cleaned[penguins_cleaned.body_mass_g
id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
169 | 169 | Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | male | 2007 |
max() penguins_cleaned.body_mass_g.
np.float64(6300.0)
## min
== penguins_cleaned.body_mass_g.min()] penguins_cleaned[penguins_cleaned.body_mass_g
id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
314 | 314 | Chinstrap | Dream | 46.9 | 16.6 | 192.0 | 2700.0 | female | 2008 |
min() penguins_cleaned.body_mass_g.
np.float64(2700.0)
- Are there any outliers in the bill length measurements for each species?
sns.boxplot(="bill_length_mm",
x="species",
y=penguins_cleaned,
data="species"
hue
)"Bill Length ($mm$)", weight="bold", size=9, loc="right")
plt.xlabel("Species", weight="bold", size=9, loc="top")
plt.ylabel("Distribution of Bill Length According to Species", size=14, weight="bold") plt.title(
Text(0.5, 1.0, 'Distribution of Bill Length According to Species')
Conclusion
This was a wrap and I think its a good start to getting familiar with python. In this project, the following have been done, data aggregation, handling missing data, and visualization.