In this tutorial we will be plotting different types of graphs using the package ggplot2
in R. The ggplot2 is an excellent package which can used to create beautiful data visualizations. It has been created and developed by Dr. Hadley Wickham. At the very heart of a ggplot graph we have a dataset which we want to plot, a coordinate system which acts as the environment for the plot and ‘geoms’- which can be tweaked to get different aesthetics for the graph.
This tutorial is primarily focused on students who are beginners in R programming and wants to quickly plot their data without much of a hassle. So without further ado let us get started!
First we need to install the ggplot2
package in R as it does not come in the standard distribution of R.
install.packages()
and to load packages we use the command library()
. Therefore to install and load ggplot2
package we use the following lines of command.install.packages("ggplot2")
library(ggplot2)
All right we have the ggplot2
package loaded, now we just need some data to plot. Most R programming tutorials uses the iris
dataset for showing examples. This tutorial won’t be like most tutorials. So let me introduce you to the lovely penguins from Palmer Station in Antarctica!
For this tutorial we will be installing the palmerpenguins
package which showcases body measurements taken from three different species of penguins from Antarctica. This package was made possible by the efforts of Dr. Allison Horst. The penguin data was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.
palmerpenguins
package and load it in R.install.packages("palmerpenguins")
library(palmerpenguins)
Now there are two datasets in this package. We will be using the penguins
dataset which is a simplified version of the raw-data present in the package.
head()
to display the first few values of penguins
dataset to see how it looks likelibrary(palmerpenguins)
head(penguins)
We can see that are 8 columns in the dataset representing different values. Now let us try plotting some graphs with this data.
So we will try to plot a simple bar graph first. Bar graphs are used to represent categorical data where the height of the rectangular bar represents the value for that category. From the bar graph we will be plotting, we will see the frequency data for all three species of penguins.
geom_bar()
command to plot the bar graph. Let us also use the command theme_bw()
for the nice aesthetics.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = species, fill = species)) +
xlab("Species") + ylab("Frequency") +
ggtitle("Frequency of individuals for each species") +
geom_bar() + theme_bw()
Histograms are similar to bar graphs visually. But histograms are used to represent continuous data. Also the all the rectangular bars will have same bin size or width.
geom_histogram()
.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
xlab("Body Mass (g)") + ylab("Frequency") +
ggtitle("Frequency of individuals for respective body mass") +
geom_histogram(bins = 25) + theme_bw()
## Warning: Removed 2 rows containing non-finite values (stat_bin).
The warning message indicates that for two rows in the dataset, they have NA values or that they did not have any values present. This true for real life cases, as during data collection sometimes you will be unable to collect data due to various reasons. So this is perfectly fine.
Line graph simply joins together data points to show overall distribution.
geom_line()
for plotting a line graph.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = bill_length_mm,
y = bill_depth_mm, colour = species)) +
xlab("Bill length (mm)") + ylab("Bill depth (mm)") +
ggtitle("Bill length vs Bill depth") + geom_line() + theme_bw()
Scatter plot simply denotes the data points in the dataset.
geom_point()
to plot a scatter plot.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm,
shape = species, colour = species)) +
xlab("Body mass (g)") + ylab("Flipper length (mm)") +
ggtitle("Body mass vs Filpper length") + geom_point(size = 2) + theme_bw()
Density plots is similar to histograms but shows it shows the overall distribution of the data in a finer way. This way we will get a bell shaped curve if our data follows a normal distribution.
geom_density()
to a density plot.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
xlab("Body Mass (g)") + ylab("Density") + ggtitle("Body mass distribution") +
geom_density() + theme_bw()
Since we plotted for all three species the graph looks clustered. Let us try plotting the same graph for only gentoo penguins. We will use the dplyr
package to filter()
data for gentoo penguins alone. The dplyr
package comes in-built with R so just load the dplyr
package using the command library()
. The dplyr
package was also developed by Dr. Hadley Wickham.
library(ggplot2)
library(palmerpenguins)
library(dplyr)
<- penguins %>% filter(species == "Gentoo")
penguins_gentoo
ggplot(data = penguins_gentoo, aes(x = body_mass_g)) +
xlab("Body Mass of Gentoo penguins (g)") + ylab("Density") +
ggtitle("Body mass distribution of Gentoo penguins") +
geom_density(fill = "red") + theme_bw()
Dot-plot is similar to a density plot but it shows discretely each data point in the distribution.
geom_dotplot()
to plot a dot-plot.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
xlab("Species") + ylab("Body mass (g)") +
ggtitle("Body mass in three diferent species of penguins") +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 100) + theme_bw()
Rug-plot is a simple way to visualize distribution of data along the axis lines. It is often used in conjunction with other graphical representations.
geom_rug()
to plot a rug-plot.library(ggplot2)
library(palmerpenguins)
library(dplyr)
<- penguins %>% filter(species == "Gentoo")
penguins_gentoo
ggplot(data = penguins_gentoo, aes(x = body_mass_g, y = flipper_length_mm)) +
xlab("Body Mass of Gentoo penguins (g)") + ylab("Density") +
ggtitle("Body mass distribution of Gentoo penguins") +
geom_point(colour = "darkred") + geom_rug() + theme_bw()
Box-plot is one of the better ways of showing data via quartiles. You can learn more about box-plots here.
geom_boxplot()
to plot a box-plot.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = species, y = body_mass_g, colour = species)) +
xlab("Species") + ylab("Body mass (g)") +
ggtitle("Body mass in three diferent species of penguins") + geom_boxplot() +
theme_bw()
Violin-plot can be considered as the best of both a box-plot and a density plot. It shows the quartile values, like in a box-plot and also shows the distribution of the data, like in a density plot.
geom_violin()
in conjunction with geom_boxplot()
to plot a violin-plot.library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
xlab("Species") + ylab("Body mass (g)") +
ggtitle("Body mass in three diferent species of penguins") +
geom_violin(aes(colour = species), trim = TRUE) + geom_boxplot(width = 0.2) +
theme_bw()
I hope this tutorial helped you to get familiarise with ggplot2
commands. There are many more different types of graphs that you can plot using ggplot2
. The tutorial only showed some of the commonly used ones. The best way to learn R is through actually doing it yourself. Try to recreate the examples given in this tutorial by yourself and then try what you learned with the different datasets available in R. Have a good day!
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. Read more about ggplot2 here. You can also look at the cheat sheet for all the syntax used in ggplot2
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.