Learn how to manipulate your data with the dplyr package. In the continuation with the previous chapter.
In the previous chapter we have seen quite a lot of functions from the dplyr
package. In this chapter, we will see the rest of the functions where we learn how to handle row names, how to join columns and rows and different set operations in the dplyr
package.
#loading neessary packages
library(dplyr)
Tidy data does not use row names. So use rownames_to_column()
command to convert row names to a new column to the data. The function column_to_rownames()
does the exact opposite of rownames_to_column()
as it converts a column into rownames but make sure that the column you are converting into rownames does not contain NA
values.
# mtcars dataset contains rownames
# creates new column called car_names which contains row names
%>% rownames_to_column(var = "car_names")
mtcars
# returns the original mtcars dataset
%>% rownames_to_column(var = "car_names") %>%
mtcars column_to_rownames(var = "car_names")
Joins columns with other columns. Similar function as that of cbind()
from base R.
<- tidytable::data.table(x = letters[1:5], y = c(1:5))
df1 <- tidytable::data.table(x = letters[3:7], y = c(6:10))
df2 bind_cols(df1,df2)
#similar functionality
cbind(df1,df2)
Joins rows with other rows. Similar function as that of rbind()
from base R.
<- tidytable::data.table(x = letters[1:5], y = c(1:5))
df1 <- tidytable::data.table(x = letters[3:7], y = c(6:10))
df2 bind_rows(df1,df2)
#similar functionality
rbind(df1,df2)
The functions that are described below have the same functionality as that of bind_cols()
but give you control over how the columns are joined.
Mutating joins include left_join()
, right_join()
, inner_join()
and full_join()
and filtering joins include semi_join()
and anti_join()
.
In the code below, matching variables of df2 are joined with df1. In the final data, you can see that only kevin and sam from df2 are matched with df1, and only those row values are joined with df1. For those variables which didn’t get a match, the row values for those are filled with NA
. You can interpret the variables with NA
values as; both john and chris are not present in df2.
If you are familiar with set theory in mathematics, what we are doing essentially is similar to \((df1 \cap df2) \cup df1\).
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% left_join(df2) df1
Similar to left_join()
but here, you will be joining matching values from df1 to df2, the opposite of what we did earlier. As you can see only kevin and sam from the df1 is matched with df2, and only those row values are joined with df2. For the variables which didn’t get a match, the row values for those are filled with NA
. You can interpret the variables with NA
values as; bob is not present in df1.
This function, in the manner used here, is similar to \((df1 \cap df2) \cup df2\).
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% right_join(df2) df1
The function inner_join()
compares both df1 and df2 variables and only joins rows with the same variables. Here only kevin and sam are common in both the dataframes so the row values of only those columns are joined and others are omitted.
This function is similar to \(df1 \cap df2\).
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% inner_join(df2) df1
The function full_join()
compares both df1 and df2 variables and joins all possible matches while retaining both mistakes in df1 and df2 with NA
values.
This function is similar to \(df1 \cup df2\).
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% full_join(df2) df1
This is an example of filtering join. The function anti_join()
compares df1 variables to and df2 variables and only outputs those variables of df1 which didn’t get a match with df2.
This function, in the manner used here, is similar to \(df1 \cap df2^c\).
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% anti_join(df2) df1
This is an example of filtering join. The function semi_join()
is similar to inner_join()
but it only gives variables of df1 which has a match with df2.
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
%>% semi_join(df2) df1
Here is a nice graphical representation of the functions we just described now. Image source.
Additionally, you can specify which common columns to match.
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df2
# match with column 'x'
%>% left_join(df2, by = "x")
df1
<- tidytable::data.table(a = c("john","kevin","chris","sam","sam"), y = 1:5)
df3 <- tidytable::data.table(b = c("kevin","sam", "bob"), z = 10:12)
df4
# matching with column having different names, a and b in this case
%>% left_join(df4, by = c("a" = "b")) df3
Similar to the mutating join functions that we had seen, there are different functions related to set theory operations.
Outputs common rows in the dataset.
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"))
df2
intersect(df1, df2)
x
1: kevin
2: sam
Outputs rows in first data frame but not in second data frame.
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"))
df2
setdiff(df1, df2)
x
1: john
2: chris
Outputs all the rows in both dataframes
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"))
df2
union(df1, df2)
x
1: john
2: kevin
3: chris
4: sam
5: bob
Checks whether two datasets have same number of rows.
<- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df1 <- tidytable::data.table(x = c("kevin","sam", "bob"))
df2
setequal(df1, df2)
[1] FALSE
In this chapter, we have seen how to handle row names, how to combine columns and rows, what are mutating and filtering joins and various set operations. Thus to conclude this chapter, we have now learned almost all functions in the dplyr package and have seen how to manipulate data efficiently. With the knowledge of the pipe operator that we have seen in chapter 1, we are now equipped to write codes compactly and more clearly. I hope this chapter was useful for you and I will see you next time. Have a good day!
Previous chapter:
2: Data manipulation using dplyr (part 1)
dplyr
.[1] "2022-01-06 17:12:52 IST"
If you see mistakes or want to suggest changes, please create an issue on the source repository.