Press "Enter" to skip to content

How to merge data in R using R merge, dplyr, or data.table

R has several quick and elegant ways to join data frames using a common column. I would like to show you three of them:

  • base R merge() function
  • dplyrjoins the family of functions
  • data.tableparenthesis syntax

Get and import the data

For this example, I’ll use one of my favorite demo data sets: US Bureau of Transportation Statistics flight delay times. If you’d like to follow along, go to http://bit.ly/USFlightDelays and download data for the time period of your choice with columns Date of flight, Report_Airline, Origin, Destinationand ExitDelayMinutes. Also get the lookup table for Report_Airline.

Or, you can download these two data sets, plus my R code in a single file and a PowerPoint explaining the different types of data merges, here:

discharge

It includes R scripts, various data files, and a PowerPoint to accompany the InfoWorld tutorial. Sharon McLis

To read the file with R base, you would first unzip the flight delay file and then import the flight delay data and code lookup file with read.csv(). If you’re running the code, it’s likely that the delay file you downloaded has a different name than the code below. Also, note that the search file is unusual. .csv_ extension.


unzip("673598238_T_ONTIME_REPORTING.zip")
mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote="\"")
mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",
quote="\"", sep = "," )

Next, I’ll take a look at both files with head():


head(mydf)
     FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW  X
1 2019-08-01                DL    ATL  DFW            31 NA
2 2019-08-01                DL    DFW  ATL             0 NA
3 2019-08-01                DL    IAH  ATL            40 NA
4 2019-08-01                DL    PDX  SLC             0 NA
5 2019-08-01                DL    SLC  PDX             0 NA
6 2019-08-01                DL    DTW  ATL            10 NA

head(mylookup) Code Description 1 02Q Titan Airways 2 04Q Tradewind Aviation 3 05Q Comlux Aviation, AG 4 06Q Master Top Linhas Aereas Ltd. 5 07Q Flair Airlines Ltd. 6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

Fuses with base R

He mydf the delay data frame only has information from the airline by code. I would like to add a column with the names of the airlines of mylookup. A base R way to do this is with the merge() function, using the basic syntax merge(df1, df2). The order of data frame 1 and data frame 2 doesn’t matter, but whichever is first is considered x and whichever is second is y.

If the columns you want to join by don’t have the same name, you need to tell merge which columns you want to join: by.x for the data frame column name x,y by.y for him and, how merge(df1, df2, by.x = "df1ColName", by.y = "df2ColName").

You can also tell merge if you want all rows, including unmatched ones, or just matching rows, with the arguments all.x and all.y. In this case, I’d like to have all the rows of the lag data; if there is no airline code in the lookup table, I still want the information. But I don’t need lookup table rows that are not in the delay data (there are some codes for old airlines that don’t fly there anymore). So, all.x It does not matter TRUE but all.y It does not matter FALSE. Here is the code:


joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER", 
by.y = "Code", all.x = TRUE, all.y = FALSE)

The new joined data frame includes a column called Description with the name of the airline based on the airline code:


head(joined_df)
  OP_UNIQUE_CARRIER    FL_DATE ORIGIN DEST DEP_DELAY_NEW  X       Description
1                9E 2019-08-12    JFK  SYR             0 NA Endeavor Air Inc.
2                9E 2019-08-12    TYS  DTW             0 NA Endeavor Air Inc.
3                9E 2019-08-12    ORF  LGA             0 NA Endeavor Air Inc.
4                9E 2019-08-13    IAH  MSP             6 NA Endeavor Air Inc.
5                9E 2019-08-12    DTW  JFK            58 NA Endeavor Air Inc.
6                9E 2019-08-12    SYR  JFK             0 NA Endeavor Air Inc.

Joins with dplyr

He dplyr The package uses SQL database syntax for its join functions. TO unite left means: Include everything to the left (what was the data frame x in merge()) and all rows matching the right data frame (y). If the join columns have the same name, all you need is left_join(x, y). If they don’t have the same name, you need a by argument, like left_join(x, y, by = c("df1ColName" = "df2ColName")).

Note the syntax of by: Is a named vector, with the names of the left and right columns enclosed in quotes.

Also Read:  What is Apache Spark? The big data platform that crushed Hadoop

Update: as of dplyr version 1.1.0 (on CRAN as of Jan 29, 2023), dplyr unions have an additional by syntax using join_by():


left_join(x, y, by = join_by(df1ColName == df2ColName))

The new join_by() helper function uses unquoted column names and the == boolean operator, which the package authors say makes more sense in an R context than c("col1" = "col2")from = it is meant to assign a value to a variable, not to test for equality.


unite left IDG

A left join keeps all rows in the left data frame and only matching rows in the right data frame.

The code to import and merge both data sets using left_join() Is under. Start loading the dplyr and readr packages, and then read the two files with read_csv(). when you use read_csv()I don’t need to unzip the file first.


library(dplyr)
library(readr)

mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

joined_tibble <- left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))

Note that dplyr is older by syntax without join_by() it still works


joined_tibble <- left_join(mytibble, mylookup_tibble,
by = c("OP_UNIQUE_CARRIER" = "Code"))

read_csv() creates tibbleswhich are a type of data frame with some additional features. left_join() merge the two. Take a look at the syntax: in this case, the order matters. left_join() half include all rows to the left, or the first data set, but only the rows that match the second. And, since I need to join two columns with different names, I included a by argument.

The new join syntax in the development-only version of dplyr I would be:


joined_tibble2 <- left_join(mytibble, mylookup_tibble, 
by = join_by(OP_UNIQUE_CARRIER == Code))

However, since most people probably have the CRAN version, I’ll use dplyroriginal named vector syntax in the rest of this article, until join_by() becomes part of the CRAN version.

We can see the structure of the result with dplyr‘s glimpse() function, which is another way to view the first elements of a data frame:


glimpse(joined_tibble)
Observations: 658,461
Variables: 7
$ FL_DATE           <date> 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…
$ OP_UNIQUE_CARRIER <chr> "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…
$ ORIGIN            <chr> "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…
$ DEST              <chr> "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…
$ DEP_DELAY_NEW     <dbl> 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …
$ X6                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Description       <chr> "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you’ll probably notice that dplyr it is much faster than base R.

Next, let’s look at a super-fast way to perform joins.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *