Reading Multiple CSVs into Merged R Dataframe

The purpose of this article is to explain how to load and clean 100+ .csv files containing polling place data into a singular R dataframe. The data, which is available for download here, is structured as follows:
— Each state (32 in total) has its own folder
— Within each state (folder), there are a variable number of CSV files, one for each year that polling place data is available

Here is what the folder structure looks like:

Here is one example of the datasets that could be contained in a folder:

Generally, my approach to reading in these data files into 1 R Dataframe is to:
0. Initialize an empty dataframe to hold the results
1. Loop through each state folder
2. Determine the CSVs within that state folder

As a precautionary measure, let’s determine how many state folders there are. We’ll assume that any object in our working directory that does NOT end in .csv is a state folder. Because there’s no direct regex to look for “un-matches”, let’s find the set difference between all of our files and those ending in .csv.

0. Initialize an Empty Dataframe

Next, we’ll set up the structure for our dataframe. These are the columns and associated data types for each column that we expect to get.

1. Read in Data

In this next section, which constitutes the major lift for reading in the data, we will:
1. Loop through each of the state folders we discovered earlier (outer loop) and determine the number of files in the folder
2. Loop through each of the .csv files (inner loop) to:
3a. Determine if it’s a.csv
3b. If it is, read it in as a temporary dataframe
3. Create two new columns in the temporary dataframe:
— state_name <- same as the folder name in the outer loop
— year <- same as the year that appears in the name of the file
4. Append (union) the temporary dataframe created in Step 3 to the empty, initialized dataframe from earlier

1.1 Loop through State Folders

First, we start with the outer loop, which goes through all of the state folders, as defined above. Once in each state folder, we’ll identify all of the files that exist within it (usually a few .csv files and one .md file).

1.2 Loop through CSV Files

Next, let’s loop through each of the states to read our .csv files in. Remember that in this step we will:
— Determine if the file is.csv (otherwise, it’s .md)
— Determine the name of the file (a combination of our file path, the state name from the outer loop, and the value from the inner loop)
— Read the file in as a temporary dataframe

1.3 Create New Features

Now that we’ve read in our dataset, we’ll create two new columns in the temporary dataframe:
1. state_name | From the outer loop
2. year | From the name of the csv (also available as the iterator, i)

1.4 Union to Master Dataframe

Lastly, we’ll append the temporary dataframe to our master dataframe using a smart union (bind_rows), which doesn’t need all column headers to match.

Altogether Now!

What does all of this look like together? Well, here it is:

And that’s it! If you run the code above, make sure to uncomment the print/cat functions so you can get updates as the code runs. Let’s take a look at the final dataset:

Looks great!

Alternative Methods

It shouldn’t go without saying that this is not the only way to read in multiple data files from different folders! More experienced R users usually use techniques from the purrr package (i.e. map() function) to more efficiently (in terms of time and lines of code) perform the same task as above. While the purrr package does provide much facility and readability in code, I usually prefer to write out the process using for loops, as it allows me to better understand what’s happening “under the hood”.

Originally published at https://datacracy.netlify.app on December 31, 2020.

Strategy & Analytics Consultant | R Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store