Purrr is a package that fills the missing parts in R’s functional programming tools: it’s built to make your pure functions purrr. R’s functional programming (FP) toolkit is enhanced by purrr by providing a comprehensive and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more concise and easier to read.
Purrr is the response of tidyverse to apply functions for iteration. It’s one of those packages that you might have heard of, but seemed too difficult to sit down and learn. Starting with map functions and taking you on a journey that takes advantage of the list ’s strength, this presentation will have you purrring in no time.
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------ tidyverse 1.3.0 --
## <U+221A> ggplot2 3.3.2 <U+221A> dplyr 1.0.2
## <U+221A> tibble 3.0.3 <U+221A> stringr 1.4.0
## <U+221A> tidyr 1.1.2 <U+221A> forcats 0.5.0
## <U+221A> purrr 0.3.4
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
files <- list.files(path="C:/Users/froze/OneDrive/Masaüstü/RÖDEVİ/",
pattern=".csv", full.names = TRUE)
Imagine reading in hundreds of files with a similar structure, and performing an action on them. We don’t want to write hundreds of repetitive lines of code to read or execute the operation in all of the scripts. We want to iterate about them, instead. Iteration is the mechanism by which many inputs do the same thing.Iterating is essential to make our code efficient, and is powerful when working with lists.
The names of 16 CSV files for this exercise have been loaded into a list called data. We may use the list.files() function to build the list in our own work. Even the readr library is loaded already.
# Initialize list
all_files <- list()
# For loop to read files into a list
for(i in seq_along(files)){
all_files[[i]] <- read.csv(file = files[[i]])
}
# Output size of list object
length(all_files)
## [1] 0
We’ve made a nice loop, but it needs a lot of code to do something as basic as adding in a list a set of files. This is where purrr joins. In one line of code, we can do the same thing as a for loop with purrr::map(). The map() function iterates over a set, and uses a different function that can be defined with the .f argument.
map() takes two arguments:
The first is the list over that will be iterated over The second is a function that will act on each element of the list The readr library is already loaded.
# Load purrr library
library(purrr)
# Use map to iterate
all_files_purrr <- map(files, read_csv)
# Output size of list object
length(all_files_purrr)
## [1] 0
But iteration is not just for file reading; iteration can be used to perform other object behavior. First we’re going to try to iterate with a loop.
We will turn each element of a list into a form of numeric data, and then place it back in the same element in the same list.
We will iterate for this exercise using a for loop which takes list_of_df, which is a list of vector characters, but the characters are actually numbers! We need to convert the character vectors to numeric to allow us to perform mathematical operations on them; we can use the base R function, as.numeric() to do that.
char_vector <- c("1", "2","3", "4")
list_of_df <- list()
for(i in 1:10){
list_of_df[[i]] <- char_vector
}
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "character"
# Change each element from a character to a number
for(i in seq_along(list_of_df)){
list_of_df[[i]] <- as.numeric(list_of_df[[i]])
}
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
# Print out the list
list_of_df
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] 1 2 3 4
##
## [[4]]
## [1] 1 2 3 4
##
## [[5]]
## [1] 1 2 3 4
##
## [[6]]
## [1] 1 2 3 4
##
## [[7]]
## [1] 1 2 3 4
##
## [[8]]
## [1] 1 2 3 4
##
## [[9]]
## [1] 1 2 3 4
##
## [[10]]
## [1] 1 2 3 4
Now you will change each element of a list into a numeric data type and after that put all of them in same list, but we can use map() to solve problem easily instead of using a for loop. We can use map() function in Purr to create loop on the list easily, and characters will be turned numbers. We will use just one line instead of write a whole for loop.
# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
# Change each character element to a number
list_of_df <- map(list_of_df, as.numeric)
# Check the class type of the first element again
class(list_of_df[[1]])
## [1] "numeric"
# Print out the list
list_of_df %>%
glimpse()
## List of 10
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
## $ : num [1:4] 1 2 3 4
We have to refresh our memory to understand how pipes can be used between functions. We can also use pipes inside of map() function to help you iterate a pipeline of tasks over a list of inputs. We will work with list of numbers instead of repurrrsive datasets so that we can do mathematical operations
# Create a list of values from 1 through 10
numlist <- list(1,2,3,4,5,6,7,8,9,10)
# Iterate over the numlist
map(numlist, ~.x %>% sqrt() %>% sin())%>%
glimpse()
## List of 10
## $ : num 0.841
## $ : num 0.988
## $ : num 0.987
## $ : num 0.909
## $ : num 0.787
## $ : num 0.638
## $ : num 0.476
## $ : num 0.308
## $ : num 0.141
## $ : num -0.0207
When many of us are trying to solve a problem with data ,we first need to build some simulated data to see if our idea is even possible.In this exercise, we will do how this Works in Purr by simulating data for two populations,a and b, from the sites : “north”, “east”, and “west”. Two populations will be randomly drawn from a normal distribution, with different means and standard deviations.
# List of sites north, east, and west
sites <- list("north","east","west")
# Create a list of dataframes, each with a years, a, and b column
list_of_df <- map(sites,
~data.frame(sites = .x,
a = rnorm(mean = 5, n = 200, sd = (5/2)),
b = rnorm(mean = 200, n = 200, sd = 15)))
list_of_df%>%
glimpse()
## List of 3
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "north" "north" "north" "north" ...
## ..$ a : num [1:200] 3.738 0.197 4.388 4.59 0.487 ...
## ..$ b : num [1:200] 220 188 190 188 196 ...
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "east" "east" "east" "east" ...
## ..$ a : num [1:200] 6.761 7.253 5.653 0.423 7.145 ...
## ..$ b : num [1:200] 200 160 180 187 211 ...
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "west" "west" "west" "west" ...
## ..$ a : num [1:200] -0.46 7.87 8.39 2.27 3.36 ...
## ..$ b : num [1:200] 207 212 173 219 213 ...
With map, you can do not only calculating the square root of a number or simulating data, but also iterating over different inputs to run several models, each of them using the unique values of a specified list element. You can also repeat for the models you’ve run to produce the model summaries and examine the outcomes.
# Map over the models to look at the relationship of a vs b
list_of_df %>%
map(~ lm(a ~ b, data = .)) %>%
map(summary)
## [[1]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2012 -1.6020 0.1433 1.9298 5.3882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.00134 2.31132 3.029 0.00278 **
## b -0.01086 0.01160 -0.936 0.35065
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.415 on 198 degrees of freedom
## Multiple R-squared: 0.004401, Adjusted R-squared: -0.0006274
## F-statistic: 0.8752 on 1 and 198 DF, p-value: 0.3507
##
##
## [[2]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6099 -1.6167 0.0127 1.5202 7.5954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.921669 2.373113 2.495 0.0134 *
## b -0.004025 0.011804 -0.341 0.7334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 198 degrees of freedom
## Multiple R-squared: 0.000587, Adjusted R-squared: -0.004461
## F-statistic: 0.1163 on 1 and 198 DF, p-value: 0.7334
##
##
## [[3]]
##
## Call:
## lm(formula = a ~ b, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5631 -1.6053 0.1558 1.6774 5.7957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.34471 2.44608 4.638 6.39e-06 ***
## b -0.03120 0.01226 -2.545 0.0117 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.527 on 198 degrees of freedom
## Multiple R-squared: 0.03167, Adjusted R-squared: 0.02678
## F-statistic: 6.477 on 1 and 198 DF, p-value: 0.01169
The map() function is very effective if you need to loop over one list, however, you will often need to loop over two lists simultaneously. This is where map2() gets in. While map() takes the list as the .x parameter; map2() takes two lists as two parameters: .x and .y.
To test out map2(), you are going to produce a basic dataset, with one list of numbers and one list of strings. You will put these two lists unitedly and produce some simulated data.
# List of 1, 2 and 3
means <- list(1,2,3)
# Create sites list
sites <- list("north","west","east")
# Map over two arguments: sites and means
list_of_files_map2 <- map2(sites, means, ~data.frame(sites = .x,
a = rnorm(mean = .y, n = 200, sd = (5/2))))
list_of_files_map2%>%
glimpse()
## List of 3
## $ :'data.frame': 200 obs. of 2 variables:
## ..$ sites: chr [1:200] "north" "north" "north" "north" ...
## ..$ a : num [1:200] -0.4271 5.6903 0.0847 2.5748 1.0994 ...
## $ :'data.frame': 200 obs. of 2 variables:
## ..$ sites: chr [1:200] "west" "west" "west" "west" ...
## ..$ a : num [1:200] -1.88 0.72 4.92 -1.83 7.25 ...
## $ :'data.frame': 200 obs. of 2 variables:
## ..$ sites: chr [1:200] "east" "east" "east" "east" ...
## ..$ a : num [1:200] 1.51 6.69 4.39 6.31 3.44 ...
What if you need to loop over three lists? Is there a map3()? To loop over more than two lists, whether it’s three, four, or even 20, you need pmap() function. However, pmap() requires our list arguments a bit differently.
To use pmap(), you first need to produce a master list of all the lists we want to loop over. The master list is the input for pmap(). Don’t use the .x or .y, use the list names as the argument names.
You are going to simulate data for the last time, using five lists as inputs, instead of two. Using pmap() gives you complete control over our simulated dataset, and will let you use two unique means and two unique standard deviations along with the unique sites.
means2 <- list(0.5, 1, 1.5)
sigma <- list(1, 2, 3)
sigma2 <- list(0.5, 1, 1.5)
# Create a master list, a list of lists
pmapinputs <- list(sites = sites, means = means, sigma = sigma,
means2 = means2, sigma2 = sigma2)
# Map over the master list
list_of_files_pmap <- pmap(pmapinputs,
function(sites, means, sigma, means2, sigma2)
data.frame(sites = sites,
a = rnorm(mean = means, n = 200, sd = sigma),
b = rnorm(mean = means2, n = 200, sd = sigma2)))
list_of_files_pmap%>%
glimpse()
## List of 3
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "north" "north" "north" "north" ...
## ..$ a : num [1:200] 2.797 0.684 0.585 0.889 0.317 ...
## ..$ b : num [1:200] -0.453 0.326 -0.373 0.373 0.459 ...
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "west" "west" "west" "west" ...
## ..$ a : num [1:200] 4.633 2.844 0.398 6.892 1.665 ...
## ..$ b : num [1:200] 1.66 1.28 1.19 1.35 -1.37 ...
## $ :'data.frame': 200 obs. of 3 variables:
## ..$ sites: chr [1:200] "east" "east" "east" "east" ...
## ..$ a : num [1:200] 2.775 2.329 0.768 0.245 5.188 ...
## ..$ b : num [1:200] 3.063 2.579 4.964 0.438 0.336 ...
```