Base R notes

Many functions in base R has faded away from my daily use of R because of tidyverse and the paradigm to do as many operations as possible in a data.frame.

Get the variable name

deparse(substitute(variable))

Indexing and subsetting

which to return a logical vector that can be used in [] for subsetting

Tidyverse alterantive (notes for myself)

Imagine that I have a list of data.frames (group_split split a dataframe into lists of dataframes by the value of column specified)

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
list_df <- iris %>% group_split(Species)

list_df[[1]]
## # A tibble: 50 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 40 more rows

To get dataframes whose sum of large Sepals’ lengths (Sepal.Length >5) is larger than 200, I can either pull out the values > 5, calculate their sum, compare to obtain a vector of TRUE values and subset by the logical vector, in this way:

subset_vector <- list_df %>%  map_dbl(
    ~ filter(., Sepal.Length > 5)  %>% 
    pull(Sepal.Length)  %>% 
    as.double()  %>% 
    sum(na.rm = TRUE)
)

list_df[which(subset_vector>200)]
## <list_of<
##   tbl_df<
##     Sepal.Length: double
##     Sepal.Width : double
##     Petal.Length: double
##     Petal.Width : double
##     Species     : factor<fb977>
##   >
## >[2]>
## [[1]]
## # A tibble: 50 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
##  1          7           3.2          4.7         1.4 versicolor
##  2          6.4         3.2          4.5         1.5 versicolor
##  3          6.9         3.1          4.9         1.5 versicolor
##  4          5.5         2.3          4           1.3 versicolor
##  5          6.5         2.8          4.6         1.5 versicolor
##  6          5.7         2.8          4.5         1.3 versicolor
##  7          6.3         3.3          4.7         1.6 versicolor
##  8          4.9         2.4          3.3         1   versicolor
##  9          6.6         2.9          4.6         1.3 versicolor
## 10          5.2         2.7          3.9         1.4 versicolor
## # ... with 40 more rows
## 
## [[2]]
## # A tibble: 50 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>    
##  1          6.3         3.3          6           2.5 virginica
##  2          5.8         2.7          5.1         1.9 virginica
##  3          7.1         3            5.9         2.1 virginica
##  4          6.3         2.9          5.6         1.8 virginica
##  5          6.5         3            5.8         2.2 virginica
##  6          7.6         3            6.6         2.1 virginica
##  7          4.9         2.5          4.5         1.7 virginica
##  8          7.3         2.9          6.3         1.8 virginica
##  9          6.7         2.5          5.8         1.8 virginica
## 10          7.2         3.6          6.1         2.5 virginica
## # ... with 40 more rows

Or, I can do everything within the data.frame, in this way:

list_df  %>%
map(
     ~ filter(., Sepal.Length > 5)  %>%  
    mutate(length_sum = sum(as.double(Sepal.Length), na.rm = TRUE))  %>% 
    filter(length_sum > 200)
)
## [[1]]
## # A tibble: 0 x 6
## # ... with 6 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## #   Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>, length_sum <dbl>
## 
## [[2]]
## # A tibble: 47 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species    length_sum
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>           <dbl>
##  1          7           3.2          4.7         1.4 versicolor       282.
##  2          6.4         3.2          4.5         1.5 versicolor       282.
##  3          6.9         3.1          4.9         1.5 versicolor       282.
##  4          5.5         2.3          4           1.3 versicolor       282.
##  5          6.5         2.8          4.6         1.5 versicolor       282.
##  6          5.7         2.8          4.5         1.3 versicolor       282.
##  7          6.3         3.3          4.7         1.6 versicolor       282.
##  8          6.6         2.9          4.6         1.3 versicolor       282.
##  9          5.2         2.7          3.9         1.4 versicolor       282.
## 10          5.9         3            4.2         1.5 versicolor       282.
## # ... with 37 more rows
## 
## [[3]]
## # A tibble: 49 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   length_sum
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>          <dbl>
##  1          6.3         3.3          6           2.5 virginica       324.
##  2          5.8         2.7          5.1         1.9 virginica       324.
##  3          7.1         3            5.9         2.1 virginica       324.
##  4          6.3         2.9          5.6         1.8 virginica       324.
##  5          6.5         3            5.8         2.2 virginica       324.
##  6          7.6         3            6.6         2.1 virginica       324.
##  7          7.3         2.9          6.3         1.8 virginica       324.
##  8          6.7         2.5          5.8         1.8 virginica       324.
##  9          7.2         3.6          6.1         2.5 virginica       324.
## 10          6.5         3.2          5.1         2   virginica       324.
## # ... with 39 more rows

The advantage of doing it in a data.frame way is that I can continue using the rich vocabulary that tidyverse provides. For example, I can condition the sum on Sepal.Width using group_by in each category of flowers:

list_df_advanced <- list_df %>% 
  map(
    ~ mutate(., width_category = if_else(Sepal.Width > 3, "wide", "narrow")
  )
  )

list_df_advanced  %>%
map(
     ~ filter(., Sepal.Length > 5)  %>%  
    group_by(width_category) %>% 
    mutate(length_sum = sum(as.double(Sepal.Length), na.rm = TRUE))  %>% 
    ungroup() %>% 
    filter(length_sum > 200)
)
## [[1]]
## # A tibble: 0 x 7
## # ... with 7 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## #   Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>, width_category <chr>,
## #   length_sum <dbl>
## 
## [[2]]
## # A tibble: 39 x 7
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species    width_category
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>      <chr>         
##  1          5.5         2.3          4           1.3 versicolor narrow        
##  2          6.5         2.8          4.6         1.5 versicolor narrow        
##  3          5.7         2.8          4.5         1.3 versicolor narrow        
##  4          6.6         2.9          4.6         1.3 versicolor narrow        
##  5          5.2         2.7          3.9         1.4 versicolor narrow        
##  6          5.9         3            4.2         1.5 versicolor narrow        
##  7          6           2.2          4           1   versicolor narrow        
##  8          6.1         2.9          4.7         1.4 versicolor narrow        
##  9          5.6         2.9          3.6         1.3 versicolor narrow        
## 10          5.6         3            4.5         1.5 versicolor narrow        
## # ... with 29 more rows, and 1 more variable: length_sum <dbl>
## 
## [[3]]
## # A tibble: 32 x 7
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   width_category
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>     <chr>         
##  1          5.8         2.7          5.1         1.9 virginica narrow        
##  2          7.1         3            5.9         2.1 virginica narrow        
##  3          6.3         2.9          5.6         1.8 virginica narrow        
##  4          6.5         3            5.8         2.2 virginica narrow        
##  5          7.6         3            6.6         2.1 virginica narrow        
##  6          7.3         2.9          6.3         1.8 virginica narrow        
##  7          6.7         2.5          5.8         1.8 virginica narrow        
##  8          6.4         2.7          5.3         1.9 virginica narrow        
##  9          6.8         3            5.5         2.1 virginica narrow        
## 10          5.7         2.5          5           2   virginica narrow        
## # ... with 22 more rows, and 1 more variable: length_sum <dbl>

I can’t think of a straightforward way to achieve this in base R without many loops…

Avatar
Tim

Personalizing medicine

Related