intRos: Tidy data manipulation

Tidy and transfrom your data using reproducible code in R
tidyverse
R
Author
Affiliation

Alice Trevail

University of Exeter

Published

August 11, 2024

Modified

July 19, 2024

Outline

The way we collect or download data can often be different to how we analyse them. For example, have you ever needed to work on a subset of your data, add new columns, or calculate summary values? There are good reasons to do this kind of data manipulation in a reproducible way: it’s less prone to errors, saves us time when repeating the process, and helps us to share our code with confidence. In this walkthrough, we’ll share some of our favourite tools for reproducible data manipulation in R and the tidyverse.

Prerequisites

  • Have R installed
  • Be using R for your research

Learning objectives

Learning objectives

This introduction should allow you to:

  • Change the structure of your data using reproducible code
  • Know where to learn more about tidyverse data manipulation

Tidy data manipulation

Why manipulate data in R?

  • Reproducible

  • Easy to update and re-use with new data

  • Less prone to human error

  • Can be quick, easy, and fun…

Here’s an example of how we can use the tidyverse packages (& janitor) to tidy, transform, and visualise the palmer penguins data:

Penguin morphometrics
Code
# load the tidyverse, and other useful packages
pacman::p_load(flextable, palmerpenguins, 
               janitor, magrittr, tidyverse) 

vars_measurements <- c("culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "body_mass_g")

# use pipe operative from magrittr
penguins_example <- penguins_raw %>%
  # column names to snake_case using janitor package
  janitor::clean_names() %>% 
  # rename columns using dplyr
  rename(carbon = delta_13_c_o_oo,
         nitrogen = delta_15_n_o_oo) %>%
  # add new columns using dplyr
  mutate(
    # extract year using lubridate
    year = lubridate::year(date_egg), 
    # change case in sex column using stringr
    sex = str_to_lower(sex), 
    # take first element of species column string using stringr and purrr
    species = str_split_i(species, pattern = " ", i = 1)) %>% 
  # remove nas in sex column using dplyr
  filter(!is.na(sex))

# plot data using ggplot
ggplot(penguins_example, aes(x = flipper_length_mm, y = body_mass_g, 
                             # reorder sex using forcats
                             col = fct_reorder2(sex, flipper_length_mm, body_mass_g))) +
  facet_grid(cols = vars(species), scales = "free")+
  geom_point()+
  geom_smooth(method = lm)+
  scale_colour_viridis_d(option = "magma", begin = 0.7, end = 0.3, name = "Sex")+
  theme_minimal()+
  theme(panel.border = element_rect(fill = NA))

Sexual dimporphism in Palmer penguins
Penguin isotope signatures
Code
penguins_summary_isotopes <- penguins_example %>% 
  # change data format to long using tidyr
  pivot_longer(cols = carbon:nitrogen, names_to = "isotope", values_to = "value") %>%
  # group data using dplyr
  group_by(species, isotope) %>% 
  # summarise data for each group using dplyr
  summarize(mean = mean(value, na.rm = T),
            sd =  sd(value, na.rm = T)) %>%
  # change data format to wide using tidyr
  pivot_wider(id_cols = species, names_from = isotope, values_from=c(mean, sd))

ggplot(penguins_example, aes(x = carbon, y = nitrogen, col = species)) +
  geom_point(alpha = 0.6)+
  geom_errorbar(data = penguins_summary_isotopes, 
                aes(x = mean_carbon, 
                    ymax = mean_nitrogen+sd_nitrogen, ymin = mean_nitrogen-sd_nitrogen, col = species),
                inherit.aes = F, width = 0.1)+
  geom_errorbar(data = penguins_summary_isotopes, 
                aes(y = mean_nitrogen, 
                    xmax = mean_carbon+sd_carbon, xmin = mean_carbon-sd_carbon, col = species), 
                inherit.aes = F, width = 0.1)+
  scale_colour_viridis_d(option = "mako", begin = 0.75, end = 0.1, name = "Species")+
  theme_minimal()+
  theme(panel.border = element_rect(fill = NA))

Isotope signatures among Palmer penguins

Some useful functions

Here, we’ll dig into the functions used in the above example to see how we processed the data:

janitor::clean_names()

Let’s remind ourselves of the original data:

glimpse(penguins_raw)
Rows: 344
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

And now tidy up our column names into snake_case. (This isn’t part of the tidyverse, but it is too useful not to mention)

penguins_example <- penguins_raw %>%
  janitor::clean_names() 

glimpse(penguins_example)
Rows: 344
Columns: 17
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

dplyr::rename()

To rename a column

penguins_example %<>%
  dplyr::rename(carbon = delta_13_c_o_oo,
         nitrogen = delta_15_n_o_oo) 

glimpse(penguins_example)
Rows: 344
Columns: 17
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

dplyr::mutate()

To add a column

penguins_example %<>%
  mutate(year = 2007)

glimpse(penguins_example)
Rows: 344
Columns: 18
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

dplyr::mutate()

To add a column

penguins_example %<>%
  mutate(year = "unknown")

glimpse(penguins_example)
Rows: 344
Columns: 18
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
$ year              <chr> "unknown", "unknown", "unknown", "unknown", "unknown…

lubridate::year()

Lubridate is great for working with dates and times. Here, we want to extract year from a column of date

penguins_example %<>%
  mutate(year = lubridate::year(date_egg))

glimpse(penguins_example)
Rows: 344
Columns: 18
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

stringr::str_to_lower()

Change the contents of a column to lower case

penguins_example %<>%
  mutate(sex = str_to_lower(sex))

glimpse(penguins_example)
Rows: 344
Columns: 18
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

dplyr::select()

Keep certain columns

penguins_tiny <- penguins_example %>%
  select(individual_id, sex, body_mass_g)

glimpse(penguins_tiny)
Rows: 344
Columns: 3
$ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A1", …
$ sex           <chr> "male", "female", "female", NA, "female", "male", "femal…
$ body_mass_g   <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250…

Remove certain columns

penguins_tiny2 <- penguins_example %>%
  select(!c(study_name, sample_number, region, comments))

glimpse(penguins_tiny2)
Rows: 344
Columns: 14
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ carbon            <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

dplyr::filter()

Filter the rows of a dataframe by different criteria

individual_id sex body_mass_g
N1A1 male 3750
N1A2 female 3800
N2A1 female 3250
N2A2 NA NA
N3A1 female 3450
N3A2 male 3650
N4A1 female 3625
N4A2 male 4675
N5A1 NA 3475
N5A2 NA 4250
penguins_tiny %<>%
  filter(!is.na(sex))
individual_id sex body_mass_g
N1A1 male 3750
N1A2 female 3800
N2A1 female 3250
N3A1 female 3450
N3A2 male 3650
N4A1 female 3625
N4A2 male 4675

tidyr::pivot_longer()

Transform data from wide to long (each observation in a single row)

penguins_isotopes_tiny <- penguins_example %>% 
  select(species, individual_id, carbon, nitrogen)
species individual_id carbon nitrogen
Adelie N1A2 -24.69454 8.94956
Adelie N2A1 -25.33302 8.36821
Chinstrap N61A1 -24.30229 9.03935
Chinstrap N61A2 -24.23592 8.92069
Gentoo N31A1 -25.51390 7.99300
Gentoo N31A2 -25.39369 8.14756
penguins_isotopes_tiny %<>% 
  pivot_longer(cols = carbon:nitrogen, 
               names_to = "isotope", 
               values_to = "value") 

dplyr::group_by() & dplyr::summarise()

Group by organises the data into groups, based on column values. This is useful for performing functions on each group, or summarising the data to calculate summary statistics

species individual_id isotope value
Adelie N1A2 carbon -24.69454
Adelie N1A2 nitrogen 8.94956
Adelie N2A1 carbon -25.33302
Adelie N2A1 nitrogen 8.36821
Chinstrap N61A1 carbon -24.30229
Chinstrap N61A1 nitrogen 9.03935
Chinstrap N61A2 carbon -24.23592
Chinstrap N61A2 nitrogen 8.92069
Gentoo N31A1 carbon -25.51390
Gentoo N31A1 nitrogen 7.99300
Gentoo N31A2 carbon -25.39369
Gentoo N31A2 nitrogen 8.14756
penguins_isotopes_tiny %<>%
  group_by(species, isotope) %>% 
  summarize(mean = mean(value),
            sd =  sd(value)) 
species isotope mean sd
Adelie carbon -25.013780 0.4514735
Adelie nitrogen 8.658885 0.4110765
Chinstrap carbon -24.269105 0.0469307
Chinstrap nitrogen 8.980020 0.0839053
Gentoo carbon -25.453795 0.0850013
Gentoo nitrogen 8.070280 0.1092904

tidyr::pivot_wider()

Transform data from long to wide (observations across multiple columns)

penguins_isotopes_tiny %<>%
  pivot_wider(id_cols = species, 
              names_from = isotope, 
              values_from=c(mean, sd)) 
species mean_carbon mean_nitrogen sd_carbon sd_nitrogen
Adelie -25.01378 8.658885 0.4514735 0.4110765
Chinstrap -24.26910 8.980020 0.0469307 0.0839053
Gentoo -25.45379 8.070280 0.0850013 0.1092904

dplyr joins

There are lots of different ways to join multiple dataframes, using shared columns (e.g., an ID column). Here’s an example using left_join(), which keeps all values in the left hand dataframe - in this case penguins_example.

penguins_example %<>%
  left_join(penguins_summary_isotopes,
            by = join_by(species))

glimpse(penguins_example)
Rows: 333
Columns: 22
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 5, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20,…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N3A1", "N3A2", "N4A1", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <chr> "male", "female", "female", "female", "male", "femal…
$ nitrogen          <dbl> NA, 8.94956, 8.36821, 8.76651, 8.66496, 9.18718, 9.4…
$ carbon            <dbl> NA, -24.69454, -25.33302, -25.32426, -25.29805, -25.…
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, NA, NA, "N…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ mean_carbon       <dbl> -25.81356, -25.81356, -25.81356, -25.81356, -25.8135…
$ mean_nitrogen     <dbl> 8.859398, 8.859398, 8.859398, 8.859398, 8.859398, 8.…
$ sd_carbon         <dbl> 0.5871106, 0.5871106, 0.5871106, 0.5871106, 0.587110…
$ sd_nitrogen       <dbl> 0.4282233, 0.4282233, 0.4282233, 0.4282233, 0.428223…

Summary

We have shown a range of functions in R for changing the structure of your data. Hopefully, this has given an insight of how powerful the tidyverse is for efficient data manipulation. Importantly all of these processes are reproducible and limit the chances of us making errors in our workflow.

If we’ve missed a function that you find useful, we’d love to hear what it is. Leave us a comment below!

Additional Resources

  • Each tidyverse package has great documentation and useful cheatsheets, available via the tidyverse website
  • The book R for Data Science, which is available online for free, explains how to get your data into R, get it into the most useful structure, transform and visualize it.
  • For helpful visualisations of how tidyverse functions transform data structures, check out Garrick Aden-Buie’s tidyexplain
  • Liza Wood, now part of the Exeter Data Analytics Hub, has a great tutorial on the tidyverse on her website

Acknowledgements

We did not create this content alone! Inspiration, tips, and resources have been borrowed from multiple sources.