Tidying Multi-Header Excel Data with R

Problem

If you’re lucky enough to work with a lot of data that arrives in the form of excel tables, you will most likely be familiar with data in the format below.

This format makes it easier for a human eye to read the data, but seriously hinders a computers ability to analyse it. So, if you prefer analysing and visualising your data with a programming language rather than staring at it in Excel we’re going to need to knock it into a tidier form before we can let the powers of R loose on it.

That means one column for each variable and one row for each observation of said variables. For this sample data set I would define that as columns for:

  • ID
  • DATE
  • WEEKDAY
  • TIME
  • GROUP
  • METRIC1
  • METRIC2
  • METRIC3

Currently we have the metrics for each group spread out over 3 chunks with the group indicated by a top header (only in the first column of that group) and the metric headers below. Adding fuel to the fire, in the first chunk of columns you’ll notice the ID and WEEKDAY columns are floating up in row one with the others down in row 2 - nice.


Solution

A data frame in R can only take one set of column names that must be unique from each other so we need a method of reducing these headers into a single row of unique names by merging the group and metric information together with a separator. So group 1’s metric columns will be named GROUP1_METRIC1 GROUP1_METRIC2 GROUP1_METRIC3 and so forth for each group.

So let’s give it a go. These are the packages required:

library(tidyverse) 
library(readxl)
library(hms) # to convert datetimes to hms class
library(zoo) # for the super useful na.locf0 function

If we read all of the data with read_excel what do we get?

path <- "../../static/my_data.xlsx"
raw_data <- read_excel(path)
raw_data
## # A tibble: 11 x 13
##       ID X__1  WEEKDAY  X__2  GROUP1 X__3  X__4  GROUP2 X__5  X__6  GROUP3
##    <dbl> <chr> <chr>    <chr> <chr>  <chr> <chr> <chr>  <chr> <chr> <chr> 
##  1 NA    DATE  <NA>     TIME  METRI… METR… METR… METRI… METR… METR… METRI…
##  2  1.00 43466 Tuesday  0.5   0.467… 0.80… 0.30… 0.475… 5.98… 0.78… 0.843…
##  3  2.00 43467 Wednesd… 0.54… 0.822… 0.51… 0.27… 1.791… 0.80… 0.21… 0.469…
##  4  3.00 43468 Thursday 0.58… 0.891… 0.63… 0.51… 0.794… 0.74… 0.35… 0.281…
##  5  4.00 43469 Friday   0.625 0.247… 0.13… 0.93… 0.951… 0.35… 0.86… 0.480…
##  6  5.00 43470 Saturday 0.66… 0.286… 0.35… 0.47… 0.364… 6.03… 7.35… 6.769…
##  7  6.00 43471 Sunday   0.70… 0.416… 0.98… 0.64… 0.667… 0.48… 0.76… 0.927…
##  8  7.00 43472 Monday   0.75  0.401… 0.58… 5.40… 0.608… 0.89… 0.39… 9.688…
##  9  8.00 43473 Tuesday  0.79… 0.643… 0.87… 0.67… 0.602… 0.77… 0.52… 0.659…
## 10  9.00 43474 Wednesd… 0.83… 0.657… 0.39… 0.37… 0.870… 0.30… 0.51… 0.898…
## 11 10.0  43475 Thursday 0.875 0.749… 0.47… 0.88… 0.820… 0.10… 2.90… 0.423…
## # ... with 2 more variables: X__7 <chr>, X__8 <chr>

a mess.

The first row has been set as the column names with blank names assigned as X__* by readxl. Also every column has a character class because the second row headers have been considered as data.


Building our own column names with code

So before attempting to actually read in the data lets try to form the single set of column names we need. To do this, I’m going to first get the headers in row 1 as a character vector and replace the missing column name pattern X__* with an NA.

Like so:

head1 <- read_excel(path, col_names = TRUE) %>% 
  names() %>% 
  str_replace("X__\\d*", NA_character_)

head1
##  [1] "ID"      NA        "WEEKDAY" NA        "GROUP1"  NA        NA       
##  [8] "GROUP2"  NA        NA        "GROUP3"  NA        NA

To do the same for the headers in row 2 we do the same thing but add a skip = 1 argument to read_excel which tells it to ignore the first row entirely.

head2 <- read_excel(path, skip = 1, col_names = TRUE) %>% 
  names() %>% 
  str_replace("X__\\d*", NA_character_)

head2
##  [1] NA           "DATE"       NA           "TIME"       "METRIC1"   
##  [6] "METRIC2"    "METRIC3"    "METRIC1__1" "METRIC2__1" "METRIC3__1"
## [11] "METRIC1__2" "METRIC2__2" "METRIC3__2"

Because we had non-unique column names in this case read_excel has assigned some __* patterns to the metric names which we also want to get rid of.

head2 <- head2 %>% str_remove("__\\d*")
head2
##  [1] NA        "DATE"    NA        "TIME"    "METRIC1" "METRIC2" "METRIC3"
##  [8] "METRIC1" "METRIC2" "METRIC3" "METRIC1" "METRIC2" "METRIC3"

To deal with the rogue ID and WEEKDAY columns that were in row 1 with nothing below in row 2, we want to move them from head1 into head2 then replace with NA in head1

ncols <- length(head1)

for (n in 1:ncols) {
  if (is.na(head2[n])) {
    head2[n] <- head1[n]
    head1[n] <- NA_character_
  }
}

head1
##  [1] NA       NA       NA       NA       "GROUP1" NA       NA      
##  [8] "GROUP2" NA       NA       "GROUP3" NA       NA
head2
##  [1] "ID"      "DATE"    "WEEKDAY" "TIME"    "METRIC1" "METRIC2" "METRIC3"
##  [8] "METRIC1" "METRIC2" "METRIC3" "METRIC1" "METRIC2" "METRIC3"

Getting there! Next we need to fill out each group name so that it occurs 3 times to match each metric, rather than just once. To do this we can convert the vector to a 1 column tibble or data frame then use the zoo::na.locf0() that replaces NAs with nearest non-NA above it, then pull it back into a vector.

head1 <- tibble(head1) %>% 
  mutate(head1 = zoo::na.locf0(head1)) %>% 
  pull()

head1
##  [1] NA       NA       NA       NA       "GROUP1" "GROUP1" "GROUP1"
##  [8] "GROUP2" "GROUP2" "GROUP2" "GROUP3" "GROUP3" "GROUP3"

We’re finally at the stage where we can combine the headers. Using map_chr we can ask R to merge headers with a _ separator only if there is a non-NA in head1 and head2, otherwise we only need the head2 name.

headers <- map_chr(1:ncols, ~ {
  case_when(
    !is.na(head1[.x]) & !is.na(head2[.x]) ~ paste(head1[.x], head2[.x], sep = "_"),
    TRUE ~ head2[.x]
  )
})

headers
##  [1] "ID"             "DATE"           "WEEKDAY"        "TIME"          
##  [5] "GROUP1_METRIC1" "GROUP1_METRIC2" "GROUP1_METRIC3" "GROUP2_METRIC1"
##  [9] "GROUP2_METRIC2" "GROUP2_METRIC3" "GROUP3_METRIC1" "GROUP3_METRIC2"
## [13] "GROUP3_METRIC3"

🙌


Import the data with new headers

Now we can read in the data, ignore both header rows and supply our own column names to read_excel

raw_data <- read_excel(path, skip = 2, col_names = headers)

raw_data
## # A tibble: 10 x 13
##       ID DATE                WEEKDAY   TIME                GROUP1_METRIC1
##    <dbl> <dttm>              <chr>     <dttm>                       <dbl>
##  1  1.00 2019-01-01 00:00:00 Tuesday   1899-12-31 12:00:00          0.468
##  2  2.00 2019-01-02 00:00:00 Wednesday 1899-12-31 13:00:00          0.822
##  3  3.00 2019-01-03 00:00:00 Thursday  1899-12-31 14:00:00          0.891
##  4  4.00 2019-01-04 00:00:00 Friday    1899-12-31 15:00:00          0.248
##  5  5.00 2019-01-05 00:00:00 Saturday  1899-12-31 16:00:00          0.286
##  6  6.00 2019-01-06 00:00:00 Sunday    1899-12-31 17:00:00          0.416
##  7  7.00 2019-01-07 00:00:00 Monday    1899-12-31 18:00:00          0.402
##  8  8.00 2019-01-08 00:00:00 Tuesday   1899-12-31 19:00:00          0.643
##  9  9.00 2019-01-09 00:00:00 Wednesday 1899-12-31 20:00:00          0.657
## 10 10.0  2019-01-10 00:00:00 Thursday  1899-12-31 21:00:00          0.750
## # ... with 8 more variables: GROUP1_METRIC2 <dbl>, GROUP1_METRIC3 <dbl>,
## #   GROUP2_METRIC1 <dbl>, GROUP2_METRIC2 <dbl>, GROUP2_METRIC3 <dbl>,
## #   GROUP3_METRIC1 <dbl>, GROUP3_METRIC2 <dbl>, GROUP3_METRIC3 <dbl>

Tidy the data

Now to tidy the data into our desired columns listed above, we use the gather, separate, spread method.

# get index of first column that contains metric data
mcol1 <- which(str_detect(headers, "METRIC")) %>% first()

tidy_data <- raw_data %>% 
  gather(KEY, VALUE, mcol1:ncols) %>% 
  separate(KEY, into = c("GROUP", "METRIC"), sep = "_") %>% 
  spread(METRIC, VALUE) %>% 
  mutate(DATE = as.Date(DATE), TIME = as.hms(TIME)) # reclass for nicer printing

tidy_data
## # A tibble: 30 x 8
##       ID DATE       WEEKDAY   TIME   GROUP  METRIC1 METRIC2 METRIC3
##    <dbl> <date>     <chr>     <time> <chr>    <dbl>   <dbl>   <dbl>
##  1  1.00 2019-01-01 Tuesday   12:00  GROUP1  0.468   0.810    0.310
##  2  1.00 2019-01-01 Tuesday   12:00  GROUP2  0.476   0.0598   0.781
##  3  1.00 2019-01-01 Tuesday   12:00  GROUP3  0.844   0.423    0.409
##  4  2.00 2019-01-02 Wednesday 13:00  GROUP1  0.822   0.511    0.277
##  5  2.00 2019-01-02 Wednesday 13:00  GROUP2  0.0179  0.802    0.212
##  6  2.00 2019-01-02 Wednesday 13:00  GROUP3  0.470   0.124    0.278
##  7  3.00 2019-01-03 Thursday  14:00  GROUP1  0.891   0.638    0.519
##  8  3.00 2019-01-03 Thursday  14:00  GROUP2  0.795   0.749    0.360
##  9  3.00 2019-01-03 Thursday  14:00  GROUP3  0.281   0.121    0.354
## 10  4.00 2019-01-04 Friday    15:00  GROUP1  0.248   0.131    0.935
## # ... with 20 more rows

Go berserk with your tidy data

Now we can do cool stuff like this…

# average of each metric by group
tidy_data %>% 
  group_by(GROUP) %>% 
  summarise_at(vars(contains("METRIC")), mean) 
## # A tibble: 3 x 4
##   GROUP  METRIC1 METRIC2 METRIC3
##   <chr>    <dbl>   <dbl>   <dbl>
## 1 GROUP1   0.558   0.576   0.514
## 2 GROUP2   0.617   0.459   0.452
## 3 GROUP3   0.515   0.313   0.494

and this…

ggplot(tidy_data, aes(x = METRIC1, y = METRIC2, colour = GROUP)) +
  geom_point(aes(size = METRIC3))


That’s all for now. Give me a shout if you have any questions or suggested improvements.

hasta luego!


Paul Campbell
R | Data | Visualisation
comments powered by Disqus