Working with google sheets and R

Standard staff about google sheets

Google sheets are similar to MS Excel, but with much less functionality. Similar, but the same.
They can be easily used for typical staff like data entry, verification, etc.
They can be shared with other users immediately, that’s their biggest advantage.
Content in google sheets can be imported/exported in various typical formats like MS Excel (.xlsx), LibreOffice (.ods), comma separated values (.csv, .tsv) etc.
Every google sheet is uniquely identified by its key: a 44 character unique key identifies every google document

Packages

Here we will need these packages:

library(googledrive)
library(googlesheets4)
library(tidyverse)
library(lubridate)
library(sqldf)
library(writexl)

If they are not present in your system, just install them as follows:

install.packages('tidyverse', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('lubridate', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('writexl', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('sqldf', dep = TRUE, repos = 'http://cran.rstudio.com')
devtools::install_github('tidyverse/googledrive')
devtools::install_github('tidyverse/googlesheets4')

Google drive

Needless to say, in order to work with google drive you basically need a google account. Probably something like “as@gmail.com”. However you can register to google with other email providers as well, I use my university account email: astavrak@uoi.gr.

Here is a picture of (initially) empty folder: eduR. A folder as appers in Google Drive

If you look closer on the URL of your browser, you will the ID of this folder: Every folder has a unique key ID

This is : 1_u-MfVg-pf0Im5nalvrBlSuTpikrvzUw.

If you know this ID, or you can identify it otherwise, then you can start bussiness.

Google drive folder contents.

Every time you need to work with google you have to get authorised to do so. Execute:

drive_auth()

And follow the instructions in your screen. You will be directed to google servises in order to login with your username and password.

If you save this action results, then you can call drive authorization like this:

drive_auth(email = "astavrak@uoi.gr")

and your program will connect to google drive without asking for password.

So, what are the contents of the eduR folder we can do it like this:

drive_ls(path = "eduR")

Create a new file in google drive

Let’s create (with R) our first googlesheet:

drive_create(name = "my-first-google-sheet", 
             path = "eduR", 
             type = "spreadsheet", 
             overwrite = TRUE)

We open the google sheet and seeits unique ID. Otherwise we can find it as follows:

drive_find(pattern = "my-first-google-sheet")

# A tibble: 1 x 3
  name                  id                                           drive_resource   
* <chr>                 <chr>                                        <list>           
1 my-first-google-sheet 1VOVoQ_F8SE6XA4t6ctuzV6-VZ6iPK3qxLo2PSDH0LWE <named list [33]>

A lot of information is contained in this result. For the moment we are interested only in the ID. We can store it for further use:

sheet <- drive_find(pattern = "my-first-google-sheet")
ss <- sheet$id

Variable ss contains now the value of 1VOVoQ_F8SE6XA4t6ctuzV6-VZ6iPK3qxLo2PSDH0LWE which is the unique ID of this google sheet.

Alternatively we can identify it’s value manually:

ss <- "1VOVoQ_F8SE6XA4t6ctuzV6-VZ6iPK3qxLo2PSDH0LWE"

Typically we copy/paste this value from browser’s URL.

Modify the content

And now let’s add something to ths spreasheet, put some random numbers in it:

df <- tibble(n = 1:100, x = runif(100), y = rnorm(100))
sheets_edit(ss, data = df)

## Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## https://gargle.r-lib.org/articles/non-interactive-auth.html
## The googlesheets4 package is using a cached token for astavrak@uoi.gr.

Yes, if you open the spreadsheet in your browser you will see that it has been modified: Modified googlesheet with R

Share the file

Google drive documents can be easily shared with collaborators (just press the Share button). If you need to do with code in R, it’s also very simple:

drive_share(file = ss, 
            type = "user", 
            role = "writer", 
            emailAddress = "as@gmail.com", 
            emailMessage = "Hey, look at my file!")

Adding a message is optional, but it helps the reciepient of the email.

Read data

The most important work now is to read the data. Usually, we create once the file and we share it once (or a couple of times). But we need to read the updated data quite frequently. In this case we do not need anymore the googledrive package. We turmn to googlesheets4.

Here is how (easily) can read our data:

df <- sheets_read(ss)

Now the variable df contains the tabular data stored in the google sheet. Very easily we can proccedd to some kind of summary statistics or analysis. For example:

summary(df)

##        n                x                 y           
##  Min.   :  1.00   Min.   :0.01507   Min.   :-2.89709  
##  1st Qu.: 25.75   1st Qu.:0.29999   1st Qu.:-0.62686  
##  Median : 50.50   Median :0.47240   Median : 0.02935  
##  Mean   : 50.50   Mean   :0.49117   Mean   :-0.09814  
##  3rd Qu.: 75.25   3rd Qu.:0.71916   3rd Qu.: 0.60411  
##  Max.   :100.00   Max.   :0.99967   Max.   : 1.56875

One way or another, data have been transfered from google to R. We can do anything we like with them.

The previous command always reads the firs sheet in the document. If we have more than one sheet we may need to identify the sheet we want. For example:

sheets_read(ss, sheet = "Sheet1")

## # A tibble: 100 x 3
##        n      x       y
##    <dbl>  <dbl>   <dbl>
##  1     1 0.312   0.425 
##  2     2 0.561   0.646 
##  3     3 0.558   1.39  
##  4     4 0.366   0.332 
##  5     5 0.110  -1.13  
##  6     6 0.130  -0.0383
##  7     7 0.359  -1.13  
##  8     8 0.839  -0.0175
##  9     9 0.0907 -0.142 
## 10    10 0.220  -1.10  
## # … with 90 more rows

Fursthermore, the program will try to guess the data types of the colums. We may specify the data types if we want to:

sheets_read(ss, 
            sheet = "Sheet1",
            col_types = "nnn")

## # A tibble: 100 x 3
##        n      x       y
##    <dbl>  <dbl>   <dbl>
##  1     1 0.312   0.425 
##  2     2 0.561   0.646 
##  3     3 0.558   1.39  
##  4     4 0.366   0.332 
##  5     5 0.110  -1.13  
##  6     6 0.130  -0.0383
##  7     7 0.359  -1.13  
##  8     8 0.839  -0.0175
##  9     9 0.0907 -0.142 
## 10    10 0.220  -1.10  
## # … with 90 more rows

Here, n stands for numeric column. So nnn means that we get three columns of data with numeric values. We can use c for character, d for dates and i for integers.

It is a good practise to identify explicitely the sheet and the data types.

Database company in google sheets

Download the data

If not done before, open your googlesheets connection:

library(googlesheets4)
sheets_auth(email = "astavrak@uoi.gr")

And download the data from the sheet: 1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg

departments <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg", 
            sheet = "departments",
            col_types = "ici")

employees <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg", 
            sheet = "employees", 
            col_types = "iccinD") 

projects <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg", 
            sheet = "projects",
            col_types = "icnDDn")

workson <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg", 
            sheet = "workson", 
            col_types = "ii")

Sometimes the automatic conversion to datatype date is not easy or even possible. In such cases we can easily obtain the desired result if we simply read it as text and convert manually the column to date. Function ymd from lubridate package is very handy here:

employees <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg", 
            sheet = "employees", 
            col_types = "iccinc") %>% 
  mutate(hiredate = ymd(hiredate))

Execute simple SQL queries

Load sqldf library:

library(sqldf)

List all details of employees of department 2:

sqldf("SELECT * FROM employees WHERE depid = 2")

##   empid firstname     lastname depid  salary   hiredate
## 1   153     Μαρία   Αλεβιζάτου     2 1321.92 2001-05-15
## 2   243  Δέσποινα Παπαδοπούλου     2 1609.52 1999-03-05
## 3   419    Πέτρος  Αρβανιτάκης     2 1323.80 2000-07-17
## 4   503  Μαριλένα       Κρέσπα     2 1105.04 2001-03-07

Or, in dplyr:

employees %>% 
  filter(depid == 2)

## # A tibble: 4 x 6
##   empid firstname lastname     depid salary hiredate  
##   <int> <chr>     <chr>        <int>  <dbl> <date>    
## 1   153 Μαρία     Αλεβιζάτου       2  1322. 2001-05-15
## 2   243 Δέσποινα  Παπαδοπούλου     2  1610. 1999-03-05
## 3   419 Πέτρος    Αρβανιτάκης      2  1324. 2000-07-17
## 4   503 Μαριλένα  Κρέσπα           2  1105. 2001-03-07

List employee name and hiredate of department 2:

sqldf("SELECT firstname, lastname, hiredate FROM employees WHERE depid = 2")

##   firstname     lastname   hiredate
## 1     Μαρία   Αλεβιζάτου 2001-05-15
## 2  Δέσποινα Παπαδοπούλου 1999-03-05
## 3    Πέτρος  Αρβανιτάκης 2000-07-17
## 4  Μαριλένα       Κρέσπα 2001-03-07