Standard staff about google sheets
Google sheets are similar to MS Excel, but with much less functionality. Similar, but the same.
They can be easily used for typical staff like data entry, verification, etc.
They can be shared with other users immediately, that’s their biggest advantage.
Content in google sheets can be imported/exported in various typical formats like MS Excel (.xlsx), LibreOffice (.ods), comma separated values (.csv, .tsv) etc.
Every google sheet is uniquely identified by its key: a 44 character unique key identifies every google document
Packages
Here we will need these packages:
library(googledrive)
library(googlesheets4)
library(tidyverse)
library(lubridate)
library(sqldf)
library(writexl)
If they are not present in your system, just install them as follows:
install.packages('tidyverse', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('lubridate', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('writexl', dep = TRUE, repos = 'http://cran.rstudio.com')
install.packages('sqldf', dep = TRUE, repos = 'http://cran.rstudio.com')
devtools::install_github('tidyverse/googledrive')
devtools::install_github('tidyverse/googlesheets4')
Google drive
Needless to say, in order to work with google drive you basically need a google account. Probably something like “as@gmail.com”. However you can register to google with other email providers as well, I use my university account email: astavrak@uoi.gr.
Here is a picture of (initially) empty folder: eduR.
If you look closer on the URL of your browser, you will the ID of this folder:
This is : 1_u-MfVg-pf0Im5nalvrBlSuTpikrvzUw.
If you know this ID, or you can identify it otherwise, then you can start bussiness.
Google drive folder contents.
Every time you need to work with google you have to get authorised to do so. Execute:
And follow the instructions in your screen. You will be directed to google servises in order to login with your username and password.
If you save this action results, then you can call drive authorization like this:
and your program will connect to google drive without asking for password.
So, what are the contents of the eduR folder we can do it like this:
Create a new file in google drive
Let’s create (with R) our first googlesheet:
We open the google sheet and seeits unique ID. Otherwise we can find it as follows:
drive_find(pattern = "my-first-google-sheet")
# A tibble: 1 x 3
name id drive_resource
* <chr> <chr> <list>
1 my-first-google-sheet 1VOVoQ_F8SE6XA4t6ctuzV6-VZ6iPK3qxLo2PSDH0LWE <named list [33]>
A lot of information is contained in this result. For the moment we are interested only in the ID. We can store it for further use:
Variable ss contains now the value of 1VOVoQ_F8SE6XA4t6ctuzV6-VZ6iPK3qxLo2PSDH0LWE which is the unique ID of this google sheet.
Alternatively we can identify it’s value manually:
Typically we copy/paste this value from browser’s URL.
Modify the content
And now let’s add something to ths spreasheet, put some random numbers in it:
## Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## https://gargle.r-lib.org/articles/non-interactive-auth.html
## The googlesheets4 package is using a cached token for astavrak@uoi.gr.
Yes, if you open the spreadsheet in your browser you will see that it has been modified:
Read data
The most important work now is to read the data. Usually, we create once the file and we share it once (or a couple of times). But we need to read the updated data quite frequently. In this case we do not need anymore the googledrive package. We turmn to googlesheets4.
Here is how (easily) can read our data:
Now the variable df contains the tabular data stored in the google sheet. Very easily we can proccedd to some kind of summary statistics or analysis. For example:
## n x y
## Min. : 1.00 Min. :0.01507 Min. :-2.89709
## 1st Qu.: 25.75 1st Qu.:0.29999 1st Qu.:-0.62686
## Median : 50.50 Median :0.47240 Median : 0.02935
## Mean : 50.50 Mean :0.49117 Mean :-0.09814
## 3rd Qu.: 75.25 3rd Qu.:0.71916 3rd Qu.: 0.60411
## Max. :100.00 Max. :0.99967 Max. : 1.56875
One way or another, data have been transfered from google to R. We can do anything we like with them.
The previous command always reads the firs sheet in the document. If we have more than one sheet we may need to identify the sheet we want. For example:
## # A tibble: 100 x 3
## n x y
## <dbl> <dbl> <dbl>
## 1 1 0.312 0.425
## 2 2 0.561 0.646
## 3 3 0.558 1.39
## 4 4 0.366 0.332
## 5 5 0.110 -1.13
## 6 6 0.130 -0.0383
## 7 7 0.359 -1.13
## 8 8 0.839 -0.0175
## 9 9 0.0907 -0.142
## 10 10 0.220 -1.10
## # … with 90 more rows
Fursthermore, the program will try to guess the data types of the colums. We may specify the data types if we want to:
## # A tibble: 100 x 3
## n x y
## <dbl> <dbl> <dbl>
## 1 1 0.312 0.425
## 2 2 0.561 0.646
## 3 3 0.558 1.39
## 4 4 0.366 0.332
## 5 5 0.110 -1.13
## 6 6 0.130 -0.0383
## 7 7 0.359 -1.13
## 8 8 0.839 -0.0175
## 9 9 0.0907 -0.142
## 10 10 0.220 -1.10
## # … with 90 more rows
Here, n stands for numeric column. So nnn means that we get three columns of data with numeric values. We can use c for character, d for dates and i for integers.
It is a good practise to identify explicitely the sheet and the data types.
Database company in google sheets
Download the data
If not done before, open your googlesheets connection:
And download the data from the sheet: 1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg
departments <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg",
sheet = "departments",
col_types = "ici")
employees <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg",
sheet = "employees",
col_types = "iccinD")
projects <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg",
sheet = "projects",
col_types = "icnDDn")
workson <- sheets_read(ss = "1l8prMa5WQpv-Y5ujfml3bzbTNafShTkA8v1Hrmy3Jyg",
sheet = "workson",
col_types = "ii")
Sometimes the automatic conversion to datatype date is not easy or even possible. In such cases we can easily obtain the desired result if we simply read it as text and convert manually the column to date. Function ymd from lubridate package is very handy here:
Execute simple SQL queries
Load sqldf library:
List all details of employees of department 2:
## empid firstname lastname depid salary hiredate
## 1 153 Μαρία Αλεβιζάτου 2 1321.92 2001-05-15
## 2 243 Δέσποινα Παπαδοπούλου 2 1609.52 1999-03-05
## 3 419 Πέτρος Αρβανιτάκης 2 1323.80 2000-07-17
## 4 503 Μαριλένα Κρέσπα 2 1105.04 2001-03-07
Or, in dplyr:
## # A tibble: 4 x 6
## empid firstname lastname depid salary hiredate
## <int> <chr> <chr> <int> <dbl> <date>
## 1 153 Μαρία Αλεβιζάτου 2 1322. 2001-05-15
## 2 243 Δέσποινα Παπαδοπούλου 2 1610. 1999-03-05
## 3 419 Πέτρος Αρβανιτάκης 2 1324. 2000-07-17
## 4 503 Μαριλένα Κρέσπα 2 1105. 2001-03-07
List employee name and hiredate of department 2:
## firstname lastname hiredate
## 1 Μαρία Αλεβιζάτου 2001-05-15
## 2 Δέσποινα Παπαδοπούλου 1999-03-05
## 3 Πέτρος Αρβανιτάκης 2000-07-17
## 4 Μαριλένα Κρέσπα 2001-03-07
Or, in dplyr:
## # A tibble: 4 x 3
## firstname lastname hiredate
## <chr> <chr> <date>
## 1 Μαρία Αλεβιζάτου 2001-05-15
## 2 Δέσποινα Παπαδοπούλου 1999-03-05
## 3 Πέτρος Αρβανιτάκης 2000-07-17
## 4 Μαριλένα Κρέσπα 2001-03-07
List employee name and hiredate of department 2:
## firstname lastname hiredate
## 1 Μαρία Αλεβιζάτου 2001-05-15
## 2 Δέσποινα Παπαδοπούλου 1999-03-05
## 3 Πέτρος Αρβανιτάκης 2000-07-17
## 4 Μαριλένα Κρέσπα 2001-03-07
List names and project codes of employees of department 2:
sqldf("SELECT e.firstname, e.lastname, w.proid
FROM employees e INNER JOIN workson w ON e.empid = w.empid
WHERE e.depid = 2")
## firstname lastname proid
## 1 Μαρία Αλεβιζάτου 14
## 2 Μαρία Αλεβιζάτου 38
## 3 Δέσποινα Παπαδοπούλου 21
## 4 Δέσποινα Παπαδοπούλου 38
## 5 Πέτρος Αρβανιτάκης 12
## 6 Πέτρος Αρβανιτάκης 21
## 7 Πέτρος Αρβανιτάκης 43
## 8 Μαριλένα Κρέσπα 21
## 9 Μαριλένα Κρέσπα 38