class: center, middle, inverse, title-slide # Welcome! ## Data Science for Cognitive (Neuro)Science ### Ina Bornkessel-Schlesewsky ### Jan 2021 (updated: 2021-01-17) --- class: inverse, mline, center, middle # What is Data Science? --- # What is Data Science? .pull-left[ <br> > Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. <br> .font70[ Wickham, H. & Grolemund, G. (2017) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media. Henceforth: *R4DS* ] freely available at https://r4ds.had.co.nz ] .pull-right[ <img src="images/r4ds.png" width="85%" style="display: block; margin: auto 0 auto auto;" /> ] --- # What is Data Science? <img src="images/Data_scientist_Venn_diagram.png" width="60%" style="display: block; margin: auto auto auto 0;" /> .pull-right[ .font60[ StackExchange Data Science user Stephan Kolassa [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0) via [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Data_scientist_Venn_diagram.png) ] ] --- # What is Data Science? .pull-left[ ### The "sexy new job" .font80[ > 'Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”' <br> > 'If “sexy” means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills.' ] ] <img src="images/data_science_hbr.png" width="50%" style="display: block; margin: auto 0 auto auto;" /> <img src="images/hbr_cover.png" width="40%" style="display: block; margin: auto 0 auto auto;" /> --- # Data Science ... ## ... why do we need it for cognitive neuroscience and psychology? <br> .font150[ Quite simply, because we're also interested in turning "raw data into understanding, insight, and knowledge", as well as in communicating our results! ] --- # What you will learn in this course <br> * how to gain insights from data using contemporary computational tools * basic programming skills in an open source programming language (i.e. R) * how to produce reproducible reports (good for science and good for you!) * how to use online repositories such as GitHub or the Open Science Framework to share data and code You will also develop an understanding of how these tools help to foster open science, reproducible research and thus the ethical treatment of data. While critical for research in cognitive (neuro-)science, the data science skills acquired in this course generalise readily to other domains (think sexy jobs!) --- # So is this just another stats course ... .pull-left[ .font150[ Apart from the fact that we're using R? ] ] -- .pull-right[  .font60[ "Oh no you didn't" gif by *happydog* from https://giphy.com ] ] --- # Not just another stats course ### Our focus will be on * understanding data rather than statistical tests per se (though they will come up) * philosophy / workflow rather than "results" * (moral of the story: it's not just about statistical significance!) ### You will be introduced to a set of tools and workflow that * foster good practices in dealing with data (i.e. we try to draw the best insights we can from a dataset) * foster open science (i.e. we share our data and "show our work", which is good for science and for sharing knowledge) * are economical and reproducible (i.e. we avoid doing stuff by hand) --- # Speaking of workflow <br> <img src="images/r4ds_workflow.png" width="100%" /> .font80[ from *R4DS* ] --- # Speaking of workflow .pull-left[ <img src="images/r4ds_workflow.png" width="100%" style="display: block; margin: auto auto auto 0;" /> * **import** data (into R) * **tidy** data - bring it into a consistent format that can be used for multiple purposes (each column = variable; each row = observation) - lets you focus on understanding the data rather than which format you need ] .pull-right[ <br> * **transform** data * e.g. focus on observations of interest (such as those from a particular location), create new variables (such as speed from distance and time), compute summary statistics * **visualise** data * essential for understanding * **model** data * use (statistical) models to answer your questions about the data * **communicate** insights ] --- class: inverse, mline, center, middle # Let's give it a go! --- # A highly topical issue ... .pull-left[ <img src="images/covid_confirmed_linear.png" width="150%" style="display: block; margin: auto auto auto 0;" /> ] .pull-right[ <img src="images/covid_confirmed_log.png" width="150%" style="display: block; margin: auto auto auto 0;" /> ] .font80[ from https://www.abc.net.au/news/2020-05-13/coronavirus-numbers-worldwide-data-tracking-charts/12107500 retrieved on 2021-01-08 ] --- class: middle ## These figures must be really tricky to generate ... right? --- # Enter R and RStudio <br> * R is a programming language for statistical computing (but it can also be used for other things) * RStudio is an integrated development environment (IDE) for R, which is a fancy way of saying that it provides a convenient platform within which we can use R <img src="images/RStudio.png" width="80%" /> --- # RStudio Cloud * For (the first part of) this course, we will be using RStudio Cloud, which is a web-based version of RStudio * This means that you won't have to install anything on your computer and that you will have direct access to all of the materials that I have prepared * Go to https://rstudio.cloud and create a login * Once you have done this, use the link XXX to access the CSN-RH workspace on RStudio Cloud and select the data_science_course_2021 project - When you access the project, you will receive your own permanent copy of it to work on --- # Exercise: Covid-related deaths .pull-left[ * click on the document *01a_covid.Rmd* * this is an *RMarkdown* document which mixes text and code and is an excellent format for reproducible research reports, as we will see later * it is adapted from [Mine Çetinkaya-Rundel's *Datascience in a Box*](https://github.com/rstudio-education/datascience-box/blob/master/course-materials/application-exercises/ae-01b-covid/covid.Rmd) * don't worry too much about the code for now; this is what the document looks like when rendered ] .pull-right[ <img src="images/covid_rendered.png" width="80%" /> ] --- # Exercise: Covid-related deaths .pull-left[ * click on the *Knit* button at the top of the document to produce the rendered version * have a read-through and look at the figure at the bottom * there is also an inetractive table to show you which countries are in the data set ] .pull-right[ <img src="images/covid_rendered2.png" width="80%" /> ] --- # Your turn! For each of the following challenges, go back to the raw document (i.e. the one that doesn't look pretty 😄), try to figure out how to make the relevant change and then render the document using *Knit* to see whether you were correct! * Change *Turkey* to *Australia* in the list of countries under examination; check the figure to see if it worked * Use the table at the bottom of the document to find another country of interest to you and change *Australia* to that country * See if you can change the figure to represent cumulative deaths from the 100th rather than the 10th death * Examine the format of the data being looked at (hint: go to the table close to the top of the document); an alternative entry to "death" under *type* is "confirmed". Can you change the document so that the figure reflects cumulative confirmed cases rather than deaths? (This one is a bit trickier!)