Data Science with R

I am not a data scientist. But I have always wondered what the term data science meant when I began learning statistics with R. According to a youtube video by Professor Dr. Robert Curtis:

Data science = Programming + Statistics + Mathematics

However, I will add Communication. I also thought about which resources you can use to get to the point where you can call yourself a data scientist.

Programming

You do not have to get a bachelor degree in Computer Science to become a data scientist. You could either have received formal training in programming or you could be self-taught. There are several programming languages out there: R, Python, Julia, Mathlab etc. For those interested in using R, you can start with the book by Tilman M. Davies, The Book of R. You need the part I and II of the book to get started:

Before reading this book, I thought I understood the language of R; but by reading the part 1 and 2, I started to see gaps in my understanding of R, which I began to fill.

Part I: The Language

Part I covers the basic syntax and object types that are used in R programming. Chapters 2 through 5 introduce simple arithmetic, assignment, and important types of object such as vectors, matrices, lists, and data frames. Chapter 6 discusses how R represents missing data values and differentiates different object types. There is an introductory lesson on plotting in Chapter 7, which makes use of both built-in and external packages – here, the ggplot2 package. Chapter 8 covers how to import data from external files, which is important if one wants to do analysis on an original data.

Part II: Programming

Part II familiarizes you with the common R programming mechanisms. First, Tilman discusses functions and their operation in R in Chapter 9. Then, in Chapter 10, he covers loops and conditional statements, which control the flow, repetition, and ultimately the execution of your code. He then goes on to teach you how to write your own executable R functions in Chapter 11. He also covers some additional topics, such as error handling and measuring function execution time, in Chapter 12.

Part III: Data Wrangling

For part III, I recommend you switch to the book R for Data Science, Hadley Wickham and Garrett Grolemund. You’ll learn about data wrangling, how to get your data into R in a way that is useful for visualization and modelling. Data wrangling is so important that you can’t work with your data without it. There are three main parts to data wrangling: import, tidy, and transform.

Begin with chapter 3 of the book, where you will be introduced to data transformation. You’ll learn how to select important variables, filter out observations, create new variables, and summarize your data. Then move to chapter 7, where you’ll learn about the varieties of the data frame used in the book: the tibble. In Chapter 8, you’ll learn how to import your data into R. In Chapter 9, you’ll learn about tidy data, which ensures consistency if you intend to store your data in a way that makes transformation, visualization, and modeling easier.

You then go deeply into data transformation in subsequent chapters. They focus on new skills for three specific types of data you will frequently encounter in practice: Chapter 10 will equip with skills for working with multiple interrelated datasets. Chapter 11 will introduce regular expressions, which is a useful skill for manipulating strings. Chapter 12 will explain the storage of categorical data in R. Chapter 13 will provide you with key tools to work with dates and date-times.

Part IV: Data Exploration

To explore data is to look at your data, rapidly generate hypotheses, quickly test them, then repeat the process when necessary. Read chapter 1 of R for Data Science, which succinctly explains visualization, ggplot2 package, and how to skilfully turn data into plots. Then move to chapter 5, where you will combine visualization and transformation to ask and answer interesting questions about data. You can also use R Graphics Cookbook by Winston Chang, if you would like to learn more on data exploration.

Part V: Modelling

Now that you have programming skills, you can turn to modelling. You need materials on statistical analysis with application in R. The material(s) should cover descriptive and inferential statistics. My favorite books are:

Part IV of *The Book of R* also covers several topics of Statistical Testing and Modelling. I find it impressive how Colonescu writes customized R functions in his books.

Statistics

But in order not to have a “plug and chug” knowledge of statistics, you should also do a book course on statistics. In the book course , you are required to do the calculations by hand, which can become very tedious. The data points are, however, very small. Knowing how these calculations are created ensures that you actually learn the concepts and topics of statistics. Learn the followng statistics in this sequence:

Descriptive statistics
Probability distributions
- The Normal Distribution
- Sampling Distributions
- Central Limit Theorem
Statistical inference
- Confidence Intervals
- Hypothesis Testing with One/Two Samples
- Correlation and Linear Regression
- Dichotomous Dependent Variable (Logistic Regression)
- Analysis of Categorical Data (Chi-Square Tests)
- The F-Distribution and Hypothesis Testing with Two Variances, ANOVA
- Time-Series Data Analysis
- Cross-Sectional Time-Series Data Analysis (Panel Data)

You could perhaps kill two birds with one stone. For example, when I began learning statistics with R at the University of Cologne, the class was divided into two parts. In the first part of the class, we learnt statistical concepts and did short calculations by hand. In the second part of the class, we used R for statistical analyses.

My favorite introductory statistics materials:

Hanomi cs has awesome youtube lectures on several topics of time-series and panel data analyses. I found them useful (but I often ignore the economic theories, and just focus on the statistical procedures).

Mathematics

I found this picture on a Facebook group. I just find it really funny.

You will be using a lot of the concepts in (elementary) set theory and logic in data science. In his youtube video, Professor Robert Curtis mentioned that you probably will not use calculus I and II, but you need that to understand multivariable calculus. That is, you will apply several concepts of multivariable calculus in data science. Linear algebra is the most relevant of the mathematical sequence. In my view, you can successfully learn (elementary) linear algebra without any knowledge of calculus.

Set theory and logic
Calculus I and II
Multivariable calculus
Linear algebra

My favorite materials for learning math:

The author’s style of writing is accessible. As a reader, I could tell that he puts in a lot of effort to explain math to those who don’t like/know math.

Jason Gibson’s math lessons: Jason Gibson is really good at explaining mathematical concepts. In fact, he is a God sent to those of us who struggle with math.

These statistics and math sequence marks the beginning of a training as a data scientist. The topics you learn afterwards will be determined by your area of specialization in data science.

Communication

As you’ve noticed, all data science work requires the technical skills to acquire your data, clean it, and perform your analysis. But as you’re doing this, it’s also important to keep the issue of communication in mind. Statistics is not about the numbers. There is a story behind these numbers that you want to tell. Ideally, you want to disseminate your results to an audience of (usually) non-data scientists. One of the good books about presentation tips to improve your data science communication skills is: