Welcome to bootcamp!

This bootcamp was originally desgined for my MD/PhD cohort at UW-Madison. However, anyone can follow along! This curriculum is designed for individuals new to both R and coding in general. NO PRIOR CODING EXPERIENCE REQUIRED! I have not created any materials presented in this bootcamp. Rather, I have put together a roadmap of resources that I think will help students build the coding skills and statistical knowledge needed to analyze a large, “high-dimensional” dataset.

Phase 1 consists entirely of CodeAcademy PRO’s R course, in addition to practice sets designed by Jenny Bryan for UBC’s STAT545 lecture. By the end of Phase 1 of this boot camp, you should be able to:

  • manipulate a dataset into “tidy” format
  • generate a variety of different graphs from a “tidy” dataset
  • analyze trends over different groupings of data
  • calculate mean/median/mode/IQR and perform basic statistical tests (t.tests, generalized linear models)

After getting your feet wet with data cleaning/data viz, we will move on to more complex concepts featured in the R Software Carpentry workshops that will make your coding life easier! By the end of phase 2, you should be able to:

  • understand how to construct and interprete an if/else statement
  • push/pull something from github

Phase 3 of this bootcamp will focus on providing an overview of statistics concepts needed to understand most computational approaches, particularly those in the genomics field. We will focus particularly on two-group comparisons, linear modeling, multiple testing and supervised/unsupervised learning. Phase 3 will refer to slidedecks prepared for UBC’s STAT540 course. By the end of phase 3, you should be able to:

  • understand what quality checks should be performed on a dataset before conducting any analyses
  • understand the statistical underpinnings of two-group comparisons and linear models
  • know why multiple testing is important in “big data” studies
  • perform PCA analysis
  • understand supervised and unsupervised clustering methods

Syllabus

Phase 1

To complete this phase, you will need to sign up for CodeAcademy’s PRO R course. After completing each lesson, do the practice sets in your Google collab document. If a lesson does not have any associated practice sets, continue on to the next lesson. If you have any questions, just make a comment in the Google Collab document and tag me.

Lesson Title More practice Practice answers Due date
Introduction to R Syntax June 1st
Learn R: Data Frames June 1st
Data Cleaning June 1st
Data Visualization with ggplot2 June 1st
Logical operators from DataCamp HW1, HW2 HW1 answers, HW2 answers June 1st
Aggregates HW3 HW3 answers June 1st
Joining Tables HW4 Link coming soon! June 1st
Mean, Median and Mode June 1st
Variance and Standard Deviation June 1st
Quartiles, Quantiles and IQR June 1st
Hypothesis testing HW5 Link coming soon! June 1st

Phase 2

Make sure you have RStudio up and running on your own machine. Instructions for how to do this should have been provided in Lesson 1 of CodeAcademy, but if you didn’t get a chance to set that up, there are instructions here.

Lesson Title Source Due date
If/else statements Software Carpentry June 15th
Natural Language Processing TBD June 15th
Tenents of writing good code in R Software Carpentry June 15th

Phase 3

This phase will focus on understanding conceptual topics foundational to genomics-based/statistical research. The slide decks below are referenced from UBC’s STAT540 course. They are used with permission from Sara Mostafavi (course director).

Lesson Title Source Date
Exploratory data analysis and quality control STAT540 TBD
Stats/math background for big data STAT540 TBD
Two group comparisons STAT540 TBD
ANOVA STAT540 TBD
Linear models STAT540 TBD
Linear modeling with continuous variables STAT540 TBD
Multiple testing STAT540 TBD
Principal component analysis STAT540 TBD
Cluster analysis STAT540 TBD
Supervised learning STAT540 TBD
Supervised learning part 2 STAT540 TBD