Welcome to bootcamp!

This bootcamp was originally desgined for my MD/PhD cohort at UW-Madison. However, anyone can follow along! This curriculum is designed for individuals new to both R and coding in general. NO PRIOR CODING EXPERIENCE REQUIRED! I have not created any materials presented in this bootcamp. Rather, I have put together a roadmap of resources that I think will help students build the coding skills and statistical knowledge needed to analyze a large, “high-dimensional” dataset.

Phase 1 consists entirely of CodeAcademy PRO’s R course, in addition to practice sets designed by Jenny Bryan for UBC’s STAT545 lecture. By the end of Phase 1 of this boot camp, you should be able to:

manipulate a dataset into “tidy” format
generate a variety of different graphs from a “tidy” dataset
analyze trends over different groupings of data
calculate mean/median/mode/IQR and perform basic statistical tests (t.tests, generalized linear models)

After getting your feet wet with data cleaning/data viz, we will move on to more complex concepts featured in the R Software Carpentry workshops that will make your coding life easier! By the end of phase 2, you should be able to:

understand how to construct and interprete an if/else statement
push/pull something from github

Phase 3 of this bootcamp will focus on providing an overview of statistics concepts needed to understand most computational approaches, particularly those in the genomics field. We will focus particularly on two-group comparisons, linear modeling, multiple testing and supervised/unsupervised learning. Phase 3 will refer to slidedecks prepared for UBC’s STAT540 course. By the end of phase 3, you should be able to:

understand what quality checks should be performed on a dataset before conducting any analyses
understand the statistical underpinnings of two-group comparisons and linear models
know why multiple testing is important in “big data” studies
perform PCA analysis
understand supervised and unsupervised clustering methods

Syllabus

Phase 1

To complete this phase, you will need to sign up for CodeAcademy’s PRO R course. After completing each lesson, do the practice sets in your Google collab document. If a lesson does not have any associated practice sets, continue on to the next lesson. If you have any questions, just make a comment in the Google Collab document and tag me.

Lesson Title	More practice	Practice answers	Due date
Introduction to R Syntax			June 1st
Learn R: Data Frames			June 1st
Data Cleaning			June 1st
Data Visualization with ggplot2			June 1st
Logical operators from DataCamp	HW1, HW2	HW1 answers, HW2 answers	June 1st
Aggregates	HW3	HW3 answers	June 1st
Joining Tables	HW4	Link coming soon!	June 1st
Mean, Median and Mode			June 1st
Variance and Standard Deviation			June 1st
Quartiles, Quantiles and IQR			June 1st
Hypothesis testing	HW5	Link coming soon!	June 1st

Phase 2

Make sure you have RStudio up and running on your own machine. Instructions for how to do this should have been provided in Lesson 1 of CodeAcademy, but if you didn’t get a chance to set that up, there are instructions here.

Lesson Title	Source	Due date
If/else statements	Software Carpentry	June 15th
Natural Language Processing	TBD	June 15th
Tenents of writing good code in R	Software Carpentry	June 15th

Phase 3

This phase will focus on understanding conceptual topics foundational to genomics-based/statistical research. The slide decks below are referenced from UBC’s STAT540 course. They are used with permission from Sara Mostafavi (course director).

Lesson Title	Source	Date
Exploratory data analysis and quality control	STAT540	TBD
Stats/math background for big data	STAT540	TBD
Two group comparisons	STAT540	TBD
ANOVA	STAT540	TBD
Linear models	STAT540	TBD
Linear modeling with continuous variables	STAT540	TBD
Multiple testing	STAT540	TBD
Principal component analysis	STAT540	TBD
Cluster analysis	STAT540	TBD
Supervised learning	STAT540	TBD
Supervised learning part 2	STAT540	TBD