Welcome to bootcamp!
This bootcamp was originally desgined for my MD/PhD cohort at UW-Madison. However, anyone can follow along! This curriculum is designed for individuals new to both R and coding in general. NO PRIOR CODING EXPERIENCE REQUIRED! I have not created any materials presented in this bootcamp. Rather, I have put together a roadmap of resources that I think will help students build the coding skills and statistical knowledge needed to analyze a large, “high-dimensional” dataset.
Phase 1 consists entirely of CodeAcademy PRO’s R course, in addition to practice sets designed by Jenny Bryan for UBC’s STAT545 lecture. By the end of Phase 1 of this boot camp, you should be able to:
- manipulate a dataset into “tidy” format
- generate a variety of different graphs from a “tidy” dataset
- analyze trends over different groupings of data
- calculate mean/median/mode/IQR and perform basic statistical tests (t.tests, generalized linear models)
After getting your feet wet with data cleaning/data viz, we will move on to more complex concepts featured in the R Software Carpentry workshops that will make your coding life easier! By the end of phase 2, you should be able to:
- understand how to construct and interprete an if/else statement
- push/pull something from github
Phase 3 of this bootcamp will focus on providing an overview of statistics concepts needed to understand most computational approaches, particularly those in the genomics field. We will focus particularly on two-group comparisons, linear modeling, multiple testing and supervised/unsupervised learning. Phase 3 will refer to slidedecks prepared for UBC’s STAT540 course. By the end of phase 3, you should be able to:
- understand what quality checks should be performed on a dataset before conducting any analyses
- understand the statistical underpinnings of two-group comparisons and linear models
- know why multiple testing is important in “big data” studies
- perform PCA analysis
- understand supervised and unsupervised clustering methods
Syllabus
Phase 1
To complete this phase, you will need to sign up for CodeAcademy’s PRO R course. After completing each lesson, do the practice sets in your Google collab document. If a lesson does not have any associated practice sets, continue on to the next lesson. If you have any questions, just make a comment in the Google Collab document and tag me.
Lesson Title | More practice | Practice answers | Due date |
---|---|---|---|
Introduction to R Syntax | June 1st | ||
Learn R: Data Frames | June 1st | ||
Data Cleaning | June 1st | ||
Data Visualization with ggplot2 | June 1st | ||
Logical operators from DataCamp | HW1, HW2 | HW1 answers, HW2 answers | June 1st |
Aggregates | HW3 | HW3 answers | June 1st |
Joining Tables | HW4 | Link coming soon! | June 1st |
Mean, Median and Mode | June 1st | ||
Variance and Standard Deviation | June 1st | ||
Quartiles, Quantiles and IQR | June 1st | ||
Hypothesis testing | HW5 | Link coming soon! | June 1st |
Phase 2
Make sure you have RStudio up and running on your own machine. Instructions for how to do this should have been provided in Lesson 1 of CodeAcademy, but if you didn’t get a chance to set that up, there are instructions here.
Lesson Title | Source | Due date |
---|---|---|
If/else statements | Software Carpentry | June 15th |
Natural Language Processing | TBD | June 15th |
Tenents of writing good code in R | Software Carpentry | June 15th |
Phase 3
This phase will focus on understanding conceptual topics foundational to genomics-based/statistical research. The slide decks below are referenced from UBC’s STAT540 course. They are used with permission from Sara Mostafavi (course director).
Lesson Title | Source | Date |
---|---|---|
Exploratory data analysis and quality control | STAT540 | TBD |
Stats/math background for big data | STAT540 | TBD |
Two group comparisons | STAT540 | TBD |
ANOVA | STAT540 | TBD |
Linear models | STAT540 | TBD |
Linear modeling with continuous variables | STAT540 | TBD |
Multiple testing | STAT540 | TBD |
Principal component analysis | STAT540 | TBD |
Cluster analysis | STAT540 | TBD |
Supervised learning | STAT540 | TBD |
Supervised learning part 2 | STAT540 | TBD |