Welcome to bootcamp!
This bootcamp was originally desgined for my MD/PhD cohort at UW-Madison. However, anyone can follow along! This curriculum is designed for individuals new to both R and coding in general. NO PRIOR CODING EXPERIENCE REQUIRED! I have not created any materials presented in this bootcamp. Rather, I have put together a roadmap of resources that I think will help students build the coding skills and statistical knowledge needed to analyze a large, “high-dimensional” dataset.
Phase 1 consists entirely of CodeAcademy PRO’s R course, in addition to practice sets designed by Jenny Bryan for UBC’s STAT545 lecture. By the end of Phase 1 of this boot camp, you should be able to:
- manipulate a dataset into “tidy” format
- generate a variety of different graphs from a “tidy” dataset
- analyze trends over different groupings of data
- calculate mean/median/mode/IQR and perform basic statistical tests (t.tests, generalized linear models)
After getting your feet wet with data cleaning/data viz, we will move on to more complex concepts featured in the R Software Carpentry workshops that will make your coding life easier! By the end of phase 2, you should be able to:
- understand how to construct and interprete an if/else statement
- push/pull something from github
Phase 3 of this bootcamp will focus on providing an overview of statistics concepts needed to understand most computational approaches, particularly those in the genomics field. We will focus particularly on two-group comparisons, linear modeling, multiple testing and supervised/unsupervised learning. Phase 3 will refer to slidedecks prepared for UBC’s STAT540 course. By the end of phase 3, you should be able to:
- understand what quality checks should be performed on a dataset before conducting any analyses
- understand the statistical underpinnings of two-group comparisons and linear models
- know why multiple testing is important in “big data” studies
- perform PCA analysis
- understand supervised and unsupervised clustering methods
Syllabus
Phase 1
To complete this phase, you will need to sign up for CodeAcademy’s PRO R course. After completing each lesson, do the practice sets in your Google collab document. If a lesson does not have any associated practice sets, continue on to the next lesson. If you have any questions, just make a comment in the Google Collab document and tag me.
| Lesson Title | More practice | Practice answers | Due date |
|---|---|---|---|
| Introduction to R Syntax | June 1st | ||
| Learn R: Data Frames | June 1st | ||
| Data Cleaning | June 1st | ||
| Data Visualization with ggplot2 | June 1st | ||
| Logical operators from DataCamp | HW1, HW2 | HW1 answers, HW2 answers | June 1st |
| Aggregates | HW3 | HW3 answers | June 1st |
| Joining Tables | HW4 | Link coming soon! | June 1st |
| Mean, Median and Mode | June 1st | ||
| Variance and Standard Deviation | June 1st | ||
| Quartiles, Quantiles and IQR | June 1st | ||
| Hypothesis testing | HW5 | Link coming soon! | June 1st |
Phase 2
Make sure you have RStudio up and running on your own machine. Instructions for how to do this should have been provided in Lesson 1 of CodeAcademy, but if you didn’t get a chance to set that up, there are instructions here.
| Lesson Title | Source | Due date |
|---|---|---|
| If/else statements | Software Carpentry | June 15th |
| Natural Language Processing | TBD | June 15th |
| Tenents of writing good code in R | Software Carpentry | June 15th |
Phase 3
This phase will focus on understanding conceptual topics foundational to genomics-based/statistical research. The slide decks below are referenced from UBC’s STAT540 course. They are used with permission from Sara Mostafavi (course director).
| Lesson Title | Source | Date |
|---|---|---|
| Exploratory data analysis and quality control | STAT540 | TBD |
| Stats/math background for big data | STAT540 | TBD |
| Two group comparisons | STAT540 | TBD |
| ANOVA | STAT540 | TBD |
| Linear models | STAT540 | TBD |
| Linear modeling with continuous variables | STAT540 | TBD |
| Multiple testing | STAT540 | TBD |
| Principal component analysis | STAT540 | TBD |
| Cluster analysis | STAT540 | TBD |
| Supervised learning | STAT540 | TBD |
| Supervised learning part 2 | STAT540 | TBD |