Welcome to bootcamp!
This bootcamp was originally desgined for my MD/PhD cohort at UW-Madison. However, anyone can follow along! This curriculum is designed for individuals new to both R and coding in general. NO PRIOR CODING EXPERIENCE REQUIRED! I have not created any materials presented in this bootcamp. Rather, I have put together a roadmap of resources that I think will help students build the coding skills and statistical knowledge needed to analyze a large, “high-dimensional” dataset.
Phase 1 consists entirely of CodeAcademy PRO’s R course, in addition to practice sets designed by Jenny Bryan for UBC’s STAT545 lecture. By the end of Phase 1 of this boot camp, you should be able to:
- manipulate a dataset into “tidy” format
- generate a variety of different graphs from a “tidy” dataset
- analyze trends over different groupings of data
- calculate mean/median/mode/IQR and perform basic statistical tests (t.tests, generalized linear models)
After getting your feet wet with data cleaning/data viz, we will move on to more complex concepts featured in the R Software Carpentry workshops that will make your coding life easier! By the end of phase 2, you should be able to:
- understand how to construct and interprete an if/else statement
- push/pull something from github
Phase 3 of this bootcamp will focus on providing an overview of statistics concepts needed to understand most computational approaches, particularly those in the genomics field. We will focus particularly on two-group comparisons, linear modeling, multiple testing and supervised/unsupervised learning. Phase 3 will refer to slidedecks prepared for UBC’s STAT540 course. By the end of phase 3, you should be able to:
- understand what quality checks should be performed on a dataset before conducting any analyses
- understand the statistical underpinnings of two-group comparisons and linear models
- know why multiple testing is important in “big data” studies
- perform PCA analysis
- understand supervised and unsupervised clustering methods
To complete this phase, you will need to sign up for CodeAcademy’s PRO R course. After completing each lesson, do the practice sets in your Google collab document. If a lesson does not have any associated practice sets, continue on to the next lesson. If you have any questions, just make a comment in the Google Collab document and tag me.
|Lesson Title||More practice||Practice answers||Due date|
|Introduction to R Syntax||June 1st|
|Learn R: Data Frames||June 1st|
|Data Cleaning||June 1st|
|Data Visualization with ggplot2||June 1st|
|Logical operators from DataCamp||HW1, HW2||HW1 answers, HW2 answers||June 1st|
|Aggregates||HW3||HW3 answers||June 1st|
|Joining Tables||HW4||Link coming soon!||June 1st|
|Mean, Median and Mode||June 1st|
|Variance and Standard Deviation||June 1st|
|Quartiles, Quantiles and IQR||June 1st|
|Hypothesis testing||HW5||Link coming soon!||June 1st|
Make sure you have RStudio up and running on your own machine. Instructions for how to do this should have been provided in Lesson 1 of CodeAcademy, but if you didn’t get a chance to set that up, there are instructions here.
|Lesson Title||Source||Due date|
|If/else statements||Software Carpentry||June 15th|
|Natural Language Processing||TBD||June 15th|
|Tenents of writing good code in R||Software Carpentry||June 15th|
This phase will focus on understanding conceptual topics foundational to genomics-based/statistical research. The slide decks below are referenced from UBC’s STAT540 course. They are used with permission from Sara Mostafavi (course director).
|Exploratory data analysis and quality control||STAT540||TBD|
|Stats/math background for big data||STAT540||TBD|
|Two group comparisons||STAT540||TBD|
|Linear modeling with continuous variables||STAT540||TBD|
|Principal component analysis||STAT540||TBD|
|Supervised learning part 2||STAT540||TBD|