Merge branch 'master' of https://github.com/genomicsclass/labs

rafalab · rafalab · commit 54aeee93b61b · 2015-03-04T10:41:10.000-05:00
Conflicts:
	course3/eda_for_highthroughput.Rmd
	course3/inference_for_highthroughput.Rmd
	course3/intro_to_highthroughput_data.Rmd
diff --git a/Rscripts/cheung.R b/Rscripts/cheung.R
@@ -23,6 +23,6 @@ eth2[is.na(eth2)]<-"HAN" ##from LA, checked here
 ##http://ccr.coriell.org/Sections/Search/Advanced_Search.aspx?PgId=175
 
 
-pd=data.frame(ethnicity=eth2,date=dates,filename=basename(filenames))
+pd=data.frame(ethnicity=eth2,date=dates,filename=I(basename(filenames)))
 pData(e)<-pd
 save(e,file="GSE5859.rda")
diff --git a/Rscripts/cheungSubset.R b/Rscripts/cheungSubset.R
@@ -11,7 +11,6 @@ y<- colMeans(exprs(e)[which(annot$CHR=="chrY"),])
 sex <- ifelse(y<4.5,"F","M")
 
 sampleInfo <- pData(e)
-sampleInfo <- sampleInfo[,which(colnames(sampleInfo)!="filename")]
 sampleInfo$group <- ifelse(sex=="F",1,0)
 
 batch <- format(pData(e)$date,"%y%m")
diff --git a/course3/eda_for_highthroughput.Rmd b/course3/eda_for_highthroughput.Rmd
@@ -13,7 +13,11 @@ library(rafalib)
 ```
 
 # Introduction 
+<<<<<<< HEAD
 An under-appreciated advantage of working with high-throughput data is that problems with the data are sometimes more easily exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with exploratory data analysis (EDA). Here we review some of the plots that allow us to detect quality problems.
+=======
+An underappreciated advantage of highthroughput data is that the problems with the data are sometimes more easilty exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with highthroughput technologies. Here we review some of the plots that allow us to detect quality problems.
+>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
 
 We will use the results obtained in a previous section:
 
@@ -34,7 +38,11 @@ nullpvals <- rowttests(randomData,g)$p.value
 
 # Volcano plots
 
+<<<<<<< HEAD
 As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) the "highly significant" features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$.  Here is the volcano plot for our results above:
+=======
+As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) we have the "highly significant" results be high on the plot. Using log permits us to better distinguish between, say, 0.05 and 0.001.  Here is the volcano plot for our results above:
+>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
 
 ```{r}
 plot(results$dm,-log10(results$p.value),
@@ -90,7 +98,11 @@ qs <- t(apply(ge,2,quantile,prob=c(0.05,0.25,0.5,0.75,0.95)))
 matplot(qs,type="l",lty=1)
 ```
 
+<<<<<<< HEAD
 We can also plot all the histograms. Because we have so much data we create histograms using small bins, then smooth the heights of the bars and then plot _smooth histograms_. We re-calibrate the height of these smooth curves so that if a bar is made with base of size "unit" and height given by the curve at $x_0$, the area approximates the number of points in  region of size "unit" centered at $x_0$:
+=======
+We can also plot all the histograms. Because we have so much data we can use small bins and smooth the heights of the bars and the plot _smooth histograms_. 
+>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
 
 ```{r}
 mypar2(1,1)
diff --git a/course3/inference_for_highthroughput.Rmd b/course3/inference_for_highthroughput.Rmd
@@ -14,9 +14,13 @@ library(rafalib)
 
 # Introduction 
 
+<<<<<<< HEAD
 Supposed we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply  the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
+=======
+Suppose we were given highthroughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply  the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
+>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
 
-# Thousands of test
+# Thousands of tests
 
 In this data we have two groups denoted with 0 and 1:
 ```{r}
diff --git a/course3/intro_to_highthroughput_data.Rmd b/course3/intro_to_highthroughput_data.Rmd
@@ -35,7 +35,7 @@ So a high throughput experiment is usually defined by three tables: one with the
 Because a dataset is typically defined by set of experimental unit and a product  defines a fixed set of features the high-throughput measurements can be stored in an $n \times m$ matrix with $n$ the number of units and $m$ the number of features. In R the convention has been to store the transpose of these matrices. Here is an example from a gene expression dataset:
 
 ```{r}
-##can be isntalled with:
+##can be installed with:
 #library(devtools)
 #install_github("genomicsclass/GSE5859Subset")
 library(GSE5859Subset)
@@ -70,7 +70,11 @@ Note that it includes an ID that permits us to connect the rows of this table wi
 ```{r}
 head(match(geneAnnotation$PROBEID,rownames(geneExpression)))
 ```
+<<<<<<< HEAD
 The table also includes biological information abuot the features. Namely,  chromosome location and the gene "name" used by biologists.
+=======
+The table also includes biological information about the features. Namely,  chromosome location and the gene "name" used by biologists.
+>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
 
 # Examples
 
diff --git a/course3/storage/crossvalidation.Rmd b/course3/storage/crossvalidation.Rmd
diff --git a/course3/storage/distance_lecture.Rmd b/course3/storage/distance_lecture.Rmd