Skip to content

Commit 54aeee9

Browse files
committedMar 4, 2015
Merge branch 'master' of https://github.com/genomicsclass/labs
Conflicts: course3/eda_for_highthroughput.Rmd course3/inference_for_highthroughput.Rmd course3/intro_to_highthroughput_data.Rmd
2 parents 4b15590 + 804deca commit 54aeee9

7 files changed

+23
-4
lines changed
 

‎Rscripts/cheung.R

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,6 @@ eth2[is.na(eth2)]<-"HAN" ##from LA, checked here
2323
##http://ccr.coriell.org/Sections/Search/Advanced_Search.aspx?PgId=175
2424

2525

26-
pd=data.frame(ethnicity=eth2,date=dates,filename=basename(filenames))
26+
pd=data.frame(ethnicity=eth2,date=dates,filename=I(basename(filenames)))
2727
pData(e)<-pd
2828
save(e,file="GSE5859.rda")

‎Rscripts/cheungSubset.R

-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ y<- colMeans(exprs(e)[which(annot$CHR=="chrY"),])
1111
sex <- ifelse(y<4.5,"F","M")
1212

1313
sampleInfo <- pData(e)
14-
sampleInfo <- sampleInfo[,which(colnames(sampleInfo)!="filename")]
1514
sampleInfo$group <- ifelse(sex=="F",1,0)
1615

1716
batch <- format(pData(e)$date,"%y%m")

‎course3/eda_for_highthroughput.Rmd

+12
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,11 @@ library(rafalib)
1313
```
1414

1515
# Introduction
16+
<<<<<<< HEAD
1617
An under-appreciated advantage of working with high-throughput data is that problems with the data are sometimes more easily exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with exploratory data analysis (EDA). Here we review some of the plots that allow us to detect quality problems.
18+
=======
19+
An underappreciated advantage of highthroughput data is that the problems with the data are sometimes more easilty exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with highthroughput technologies. Here we review some of the plots that allow us to detect quality problems.
20+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
1721
1822
We will use the results obtained in a previous section:
1923

@@ -34,7 +38,11 @@ nullpvals <- rowttests(randomData,g)$p.value
3438

3539
# Volcano plots
3640

41+
<<<<<<< HEAD
3742
As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) the "highly significant" features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$. Here is the volcano plot for our results above:
43+
=======
44+
As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) we have the "highly significant" results be high on the plot. Using log permits us to better distinguish between, say, 0.05 and 0.001. Here is the volcano plot for our results above:
45+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
3846
3947
```{r}
4048
plot(results$dm,-log10(results$p.value),
@@ -90,7 +98,11 @@ qs <- t(apply(ge,2,quantile,prob=c(0.05,0.25,0.5,0.75,0.95)))
9098
matplot(qs,type="l",lty=1)
9199
```
92100

101+
<<<<<<< HEAD
93102
We can also plot all the histograms. Because we have so much data we create histograms using small bins, then smooth the heights of the bars and then plot _smooth histograms_. We re-calibrate the height of these smooth curves so that if a bar is made with base of size "unit" and height given by the curve at $x_0$, the area approximates the number of points in region of size "unit" centered at $x_0$:
103+
=======
104+
We can also plot all the histograms. Because we have so much data we can use small bins and smooth the heights of the bars and the plot _smooth histograms_.
105+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
94106
95107
```{r}
96108
mypar2(1,1)

‎course3/inference_for_highthroughput.Rmd

+5-1
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,13 @@ library(rafalib)
1414

1515
# Introduction
1616

17+
<<<<<<< HEAD
1718
Supposed we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
19+
=======
20+
Suppose we were given highthroughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
21+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
1822
19-
# Thousands of test
23+
# Thousands of tests
2024

2125
In this data we have two groups denoted with 0 and 1:
2226
```{r}

‎course3/intro_to_highthroughput_data.Rmd

+5-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ So a high throughput experiment is usually defined by three tables: one with the
3535
Because a dataset is typically defined by set of experimental unit and a product defines a fixed set of features the high-throughput measurements can be stored in an $n \times m$ matrix with $n$ the number of units and $m$ the number of features. In R the convention has been to store the transpose of these matrices. Here is an example from a gene expression dataset:
3636

3737
```{r}
38-
##can be isntalled with:
38+
##can be installed with:
3939
#library(devtools)
4040
#install_github("genomicsclass/GSE5859Subset")
4141
library(GSE5859Subset)
@@ -70,7 +70,11 @@ Note that it includes an ID that permits us to connect the rows of this table wi
7070
```{r}
7171
head(match(geneAnnotation$PROBEID,rownames(geneExpression)))
7272
```
73+
<<<<<<< HEAD
7374
The table also includes biological information abuot the features. Namely, chromosome location and the gene "name" used by biologists.
75+
=======
76+
The table also includes biological information about the features. Namely, chromosome location and the gene "name" used by biologists.
77+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
7478
7579
# Examples
7680

File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)
Please sign in to comment.