You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: course3/eda_for_highthroughput.Rmd
+12
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,11 @@ library(rafalib)
13
13
```
14
14
15
15
# Introduction
16
+
<<<<<<< HEAD
16
17
An under-appreciated advantage of working with high-throughput data is that problems with the data are sometimes more easily exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with exploratory data analysis (EDA). Here we review some of the plots that allow us to detect quality problems.
18
+
=======
19
+
An underappreciated advantage of highthroughput data is that the problems with the data are sometimes more easilty exposed. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with highthroughput technologies. Here we review some of the plots that allow us to detect quality problems.
20
+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
17
21
18
22
We will use the results obtained in a previous section:
As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) the "highly significant" features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$. Here is the volcano plot for our results above:
43
+
=======
44
+
As we described in the Introduction chapter, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data we can visualize the results by making a plot. The idea behind a _volcano plot_ is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis the effect size. By using - log (base 10) we have the "highly significant" results be high on the plot. Using log permits us to better distinguish between, say, 0.05 and 0.001. Here is the volcano plot for our results above:
We can also plot all the histograms. Because we have so much data we create histograms using small bins, then smooth the heights of the bars and then plot _smooth histograms_. We re-calibrate the height of these smooth curves so that if a bar is made with base of size "unit" and height given by the curve at $x_0$, the area approximates the number of points in region of size "unit" centered at $x_0$:
103
+
=======
104
+
We can also plot all the histograms. Because we have so much data we can use small bins and smooth the heights of the bars and the plot _smooth histograms_.
Copy file name to clipboardexpand all lines: course3/inference_for_highthroughput.Rmd
+5-1
Original file line number
Diff line number
Diff line change
@@ -14,9 +14,13 @@ library(rafalib)
14
14
15
15
# Introduction
16
16
17
+
<<<<<<< HEAD
17
18
Supposed we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
19
+
=======
20
+
Suppose we were given highthroughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. Note that if, instead thousands of genes, we were handed data from just one gene we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
21
+
>>>>>>> 804decaab299a457775ea851d64dadc2ff08d928
18
22
19
-
# Thousands of test
23
+
# Thousands of tests
20
24
21
25
In this data we have two groups denoted with 0 and 1:
Copy file name to clipboardexpand all lines: course3/intro_to_highthroughput_data.Rmd
+5-1
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ So a high throughput experiment is usually defined by three tables: one with the
35
35
Because a dataset is typically defined by set of experimental unit and a product defines a fixed set of features the high-throughput measurements can be stored in an $n \times m$ matrix with $n$ the number of units and $m$ the number of features. In R the convention has been to store the transpose of these matrices. Here is an example from a gene expression dataset:
36
36
37
37
```{r}
38
-
##can be isntalled with:
38
+
##can be installed with:
39
39
#library(devtools)
40
40
#install_github("genomicsclass/GSE5859Subset")
41
41
library(GSE5859Subset)
@@ -70,7 +70,11 @@ Note that it includes an ID that permits us to connect the rows of this table wi
0 commit comments