MASH v FLASH workflows for GTEx

Last updated: 2018-08-15

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180609)

The command set.seed(20180609) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: e409183
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/
    Ignored:    docs/.DS_Store
    Ignored:    docs/images/.DS_Store
    Ignored:    docs/images/.Rapp.history
    Ignored:    output/.DS_Store
    Ignored:    output/.Rapp.history
    Ignored:    output/MASHvFLASHgtex/.DS_Store
    Ignored:    output/MASHvFLASHsims/.DS_Store
    Ignored:    output/MASHvFLASHsims/backfit/.DS_Store
    Ignored:    output/MASHvFLASHsims/backfit/.Rapp.history

Untracked files:
    Untracked:  analysis/MASHvFLASHgtex3.Rmd

Unstaged changes:
    Modified:   code/MASHvFLASHgtex3.R
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	e409183	Jason Willwerscheid	2018-08-15	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	2fb332d	Jason Willwerscheid	2018-08-05	Build site.
Rmd	20c64ab	Jason Willwerscheid	2018-08-05	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	c99babc	Jason Willwerscheid	2018-08-02	Build site.
Rmd	942570e	Jason Willwerscheid	2018-08-02	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	44d612a	Jason Willwerscheid	2018-08-01	Build site.
Rmd	eaefa15	Jason Willwerscheid	2018-08-01	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	25e0a6f	Jason Willwerscheid	2018-08-01	Build site.
Rmd	36e6d62	Jason Willwerscheid	2018-08-01	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	76619ea	Jason Willwerscheid	2018-07-22	Build site.
Rmd	30e7d23	Jason Willwerscheid	2018-07-22	wflow_publish(c(“analysis/index.Rmd”,
html	860aa52	Jason Willwerscheid	2018-07-22	Build site.
Rmd	3abd505	Jason Willwerscheid	2018-07-22	wflow_publish(c(“analysis/MASHvFLASHsims.Rmd”,
html	8720e9c	Jason Willwerscheid	2018-07-01	Build site.
Rmd	1874a0d	Jason Willwerscheid	2018-07-01	wflow_publish(c(“analysis/MASHvFLASHgtex2.Rmd”,
html	ffbafca	Jason Willwerscheid	2018-06-27	Build site.
Rmd	eb90603	Jason Willwerscheid	2018-06-27	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)
html	89b9e69	Jason Willwerscheid	2018-06-26	Build site.
Rmd	1ced908	Jason Willwerscheid	2018-06-26	wflow_publish(“analysis/MASHvFLASHgtex2.Rmd”)

Introduction

While I previously compared MASH and FLASH fits on strong tests only and on a random subset of tests, here I propose several workflows that are analogous to the one suggested for the GTEx data in this MASH vignette. For the code used in this analysis, see below.

Fitting methods

For MASH, I follow the workflow in the vignette linked above, except that I assume that the null tests are uncorrelated (that is, I set \(V = I\)). This is almost certainly not the case, but some more work needs to be done before we can handle the case \(V \ne I\) in FLASH.

The workflows for FLASH proceed along similar lines to the workflow for MASH:

I obtain “data-driven” loadings (analogous to MASH’s data-driven covariance matrices) by fitting a FLASH object to the “strong” tests using either the “OHF” method or the “Zero” method described in my simulation study.
I fix the loadings obtained in step 1 at their expectation \(EL\) and add them as fixed data-driven loadings to a FLASH object. Further, I add 45 fixed “canonical” loadings (a vector of all ones and a one-hot vector for each of 44 conditions), which can be viewed as analogous to MASH’s canonical covariance matrices. (See the discussion in my vignette.) Since the “Zero” method generates many more data-driven loadings than the “OHF” method, I experiment with only keeping a subset of the data-driven loadings from the “Zero” method.
With the loadings fixed, I backfit a FLASH object to the random subset of tests to obtain priors \(g_f\) on the factors. I use the random subset rather than the “strong” subset because I want the priors to hold generally, and not just for the strong tests.
Finally, using the same fixed loadings as in step 2 and fixing the priors \(g_f\) at the values obtained in step 3, I backfit a FLASH object to the strong tests to get posterior means and variances.

I fit five FLASH objects, using five variations of the above workflow:

methods <- data.frame(fit = as.character(1:5),
                      data.driven = c("OHF", "Zero (top 5)", 
                                      "Zero (top 10)", "Zero (top 20)",
                                      "Zero (full)"),
                      include.canonical = c(rep("Yes", 4), "No"))
knitr::kable(methods)

fit	data.driven	include.canonical
1	OHF	Yes
2	Zero (top 5)	Yes
3	Zero (top 10)	Yes
4	Zero (top 20)	Yes
5	Zero (full)	No

Comments on MASH fit

library(mashr)

Loading required package: ashr

fpath <- "./output/MASHvFLASHgtex2/"
m_final <- readRDS(paste0(fpath, "m.rds"))

It took 2.4 minutes to run Extreme Deconvolution to find data-driven covariance matrices. The MASH fit on the random subset of tests (to determine mixture weights) required 2.7 minutes, and the final MASH fit on the strong tests required only 16 seconds. The total fitting time was 5.3 minutes.

The estimated mixture weights were as follows. Note in particular that there are large weights on the data-driven matrices “ED_tPCA” and “ED_PCA_2,” as well as on the canonical “equal_effects” matrix. There are moderate weights on the null matrix, some unique effects (including testis, thyroid, and transformed fibroblasts), and two of the canonical “simple_het” matrices (where effect sizes are assumed to be of equal variance and equally correlated, with correlation coefficients of, respectively, 0.25 and 0.5).

barplot(get_estimated_pi(m_final), las = 2, cex.names = 0.4)

Expand here to see past versions of mixwts-1.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
ffbafca	Jason Willwerscheid	2018-06-27

Correlation plots for the data-driven matrices are as follows. The first (“ED_tPCA”) describes effects that are shared across a handful of tissues (which, notably, does not include brain tissues).

library(corrplot)

corrplot 0.84 loaded

corrplot(m_final$fitted_g$Ulist[["ED_tPCA"]], tl.cex=0.5)

Expand here to see past versions of corr-1.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
ffbafca	Jason Willwerscheid	2018-06-27

The second (“ED_PCA_2”) describes a pattern of sharing among brain tissues that is, interestingly, strongly anti-correlated with whole blood.

corrplot(m_final$fitted_g$Ulist[["ED_PCA_2"]], tl.cex=0.5)

Expand here to see past versions of corr2-1.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
ffbafca	Jason Willwerscheid	2018-06-27

Comments on FLASH fits

devtools::load_all("/Users/willwerscheid/GitHub/flashr/")

Loading flashr

fl_final <- list()
fl_final[[1]] <- readRDS(paste0(fpath, "OHF.rds"))
fl_final[[2]] <- readRDS(paste0(fpath, "top5.rds"))
fl_final[[3]] <- readRDS(paste0(fpath, "top10.rds"))
fl_final[[4]] <- readRDS(paste0(fpath, "top20.rds"))
fl_final[[5]] <- readRDS(paste0(fpath, "zero.rds"))

Data-driven loadings (step 1)

OHF

14.3 minutes were required to backfit the 45 canonical loadings and then greedily obtain 2 additional factor/loading pairs using the “OHF” method. The loadings thus obtained are plotted below. Clearly, they are not as “nice” (sparse, interpretable) as many of the factor/loading pairs obtained using the “Zero” method.

To understand what’s happening, I want to pause to re-consider how the OHF method works. First, OHF backfits the canonical loadings. At this stage, the effect estimates have a simple interpretation: for each test, the effect in each condition is shrunken towards the mean effect (or rather, a shrunken estimate of the mean). This shrinkage is performed as if the deviations from the mean were independent. Second, data-driven loadings are greedily fitted to the residuals obtained from the backfit. These loadings will capture covariance structures that are more complex than the simple dependency structure captured by the canonical loadings (that is, dependence via the mean).

I believe that the first data-driven loading (#46) is capturing two (or more) separate dependency structures. Clearly, it is capturing covariance among brain tissues. But notice also that the tissues that are negatively loaded are the same tissues that have large weights in the MASH “ED_tPCA” structure depicted above. Further, notice that the non-brain tissues that are positively loaded are anti-correlated with brain tissues according to the second data-driven loading (#47). Thus, a linear combination of #46 and #47 might yield a covariance structure that captures positive correlation among brain tissues alone (which is more or less what is described by the MASH “ED_PCA_2” structure illustrated above).

In brief, I think that the dependencies implied by the OHF method are not so different from those implied by MASH as might at first appear. I think that this similarity could be further explored by placing non-negativity constraints on the loadings, which I intend to do in a future analysis.

missing.tissues <- c(7, 8, 19, 20, 24, 25, 31, 34, 37)
gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE", sep = '\t', comment.char = '')[-missing.tissues, 2]

par(mar=c(1,1,1,1))
par(mfrow=c(2,1))
for (i in 46:47) {
  barplot(fl_final[[1]]$EL[, i], main=paste('Loading', i), las=2,
          cex.names = 0.4, col=as.character(gtex.colors), names="")
}

Expand here to see past versions of dd.OHF-1.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01

Zero

6.3 minutes were required to greedily add 37 factor/loading pairs using the “Zero” method. To view plots of the loadings, scroll down to the “FLASH loadings” section below. Note in particular that most of the loadings beyond the first 16 or so are strongly loaded on a single condition. This observation motivated my desire to discard the majority of data-driven loadings. I reasoned that retaining only the “top” data-driven loadings would cut down on the time needed to backfit but would not change the final fit very much.

Priors on factors (steps 2-3)

The times needed to fit priors to the random subset of tests were as follows.

t <- readRDS(paste0(fpath, "t.rds"))
t_random <- data.frame(fit = as.character(1:5),
                       fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"), 
                       number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
                       minutes.to.fit = sapply(t$random, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_random, digits=1)

fit	fixed.loadings	number.of.factors	minutes.to.fit
1	Canonical + OHF	47	3.0
2	Canonical + Top 5 Zero	50	8.5
3	Canonical + Top 10 Zero	55	17.6
4	Canonical + Top 20 Zero	65	17.0
5	Zero only	37	0.9

To inspect the fitted priors \(g_f \sim (1 - w_f) \delta_0 + w_f N(0, \sigma^2_f)\), I plot the \(w_f\)s (that is, the proportion of genes that we expect to have nonnull loadings for each factor/loading pair). The “equal effects” factor is colored black, while the “unique effects” factors are colored with the GTEx colors used in the factor plots above. The data-driven factors are colored brown (OHF) or in grayscale (Zero).

par(mfrow = c(1, 2))
plot_gf.w <- function(fl, names, colors, main) {
  w <- 1 - sapply(fl$gf, function(g) {g$pi0})
  barplot(w, ylim=c(0, 1), ylab="w", xlab="factor/loading",
          cex.names=0.4, names=names, col=colors, main=main)
}
canonical.names <- c("EE", as.character(1:44))
canonical.colors <- c("black", as.character(gtex.colors))
OHF.colors <- c("tan4", "tan3")
zero.colors <- c("black", gray.colors(19, 0.2, 0.9), 
                 gray.colors(17, 0.95, 1))
plot_gf.w(fl_final[[1]], 
          c(canonical.names, paste0("D", 1:2)),
          c(canonical.colors, OHF.colors),
          main="Canonical + OHF")
plot_gf.w(fl_final[[2]], 
          c(canonical.names, paste0("D", 1:5)),
          c(canonical.colors, zero.colors[1:5]),
          main="Canonical + Top 5")

Expand here to see past versions of plot_w-1.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_gf.w(fl_final[[3]], 
          c(canonical.names, paste0("D", 1:10)),
          c(canonical.colors, zero.colors[1:10]),
          main="Canonical + Top 10")
plot_gf.w(fl_final[[4]], 
          c(canonical.names, paste0("D", 1:20)),
          c(canonical.colors, zero.colors[1:20]),
          main="Canonical + Top 20")

Expand here to see past versions of plot_w-2.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_gf.w(fl_final[[5]], 
          paste0("D", 1:37),
          zero.colors,
          main="Zero")

Expand here to see past versions of plot_w-3.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

It is also instructive to plot the \(\sigma_f\)s, normalized so that the \(\ell_\infty\)-norm for each loading is equal to one. These plots roughly describe how large we can expect non-null identical and unique effects to be. Note in particular that the “unique effects” loadings capture rare but large effects, whereas the data-driven loadings capture smaller but more common effects.

par(mfrow = c(1, 2))
plot_gf.s <- function(fl, names, colors, main) {
  nrm <- apply(abs(fl$EL), 2, max)
  s <- sqrt(1 / sapply(fl$gf, function(g) {g$a})) * nrm
  barplot(s, ylim=c(0, 6), ylab="sigma", xlab="factor/loading",
          cex.names=0.4, names=names, col=colors, main=main)
}
plot_gf.s(fl_final[[1]], 
          c(canonical.names, paste0("D", 1:2)),
          c(canonical.colors, OHF.colors),
          main="Canonical + OHF")
plot_gf.s(fl_final[[2]], 
          c(canonical.names, paste0("D", 1:5)),
          c(canonical.colors, zero.colors[1:5]),
          main="Canonical + Top 5")

Expand here to see past versions of plot_s-1.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_gf.s(fl_final[[3]], 
          c(canonical.names, paste0("D", 1:10)),
          c(canonical.colors, zero.colors[1:10]),
          main="Canonical + Top 10")
plot_gf.s(fl_final[[4]], 
          c(canonical.names, paste0("D", 1:20)),
          c(canonical.colors, zero.colors[1:20]),
          main="Canonical + Top 20")

Expand here to see past versions of plot_s-2.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_gf.s(fl_final[[5]], 
          paste0("D", 1:37),
          zero.colors,
          main="Zero")

Expand here to see past versions of plot_s-3.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02

Final backfit (step 4)

The times needed for the final backfits (with loadings and priors \(g_f\) fixed at values determined during steps 1-3) were as follows.

t_final <- data.frame(fit = as.character(1:5),
                      fixed.loadings = c("Canonical + OHF", "Canonical + Top 5 Zero", "Canonical + Top 10 Zero", "Canonical + Top 20 Zero", "Zero only"), 
                       number.of.factors = sapply(fl_final, function(x) {flash_get_nfactors(x)}),
                       minutes.to.fit = sapply(t$final, function(x) {as.numeric(x, units="mins")})
)
knitr::kable(t_final, digits=1)

fit	fixed.loadings	number.of.factors	minutes.to.fit
1	Canonical + OHF	47	4.9
2	Canonical + Top 5 Zero	50	14.4
3	Canonical + Top 10 Zero	55	17.8
4	Canonical + Top 20 Zero	65	34.8
5	Zero only	37	0.4

The proportion of variance explained per factor/loading pair is as follows. Since, in each case, the “equal effects” factor/loading pair constitutes by far the largest proportion of variance explained (around 0.7), I only plot the PVE for the other factor/loading pairs.

par(mfrow = c(1, 2))
plot_pve <- function(fl, names, colors, main) {
  pve <- flash_get_pve(fl)
  barplot(pve[2:length(pve)], ylim=c(0, .05), ylab="PVE", 
          xlab="factor/loading", cex.names=0.4,
          names=names, col=colors, main=main)
}
canonical.names <- as.character(1:44)
canonical.colors <- canonical.colors[2:45]
plot_pve(fl_final[[1]], 
         c(canonical.names, paste0("D", 1:2)),
         c(canonical.colors, OHF.colors),
         main="Canonical + OHF")
plot_pve(fl_final[[2]], 
         c(canonical.names, paste0("D", 1:5)),
         c(canonical.colors, zero.colors[1:5]),
         main="Canonical + Top 5")

Expand here to see past versions of pve-1.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_pve(fl_final[[3]], 
         c(canonical.names, paste0("D", 1:10)),
         c(canonical.colors, zero.colors[1:10]),
         main="Canonical + Top 10")
plot_pve(fl_final[[4]], 
         c(canonical.names, paste0("D", 1:20)),
         c(canonical.colors, zero.colors[1:20]),
         main="Canonical + Top 20")

Expand here to see past versions of pve-2.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

plot_pve(fl_final[[5]], 
         paste0("D", 2:37),
         zero.colors[2:37],
         main="Zero")

Expand here to see past versions of pve-3.png:

Version	Author	Date
c99babc	Jason Willwerscheid	2018-08-02
25e0a6f	Jason Willwerscheid	2018-08-01

Total fitting time

The total time needed for each workflow is as follows.

fl_t <- (t_random$minutes.to.fit + t_final$minutes.to.fit
         + c(t$strong$OHF, t$strong$zero * c(5, 10, 20, 37) / 37))
total_time <- data.frame(method = c("OHF", "Top 5", "Top 10", 
                                    "Top 20", "Zero", "MASH"),
                         total.time.to.fit = c(fl_t, 5.3))
knitr::kable(total_time, digits=1)

method	total.time.to.fit
OHF	22.2 mins
Top 5	23.8 mins
Top 10	37.1 mins
Top 20	55.2 mins
Zero	7.5 mins
MASH	5.3 mins

MASH v FLASH initial observations

There is substantial disagreement among fits. The correlation matrix for posterior means is as follows.

all_fitted <- sapply(fl_final, flash_get_fitted_values)
all_fitted <- cbind(all_fitted, as.vector(t(get_pm(m_final))))
colnames(all_fitted) <- c("OHF", "Top 5", "Top 10", 
                          "Top 20", "Zero", "MASH")
round(cor(all_fitted), digits=3)

         OHF Top 5 Top 10 Top 20  Zero  MASH
OHF    1.000 0.984  0.983  0.983 0.971 0.966
Top 5  0.984 1.000  0.995  0.994 0.966 0.961
Top 10 0.983 0.995  1.000  0.999 0.968 0.963
Top 20 0.983 0.994  0.999  1.000 0.969 0.963
Zero   0.971 0.966  0.968  0.969 1.000 0.979
MASH   0.966 0.961  0.963  0.963 0.979 1.000

Entry \((i, j)\) in the following matrix gives, of the values that method \(j\) finds to be significant at 5%, the proportion of values that method \(i\) also finds to be significant at 5%. For example, of the values that MASH finds to be significant, OHF also finds 90% to be significant. But of the values that OHF finds to be significant, MASH only agrees in 81% of cases.

fl_lfsr <- readRDS(paste0(fpath, "fllfsr.rds"))

all_signif <- sapply(fl_lfsr, function(x) {as.numeric(x <= 0.05)})
all_signif <- cbind(all_signif, as.vector(t(get_lfsr(m_final) <= 0.05)))
m <- ncol(all_signif)
agree_mat <- matrix(0, nrow=m, ncol=m)
for (i in 1:m) {
  for (j in 1:m) {
    agree_mat[i, j] <- (sum(all_signif[, i] * all_signif[, j]) /
                          sum(all_signif[, j]))
  }
}
rownames(agree_mat) <- colnames(agree_mat) <- colnames(all_fitted)
round(agree_mat, digits=2)

        OHF Top 5 Top 10 Top 20 Zero MASH
OHF    1.00  0.93   0.92   0.92 0.90 0.90
Top 5  0.93  1.00   0.96   0.96 0.92 0.92
Top 10 0.92  0.96   1.00   0.98 0.93 0.92
Top 20 0.92  0.95   0.97   1.00 0.93 0.91
Zero   0.89  0.91   0.92   0.92 1.00 0.91
MASH   0.81  0.83   0.83   0.83 0.83 1.00

I will offer some explanations for the differences among fits in subsequent analyses.

FLASH loadings

The FLASH loadings (as fitted on the strong tests) are as follows.

par(mar=c(1,1,1,1))
par(mfrow=c(3,2))
for(i in 1:flash_get_nfactors(fl_final[[5]])){
  barplot(fl_final[[5]]$EL[, i], main=paste('Loading', i), las=2,
          cex.names = 0.4, col=as.character(gtex.colors), names="")
}

Expand here to see past versions of flash_factors-1.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
860aa52	Jason Willwerscheid	2018-07-22
89b9e69	Jason Willwerscheid	2018-06-26

Expand here to see past versions of flash_factors-2.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
860aa52	Jason Willwerscheid	2018-07-22
89b9e69	Jason Willwerscheid	2018-06-26

Expand here to see past versions of flash_factors-3.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
860aa52	Jason Willwerscheid	2018-07-22
89b9e69	Jason Willwerscheid	2018-06-26

Expand here to see past versions of flash_factors-4.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
860aa52	Jason Willwerscheid	2018-07-22
89b9e69	Jason Willwerscheid	2018-06-26

Expand here to see past versions of flash_factors-5.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01
860aa52	Jason Willwerscheid	2018-07-22
89b9e69	Jason Willwerscheid	2018-06-26

Expand here to see past versions of flash_factors-6.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01

Expand here to see past versions of flash_factors-7.png:

Version	Author	Date
25e0a6f	Jason Willwerscheid	2018-08-01

Code

Click “Code” to view the code used to obtain the above results.

# Make sure to use branch "trackObj" when loading flashr.

# devtools::install_github("stephenslab/flashr", ref="trackObj")
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
# devtools::install_github("stephenslab/ebnm")
devtools::load_all("/Users/willwerscheid/GitHub/ebnm/")

library(mashr)

source("./code/fits.R")
source("./code/utils.R")

fpath <- "./output/MASHvFLASHgtex2/"

gtex <- readRDS(gzcon(url("https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE")))

strong <- t(gtex$strong.z)
random <- t(gtex$random.z)


# FLASH fits ------------------------------------------------------------

strong_data <- flash_set_data(strong, S = 1)
random_data <- flash_set_data(random, S = 1)

# methods to get DD: OHF; Zero(top5, top10, top20, full)
# LLs to use: canonical + OHF; Zero; canonical + top5/top10/top15; full

# Step 1. Learn data-driven loadings using strong tests.
t0 <- Sys.time()
fl_strong.OHF <- fit_flash_OHF(strong_data, Kmax=50, backfit=FALSE)
t_strong.OHF <- Sys.time() - t0

t0 <- Sys.time()
fl_strong.zero <- fit_flash_zero(strong_data, Kmax=50, backfit=FALSE)
t_strong.zero <- Sys.time() - t0

# Step 2. Fit the model to random tests to learn priors on factors.
LL <- list()
LL[[1]] <- fl_strong.OHF$f$EL

n <- nrow(strong)
canonical <- cbind(rep(1, n), diag(rep(1, n)))
LL[[2]] <- cbind(canonical, fl_strong.zero$f$EL[, 1:5])
LL[[3]] <- cbind(canonical, fl_strong.zero$f$EL[, 1:10])
LL[[4]] <- cbind(canonical, fl_strong.zero$f$EL[, 1:20])
LL[[5]] <- fl_strong.zero$f$EL

fl_random <- list()
t_random <- list()
for (i in 1:length(LL)) {
  t0 <- Sys.time()
  fl <- flash_add_fixed_l(random_data, LL[[i]])
  fl <- flash_backfit(random_data,
                      fl,
                      var_type="zero",
                      ebnm_fn = "ebnm_pn",
                      nullcheck = FALSE,
                      warmstart = TRUE,
                      verbose = TRUE)
  t_random[[i]] <- Sys.time() - t0
  fl_random[[i]] <- fl$f
}

# Step 3. Compute posteriors on strong tests, using priors from step 2.

fl_final <- list()
t_final <- list()
for (i in 1:length(LL)) {
  t0 <- Sys.time()
  fl <- flash_add_fixed_l(strong_data, LL[[i]])
  ebnm_param_f = lapply(fl_random[[i]]$gf, function(g) {list(g=g, fixg=TRUE)})
  fl <- flash_backfit(strong_data,
                      fl,
                      var_type="zero",
                      ebnm_fn = "ebnm_pn",
                      ebnm_param = list(f = ebnm_param_f, l = list()),
                      nullcheck = FALSE,
                      verbose = TRUE)
  t_final[[i]] <- Sys.time() - t0
  fl_final[[i]] <- fl$f
}

saveRDS(fl_final[[1]], paste0(fpath, "OHF.rds"))
saveRDS(fl_final[[2]], paste0(fpath, "top5.rds"))
saveRDS(fl_final[[3]], paste0(fpath, "top10.rds"))
saveRDS(fl_final[[4]], paste0(fpath, "top20.rds"))
saveRDS(fl_final[[5]], paste0(fpath, "zero.rds"))

t <- list()
t$strong <- list()
t$strong$OHF <- t_strong.OHF
t$strong$zero <- t_strong.zero
t$random <- t_random
t$final <- t_final


# Sample from posterior to get LFSR for FLASH fits ----------------------

nsamp <- 200
fl_lfsr <- list()
for (i in 1:length(fl_final)) {
  sampler <- flash_sampler(strong_data, fl_final[[i]], fixed="loadings")
  samp <- sampler(200)
  fl_lfsr[[i]] <- flash_lfsr(samp)
}

saveRDS(fl_lfsr, paste0(fpath, "fllfsr.rds"))


# MASH fit --------------------------------------------------------------

strong_data <- mash_set_data(t(strong), Shat = 1)
random_data <- mash_set_data(t(random), Shat = 1)

t$mash <- list()

# Step 1. Learn data-driven loadings using strong tests.
t0 <- Sys.time()
U.pca <- cov_pca(strong_data, 5)
U.ed <- cov_ed(strong_data, U.pca)
t$mash$ed <- Sys.time() - t0

# 2. Fit the model to random tests to learn mixture weights
t0 <- Sys.time()
U.c <- cov_canonical(random_data)
m_random <- mash(random_data, Ulist = c(U.ed,U.c))
t$mash$random <- Sys.time() - t0

# 3. Compute posterior summaries on the strong tests
t0 <- Sys.time()
m_final <- mash(strong_data, g=get_fitted_g(m_random), fixg=TRUE)
t$mash$final <- Sys.time() - t0

saveRDS(m_final, paste0(fpath, "m.rds"))
saveRDS(t, paste0(fpath, "t.rds"))

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] flashr_0.5-14 corrplot_0.84 mashr_0.2-7   ashr_2.2-10  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17        pillar_1.2.1        compiler_3.4.3     
 [4] git2r_0.21.0        highr_0.6           plyr_1.8.4         
 [7] workflowr_1.0.1     R.methodsS3_1.7.1   R.utils_2.6.0      
[10] iterators_1.0.9     tools_3.4.3         testthat_2.0.0     
[13] digest_0.6.15       tibble_1.4.2        gtable_0.2.0       
[16] evaluate_0.10.1     memoise_1.1.0       lattice_0.20-35    
[19] rlang_0.2.0         Matrix_1.2-12       foreach_1.4.4      
[22] commonmark_1.4      yaml_2.1.17         parallel_3.4.3     
[25] ebnm_0.1-12         mvtnorm_1.0-7       xml2_1.2.0         
[28] roxygen2_6.0.1.9000 withr_2.1.1.9000    stringr_1.3.0      
[31] knitr_1.20          devtools_1.13.4     rprojroot_1.3-2    
[34] grid_3.4.3          R6_2.2.2            rmarkdown_1.8      
[37] rmeta_3.0           ggplot2_2.2.1       magrittr_1.5       
[40] whisker_0.3-2       scales_0.5.0        backports_1.1.2    
[43] codetools_0.2-15    htmltools_0.3.6     MASS_7.3-48        
[46] assertthat_0.2.0    softImpute_1.4      colorspace_1.3-2   
[49] stringi_1.1.6       lazyeval_0.2.1      munsell_0.4.3      
[52] doParallel_1.0.11   pscl_1.5.2          truncnorm_1.0-8    
[55] SQUAREM_2017.10-1   R.oo_1.21.0

This reproducible R Markdown analysis was created with workflowr 1.0.1