Last updated: 2018-08-13
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180609)
The command set.seed(20180609)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 07c51b7
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/
Ignored: docs/.DS_Store
Ignored: docs/images/.DS_Store
Ignored: docs/images/.Rapp.history
Ignored: output/.DS_Store
Ignored: output/.Rapp.history
Ignored: output/MASHvFLASHgtex/.DS_Store
Ignored: output/MASHvFLASHsims/.DS_Store
Ignored: output/MASHvFLASHsims/backfit/.DS_Store
Ignored: output/MASHvFLASHsims/backfit/.Rapp.history
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 07c51b7 | Jason Willwerscheid | 2018-08-13 | wflow_publish(“analysis/MASHvFLASHnn.Rmd”) |
I argued in the previous analysis that better loadings might be obtained if nonnegative priors were used. Here I fit a flash object to the “strong” GTEx dataset using “+uniform” ash priors on the loadings.
I alternate between greedily adding factor/loading pairs (with “+uniform” ash priors on the loadings and normal-mixture ash priors on the factors) and backfitting the entire flash object. For the backfits, I set tol = 1
because convergence can be very slow. I repeat these two steps until flash_add_greedy
no longer adds any additional factor/loadings.
iteration <- 1:7
factors_added <- c(34, 1, 1, 1, 2, 1, 0)
factors_zeroed_out <- c(0, 0, 1, 0, 0, 0, 0)
objective <- -1250000 - c(7689, 7595, 7140, 6685, 6362, 6098, 6098)
minutes_taken <- c(61, 3, 13, 26, 8, 16, 0)
data <- data.frame("Iteration" = iteration,
"Factors added" = factors_added,
"Factors deleted" = factors_zeroed_out,
"Final objective" = objective,
"Minutes" = minutes_taken)
knitr::kable(data)
Iteration | Factors.added | Factors.deleted | Final.objective | Minutes |
---|---|---|---|---|
1 | 34 | 0 | -1257689 | 61 |
2 | 1 | 0 | -1257595 | 3 |
3 | 1 | 1 | -1257140 | 13 |
4 | 1 | 0 | -1256685 | 26 |
5 | 2 | 0 | -1256362 | 8 |
6 | 1 | 0 | -1256098 | 16 |
7 | 0 | 0 | -1256098 | 0 |
Finally, I tighten the tolerance to 0.1 and run a final backfit, which increases the objective to -1256080 (this takes 25 minutes).
I load the results from file.
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
Loading flashr
fl <- readRDS("./output/MASHvFLASHnn/fl.rds")
fl_g <- readRDS("./output/MASHvFLASHnn/fl_g.rds")
The first loading is well represented by the canonical “equal effects” loading.
missing.tissues <- c(7, 8, 19, 20, 24, 25, 31, 34, 37)
gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE", sep = '\t', comment.char = '')[-missing.tissues, 2]
pve <- flash_get_pve(fl, drop_zero_factors = FALSE)
par(mar=c(1,1,1,1))
barplot(fl$EL[, 1],
main=paste0('Loading 1: PVE = ', signif(pve[1], digits=3)),
las=2, cex.names=0.4, yaxt='n',
col=as.character(gtex.colors), names="")
Many loadings are well approximated by canonical “unique effects.” Loadings 3, 4, 7, 9-10, 14-16, 18-21, 26-29, and 32-40 place nearly all of their weight on a single tissue.
uniq <- c(3, 4, 7, 9, 10, 14:16, 18:21, 26:29, 32:40)
uniq_order <- order(pve[uniq], decreasing = TRUE)
tissue_names <- rownames(fl$EL)
par(mar=c(1,1,1,1))
par(mfrow=c(3,2))
for(i in uniq[uniq_order]){
barplot(fl$EL[, i],
main=paste0(tissue_names[which.max(fl$EL[, i])],
': PVE = ', signif(pve[i], digits=3)),
las=2, cex.names=0.4, yaxt='n',
col=as.character(gtex.colors), names="")
}
The remaining loadings pick up on effects that are strongly correlated across several tissues. These are the loadings I’m especially interested in, since they correspond to correlation structures that are not covered by canonical loadings.
multi <- c(2, 5, 6, 8, 11:13, 17, 22:25, 31)
multi_order <- order(pve[multi], decreasing = TRUE)
display_names <- strtrim(tissue_names, 10)
for(i in multi[multi_order]){
plot_names <- display_names
plot_names[fl$EL[, i] < 0.25 * max(fl$EL[, i])] <- ""
barplot(fl$EL[, i],
main=paste0('Loading ', i,
': PVE = ', signif(pve[i], digits=3)),
las=2, cex.names=0.8, yaxt='n',
col=as.character(gtex.colors), names=plot_names)
}
As indicated above, these loadings take about two and a half hours to obtain. However, loadings that are qualitatively very similar can be obtained via a single call to flash_add_greedy
(with no backfitting), which takes less than 15 minutes. A comparison between these easily obtained loadings and the loadings yielded by the more laborious procedure detailed above follows:
par(mfrow=c(2,2))
display_names <- strtrim(tissue_names, 20)
for(i in multi[multi_order]){
plot_names <- display_names
plot_names[fl_g$EL[, i] < 0.25 * max(fl_g$EL[, i])] <- ""
barplot(fl_g$EL[, i],
main=paste0('Greedy loading ', i),
las=2, cex.names=0.4, yaxt='n',
col=as.character(gtex.colors), names=plot_names)
barplot(fl$EL[, i],
main=paste0('Backfitted loading ', i),
las=2, cex.names=0.4, yaxt='n',
col=as.character(gtex.colors), names=plot_names)
}
Click “Code” to view the code used to obtain the above results.
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
gtex <- readRDS(gzcon(url("https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE")))
strong <- t(gtex$strong.z)
strong_data <- flash_set_data(strong, S = 1)
ebnm_param = list(f = list(), l = list(mixcompdist="+uniform"))
system.time(
fl <- flash_add_greedy(strong_data,
100,
var_type="zero",
ebnm_fn="ebnm_ash",
ebnm_param=ebnm_param)
)
saveRDS(fl, "/Users/willwerscheid/GitHub/MASHvFLASH/output/MASHvFLASHnn/fl_g.rds")
system.time(
fl <- flash_backfit(strong_data,
f_init=fl,
var_type="zero",
ebnm_fn="ebnm_ash",
ebnm_param=ebnm_param,
tol=1)
)
# Repeat the following two steps until flash_add_greedy no longer adds
# any factors:
system.time(
fl <- flash_add_greedy(strong_data,
100,
f_init=fl,
var_type="zero",
ebnm_fn="ebnm_ash",
ebnm_param=ebnm_param)
)
system.time(
fl <- flash_backfit(strong_data,
f_init=fl,
var_type="zero",
ebnm_fn="ebnm_ash",
ebnm_param=ebnm_param,
tol=1)
)
saveRDS(fl, "/Users/willwerscheid/GitHub/MASHvFLASH/output/MASHvFLASHnn/fl.rds")
# Tighten the tolerance and run a final backfit:
system.time(
fl <- flash_backfit(strong_data,
f_init=fl,
var_type="zero",
ebnm_fn="ebnm_ash",
ebnm_param=ebnm_param,
tol=0.1)
)
saveRDS(fl, "/Users/willwerscheid/GitHub/MASHvFLASH/output/MASHvFLASHnn/fl.rds")
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] flashr_0.5-14
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 pillar_1.2.1 plyr_1.8.4
[4] compiler_3.4.3 git2r_0.21.0 highr_0.6
[7] workflowr_1.0.1 R.methodsS3_1.7.1 R.utils_2.6.0
[10] iterators_1.0.9 tools_3.4.3 testthat_2.0.0
[13] digest_0.6.15 tibble_1.4.2 evaluate_0.10.1
[16] memoise_1.1.0 gtable_0.2.0 lattice_0.20-35
[19] rlang_0.2.0 Matrix_1.2-12 foreach_1.4.4
[22] commonmark_1.4 yaml_2.1.17 parallel_3.4.3
[25] ebnm_0.1-12 withr_2.1.1.9000 stringr_1.3.0
[28] roxygen2_6.0.1.9000 xml2_1.2.0 knitr_1.20
[31] devtools_1.13.4 rprojroot_1.3-2 grid_3.4.3
[34] R6_2.2.2 rmarkdown_1.8 ggplot2_2.2.1
[37] ashr_2.2-10 magrittr_1.5 whisker_0.3-2
[40] scales_0.5.0 backports_1.1.2 codetools_0.2-15
[43] htmltools_0.3.6 MASS_7.3-48 assertthat_0.2.0
[46] softImpute_1.4 colorspace_1.3-2 stringi_1.1.6
[49] lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[52] pscl_1.5.2 truncnorm_1.0-8 SQUAREM_2017.10-1
[55] R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1