getting-started.Rmd
presto makes it fast and easy to run Wilcoxon rank sum test and auROC analysis on large datasets. The tutorial below shows how to install presto, walks through the 3 major ways you can use presto with your data, and finally explores more advanced use cases.
To install the current stable release from CRAN:
install.packages('presto')
For the cutting edge version of presto:
library(devtools)
install_github('immunogenomics/presto')
The main function in this vignette is wilcoxauc
. presto currently supports 3 interfaces to wilcoxauc
, with a matrix, Seurat object, or SingleCellExperiment object. The output of wilcoxauc
is described in the next section.
The most general use of presto is with a matrix of features and observations (exprs) paired with a vector of group labels (y).
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## G1 0 1 1 2 1 1 1 2 0 1
## G2 1 0 2 2 0 1 1 1 0 0
## G3 1 1 1 1 0 0 0 2 1 0
## G4 2 2 0 2 4 0 1 1 0 1
## G5 0 0 0 3 1 1 1 2 0 2
## G6 2 1 0 1 0 1 0 0 1 0
## [1] A A A A A A
## Levels: A B C
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 A 1.15 0.082 2890 0.55 0.257 0.64 75 59
## 2 G2 A 0.76 -0.300 2190 0.42 0.081 0.41 49 65
## 3 G3 A 0.98 -0.187 2429 0.46 0.452 0.75 65 68
## 4 G4 A 1.22 0.397 3176 0.61 0.020 0.41 71 56
## 5 G5 A 1.42 0.366 3026 0.58 0.093 0.41 73 57
## 6 G6 A 0.93 -0.178 2424 0.46 0.439 0.75 56 60
We support interfacing with Seurat Version 3 objects. In the most basic use case, specify the Seurat object and the meta data variable that defines the group labels.
data(object_seurat)
# ensure object structure matches with the installed Seurat version
object_seurat <- Seurat::UpdateSeuratObject(object_seurat)
## Validating object structure
## Updating object slots
## Ensuring keys are in the proper structure
## Ensuring feature names don't have underscores or pipes
## Updating slots in RNA
## Updating slots in pca
## Setting assay used for NormalizeData.RNA to RNA
## Setting assay used for FindVariableFeatures.RNA to RNA
## Setting assay used for ScaleData.RNA to RNA
## Setting assay used for RunPCA.RNA to RNA
## Object representation is consistent with the most current Seurat version
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 6.0 0.0406 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat 5.9 -0.0278 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat 5.9 -0.0584 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat 5.9 -0.0597 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat 6.0 -0.0202 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat 5.9 -0.0075 11038 0.49 0.78 1.00 100 100
Seurat objects can store multiple assays. The assay used by wilcoxauc
can be specified with the seurat_assay
argument.
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 6.0 0.0406 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat 5.9 -0.0278 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat 5.9 -0.0584 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat 5.9 -0.0597 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat 6.0 -0.0202 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat 5.9 -0.0075 11038 0.49 0.78 1.00 100 100
Seurat supports multiple feature expression matrices, such as raw counts, library normalized data, and scaled data. These can be accessed with assay
.
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 0.51 0.0046 11385 0.51 0.85 0.97 100 100
## 2 G2 jurkat 0.52 -0.0032 11221 0.50 0.97 0.97 100 100
## 3 G3 jurkat 0.48 -0.0373 10398 0.46 0.26 0.97 100 100
## 4 G4 jurkat 0.51 -0.0172 10865 0.48 0.61 0.97 100 100
## 5 G5 jurkat 0.51 0.0015 11301 0.50 0.94 0.97 100 100
## 6 G6 jurkat 0.50 -0.0065 11137 0.50 0.89 0.97 100 100
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 6.0 0.0406 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat 5.9 -0.0278 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat 5.9 -0.0584 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat 5.9 -0.0597 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat 6.0 -0.0202 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat 5.9 -0.0075 11038 0.49 0.78 1.00 100 100
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 0.0206 0.0420 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat -0.0140 -0.0286 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat -0.0329 -0.0671 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat -0.0314 -0.0641 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat -0.0114 -0.0232 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat -0.0038 -0.0077 11038 0.49 0.78 1.00 100 100
presto supports the Bioconductor data structure SingleCellExperiment. Again, the most simple use case takes a SingleCellExperiment object and the metadata field with group labels.
## Loading required package: SingleCellExperiment
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
##
## first, second
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:data.table':
##
## shift
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 6.0 0.0406 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat 5.9 -0.0278 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat 5.9 -0.0584 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat 5.9 -0.0597 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat 6.0 -0.0202 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat 5.9 -0.0075 11038 0.49 0.78 1.00 100 100
SingleCellExperiment can have several data slots, such as counts and logcounts. These can be accessed with assay
.
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 0.51 0.0046 11385 0.51 0.85 0.97 100 100
## 2 G2 jurkat 0.52 -0.0032 11221 0.50 0.97 0.97 100 100
## 3 G3 jurkat 0.48 -0.0373 10398 0.46 0.26 0.97 100 100
## 4 G4 jurkat 0.51 -0.0172 10865 0.48 0.61 0.97 100 100
## 5 G5 jurkat 0.51 0.0015 11301 0.50 0.94 0.97 100 100
## 6 G6 jurkat 0.50 -0.0065 11137 0.50 0.89 0.97 100 100
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 jurkat 6.0 0.0406 11343 0.50 0.90 1.00 100 100
## 2 G2 jurkat 5.9 -0.0278 11209 0.50 0.96 1.00 100 100
## 3 G3 jurkat 5.9 -0.0584 10306 0.46 0.21 0.99 100 100
## 4 G4 jurkat 5.9 -0.0597 10859 0.48 0.61 1.00 100 100
## 5 G5 jurkat 6.0 -0.0202 11244 0.50 1.00 1.00 100 100
## 6 G6 jurkat 5.9 -0.0075 11038 0.49 0.78 1.00 100 100
All inputs for wilcoxauc
give the same table of results.
parameter | description |
---|---|
feature | name of feature. |
group | name of group label. |
avgExpr | mean value of feature in group. |
logFC | log fold change between observations in group vs out. |
statistic | Wilcoxon rank sum U statistic. |
auc | area under the receiver operator curve. |
pval | nominal p value, from two-tailed Gaussian approximation of U statistic. |
padj | Benjamini-Hochberg adjusted p value. |
pct_in | Percent of observations in the group with non-zero feature value. |
pct_out | Percent of observations out of the group with non-zero feature value. |
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 A 1.15 0.082 2890 0.55 0.257 0.64 75 59
## 2 G2 A 0.76 -0.300 2190 0.42 0.081 0.41 49 65
## 3 G3 A 0.98 -0.187 2429 0.46 0.452 0.75 65 68
## 4 G4 A 1.22 0.397 3176 0.61 0.020 0.41 71 56
## 5 G5 A 1.42 0.366 3026 0.58 0.093 0.41 73 57
## 6 G6 A 0.93 -0.178 2424 0.46 0.439 0.75 56 60
We often find it helpful to summarize what the most distinguishing features are in each group.
res <- wilcoxauc(exprs, y)
top_markers(res, n = 10)
## # A tibble: 10 × 4
## rank A B C
## <int> <chr> <chr> <chr>
## 1 1 G4 G20 G1
## 2 2 G5 G5 G21
## 3 3 G9 G15 G6
## 4 4 G10 G19 G16
## 5 5 G14 G25 G11
## 6 6 G1 G10 G7
## 7 7 G15 G13 G2
## 8 8 G25 G8 G17
## 9 9 G8 G24 G12
## 10 10 G24 G23 G3
We can also filter for some criteria. For instance, the top features must be in at least 70% of all observations within the group. Note that not all groups have 10 markers that meet these criteria.
res <- wilcoxauc(exprs, y)
top_markers(res, n = 10, auc_min = .5, pct_in_min = 70)
## # A tibble: 9 × 4
## rank A B C
## <int> <chr> <chr> <chr>
## 1 1 G4 G20 G1
## 2 2 G5 G5 G21
## 3 3 G14 G15 G6
## 4 4 G1 G19 G16
## 5 5 NA G25 G11
## 6 6 NA G23 G7
## 7 7 NA G9 G2
## 8 8 NA G14 G12
## 9 9 NA NA G3
presto is optimized for dense and sparse matrix inputs. When possible, use sparse inputs. In our toy dataset, almost 39% of elements are zeros. Thus, it makes sense to cast it as a sparse dgCMatrix and run wilcoxauc on that.
## [1] 0.39
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 A 1.15 0.082 2890 0.55 0.257 0.64 75 59
## 2 G2 A 0.76 -0.300 2190 0.42 0.081 0.41 49 65
## 3 G3 A 0.98 -0.187 2429 0.46 0.452 0.75 65 68
## 4 G4 A 1.22 0.397 3176 0.61 0.020 0.41 71 56
## 5 G5 A 1.42 0.366 3026 0.58 0.093 0.41 73 57
## 6 G6 A 0.93 -0.178 2424 0.46 0.439 0.75 56 60
Sometimes, you don’t want to test all groups in the dataset against all other groups. For instance, I want to compare only observations in group ‘A’ to those in group ‘B’. This is achieved with the groups_use argument.
## feature group avgExpr logFC statistic auc pval padj pct_in pct_out
## 1 G1 A 1.15 0.790 1868 0.75 2.4e-06 0.00006 75 29
## 2 G2 A 0.76 -0.081 1184 0.48 6.9e-01 0.79372 49 53
## 3 G3 A 0.98 -0.018 1246 0.50 9.5e-01 0.97606 65 62
## 4 G4 A 1.22 0.240 1413 0.57 2.0e-01 0.44123 71 60
## 5 G5 A 1.42 -0.248 1054 0.43 1.9e-01 0.44123 73 78
## 6 G6 A 0.93 0.438 1521 0.61 3.1e-02 0.14516 56 36
top_markers(res_AB)
## # A tibble: 10 × 3
## rank A B
## <int> <chr> <chr>
## 1 1 G1 G20
## 2 2 G21 G19
## 3 3 G16 G13
## 4 4 G6 G15
## 5 5 G11 G5
## 6 6 G4 G25
## 7 7 G12 G23
## 8 8 G17 G18
## 9 9 G22 G8
## 10 10 G9 G10