【工具】SCEMENT大规模单细胞测序数据的可扩展和内存高效集成工具
介绍
综合分析从不同细胞群收集的大规模单细胞数据有望提高对复杂生物系统的理解。虽然已经开发了几种用于单细胞rna测序数据集成的算法,但由于其内存和运行时间要求,许多算法缺乏处理大量数据集和/或数百万细胞的可扩展性。少数能够处理大型数据的工具是通过减少计算负担的策略来实现的,例如对数据进行子采样或选择参考数据集,以提高计算效率和可伸缩性。然而,这种捷径阻碍了下游分析的准确性,特别是那些需要定量基因表达信息的分析。
为了克服这些限制,我们提出了一种可扩展和内存高效集成方法SCEMENT。我们的新并行算法建立并扩展了以前在ComBat中应用的线性回归模型,用于无监督稀疏矩阵设置,以实现多种单细胞rna测序数据的精确集成。使用数十到数百个真实的单细胞RNA-seq数据集,我们表明SCEMENT在运行时(快214倍)和内存使用(少17.5倍)上优于ComBat以及FastIntegration和Scanorama。它不仅可以在25分钟内完成数百万个细胞的批量校正和整合,而且可以通过完整的定量基因表达信息,促进新的罕见细胞类型的发现和更稳健的基因调控网络的重建。
Abstract Motivation Integrative analysis of large-scale single-cell data collected from diverse cell populations promises an improved understanding of complex biological systems. While several algorithms have been developed for single-cell RNA-sequencing data integration, many lack the scalability to handle large numbers of datasets and/or millions of cells due to their memory and run time requirements. The few tools that can handle large data do so by reducing the computational burden through strategies such as subsampling of the data or selecting a reference dataset to improve computational efficiency and scalability. Such shortcuts, however, hamper the accuracy of downstream analyses, especially those requiring quantitative gene expression information.Results We present SCEMENT, a SCalablE and Memory-Efficient iNTegration method, to overcome these limitations. Our new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single-cell RNA-sequencing data. Using tens to hundreds of real single-cell RNA-seq datasets, we show that SCEMENT outperforms ComBat as well as FastIntegration and Scanorama in runtime (upto 214× faster) and memory usage (upto 17.5× less). It not only performs batch correction and integration of millions of cells in under 25 min, but also facilitates the discovery of new rare cell types and more robust reconstruction of gene regulatory networks with full quantitative gene expression information.
代码
地址
- SCEMENT: scalable and memory efficient integration of large-scale single-cell RNA-sequencing data