【工具】NPM用最接近配对校正组学数据的潜在批效应

介绍

批效应（BEs）是组学数据中的主要噪声源，经常掩盖真实的生物信号。BEs在现有数据集中仍然很常见。目前的BE校正方法大多依赖于特定的假设或复杂的模型，可能无法充分检测和调整BE，从而影响下游分析和发现能力。为了解决这些挑战，我们开发了NPM，这是一种基于最近邻匹配的方法，可以调整BEs，并且在广泛的数据集中可能优于其他方法。

我们评估了不同的指标和图形读数，并将我们的方法与常用的BE校正方法进行了比较。NPM显示了在保留生物差异的同时纠正生物多样性的能力。它可能优于基于多个指标的其他方法。总之，NPM被证明是一种有价值的be纠正方法，可以最大限度地提高生物医学研究的发现，适用于潜在be往往占主导地位的临床研究。

Abstract Motivation Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets. Results We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.

代码

原理：

NPM (Nearest-Pair Matching) relies on distance-based matching to deterministically search for nearest neighbors with opposite labels, so-called “nearest-pair”, among samples. NPM requires knowledge of the phenotypes but not of the batch assignment.

代码语言：javascript代码运行次数：0运行复制

## Load NPmatch and limma
library("NPmatch")
library("limma")

## X: raw data matrix, with features in rows and samples in columns.
## Meta: matrix or dataframe with the metadata associated with X. 
## We need to ensure that the samples in X and Meta are aligned.
X <- read.table("./data/GSE10846.Expression.txt", sep="\t")
Meta <- read.table("./data/GSE10846.Metadata.txt", sep="\t")
dim(X); class(X)
dim(Meta); class(Meta)
table(rownames(Meta) == colnames(X))

## To correct BEs, NPmatch requires a vector of phenotype labels per sample.
## To assess  BE correction, we will also need a vector of batch labels (see below).
## "pheno": phenotype labels.
## "batch": batch labels.
pheno <- Meta[,"dlbcl.type"]
batch <- Meta[,"Chemotherapy"]

## Intra-sample normalization of the raw data.
## We use the normalize.log2CPM.R function provided
nX <- normalize.log2CPM(X)

## Inter-sample normalization by quantile normalization
nX <- limma::normalizeQuantiles(nX)

## Batch correction with NPmatch
cX <- NPmatch(X=nX, y=pheno, dist.method="cor", sdtop=5000)
table(rownames(Meta) == colnames(cX))

## Check BEs in the raw and batch-corrected data by UMAP or t-SNE
LL <- list(X, cX)
names(LL) <- c("Uncorrected", "Batch-corrected")
Var <- c("Batch", "Pheno")

x11(width = 10, height = 10)
par(mfrow = c(2,2))
i=1
for(i in 1:length(LL)) {
     
      nb <- max(1, min(30, round(ncol(LL[[i]]) / 5)))
      # pos <- Rtsne::Rtsne(t(LL[[i]]), perplexity=nb)$Y
      pos <- uwot::tumap(t(LL[[i]]), n_neighbors = max(2, nb)) 
      
      pos <- data.frame(Dim1=pos[,1], Dim2=pos[,2], Pheno=pheno, Batch=batch)
      table(rownames(pos) == colnames(cX))
      pos[,1:2] <- apply(pos[,1:2], 2, function(x) as.numeric(x))
      pos$Col.Pheno <- as.numeric(factor(pos$Pheno))
      pos$Col.Batch <- as.numeric(factor(pos$Batch))
        
      v=1
      for(v in 1:length(Var)) {
            Col <- pos[,paste0("Col.",Var[v])]
            plot(pos$Dim1,
                 pos$Dim2,
                 col = Col,
                 xlab = "Dim1", 
                 ylab = "Dim2",
                 pch = 18, 
                 cex = 0.8, 
                 cex.lab = 1.3,
                 cex.axis = 1.3,
                 las = 1, 
                 tcl = -0.1,
                 mgp = c(1.5,0.5,0))
            
            mtext(names(LL)[i], 
                  font = 2,
                  adj = 0.5, 
                  cex = 1)
    
            legend("bottomleft",
                   unique(pos[,Var[v]]),
                   cex = 1,
                   bty = "n",
                   fill = unique(Col),
                   col = unique(Col))
    
            grid(lwd = 1.2)
       }
}

参考

NPM: Latent Batch Effects Correction of Omics data by Nearest Pair Matching

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

【工具】NPM用最接近配对校正组学数据的潜在批效应

【工具】NPM用最接近配对校正组学数据的潜在批效应

介绍

代码

参考

与本文相关的文章

评论列表(0)