「单细胞+PPI」革命性整合！Nature Methods新作scNET：双视角GNN破解

一、单细胞分析的痛点：数据噪声、零膨胀与功能缺失

单细胞RNA测序（scRNA-seq）虽能揭示细胞异质性，但面临两大瓶颈：

1. 高噪声与零膨胀：技术误差导致假零值，掩盖真实基因共表达信号；

2. 功能注释不足：仅凭表达数据难以捕捉通路和蛋白复合体的动态变化。

传统解决方案的局限：

1. 插补方法（如MAGIC、SAVER）仅修复数值，忽略基因间功能关联；

2. 现有模型（如scGPT）依赖预训练和标注数据，难以适配新数据集。

二、scNET：当单细胞遇上PPI网络

2025年3月17日， Asaf Madi 、Roded Sharan和Ron Sheinin三人在Nature Methods期刊上发表了一篇题为“scNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein–protein interactions”的研究文章（图1）。

图1 scNET

研究者们提出了一种创新的深度学习框架scNET，通过整合单细胞 RNA 测序数据和蛋白质-蛋白质相互作用（PPI）网络，实现了基因和细胞嵌入的联合学习。这一方法不仅提升了基因功能注释和细胞聚类的准确性，还显著改善了通路分析的性能（图2）。

图2 scNET模型架构

模型亮点：

1. 双编码器架构：

基因-基因关系：通过 PPI 网络建模基因间的功能关联。

细胞-细胞关系：通过 KNN 图建模细胞间的相似性。

这种双视图设计使得 scNET 能够捕捉特定生物学背景下的基因关系，同时减少噪声干扰。

2. 注意力机制优化细胞相似性图

scNET 引入了注意力机制，用于优化细胞间相似性图（KNN 图）。传统的 KNN 图假设每个细胞与固定数量的其他细胞相似，但这一假设在生物学上并不总是成立。scNET通过学习注意力权重，动态剪枝低质量边，从而生成更符合真实生物学关系的细胞相似性图。

3. 自编码器框架

scNET 采用自编码器框架，通过内积解码器重建 PPI 网络，通过全连接解码器重建基因表达。这种设计不仅保留了原始数据的动态特性，还通过 PPI 网络增强了基因表达的生物学解释性。

三、模型验证：scNET 的性能表现

1. 基因嵌入的改进

在 Gene Ontology（GO）语义相似性分析中，scNET 的嵌入空间相关性显著高于传统方法（如 scLINE、DeepImpute 等）（图3a）。
通过k-means聚类和GSEA分析发现，scNET生成的基因簇GO富集率显著高于其他方法（如K=30时达85%vs原始数据40%）（图3b）。
UMAP可视化显示scNET能形成功能明确的紧密基因簇，且能识别细胞类型特异性表达模式（图3 c-e,c: counts, d: scLINE, e: scNET）。

图3 使用疟疾相关 B 细胞数据集评估基因表达

2. scNET共嵌入网络有效捕捉生物通路特征研究通过整合PPI网络与共表达信息构建共嵌入网络，在疟疾B细胞数据集上的分析表明：

网络模块性显著提升（图4a）：在99%分位阈值下，scNET网络的模块性值全面超越原始数据。
通路重建能力优异（图4c）：对KEGG通路的预测AUPR值显著高于传统方法。
疾病基因关联分析：在白血病/淋巴瘤相关基因列表中，scNET网络平均z值达7（PPI网络为3，共表达网络仅0.5）（图4d），9个测试列表中有6个表现最优（图4e）。

图4 使用疟疾相关 B 细胞数据集进行共嵌入网络评估

3. 细胞聚类的提升

在多个数据集（如背根神经节细胞和癌细胞系）中，scNET 的细胞嵌入显著提高了细胞聚类的准确性（图5 a-l）。
在调整兰德指数（ARI）评估中，scNET 的性能优于原始数据和其他方法（如 MAGIC 和 DeepImpute）（图5 m,n）。

图5 细胞嵌入和聚类的benchmark

4. scNET 在通路富集分析中表现出色：

在胶质母细胞瘤（GBM）数据中，scNET 能够揭示 P-selectin 抑制治疗后 T 细胞激活相关的通路变化，而原始数据未能检测到这些变化（图6 b,c）。
scNET 重建的基因表达数据在捕获与特定细胞类型相关的通路（如 T 细胞受体信号通路）方面表现优异（图6 d）。

图6 重建的基因表达使得在 GBM 肿瘤微环境中能够更好地捕获不同细胞类型和条件下的途径活性

5. 零膨胀问题的解决

scNET 在减少零膨胀和提高标记基因表达准确性方面优于其他方法（如 MAGIC 和 DeepImpute），在多个细胞类型中表现出更高的 AUPR 分数。不同方法在标记基因表达中的 AUPR 分数里，scNET 在所有细胞类型中均表现最佳。

表1 标志基因表达识别不同细胞类型的AUPR值

四、如何使用 scNET？

scNET 的代码已开源，可通过以下方式获取：

GitHub：

PyPI：/

代码语言：txt复制

!pip install scent #使用pip安装scNET
#下载示例数据
import gdown
download_url = f''
output_path = './example.h5ad'
gdown.download(download_url, output_path, quiet=False)

#导入 scNET 和模型训练
import scNET
#For faster processing in medium to large datasets (e.g. 30K or above cells), the maximum cells batch size can be increased depending on the available GPU memory.

#For GPU with 24GB memory
scNET.main.MAX_CELLS_BATCH_SIZE = 3000

#for GPU with 40GB memory
scNET.main.MAX_CELLS_BATCH_SIZE = 4000

#For GPU with 80GB memory or more
scNET.main.MAX_CELLS_BATCH_SIZE = 8000

#otherwize, do not change the default value
#To control the cutoff of gene expression, the minimum precetage of cells expressing a gene can be adjusted. The default all expressed genes are considered.
#For example, to consider genes expressed in at least 5% of cells
#scNET.main.EXPRESSION_CUTOFF = 0.05
#For larger dataset (10K or above), containing larger number of subcommunities, the number of encoder layers could be increased to 4 or more. the default value is 3.
scNET.main.NUM_LAYERS = 3
#To control the number of deifferentially expressed genes, the default value is 2000
#For example, to consider 3500 DE genes
scNET.main.DE_GENES_NUM = 3500
import scanpy as sc
obj = sc.read_h5ad("./example.h5ad")
scNET.run_scNET(obj, pre_processing_flag=False, human_flag=False, number_of_batches=10, split_cells=True, max_epoch=300, model_name = "test")

#使用模型的输出
##加载所有相关嵌入
embedded_genes, embedded_cells, node_features , out_features =  scNET.load_embeddings("test")

##基于重构基因表达创建 Scanpy 对象
cell_types = {"0":"Macrophages","1":"Macrophages","2":"CD8 Tcells","3":"Microglia","4":"Cancer","5":"CD4 Tcells","6":"B Cells","10":"Prolifrating Tcells","8":"Cancer","11":"NK"}
obj.obs["Cell Type"] = obj.obs.seurat_clusters.map(cell_types)
recon_obj = scNET.create_reconstructed_obj(node_features, out_features, obj)

##marker基因
sc.pl.umap(recon_obj, color=["Cell Type","Cd4","Cd8a","Cd14","Icos","P2ry12","Mki67","Ncr1"], show=True, legend_loc='on data')

代码语言：txt复制

##计算标记基因 AUPR 的示例
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, auc
from itertools import cycle

def calculate_marker_gene_aupr(adata, marker_genes, cell_types):
    colors = cycle(['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2'])
    plt.figure(figsize=(10, 8))

    for marker_gene, cell_type, color in zip(marker_genes, cell_types, colors):
        gene_expression = adata[:, marker_gene].X.toarray().flatten()
        binary_labels = (adata.obs["Cell Type"].isin(cell_type)).astype(int)

        precision, recall, _ = precision_recall_curve(binary_labels, gene_expression)
        aupr = auc(recall, precision)
        plt.plot(recall, precision, color=color, lw=2,
                 label=f'PRAUC={aupr:.2f} for {marker_gene} ({cell_type[0]})')

    plt.xlabel('Recall', fontsize=14)
    plt.ylabel('Precision', fontsize=14)
    plt.title('Precision-Recall Curve by Cell Type', fontsize=16)
    plt.legend(loc="best", fontsize=12)
    plt.grid(True)
    plt.tight_layout()
    plt.show()
calculate_marker_gene_aupr(recon_obj,['Cd8a','Cd4','Cd14',"P2ry12","Ncr1","Mki67","Tert"],[["CD8 Tcells"],['CD4 Tcells'], ['Macrophages'], ['Microglia'], ["NK"],["Prolifrating Tcells"],["Cancer"]])

代码语言：txt复制

##基于传播的特征投影用于激活 T 细胞
scNET.run_signature(recon_obj, up_sig=["Zap70","Lck","Fyn","Cd3g","Cd28","Lat"],alpha = 0.9)

代码语言：txt复制

##关于肿瘤侵袭性分析的应用
scNET.run_signature(recon_obj,up_sig=["Cdkn2a","Myc","Pten","Kras"])

代码语言：txt复制

##创建共嵌入网络
import networkx as nx
net, mod = scNET.build_co_embeded_network(embedded_genes, node_features)
print(f"The network mdularity: {mod}")

#寻找下游转录因子
##重新嵌入 T 细胞子集
sub_obj = obj[obj.obs["Cell Type"] == "CD8 Tcells"]
scNET.run_scNET(sub_obj, pre_processing_flag=False, human_flag=False, number_of_batches=3, split_cells=False, max_epoch=300, model_name = "Tcells")
embedded_genes, embedded_cells, node_features , out_features =  scNET.load_embeddings("Tcells")
net, mod = scNET.build_co_embeded_network(embedded_genes, node_features, 99.5)
print(f"The network mdularity: {mod}")

##查找下游 TF‘s 的特定基因信号
import seaborn as sns
import matplotlib.pyplot as plt

tf_scores = scNET.find_downstream_tfs(net, ["Zap70","Lck","Fyn","Cd3g","Cd28","Lat"]).sort_values(ascending=False).head(10)

ax = sns.barplot(x=tf_scores.index, y=tf_scores.values, color='skyblue')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
ax.set_xlabel('TF')
ax.set_ylabel('Scores')
plt.show()

##查看癌症、小胶质细胞和巨噬细胞之间的差异
recon_obj.obs["Cell Type"] = recon_obj.obs.seurat_clusters.map(cell_types)
de_genes_per_group, significant_pathways, filtered_kegg, enrichment_results = scNET.pathway_enricment(recon_obj.copy()[recon_obj.obs["Cell Type"].isin(["Microglia","Macrophages","Cancer"])],groupby="Cell Type")
scNET.plot_de_pathways(significant_pathways,enrichment_results,10)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

「单细胞+PPI」革命性整合！Nature Methods新作scNET：双视角GNN破解