还在为细胞类型注释困扰？快试试 mLLMCelltype ！

引言

细胞类型注释是单细胞数据分析中的关键步骤。目前注释方法依赖于人工，手动将每个细胞簇中高表达的基因与文献中的经典细胞类型标记基因进行比对。这一流程及其耗时，而且需要专业的生物知识。随着测序成本的下降，当数据集规模扩大到数百万个来自不同组织的细胞，手动注释的方法已变得难以实现。

近日，来自德州农工大学统计系和梅奥诊所定量健康科学系计算生物学部门的研究人员取得了一项重要成果。作者基于大语言模型开发了一个新的细胞注释工具 mLLMCelltype，在单细胞 RNA 测序的细胞类型注释方面表现卓越，大幅提升了注释精度，相关研究成果《Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data》已在预印本平台 bioRxiv[1] 上发表。

工作流程

首先，通过差异表达分析识别出每个细胞簇的标记基因，随后结合组织上下文信息，构成下一步的输入 (图b-c)；
随后，多个大语言模型（LLMs）独立接收这些输入，并为每个细胞簇提出初步的细胞类型注释，同时基于标记基因证据提供生物学推理 (图d,e)。对于那些未能立即达成高共识的细胞簇，框架将启动一个迭代审议流程 (图e)；
在每一轮协商中，LLMs 会共享其结构化的论据，讨论特定标记基因的重要性（例如泛髓系标记 CD68 与组织特异性标记 MARCO 的对比）、潜在参与的信号通路，并评估组织上下文对细胞身份的影响(图e)；
每个 LLM 会根据其他模型呈现的证据和推理结果，重新权衡并优化自己的分类结果(图e)；
在每轮审议之后，由一个专门的共识检查模型（Consensus Checker LLM）对参与模型之间的意见一致程度进行评估，并与预设的共识阈值进行比较（mLLMCelltype 的关键点, 图e）。如果达成共识，则流程终止，输出最终注释结果及对应的置信评分；如果未达成共识，则进入下一轮讨论（最多允许若干轮），或将该细胞簇标记为模糊不清。

mLLMCelltype 特点

多LLM共识架构：汇集多种大语言模型的集体智慧，克服单一模型的局限性和偏见
结构化讨论过程：使大语言模型能够通过多轮协作讨论分享推理、评估证据并改进注释
透明的不确定性量化：提供定量指标（共识比例和香农熵）来识别需要专家审查的模糊细胞群体
幻觉减少：跨模型讨论通过批判性评估主动抑制不准确或无支持的预测
对输入噪声的鲁棒性：通过集体错误修正，即使在标记基因列表不完美的情况下也能保持高准确性
层次注释支持：可选扩展，用于具有父子一致性的多分辨率分析
无需参考数据集：无需预训练或参考数据即可进行准确注释
完整的推理链：记录完整的讨论过程，实现透明的决策
无缝集成：直接与标准Scanpy/Seurat工作流和标记基因输出配合使用
模块化设计：随着新LLM的可用性，可轻松整合

结果

跨多种数据集的性能评估

Fig a: 与 GPTCelltype 的比较，显示完全匹配和部分匹配（参见“方法”部分中的定义），准确性以百分比形式呈现。

Fig b: 发育中人类胸腺细胞图谱的性能比较。UMAP 可视化显示参考（第一），mLLMCelltype 预测（第二），GPTCelltype 预测（第三），以及群组级别 popV 预测（第四）的细胞类型注释，点按细胞类型着色。

Fig c: 肺细胞图谱（LCA）的性能比较。UMAP 可视化显示参考（第一），mLLMCelltype 预测（第二），GPTCelltype 预测（第三），以及群组级别 popV 预测（第四）的细胞类型注释，按细胞类型着色。

HNOCA 数据集注释结果比较

Fig a: UMAP 可视化展示了 HNOCA 参考注释的主要神经细胞类型群体和发育状态。参考注释是通过使用 snapseed 工具生成的，该工具结合了层次细胞类型定义、标记基因评分和参考映射到人类发育大脑图谱中的过程。数据进一步通过 scPoli 集成和从参考大脑图谱中使用 scVI 和 scANVI 进行标签转移，特别是对于非端脑神经元和前体细胞。

Fig b: UMAP 可视化展示了我们框架的注释，显示出在识别各种神经细胞类型和发育状态方面表现出强大的性能，当与大规模集成图谱中的参考注释进行评估时，具有很高的注释准确性。

Fig c: UMAP 可视化展示了 GPTCelltype 的注释，其准确性低于我们框架，特别是在区分相关神经前体状态和专门的神经亚型方面。

寿命范围内人类外周免疫细胞图谱的注释性能

Fig a: UMAP 可视化图展示了参考注释中免疫细胞类型在整个生命周期中的分布。参考注释是通过对 220 名健康捐赠者的免疫细胞进行全面分析生成的，这些捐赠者覆盖了从出生到超过 90 岁的 13 个年龄组，结合了转录组数据和专家注释。

Fig b: UMAP 可视化图展示了我们框架的注释，在与参考注释评估时显示出高注释准确性（76.6%），成功捕捉了复杂的发育过渡和多样的免疫细胞群体。

HLCA 数据集注释结果比较

Fig a: HLCA 参考注释的 UMAP 可视化，展示了主要细胞类型群体及其分布。HLCA 参考注释是通过一个分层框架生成的，包含 5 个粒度级别，从广泛的标签（第 1 级：免疫细胞、上皮细胞等）到精细的细胞类型（第 5 级：例如，初始 CD4 T 细胞）。数据整合和聚类是使用 scANVI 执行的，随后在不同分辨率下进行 Leiden 聚类（第 1 级：0.01，第 2 级：0.2，k = 30，第 3-5 级：0.2，k = 15/10）。最终的注释由六位肺脏生物学专家根据聚类结果、标记基因证据和 HLCA 核心聚类结果手动整理。

Fig b: 我们框架注释的 UMAP 可视化，与参考注释相比显示了高注释准确性。按照与参考相同的分层聚类方法，我们的框架逐级执行细胞类型注释。在每个级别，向 LLMs 提供全局标记基因（在一个簇中特异表达的基因，与所有其他簇相比）和姐妹标记基因（目标簇与同一父簇内的姐妹簇之间差异表达的基因），以及来自前一级别的父簇注释，以指导注释过程。

代码示例

mLLMCelltype 的Github[2]仓库，配备了各种语言的官方文档，方法使用简洁明了。

python

安装

代码语言：javascript代码运行次数：0运行复制

# 从PyPI安装
pip install mllmcelltype

# 或从GitHub安装
pip install git+.git

示例

代码语言：javascript代码运行次数：0运行复制

import scanpy as sc
import pandas as pd
from mllmcelltype import annotate_clusters, setup_logging, interactive_consensus_annotation
import os

# 设置日志
setup_logging()

# 加载数据
adata = sc.read_h5ad('your_data.h5ad')

# 检查是否已计算leiden聚类，如果没有，则计算
if'leiden'notin adata.obs.columns:
    print("计算leiden聚类...")
    # 确保数据已预处理（标准化、对数转换等）
    if'log1p'notin adata.uns:
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)

    # 如果尚未计算PCA，则计算
    if'X_pca'notin adata.obsm:
        sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
        sc.pp.pca(adata, use_highly_variable=True)

    # 计算邻居图和leiden聚类
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
    sc.tl.leiden(adata, resolution=0.8)
    print(f"leiden聚类完成，共有{len(adata.obs['leiden'].cat.categories)}个聚类")

# 运行差异表达分析获取标记基因
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

# 为每个聚类提取标记基因
marker_genes = {}
for i in range(len(adata.obs['leiden'].cat.categories)):
    # 为每个聚类提取前10个基因
    genes = [adata.uns['rank_genes_groups']['names'][str(i)][j] for j in range(10)]
    marker_genes[str(i)] = genes

# 重要提示：确保使用基因符号（如KCNJ8, PDGFRA）而不是Ensembl ID（如ENSG00000176771）
# 如果您的AnnData对象存储的是Ensembl ID，请先将其转换为基因符号：
# 示例：
# if 'Gene' in adata.var.columns:  # 检查var数据框中是否有基因符号
#     gene_name_dict = dict(zip(adata.var_names, adata.var['Gene']))
#     marker_genes = {cluster: [gene_name_dict.get(gene_id, gene_id) for gene_id in genes] 
#                    for cluster, genes in marker_genes.items()}

# 设置您想要使用的提供商的API密钥
# 您至少需要一个与计划使用的模型相对应的API密钥
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"      # GPT模型所需
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"# Claude模型所需
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"      # Gemini模型所需
os.environ["QWEN_API_KEY"] = "your-qwen-api-key"        # 通义千问模型所需
# 其他可选模型
# os.environ["DEEPSEEK_API_KEY"] = "your-deepseek-api-key"   # DeepSeek模型所需
# os.environ["ZHIPU_API_KEY"] = "your-zhipu-api-key"       # 智谱GLM模型所需
# os.environ["STEPFUN_API_KEY"] = "your-stepfun-api-key"    # Step模型所需
# os.environ["MINIMAX_API_KEY"] = "your-minimax-api-key"    # MiniMax模型所需

# 使用多个模型运行共识注释
consensus_results = interactive_consensus_annotation(
    marker_genes=marker_genes,
    species="human",
    tissue="blood",
    models=["gpt-4o", "claude-3-7-sonnet-20250219", "gemini-1.5-pro", "qwen-max-2025-01-25"],
    consensus_threshold=0.7,  # 调整共识一致性阈值
    max_discussion_rounds=3   # 模型间讨论的最大轮数
)

# 从字典中获取最终共识注释
final_annotations = consensus_results["consensus"]

# 将共识注释添加到AnnData对象
adata.obs['consensus_cell_type'] = adata.obs['leiden'].astype(str).map(final_annotations)

# 将不确定性指标添加到AnnData对象
adata.obs['consensus_proportion'] = adata.obs['leiden'].astype(str).map(consensus_results["consensus_proportion"])
adata.obs['entropy'] = adata.obs['leiden'].astype(str).map(consensus_results["entropy"])

# 重要提示：确保在可视化前已计算UMAP坐标
# 如果您的AnnData对象中没有UMAP坐标，请计算：
if'X_umap'notin adata.obsm:
    print("计算UMAP坐标...")
    # 确保已计算邻居图
    if'neighbors'notin adata.uns:
        sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
    sc.tl.umap(adata)
    print("UMAP坐标计算完成")

# 使用增强美学效果可视化结果
# 基础可视化
sc.pl.umap(adata, color='consensus_cell_type', legend_loc='right', frameon=True, title='mLLMCelltype共识注释')

# 更多自定义可视化
import matplotlib.pyplot as plt

# 设置图形尺寸和样式
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 12

# 创建更适合发表的UMAP图
fig, ax = plt.subplots(1, 1, figsize=(12, 10))
sc.pl.umap(adata, color='consensus_cell_type', legend_loc='on data', 
         frameon=True, title='mLLMCelltype共识注释',
         palette='tab20', size=50, legend_fontsize=12, 
         legend_fontoutline=2, ax=ax)

# 可视化不确定性指标
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))
sc.pl.umap(adata, color='consensus_proportion', ax=ax1, title='共识比例',
         cmap='viridis', vmin=0, vmax=1, size=30)
sc.pl.umap(adata, color='entropy', ax=ax2, title='注释不确定性（香农熵）',
         cmap='magma', vmin=0, size=30)
plt.tight_layout()

R

安装

代码语言：javascript代码运行次数：0运行复制

# 从GitHub安装
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R")

示例

代码语言：javascript代码运行次数：0运行复制

# Load required packages
library(mLLMCelltype)
library(Seurat)
library(dplyr)
library(ggplot2)
library(cowplot) # Added for plot_grid

# Load your preprocessed Seurat object
pbmc <- readRDS("your_seurat_object.rds")

# If starting with raw data, perform preprocessing steps
# pbmc <- NormalizeData(pbmc)
# pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# pbmc <- ScaleData(pbmc)
# pbmc <- RunPCA(pbmc)
# pbmc <- FindNeighbors(pbmc, dims = 1:10)
# pbmc <- FindClusters(pbmc, resolution = 0.5)
# pbmc <- RunUMAP(pbmc, dims = 1:10)

# Find marker genes for each cluster
pbmc_markers <- FindAllMarkers(pbmc,
                            only.pos = TRUE,
                            min.pct = 0.25,
                            logfc.threshold = 0.25)

# Set up cache directory to speed up processing
cache_dir <- "./mllmcelltype_cache"
dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)

# Run LLMCelltype annotation with multiple LLM models
consensus_results <- interactive_consensus_annotation(
  input = pbmc_markers,
  tissue_name = "human PBMC",  # provide tissue context
  models = c(
    "claude-3-7-sonnet-20250219",  # Anthropic
    "gpt-4o",                   # OpenAI
    "gemini-1.5-pro",           # Google
    "qwen-max-2025-01-25"       # Alibaba
  ),
  api_keys = list(
    anthropic = "your-anthropic-key",
    openai = "your-openai-key",
    gemini = "your-google-key",
    qwen = "your-qwen-key"
  ),
  top_gene_count = 10,
  controversy_threshold = 0.7,
  entropy_threshold = 1.0,
  cache_dir = cache_dir
)

# Print structure of results to understand the data
print("Available fields in consensus_results:")
print(names(consensus_results))

# Add annotations to Seurat object
# Get cell type annotations from consensus_results$final_annotations
cluster_to_celltype_map <- consensus_results$final_annotations

# Create new cell type identifier column
cell_types <- as.character(Idents(pbmc))
for (cluster_id in names(cluster_to_celltype_map)) {
  cell_types[cell_types == cluster_id] <- cluster_to_celltype_map[[cluster_id]]
}

# Add cell type annotations to Seurat object
pbmc$cell_type <- cell_types

# Add uncertainty metrics
# Extract detailed consensus results containing metrics
consensus_details <- consensus_results$initial_results$consensus_results

# Create a data frame with metrics for each cluster
uncertainty_metrics <- data.frame(
  cluster_id = names(consensus_details),
  consensus_proportion = sapply(consensus_details, function(res) res$consensus_proportion),
  entropy = sapply(consensus_details, function(res) res$entropy)
)

# Add uncertainty metrics for each cell
pbmc$consensus_proportion <- uncertainty_metrics$consensus_proportion[match(current_clusters, uncertainty_metrics$cluster_id)]
pbmc$entropy <- uncertainty_metrics$entropy[match(current_clusters, uncertainty_metrics$cluster_id)]

# Save results for future use
saveRDS(consensus_results, "pbmc_mLLMCelltype_results.rds")
saveRDS(pbmc, "pbmc_annotated.rds")

# Visualize results with SCpubr for publication-ready plots
if (!requireNamespace("SCpubr", quietly = TRUE)) {
  remotes::install_github("enblacar/SCpubr")
}
library(SCpubr)
library(viridis)  # For color palettes

# Basic UMAP visualization with default settings
pdf("pbmc_basic_annotations.pdf", width=8, height=6)
SCpubr::do_DimPlot(sample = pbmc,
                  group.by = "cell_type",
                  label = TRUE,
                  legend.position = "right") +
  ggtitle("mLLMCelltype Consensus Annotations")
dev.off()

# More customized visualization with enhanced styling
pdf("pbmc_custom_annotations.pdf", width=8, height=6)
SCpubr::do_DimPlot(sample = pbmc,
                  group.by = "cell_type",
                  label = TRUE,
                  label.box = TRUE,
                  legend.position = "right",
                  pt.size = 1.0,
                  border.size = 1,
                  font.size = 12) +
  ggtitle("mLLMCelltype Consensus Annotations") +
  theme(plot.title = element_text(hjust = 0.5))
dev.off()

# Visualize uncertainty metrics with enhanced SCpubr plots
# Get cell types and create a named color palette
cell_types <- unique(pbmc$cell_type)
color_palette <- viridis::viridis(length(cell_types))
names(color_palette) <- cell_types

# Cell type annotations with SCpubr
p1 <- SCpubr::do_DimPlot(sample = pbmc,
                  group.by = "cell_type",
                  label = TRUE,
                  legend.position = "bottom",  # Place legend at the bottom
                  pt.size = 1.0,
                  label.size = 4,  # Smaller label font size
                  label.box = TRUE,  # Add background box to labels for better readability
                  repel = TRUE,  # Make labels repel each other to avoid overlap
                  colors.use = color_palette,
                  plot.title = "Cell Type") +
      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),
            legend.text = element_text(size = 8),
            legend.key.size = unit(0.3, "cm"),
            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

# Consensus proportion feature plot with SCpubr
p2 <- SCpubr::do_FeaturePlot(sample = pbmc,
                       features = "consensus_proportion",
                       order = TRUE,
                       pt.size = 1.0,
                       enforce_symmetry = FALSE,
                       legend.title = "Consensus",
                       plot.title = "Consensus Proportion",
                       sequential.palette = "YlGnBu",  # Yellow-Green-Blue gradient, following Nature Methods standards
                       sequential.direction = 1,  # Light to dark direction
                       min.cutoff = min(pbmc$consensus_proportion),  # Set minimum value
                       max.cutoff = max(pbmc$consensus_proportion),  # Set maximum value
                       na.value = "lightgrey") +  # Color for missing values
      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),
            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

# Shannon entropy feature plot with SCpubr
p3 <- SCpubr::do_FeaturePlot(sample = pbmc,
                       features = "entropy",
                       order = TRUE,
                       pt.size = 1.0,
                       enforce_symmetry = FALSE,
                       legend.title = "Entropy",
                       plot.title = "Shannon Entropy",
                       sequential.palette = "OrRd",  # Orange-Red gradient, following Nature Methods standards
                       sequential.direction = -1,  # Dark to light direction (reversed)
                       min.cutoff = min(pbmc$entropy),  # Set minimum value
                       max.cutoff = max(pbmc$entropy),  # Set maximum value
                       na.value = "lightgrey") +  # Color for missing values
      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),
            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

# Combine plots with equal widths
pdf("pbmc_uncertainty_metrics.pdf", width=18, height=7)
combined_plot <- cowplot::plot_grid(p1, p2, p3, ncol = 3, rel_widths = c(1.2, 1.2, 1.2))
print(combined_plot)
dev.off()

❝左图显示了 UMAP 投影上的细胞类型注释。中图使用黄色-绿色-蓝色渐变显示了共识比例（更深的蓝色表示更强的一致性）。右图使用橙色-红色渐变显示了 Shannon Entropy（更深的红色表示更低的不确定性，更浅的橙色表示更高的不确定性）。

引用

代码语言：javascript代码运行次数：0运行复制

Yang, C., Zhang, X., & Chen, J. (2025). Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. .1101/2025.04.10.647852

总结

从实际应用角度看，mLLMCelltype 为大规模单细胞分析带来了显著优势。其模块化设计允许无缝集成新的 LLMs，确保未来的适应性。其核心优势在于透明、系统的共识过程：通过跟踪多模型协商，它提供了详细的推理链。这使专家能够基于共识指标高效地识别和审查有争议的案例，并获得完整背景，从而大大减少手动注释时间，同时提高复杂组织细胞类型注释的整体可靠性和质量。

Reference

[1]

bioRxiv: .1101/2025.04.10.647852v1.full

[2]

Github:

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。原始发表：2025-04-26，如有侵权请联系 cloudcommunity@tencent 删除可视化框架模型数据性能

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

还在为细胞类型注释困扰？快试试 mLLMCelltype ！