单细胞表达量矩阵读取后居然是一个长度为3的list对象
单细胞数据分析也做了挺多公共数据集了,最近收到了曾老板发来的一个邮件,给了我一个数据集。说这个数据集读进去后是一个list对象,我起初还不理解单细胞表达矩阵,非常标准的三个文件,怎么就是list对象了!下面来看看~
发来的数据集为 GSE243665:.cgi?acc=GSE243665
数据背景
这个数据来自文献《A single cell RNAseq benchmark experiment embedding "controlled" cancer heterogeneity》,主要为7个带有不同driver genes突变的细胞系,使用 10X Genomics 进行单细胞转录组测序:
- A549 (KRAS p.G12S, growth and proliferation, PMID: 20358631;
- CCL.185.IG (EML4-ALK Fusion-A549 Isogenic Cell, );
- NCI-H1395 (CRL5868, BRAF p.G469A, gain of function, resistant to all tested MEK +/− BRAF inhibitors, PMID: 32540409;
- DV90 (ERBB2 p.V842I, increases kinase activity, PMID: 23220880;
- HCC78 (SLC34A2-ROS1 Fusion, ROS1 inhibitors have antiproliferative effect PMID: 22919003;
- NCI-H596 (HTB178, MET Del14 , enhanced protection from apoptosis and cellular migration PMID: 35636967;
- PC9 (EGFR Del19, activating mutation, PMID: 21167064。
实验之后cell ranger定量的数据统计结果如下:
数据读取
在GEO下载 A549 细胞系的数据看下:
# 整理成下面这样
A549/
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
读取:
代码语言:javascript代码运行次数:0运行复制###
### Create: Jianming Zeng
### Date: 2023-12-31
### Email: jmzeng1314@163
### Blog: /
### Forum: .html
### CAFS/SUSTC/Eli Lilly/University of Macau
### Update Log: 2023-12-31 First version
### Update Log: 2024-12-09 by juan zhang (492482942@qq)
###
rm(list=ls())
options(stringsAsFactors = F)
library(ggsci)
library(dplyr)
library(future)
library(Seurat)
library(clustree)
library(cowplot)
library(data.table)
library(ggplot2)
library(patchwork)
library(stringr)
library(qs)
library(Matrix)
getwd()
# 创建目录
getwd()
gse <- "GSE243665"
dir.create(gse)
###### step1: 导入数据 ######
counts <- Read10X(data.dir = "GSE243665/A549/", gene.column = 2)
str(counts)
真的是一个list对象,然后有三个元素,到这里我才明白老板是啥意思!但是看文章中的实验部分描述,也没有说这不是一个常规的单细胞转录组测序啊:
如何创建对象呢?
上面那个counts是个list对象,里面不仅有转录组的表达矩阵,还有捕获的 140个蛋白表达。那如何创建seurat对象呢?
这个时候我看了一下 Read10X 函数,他给了我一点提示:
## 分开读取
# If there are multiple data types, they might be loaded separately
# For example, if you have both RNA and protein data
seurat_object = CreateSeuratObject(counts = counts$`Gene Expression`, names.field = 2)
seurat_object
seurat_object[['Protein']] = CreateAssayObject(counts = counts$`Antibody Capture`, names.field = 2)
seurat_object
这样就是一个多模态的数据了:
代码语言:javascript代码运行次数:0运行复制An object of class Seurat
36741 features across 6898 samples within 2 assays
Active assay: RNA (36601 features, 0 variable features)
1 layer present: counts
1 other assay present: Protein
简单探索一下看看:
代码语言:javascript代码运行次数:0运行复制# 查看特征
head(seurat_object@meta.data, 10)
table(seurat_object$orig.ident)
# Access the RNA data
RNA_data <- GetAssayData(seurat_object, assay = "RNA", slot = "counts")
as.data.frame(RNA_data[1:5,1:6])
# Access the Protein data
Protein_data <- GetAssayData(seurat_object, assay = "Protein", slot = "counts")
as.data.frame(Protein_data[1:5,1:6])
rownames(Protein_data)
后续分析如果只想分析 转录组RNA信息,走常规流程就可以啦。
如果想做多模态的数据分析,就看seurat官网: