Preprocess quality-controlled scRNA-seq data for downstream analyses such as comparative analysis.
library(Seurat)
library(ggplot2)
Load the Seurat object created in the 02_quality_control
step.
merged4norm_seurat_obj <- readRDS("./output/02_quality_control/obj/merged4norm_seurat_obj.RDS")
Seurat V5 allows simultaneous processing of multiple datasets stored
in layer
of a single Seurat object.
Since total read counts vary across cells, normalization is required to enable comparison. The default method scales the total read count per cell to 10,000, followed by log transformation.
norm_seurat_obj <- NormalizeData(merged4norm_seurat_obj)
Highly variable genes are crucial for defining cellular characteristics, while genes with low variability (e.g., housekeeping genes) are expressed similarly across all cells. By default, 2,000 genes are selected.
norm_seurat_obj <- FindVariableFeatures(norm_seurat_obj)
Visualizing the distribution of HVGs:
VariableFeaturePlot(norm_seurat_obj)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 2035 rows containing missing values (`geom_point()`).
Preparation for PCA. Scaling enables gene-to-gene comparison. By default, scaling is performed only on HVGs to reduce computational load. Each gene is transformed to have a mean of 0 and a standard deviation of 1.
norm_seurat_obj <- ScaleData(norm_seurat_obj)
Dimensionality reduction is performed to convert high-dimensional scRNA-seq data into a lower-dimensional representation while preserving its essential features.
norm_seurat_obj <- RunPCA(norm_seurat_obj)
Integration of datasets across experimental batches, donors, or
conditions is a crucial step in scRNA-seq workflows. Seurat V5
introduces the IntegrateLayers
function, which supports
multiple integration methods.
integ_norm_seurat_obj <- IntegrateLayers(
object = norm_seurat_obj,
method = CCAIntegration,
orig.reduction = "pca",
new.reduction = "integrated.cca",
verbose = FALSE
)
Preparation for clustering. The reduction
argument
specifies the dimensionality reduction result to be used.
integ_norm_seurat_obj <- FindNeighbors(integ_norm_seurat_obj,
reduction = "integrated.cca",
dims = 1:30)
Cells are classified based on their gene expression profiles. The
resolution
parameter controls cluster granularity: higher
values lead to more clusters, lower values lead to fewer clusters.
integ_norm_seurat_obj <- FindClusters(integ_norm_seurat_obj, cluster.name = "cca_clusters")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 13207
## Number of edges: 536785
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9061
## Number of communities: 22
## Elapsed time: 2 seconds
A non-linear dimensionality reduction method that visualizes data structure and patterns in two or three dimensions.
integ_norm_seurat_obj <- RunUMAP(integ_norm_seurat_obj,
reduction = "integrated.cca",
dims = 1:30,
reduction.name = "umap.cca")
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
Visualizing UMAP results:
DimPlot(
integ_norm_seurat_obj,
reduction = "umap.cca",
group.by = c("condition", "cca_clusters")
)
integ_norm_seurat_obj <- JoinLayers(integ_norm_seurat_obj)
saveRDS(integ_norm_seurat_obj, "./output/03_data_preprocessing_and_integration/obj/integ_norm_seurat_obj.RDS")