As I continue learning, I encountered several issues while using the previous dataset, RNA-seq of human multiple myeloma patients myeloid-derived suppressor cells (M-MDSC). Unfortunately, I was unable to prepare it for DESeq2 analysis due to unexpected errors, which resulted in wasted time and frustration. Despite my strong desire to utilize the dataset, I couldn’t achieve the desired results.
During my search for solutions, I came across an excellent Differential expression with DEseq2 Tutorial, which provided valuable insights into the DESeq2 analysis. However, since I still didn’t know how to properly prepare the (M-MDSC) dataset, I decided to work with the dataset used in the tutorial. This way, I could apply the knowledge I had gained from the tutorial to a dataset with known steps and expectations.
We will cover data preprocessing, differential expression analysis, data visualization with histograms and heatmaps, and the use of Principal Component Analysis (PCA) for sample relationship visualization. For those interested in the entire workflow, I have documented it on my GitHub feel free to explore.
Starting with
Sets up the necessary data and prepares it in the required format to perform differential expression analysis using the DESeq2 package. We transformed the data into a suitable matrix format for further analysis with DESeq2.
“Basically, here we upload our data and prepare rawcounts”
library(DESeq2)
library(ggplot2)
library(pheatmap
# Read in the raw read counts
rawCounts <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-raw-counts.tsv>")
head(rawCounts)
# Read in the sample mappings
sampleData <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-experiment-design.tsv>")
head(sampleData)
#For rawCounts
# Convert count data to a matrix of appropriate form that DEseq2 can read
geneID <- rawCounts$Gene.ID
sampleIndex <- grepl("SRR\\\\d+", colnames(rawCounts))
rawCounts <- as.matrix(rawCounts[, sampleIndex])
rownames(rawCounts) <- geneID
head(rawCounts)
Console Snip
> head(rawCounts)
SRR975551 SRR975552 SRR975553 SRR975554 SRR975555 SRR975556
ENSG00000000003 6617 1352 1492 3390 1464 1251
ENSG00000000005 69 1 20 23 12 4
ENSG00000000419 2798 714 510 1140 1667 322
ENSG00000000457 486 629 398 239 383 290
ENSG00000000460 466 342 73 227 193 35
ENSG00000000938 75 95 158 107 135 75
SRR975557 SRR975558 SRR975559 SRR975560 SRR975561 SRR975562
ENSG00000000003 207 1333 2126 1799 1362 3435
ENSG00000000005 20 2 3 6 10 15
ENSG00000000419 273 621 1031 677 480 1194
ENSG00000000457 164 452 172 229 264 297
ENSG00000000460 38 184 174 68 46 173
ENSG00000000938 236 254 121 107 94 90
Continue Data Preprocessing and Preparation
- Now we select specific columns, renames them, and converts the “individualID” column to a factor, ensuring that the data is in a compatible format for DESeq2’s requirements.
- “Basically, here we prepare sampleData”
# Convert sample variable mappings to an appropriate form that DESeq2 can read
head(sampleData)
rownames(sampleData) <- sampleData$Run
keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
sampleData <- sampleData[, keep]
colnames(sampleData) <- c("tissueType", "individualID")
sampleData$individualID <- factor(sampleData$individualID)
head(sampleData)
Watch Difference in Console Snip
> # Convert sample variable mappings to an appropriate form that DESeq2 can read
> head(sampleData)
Run Sample.Characteristic.biopsy.site.
1 SRR975551 primary tumor
2 SRR975552 primary tumor
Sample.Characteristic.Ontology.Term.biopsy.site. Sample.Characteristic.disease.
1 <http://www.ebi.ac.uk/efo/EFO_0000616> colorectal cancer
2 <http://www.ebi.ac.uk/efo/EFO_0000616> colorectal cancer
Sample.Characteristic.Ontology.Term.disease.
1 <http://www.ebi.ac.uk/efo/EFO_0005842>
2 <http://www.ebi.ac.uk/efo/EFO_0005842>
Sample.Characteristic.disease.staging.
1 Stage IV Colorectal Cancer
2 Stage IV Colorectal Cancer
Sample.Characteristic.Ontology.Term.disease.staging.
1 NA
2 NA
Sample.Characteristic.individual. Sample.Characteristic.Ontology.Term.individual.
1 AMC_2 NA
2 AMC_3 NA
Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
1 Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>
2 Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>
Sample.Characteristic.organism.part.
1 colon
2 colon
Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
1 <http://purl.obolibrary.org/obo/UBERON_0001155> primary tumor
2 <http://purl.obolibrary.org/obo/UBERON_0001155> primary tumor
Factor.Value.Ontology.Term.biopsy.site. Analysed
1 <http://www.ebi.ac.uk/efo/EFO_0000616> Yes
2 <http://www.ebi.ac.uk/efo/EFO_0000616> Yes
> rownames(sampleData) <- sampleData$Run
> keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
> sampleData <- sampleData[, keep]
> colnames(sampleData) <- c("tissueType", "individualID")
> sampleData$individualID <- factor(sampleData$individualID)
> head(sampleData)
tissueType individualID
SRR975551 primary tumor AMC_2
SRR975552 primary tumor AMC_3
SRR975553 primary tumor AMC_5
Here is one of the things I had learned
- Prepares the data for unsupervised clustering analysis by reordering columns, renaming tissue types, converting variables to factors, and creating a DESeq2DataSet object. Then performing variance stabilizing transformation and calculates a correlation matrix for hierarchical clustering with correlation heatmaps.
# Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
# Rename the tissue types
rename_tissues <- function(x) {
x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))
# Order the tissue types so that it is sensible and make sure the control sample is first: normal sample -> primary tumor -> metastatic tumor
sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))
# Modify factor levels to comply with safe naming conventions
levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))
# Create the DESeq2DataSet object
deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)
# Estimate size factors
dds_wt <- estimateSizeFactors(deseq2Data)
# Unsupervised clustering analysis: log transformation using vst
vsd_wt <- vst(dds_wt, blind = TRUE)
# Hierarchical clustering with correlation heatmaps
vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)
# Save vsd_cor_wt as a TSV file (Optional only to see an overview)
write.table(vsd_cor_wt, file = "vsd_cor_wt.tsv", sep = "\\t", row.names = TRUE, col.names = TRUE)
Let’s Break Our Code a Little Bit:
By reordering the columns of the count data rawCounts to match the row names (gene IDs) of the sample metadata sampleData, it ensures that the samples are in the correct order for downstream analysis. This step is crucial because DESeq2 relies on correctly matched count data and sample metadata to perform valid statistical analysis.
rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
Renaming tissue types to more informative names makes the data easier to interpret and understand.
Providing clearer labels for the different sample groups will particularly be useful for visualizations and result interpretation.
rename_tissues <- function(x) {
x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))
Ordering the tissue types and modifying factor levels ensures that the analysis treats the samples with the intended biological order. This step is crucial for performing meaningful differential expression analysis between specific tissue types
Note: The gsub() function is used to modify the factor levels by replacing any non-alphanumeric characters with underscores. This ensures compliance with safe naming conventions for factor levels and avoids potential issues in subsequent analyses.
sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))
levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))
- Creating a DESeq2DataSet object (deseq2Data) is necessary to organize the count data and sample metadata together. DESeq2 requires data to be in this specific object format for differential expression analysis.
deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)
- Estimating size factors is a critical step in the normalization process for RNA-seq data. It helps to ensure that the count data is appropriately scaled, making it suitable for meaningful comparisons between samples. Size factors are used to normalize the count data, adjusting for differences in sequencing depth and read coverage, so that genes can be compared more accurately across samples.
dds_wt <- estimateSizeFactors(deseq2Data)
- The Variance Stabilizing Transformation (VST) is a powerful statistical method used to stabilize the variance across the range of count values in RNA-seq data. It is based on the negative binomial distribution, which is often used to model count data.
vsd_wt <- vst(dds_wt, blind = TRUE)
- The Correlation Matrix for Hierarchical Clustering helps to visualize the relationships and similarities between the samples. Hierarchical clustering with correlation heatmaps allows researchers to identify potential clusters or patterns in the gene expression data, which can be valuable for understanding the underlying biological relationships and differences between the samples.
vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)
- These steps collectively prepare the data for robust differential expression analysis. They ensure that the data is correctly organized, normalized, and transformed to enable reliable detection of differentially expressed genes and meaningful insights into the biological differences between the sample groups under study.
Console Snip
> # Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
> rawCounts <- rawCounts[, unique(rownames(sampleData))]
> all(colnames(rawCounts) == rownames(sampleData))
[1] TRUE
Some Visualization
- The x-axis tick values will be displayed in a non-scientific notation format, making the plot more readable and user-friendly. The y-axis represents the number of genes falling into each bin of the histogram.
# Add the ggplot code snippet with modified x-axis formatting
ggplot(data.frame(wt_normal1 = rawCounts[, 1])) +
geom_histogram(aes(x = wt_normal1), stat = "bin", bins = 200) +
xlab("Raw expression counts") +
ylab("Number of genes") +
scale_x_continuous(labels = function(x) format(x, scientific = FALSE))
Plot the heatmap using pheatmap
# Prepare data for pheatmap
data_for_heatmap <- as.matrix(vsd_cor_wt)
# Convert tissueType to a character vector
annotation_row <- as.character(sampleData$tissueType)
# Add spaces between words in the x-axis labels
annotation_row_with_spaces <- paste(" ", annotation_row, " ")
# Plot the heatmap using pheatmap with manual row annotations
pheatmap(data_for_heatmap,
cluster_rows = TRUE,
cluster_cols = TRUE,
color = colorRampPalette(c("blue", "white", "red"))(50),
show_rownames = FALSE,
show_colnames = TRUE,
row_names_side = "left",
annotation_colors = "black",
annotation_names_row = FALSE,
labels_row = annotation_row_with_spaces,
fontsize_row = 8, # Adjust the font size of row labels
fontsize_col = 12, # Adjust the font size of column labels
angle_col = 45) # Set the angle of column labels to 45 degrees
- Finally, Perform PCA and provide PCA scores, which can be used for data exploration, visualization, and understanding the underlying structure and relationships within the dataset.
# Calculate PCA scores
sample_scores <- as.data.frame(assay(vsd_wt))
sample_scores$Sample <- rownames(sample_scores)
column_names <- colnames(vsd_wt)
colnames(sample_scores)[2:5] <- c("normal", "fibrosis", "tumor", "metastasis")
sample_scores$PC1 <- sample_scores$normal * -2 + sample_scores$fibrosis * -10 + sample_scores$tumor * 8 + sample_scores$metastasis * 1
sample_scores$PC2 <- sample_scores$normal * 0.5 + sample_scores$fibrosis * 1 + sample_scores$tumor * -5 + sample_scores$metastasis * 6
# Print the PCA scores
print(sample_scores)
Console Snip
> # Print the PCA scores
> print(sample_scores)
SRR975551 normal fibrosis tumor metastasis SRR975556
ENSG00000000003 11.923929 9.968500 10.588833 11.791039 10.058138 10.737563
ENSG00000000005 5.625681 2.799376 4.866074 5.031436 4.065530 3.717037
ENSG00000000419 10.686895 9.059366 9.056915 10.226302 10.243933 8.802204
ENSG00000000457 8.199791 8.879902 8.706264 8.013961 8.158745 8.654508
ENSG00000000460 8.141111 8.024470 6.392902 7.942364 7.214920 5.820591
ENSG00000000938 5.725709 6.301239 7.421013 6.916156 6.736584 6.794418
ENSG00000000971 8.906142 8.952251 10.967832 8.930424 8.158745 10.365427
ENSG00000001036 10.656848 10.452489 10.439085 11.081124 10.571350 10.396267
ENSG00000001084 10.073191 11.118101 9.535102 10.560985 9.788243 9.247774
ENSG00000001167 9.748116 10.326186 9.100712 10.235065 9.469240 9.013992
ENSG00000001460 6.899011 7.258717 8.048669 6.852399 7.791251 7.558024
ENSG00000001461 9.527680 9.811171 9.584137 9.060077 10.353852 9.868912
ENSG00000001497 10.308639 9.804080 9.171732 10.270811 9.505511 9.161493
ENSG00000001561 8.509119 9.179311 9.338776 8.707479 9.155344 8.998760
ENSG00000001617 7.001645 9.451539 8.135472 8.244756 9.540893 8.610090
ENSG00000001626 10.998738 10.733206 11.781040 10.338590 12.181192 12.134195
ENSG00000001629 10.228487 10.303738 9.847778 9.725462 10.016518 9.868912
Sample PC1 PC2
ENSG00000000003 ENSG00000000003 -21.438886 16.966720
ENSG00000000005 ENSG00000000005 -9.942478 5.501761
ENSG00000000419 ENSG00000000419 -16.633528 23.918686
ENSG00000000457 ENSG00000000457 -32.552004 22.028879
ENSG00000000460 ENSG00000000460 -9.224133 13.982838
ENSG00000000938 ENSG00000000938 -24.746778 16.410357
ENSG00000000971 ENSG00000000971 -47.980679 19.744307
ENSG00000001036 ENSG00000001036 -26.075487 23.687810
ENSG00000001084 ENSG00000001084 -23.311101 21.018685
ENSG00000001167 ENSG00000001167 -20.309732 19.903922
ENSG00000001460 ENSG00000001460 -32.393675 24.163534
ENSG00000001461 ENSG00000001461 -32.629246 31.312451
ENSG00000001497 ENSG00000001497 -19.653479 19.752784
ENSG00000001561 ENSG00000001561 -32.931205 25.323103
ENSG00000001617 ENSG00000001617 -24.758855 28.882818
ENSG00000001626 ENSG00000001626 -44.386905 38.541846
ENSG00000001629 ENSG00000001629 -31.265045 26.471447
[ reached 'max' / getOption("max.print") -- omitted 65200 rows ]
In conclusion, this article demonstrates the use of DESeq2 and R packages like ggplot2 and pheatmap to analyze gene expression data. It covers data preprocessing, unsupervised clustering, heatmap visualization, and principal component analysis (PCA). By providing detailed code snippets and explanations, it enables readers to perform similar analyses and gain valuable insights into biological processes.
Top comments (0)