<![CDATA[Mikhail Raevskiy Blog]]>raevskymichail.github.io//raevskymichail.github.io//favicon.pngMikhail Raevskiy Blograevskymichail.github.io//Ghost 3.37Fri, 18 Dec 2020 08:43:08 GMT60<![CDATA[Supervised Multi-Omics Integration]]>In the previous post I introduced the topic and explained why we need to do feature selection before proceeding with the integration. Here I will explain how to perform supervised OMICs integration using Partial Least Square (PLS) regression and Discriminant Analysis.

Chronic Lymphocytic Leukemia (CLL) Data Set

Integration

]]>
raevskymichail.github.io//supervised-omics-integration/5fbd353c51732c0059dfee93Wed, 25 Nov 2020 09:57:15 GMT

In the previous post I introduced the topic and explained why we need to do feature selection before proceeding with the integration. Here I will explain how to perform supervised OMICs integration using Partial Least Square (PLS) regression and Discriminant Analysis.

Chronic Lymphocytic Leukemia (CLL) Data Set

Integration of multiple molecular OMICs represents a contemporary challenge in computational Biology and Biomedicine as Big Data becomes a reality here. One example of biomedical data with multiple OMICs which we will use here is the study of Chronic Lymphocytic Leukemia (CLL). This study combined drug response with somatic mutation information, transcriptome profiling and DNA methylation assays.

Supervised Multi-Omics Integration
From Dietrich et al., Journal of Clinical Investigation, 2017, image source

Traditional biological approach to perform integration has been for decades the pair-wise correlations of OMICs layers. Despite its interpretability and simplicity this approach was recognized to provide inconsistent information for the case of multiple OMICs sets. To overcome the problem of technological dissimilarities between different data sets, a more promising idea of OMICs integration is to represent each data set by latent variables which contain no memory of the technology they were produced by.

PLS-DA Integration with DIABLO

The most straightforward implementation of the idea of latent variables is the multivariate integration of multiple OMICs based on the Partial Least Squares (PLS) regression and discriminant analysis when most informative features from different OMICs are being selected with the constraint of correlation between their first PLS components. This method has become very popular due to its simplicity and efficient implementation in the R package mixOmics.

Supervised Multi-Omics Integration
Rohart et al., Plos Computational Biology, 2017, image source

MixOmics includes several elegant statistical algorithms, among others MINT represent across samples integration, similar to batch effect correction, and DIABLO represents across OMICs integration, which is the true multi-OMICs data integration which we are going to use here for the CLL data set.

CLL OMICs Integration with DIABLO

We will start with reading the CLL data and imputing the missing values with the simple median imputation. We will use Gender as the trait of interest, thus PLS-DA algorithm will perform extraction of features simultaneously across multiple OMICs that maximize the separation between Males and Females in the low-dimensional latent PLS space. Since we direct the feature extraction with the Gender variable, this is a supervised OMICs integration method.

expr <- as.data.frame(t(read.delim("CLL_mRNA.txt", header = TRUE, sep = "\t")))
for (i in 1:ncol(expr)){
    expr[, i][is.na(expr[, i])] <- median(expr[, i], na.rm = TRUE)
}

mut <- as.data.frame(t(read.delim("CLL_Mutations.txt", header = TRUE, sep = "\t")))
for (i in 1:ncol(mut)){
    mut[, i][is.na(mut[, i])] <- median(mut[, i], na.rm = TRUE)
}

meth <- as.data.frame(t(read.delim("CLL_Methylation.txt", header = TRUE, sep = "\t")))
for (i in 1:ncol(meth) 
    meth[, i][is.na(meth[, i])] <- median(meth[, i], na.rm = TRUE)
}

drug <- as.data.frame(t(read.delim("CLL_Drugs.txt", header = TRUE, sep = "\t")))
for (i in 1:ncol(drug)
    drug[, i][is.na(drug[, i])] <- median(drug[, i], na.rm = TRUE)
}

phen <- read.delim("CLL_Covariates.txt", header = TRUE, sep = "\t")
Y <- factor(phen$Gender)

Next we will split the n=200 CLL samples into train (n=140) and test (n=60) data sets for later testing the model for prediction accuracy. Since mutations represent a binary data, there is always a lack of variation due to coding with 0 and 1. Therefore, we will pre-filter the mutation matrix by excluding the sites with variance across individuals close to zero. Next, since gene expression and methylation data sets are high-dimensional, we use LASSO to perform feature pre-selection in the way I described in the previous post. Next, we perform cross-validation for selecting the optimal number of predictive components:

library("mixOmics")

data <- list(expr = expr, mut = mut, meth = meth, drug = drug)
design <- matrix(1, ncol = length(data), nrow = length(data), dimnames = list(names(data), names(data)))
diag(design) <- 0

splsda.res <- block.splsda(X = data, Y = Y, ncomp = 8, design = design)
perf.diablo <- perf(splsda.res, validation = 'Mfold'
                    folds = 2, nrepeat = 5,
                    progressBar = FALSE, cpus = 4)
plot(perf.diablo, overlay = 'dist', sd = TRUE)
Supervised Multi-Omics Integration

The classification error rate seems to reach its plateau at ncomp=2, so we will use this number as an optimal number of predictive components to keep in the further downstream analysis. The design matrix above defines the expected covariance between the OMICs. Here due to the lack of prior knowledge we select a strong correlation 1 between the OMICs. Further, we again will use cross-validation in order to determine the optimal number of features for each OMIC to be used for extracting most predictive variables with LASSO when running the sparse PLS-DA. Finally, after we have found the optimal numbers of predictive components and variables in each OMIC, we run the sparse PLS-DA for OMICs integration.

test.keepX <- list("expr" = c(1:5), "mut" = c(1:5), "meth" = c(1:5), "drug" = c(1:5))
tune.omics <- tune.block.splsda(X = data, Y = Y, ncomp = 2, test.keepX = test.keepX,
                                design = design, cpus = 4, validation = "Mfold",
                                folds = 2, nrepeat = 5, dist = "mahalanobis.dist")
                             
list.keepX <- list("expr" = tune.omics$choice.keepX$expr, "mut" = c(dim(mut)[2], dim(mut)[2]),
                   "meth" = tune.omics$choice.keepX$meth, "drug" = tune.omics$choice.keepX$drug)

res <- block.splsda(X = data, Y = Y, ncomp = 2, keepX = list.keepX,
                    design = design, near.zero.var = FALSE)
                    
plotIndiv(res, legend = TRUE, title = "CLL Omics", ellipse = FALSE, ind.names = FALSE, cex = 2)
plotArrow(res, ind.names = FALSE, legend = TRUE, title = "CLL Omics Integration")
Supervised Multi-Omics Integration
Low-dimensional latent PLS space representation of each individual OMIC
Supervised Multi-Omics Integration
Consensus plot across OMICs as a result of OMICs integration. Males and Females are linearly separable

The quartet of plots above shows the low-dimensional PLS representation of individual OMICs after integrative feature extraction has been performed. In contrast, the so-called Arrow Plot below can be thought as a consensus plot across the OMICs. It demonstrates quite clear separation between Males and Females due to simultaneous across OMICs features extraction with PLS-DA.

Biological Intepretation of Integration Results

MixOmics provides lots of impressive visualization for biological interpretation of the data. Here we present the Correlation Circle Plot, where the variables from the top loadings from each of the OMICs are superimposed. Clustering of the variables around the poles of the circle implies strong correlation between the variables from the OMICs data sets. Variables on the opposite poles of the correlation circle plot imply strong anti-correlation.

plotVar(res, var.names = TRUE, style = 'graphics', legend = TRUE, pch = c(16, 17, 18, 19),
        cex = c(0.8, 0.8, 0.8, 0.8), col = c('blue', 'red2', "darkgreen", "darkorange"))

circosPlot(res, cutoff = 0.7, line = FALSE, size.variables = 0.5)
network(res, blocks = c(1, 2), cex.node.name = 0.6, color.node = c('blue', 'red2'))
network(res, blocks = c(1, 3), cex.node.name = 0.6, color.node = c('blue', 'darkgreen'))
network(res, blocks = c(1, 4), cex.node.name = 0.6, color.node = c('blue', 'darkorange'))
network(res, blocks = c(2, 3), cex.node.name = 0.6, color.node = c('red2', 'darkgreen'))
network(res, blocks = c(2, 4), cex.node.name = 0.6, color.node = c('red2', 'darkorange'))
network(res, blocks = c(3, 4), cex.node.name = 0.6, color.node = c('darkgreen', 'darkorange'))
Supervised Multi-Omics Integration
Correlation Circle Plot (left) and Circus Plot (right) demonstrate feature connection across OMICs

Another way to present correlations between most informative features across the OMICs is the so-called Circos Plot. Again, the variables for this plot were selected simultaneously from all the OMICs, i.e. they are different from those obtained from each individual OMIC separately. Correlation Network is yet another way to demonstrate correlations between most informative features across the OMICs data sets in a pairwise fashion.

Supervised Multi-Omics Integration
Correlation Network demonstrate par-wise correlations between features across OMICs

Prediction with DIABLO Integrative Model

Now it is time for prediction. Once we have trained the PLS-DA model, we can use it and utilize the 60 test samples for making prediction of their gender and accessing the accuracy of the prediction:

data.test <- list(expr = expr_test, mut = mut_test, meth = meth_test, drug = drug_test)
predict.diablo <- predict(res, newdata = data.test, dist = 'mahalanobis.dist')
auroc.diablo <- auroc(res, newdata = data.test, outcome.test = Y.test, plot = TRUE,
                      roc.comp = c(1), roc.block = c(1, 2, 3, 4))

data.frame(predict.diablo$class, Truth = Y.test)
table(predict.diablo$MajorityVote$mahalanobis.dist[, 1], Y.test)
round((sum(diag(table(predict.diablo$MajorityVote$mahalanobis.dist[, 1], Y.test)))
       / sum(table(predict.diablo$MajorityVote$mahalanobis.dist[, 1], Y.test))) * 100)

Typically the success rate (accuracy) of predicting the gender from the DIABLO integrative model on the CLL data set is 60–70% which is not fantastic since it is a linear integration method. We can do it better with non-linear integration model via artificial neural networks which I will explain in the next posts.

Summary

In this post, we have learnt that PLS-DA is an elegant multivariate method for OMICs integration which projects individual data sets on a common latent low-dimensional space where the data sets loose the memory of their initial technological differences. The OMICs can be merged in this common latent space and the trait classes (aka sick-healthy) become linearly separable.

]]>
<![CDATA[Unsupervised Multi-Omics Integration]]>Previously, I gave an introduction to OMICs integration and covered the supervised PLS-DA model. Today we will consider unsupervised OMICs integration which is especially handy when no class labels (no clear phenotypes) are available, but we still would like to understand heterogeneity between the samples in our data by combining

]]>
raevskymichail.github.io//unsupervised-omics-integration/5fbd390551732c0059dfeeefTue, 24 Nov 2020 17:00:19 GMT

Previously, I gave an introduction to OMICs integration and covered the supervised PLS-DA model. Today we will consider unsupervised OMICs integration which is especially handy when no class labels (no clear phenotypes) are available, but we still would like to understand heterogeneity between the samples in our data by combining different sources of biological information.

Supervised vs Unsupervised Data Integration

To compare supervised and unsupervised data analyses is like comparing The Beatles and The Velvet Underground music. Both are excellent but in different ways. While supervised modeling dominates in the modern research since it provides slow but guaranteed progress, unsupervised models have always been considered as a “high risk - high gain” approach.

Unsupervised Multi-Omics Integration

The question of Lev Landau (Nobel Prize winner in Physics) which he asked his students struggling to solve mathematical problems: “How can you solve a scientific problem without knowing the answer?” reflects the mentality of supervised hypothesis-driven mainstream research. For instance, if one dares to reply “I do not have any” to the editor’s comment “I do not quite understand your biological hypothesis”, this will most likely lead to a rejection of the paper from a peer-review scientific journal. This is because in the research we tend to constrain ourselves with an initial idea of how the things work and the study to be done is just to prove our hypothesis is true. In contrast, the unsupervised approach is a free your mind approach when you have no idea about what is in your data and therefore open to discovering unexpected phenomena.

Multi-OMICs Factor Analysis (MOFA)

MOFA is a typical hypothesis-free data exploration framework. It allows data integration by extracting common axes of variation from the multiple OMICs layers. Given data matrices from multiple OMICs on the same or overlapping samples, MOFA infers a low-dimensional representation of the data in terms of latent / hidden factors which can be then visualized and interpreted.

Unsupervised Multi-Omics Integration
Idea of unsupervised OMICs integration with Multi-OMICs Factor Analysis (MOFA)

Similar to other integrative OMICs methods such as DISCO, JIVE and O2PLS, MOFA infers if the underlying axes of heterogeneity are unique to individual OMIC or manifested in multiple OMICs. In addition, MOFA provides several useful features including:

  • Visualization of samples in low-dimensional latent space
  • Annotation of factors using (gene set) enrichment analysis
  • Imputation and support of missing and sparse data
  • Support of OMICs with non-Gaussian distributions

The latter two advantages originate from the fact that MOFA is implemented within the Bayesian framework which does not depend on the assumptions of Frequentist statistics such as large amount of data or Gaussian distribution. Importantly, MOFA is a linear integrative OMICs algorithm, a more general non-linear OMICs integration can be performed via e.g. autoencoders. Final remark to mention about MOFA is that it is a Factor Analysis (FA) algorithm, and despite FA visually might look identical to PCA, those two linear dimension reduction techniques should not be confused. While PCA is a pure matrix factorization technique (unless it is a Probabilistic PCA) which splits the total variance into orthogonal Principal Components, FA seeks to construct hidden latent variables that generate the observed data, therefore Factor Analysis is a generative model similar to the Hidden Markov Model (HMM), Gaussian Mixture Models (GMM), Variational Autoencoder (VE), Generative Adversarial Networks (GANs) etc.

Integration of scNMT Multi-OMICs Data Set

In this section, we will apply MOFA to the single cell multi-OMICs scNMT data set which profiles chromatine accessebility (scATACseq), DNA methylation (scBSseq) and gene expression (scRNAseq) from the same biological cells belonging to two types: differentiating mouse embryonic stem cells (ESCs) and embrionic bodies (EBs). We will start with reading and filtering and log-transforming the scRNAseq subset of scNMT:

scRNAseq <- read.delim("scRNAseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep = "\t")
ens2genesymbol <- read.delim("ENSEMBLE_TO_GENE_SYMBOL_MOUSE.txt")
ens2genesymbol <- ens2genesymbol[match(colnames(scRNAseq), as.character(ens2genesymbol$ensembl_gene_id)),]
colnames(scRNAseq) <- ens2genesymbol$external_gene_name
scRNAseq <- as.data.frame(t(scRNAseq))
scRNAseq <- scRNAseq[rowMeans(scRNAseq) >= 1,];
scRNAseq <- log10(scRNAseq + 1)

Next, we are going to read the scBSseq and scATACseq subsets of scNMT and display their distributions. Because the CpG sites can be either methylated or unmethylated, and similarly the sites labelled using GpC methyltransferase can come from open or closed chromatin regions, it is not surprising that the distributions of scBSseq and scATACseq look biomodal indicating the binary nature of the data.

scBSseq <- read.delim("scBSseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep = "\t")
scBSseq <- as.data.frame(t(scBSseq))
hist(rowMeans(scBSseq), breaks = 100, main = "scBSseq")

scATACseq <- read.delim("scATACseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep = "\t")
scATACseq <- as.data.frame(t(scATACseq))
hist(rowMeans(scATACseq), breaks = 100, main = "scATACseq")
Unsupervised Multi-Omics Integration
Histograms of scBSseq and scATACseq subsets of scNMT

Therefore, we are going to model the scBSseq and scATACseq data as coming from Benoulli distribution. We will further make the scBSseq and scATACseq data sets purely binary by encoding values below 50 as 0 and above 50 as 1 and removing low-variance features to avoid redundancy:

library("mixOmics")

scBSseq <- ifelse(scBSseq < 50, 0, 1)
my_nearZeroVar <- nearZeroVar(as.data.frame(t(scBSseq)))
scBSseq <- scBSseq[-which(rownames(scBSseq) %in% rownames(my_nearZeroVar$Metrics)),]

scATACseq <- ifelse(scATACseq < 50, 0, 1)
my_nearZeroVar <- nearZeroVar(as.data.frame(t(scATACseq)))
scATACseq <- scATACseq[-which(rownames(scATACseq) %in% rownames(my_nearZeroVar$Metrics)),]

Further, we will create the MOFA object and set the model parameters such as Bernoulli distributions for scBSseq and scATACseq, and Gaussian distribution (due to the log-transform) for the scRNAseq OMICs:

library("MOFA")

omics <- list(scRNAseq = scRNAseq, scBSseq = scBSseq, scATACseq = scATACseq)
MOFAobject <- createMOFAobject(omics)
plotDataOverview(MOFAobject)

DataOptions <- getDefaultDataOptions()
ModelOptions <- getDefaultModelOptions(MOFAobject)
mydistr <- c("gaussian", "bernoulli", "bernoulli")
names(mydistr) <- c("scRNAseq", "scBSseq", "scATACseq")
ModelOptions$likelihood <- mydistr
ModelOptions$numFactors <- 20

TrainOptions <- getDefaultTrainOptions()
TrainOptions$seed <- 1234

# Automatically drop factors that explain less than 3% of variance in all omics
TrainOptions$DropFactorThreshold <- 0.03
TrainOptions$tolerance <- 0.1;
TrainOptions$maxiter <- 1000
Unsupervised Multi-Omics Integration
MOFA provides a useful visualization of the dimensions of the data

Finally, we are going to prepare the MOFA object and run MOFA. It turns out that MOFA selects only 3 hidden Latent Factors (LFs) that explain more than 3% of variation for the scNMT data set. The main delivery of the MOFa is the great visualization of covariance between the individual OMICs below:

MOFAobject <- prepareMOFA(MOFAobject, DataOptions = DataOptions,
                          ModelOptions = ModelOptions, TrainOptions = TrainOptions)
MOFAobject <- runMOFA(MOFAobject)

r2 <- calculateVarianceExplained(MOFAobject)
plotVarianceExplained(MOFAobject)
Unsupervised Multi-Omics Integration
Visualization of MOFA loadings and heatmap demonstrating separation of ESC and EB cells

It looks like the loading genes provide a nice separation between ESC and EB cells on the heatmap to the right. Finally, we will display the cells in the low-dimensional latent space, this embedding is the result of integration of the 3 OMICs from the scNMT multi-OMICs data set.

controls <- c("EB_P1D12", "EB_P1E12", "EB_P1F12", "EB_P1G12", "EB_P2B12", "EB_P2D12", "EB_P2E12",
              "EB_P2F12", "EB_P2G12", "ESC_B12", "ESC_C12", "ESC_D12")
colnames(scRNAseq) <- ifelse(colnames(scRNAseq) %in% controls, "CONTROL_CELL", colnames(scRNAseq))
cell_types <- matrix(unlist(strsplit(colnames(scRNAseq), "_")), ncol = 2, byrow = TRUE)[, 1]

plotFactorScatter(MOFAobject, factors = 1:2, color_by = as.factor(cell_types))
Unsupervised Multi-Omics Integration
Low-dimensional MOFA representation of scNMT cells

We conclude that MOFA projected the 3 OMICs from scNMT data set into a common latent space where technological differences are omitted and only common variation is taken into account. The integrative approach was able to successfully distinguish between the 3 cell types: ESCs, EBs and Controls.

Summary

In this post, we have learnt that Multi-OMICs Factor Analysis (MOFA) is an elegant Bayesian based algorithm for OMICs integration via extracting the axes of common variation across the modalities. It is particularly useful for data exploration and understanding the interplay between the OMICs. MOFA is insensitive to missing data and allows combining OMICs from different non-Gaussian statistical distributions.

]]>
<![CDATA[Feature Selection for Multi-Omics Integration]]>The next few posts will be dedicated to Integrative OMICs analysis which is a modern challenge for computational biology. I will explain basic preprocessing steps to be performed on the data before merging together various types of biological information.

What is Integrative OMICs?

Next Generation Sequencing (NGS) technologies gave rise

]]>
raevskymichail.github.io//feature-selection-for-multi-omics-integration/5fbd2d5a51732c0059dfedb4Tue, 24 Nov 2020 16:27:47 GMT

The next few posts will be dedicated to Integrative OMICs analysis which is a modern challenge for computational biology. I will explain basic preprocessing steps to be performed on the data before merging together various types of biological information.

What is Integrative OMICs?

Next Generation Sequencing (NGS) technologies gave rise to manifolds of Biological and Biomedical Big Data such as genomics, proteomics, phenomics, transcriptomics, metabolomics, epigenomics, metagenomics, lipidomics etc., which are often referred to as OMICs due to their common suffix. The rapidly growing amount and diversity of data provides new unique opportunities as well as poses a number of analytical challenges. One challenge is that we assume that the OMICs data should have synergistic effects which allows to more accurately model the behavior of biological cells. In this way, the OMICs integration can identify novel biological pathways that are not necessarily distinguishable in the individual OMICs layers. However, currently there is a lack of mathematical methodologies for performing the OMICs integration although promising workflows such as DIABLO mixOmics are developed.

Feature Selection for Multi-Omics Integration

Why to Select Informative Features?

Another challenge is that combining different types of biological information increases the number of analyzed features while keeping the number of statistical observations (samples) constant. Taking into account that even individual OMICs can be very high-dimensional, their integration is difficult without severe dimensionality reduction or regularization which boils down to selection of most informative features while dropping the noisy ones.

Suppose we are interested in explaining variation in our phenotype of interest, Y, e.g. disease status, and we have a matrix X of explanatory variables (genes) which we assume are responsible for the variation in Y. The question is: do all the explanatory variables in X contribute equally to the variation in Y? Due to lots of noise in biological data it is reasonable to assume that some genes are informative and some are noisy. Since we usually have a limited number of statistical observations we can not afford including noisy genes and have to concentrate on the informative genes, i.e. select features before proceeding with modeling.

Feature Selection for Multi-Omics Integration
Feature selection is an important step in biological analysis workflow, image source

For selection we have to test the genes in X against the phenotype Y. However, how should we do it: test all the genes in X together (multivariate feature selection) or one-by-one (univariate feature selection)? Here we are going to compare both ways.

Univariate Feature Selection

To demonstrate univariate vs. multivariate feature selection I will use RNAseq gene expression data from human skeletal muscles from GTEx Human Tussue Gene Expression Consortium V7. The data set includes n=157 samples and for the sake of simplicity and speed I randomly drew p=1000 genes.

Feature Selection for Multi-Omics Integration
The Genotype-Tissue Expression (GTEx) project. Nature Genetics. 29 May 2013. 45(6):580–5.

Here we load the gene expression matrix $X$ and remove lowly expressed genes. The phenotype of interest will be Gender, for simplicity. In other words, we are going to find expression of which genes in human skeletal muscles manifests phenotypic differences between 99 Males and 58 Females in our case. Let us visualize the 157 samples of Males and Females via displaying the PCA plot.

library("mixOmics")
X <- read.table("GTEX_SkeletalMuscles_157Samples_1000Genes.txt",
                header = TRUE, row.names = 1, check.names = FALSE, sep = "\t")
X <- X[, colMeans(X) >= 1]
Y <- read.table("GTEX_SkeletalMuscles_157Samples_Gender.txt",
                header = TRUE, sep = "\t")$GENDER
                
pca.gtex <- pca(X, ncomp = 10)
plot(pca.gtex)
plotIndiv(pca.gtex, group = Y, ind.names = FALSE, legend = TRUE,
          title = 'PCA on GTEX Skeletal Muscles')
Feature Selection for Multi-Omics Integration

The PCA plot demonstrates a lot of variation between the samples within both $PC_{1}$ and $PC_{2}$, but there is no clear segregation of Males and Females based on their skeletal muscle gene expression data. A way to understand which genes are behind the variation between the samples would be to test the correlation of each individual gene against the Gender, in our case this is equivalent to the Differential Gene Expression (DGE) analysis. Here we will use a simple non-parametric Spearman correlation for inferring relation between $X$ and $Y$, we will test all genes in $X$ one-by-one against $Y$, adjust the p-values of correlations for multiple testing using False Discovery Rate (FDR) and sort genes by FDR.

rho <- vector()
p <- vector()
a <- seq(from = 0, to = dim(X)[2], by = 100)

for (i in 1:dim(X)[2]) {
  corr_output <- cor.test(X[, i], as.numeric(Y), method = "spearman")
  rho <- append(rho, as.numeric(corr_output$estimate))
  p <- append(p, as.numeric(corr_output$p.value))
  if (isTRUE(i %in% a) == TRUE) { print(paste("FINISHED ", i, " FEATURES", sep = ""))
}

output <- data.frame(GENE = colnames(X), SPEARMAN_RHO = rho, PVALUE = p)
output$FDR <- p.adjust(output$PVALUE, method = "fdr")
output <- output[order(output$FDR, output$PVALUE, - output$SPEARMAN_RHO), ]
head(output, 10)
Feature Selection for Multi-Omics Integration

We get the genes ranked by their individual association with Gender and can apply the traditional cutoff FDR=0.05 for selecting informative / significant genes. Next, we can combine those selected genes into a predictive score, this seems like an attractive idea. However, in practice this type of prediction based on univariate feature selection works very poorly because it has two problems:

  • Univariate feature selection does not defeat the Curse of Dimensionality issue since FDR correction is not enough for this purpose, i.e. it is prone to overfitting and has a poor generalization.
  • Univariate feature selection does not account for multicollinearity between features, i.e. when features are strongly correlated with each other.

The shortcomings mentioned above can be addressed with the Sparse Linear Models, i.e. models with regularization penalties like LASSO / Ridge / Elastic Net, or Partial Least Squares (PLS) regression or discriminant analysis, basic techniques for Multivariate feature selection.

The simplest way to simultaneously account for all explanatory variables in the X matrix would be to put them together into the multiple or multivariate linear model and perform the Ordinary Least Squares (OLS) minimization:

\[ Y = \beta_{1} X_{1} + \beta_{2} X_{2}  + \epsilon \]

\[ OLS = (y -  \beta_{1} X_{1} + \beta_{2} X_{2})^2 \]

Here for simplicity we used only two predictors $X_{1}$ and $X_{2}$, but there can be thousands and millions of them. It implies that in order to minimize the $OLS$ cost function we have to work in high-dimensional space which is inherently difficult because of the Curse of Dimensionality. This leads to a very unstable solution of multiple linear regression. To overcome this obstacle we can add a penalty term to the above $OLS$ cost function:

\[ OLS_{penalized} = (y -  \beta_{1} X_{1} + \beta_{2} X_{2})^2  + \lambda [\alpha (\lvert \beta_{1} \rvert + \lvert \beta_{2} \rvert ) + (1 - \alpha) (\beta_{1}^2 + \beta_{2}^2  ) ] \]

Here, $\lambda$ is called Lagrange multiplier and is a measure of how much penalty we would like to put on our linear model, its optimal value is found through cross-validation. The parameter $\alpha$ is usually fixed, but in principle can also be found through cross-validation, the type of regularization is called:

  1. LASSO if \( \alpha=1 \),
  2. Ridge if \( \alpha=0 \)
  3. Elastic Net if \( \alpha=0.5 \).

These penalty methods have a few differences which are good to remember when you select a method for your analysis. LASSO is the most strict penalty and works best on the data with lots of noise. The problem of LASSO is that it can not fully handle multi-collinearity among predictors. If two variables are strongly correlated, LASSO will select only one of them (by chance) and set the coefficient in front of the other one to zero. Sometimes this type of feature selection can be problematic if it happens that the feature that was ignored / omitted has more physical / biological meaning than the one which was selected by LASSO.

This problem can be avoided with Ridge penalty, in addition Ridge is much more stable for numerical minimization as it provides a fully convex manifold in the high-dimensional space. However, in ultra high-dimensional spaces Ridge can be too permissive and select many noisy features which might not be desirable. Elastic Net penalty provides a compromise between LASSO and Ridge and is generally preferred and recommended by Machine Learning practitioners. Let us use LASSO, perform cross-validation an check which genes were selected:

library("glmnet")

lasso_fit <- cv.glmnet(as.matrix(X), Y, family = "binomial", alpha = 1)
plot(lasso_fit)
coef <- predict(lasso_fit, s = "lambda.min", type = "nonzero")
colnames(X)[unlist(coef)]

result <- data.frame(GENE = names(as.matrix(coef(lasso_fit, s = "lambda.min"))
                                 [as.matrix(coef(lasso_fit, s = "lambda.min"))[, 1] != 0, 1])[-1],
                     SCORE = as.numeric(as.matrix(coef(lasso_fit, s = "lambda.min"))
                                       [as.matrix(coef(lasso_fit, s = "lambda.min"))[, 1] != 0, 1])[-1])
result <- result[order(-abs(result$SCORE)), ]
print(head(result, 10))
Feature Selection for Multi-Omics Integration
Feature Selection for Multi-Omics Integration

As we can see, LASSO selected approximately 30 features as informative and ranked them by its internal score. The ranking looks quite different compared to the univariate feature selection even though the gene MAP7D2 appears on the top of both feature selection methods.

Another elegant multivariate feature selection technique is the Partial Least Squares (PLS) regression and discriminant analysis which is also called (by its author) Projection on Latent Structures (PLS). The idea behind PLS is that it performs feature selection via maximizing the covariance between $X$ and $Y$:

\[ \max_{\beta} cov(X, Y) => \hat{\beta} \]

Let us perform the PLS discriminant analysis (PLS-DA) on the gene expression matrix X using the gender Y for supervision. It can be thought as multivariate selection of genes that provide largest separation between Males and Females.

library("mixOmics")

gtex.plsda <- plsda(X, Y, ncomp = 2)
background <- background.predict(gtex.plsda, comp.predicted = 2, dist = "max.dist")

plotIndiv(gtex.plsda, comp = 1:2, group = Y, ind.names = FALSE, ellipse = TRUE,
          legend = TRUE, title = 'PLSDA on GTEX Skeletal Muscles', background = background)
          
plotLoadings(gtex.plsda, comp = 1, title = 'Loadings on comp 1', contrib = 'max',
             method = 'median', ndisplay = 10, size.name = 0.6)

plotLoadings(gtex.plsda, comp = 2, title = 'Loadings on comp 2', contrib = 'max',
             method = 'median', ndisplay = 10, size.name = 0.6)
Feature Selection for Multi-Omics Integration

Here we can clearly see the separation between Males and Females, compare this figure with the PCA plot above where we did not see any segregation. To see which genes provide the gender segregation we display the PLS loadings:

Feature Selection for Multi-Omics Integration
Feature Selection for Multi-Omics Integration

Again, we conclude that the multivariate feature selection via PLS provided a set of genes which looks quite different from the univarite feature selection. Further we are going to use the sets of genes selected in multivariate fashion for integrating different OMICs layers.

Summary

In this article, we have learnt that OMICs integration provides a promising next step for accurate modeling of biological processes. However, a feature pre-selection often has to be done prior to the integration to overcome the Curse of Dimensionality. Univariate feature selection is simple, however it suffers from poor generalizability and does not account for multicollinearity. Multivariate feature selection methods such as LASSO, Ridge, Elastic Net and PLS are preferable for working with high-dimensional biological data.

]]>
<![CDATA[No True Effects in High Dimensions]]>In this post I will demonstrate how analytical modelling fails reconstructing true relations between variables when increasing dimensionality of the data.

Multivariate Analysis in High Dimensions

Consider a simple linear model $Y \sim X$, where $Y$ is the response and $X$ is the explanatory (predictor) variable. $Y$ can be a

]]>
raevskymichail.github.io//no-true-effects-in-high-dimensions/5fbd246a51732c0059dfece5Tue, 24 Nov 2020 15:50:50 GMT

In this post I will demonstrate how analytical modelling fails reconstructing true relations between variables when increasing dimensionality of the data.

Multivariate Analysis in High Dimensions

Consider a simple linear model $Y \sim X$, where $Y$ is the response and $X$ is the explanatory (predictor) variable. $Y$ can be a phenotype vector, e.g. disease status (sick / healthy) for each of n individuals / samples, while $X$ is a matrix with $n$ samples and $p$ features / dimensions. For simplicity, both $X$ and $Y$ are drawn from normal distribution. Assume low dimensional case $ n >> p $:

set.seed(123)
n <- 20             # number of samples 
p <- 2              # number of features / dimensions 
Y <- rnorm(n)
X <- matrix(rnorm(n*p), n, p)
summary(lm(Y~X))
No True Effects in High Dimensions

The result looks good, $Y$ and $X$ variables are not related as expected since they are drawn from the normal distribution. The math works without problems as long as $n > p$. Let us now increase the number of features / dimensions $p$ and see what happens.

set.seed(123456)
n <- 20
p <- 20
Y <- rnorm(n)
X <- matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
No True Effects in High Dimensions

Intriguing! R refuses calculating multivariate linear model when $p = n$ despite we have only 20 dimensions. The effect we observe here is called overfitting which can be thought as one manifestation of the Curse of Dimensionality. Overfitting implies that we can not use as many fitting parameters as we want but should select their optimal number based on some additional criterion, e.g. Akaike Information Criterion (AIC), or perform the cross-validation as the number of predictors is essentially a hyperparameter of the multivariate linear model.

So we had R throwing a “singularity” error for just 20 dimensions, in fact we typically have tens of thousands (RNAseq) and even millions (genomics, methylation) dimensions in computational biology with only n=100–1000 samples. This is the main obstacle for multivariate analysis in Life Sciences and one reason why the univariate feature selection (association studies, differential gene expression, differential methylation) is often preferred over the multivariate feature selection despite the higher predictive power of the latter. Another consequence of the typical biological setup, $n << p$, is the necessity to pre-select informative features with Lasso/Ridge/Elastic Net and perform dimensionality reduction.

Adjusting for Covariates Hides True Effects

Now imagine that $X$ is not a matrix but vector, for example a genotype vector, so we are performing an association study phenotype $Y$ vs. genotype $X$. Often I observe attempts to account for additional factors (covariates) which might confound this association. Indeed, if individuals come from different human populations such as African, European, Asian, this heterogeneity may explain a significant percent of variation in the phenotype $Y$ while we are interested in the phenotypic variation explained by the genotypes, in other words the population heterogeneity is a confounding factor which should be adjusted for in the model $Y \sim X$. Typically, the Principal Component Analysis (PCA) is performed on all available genetic variants and a handful of $PCs$ is selected as covariates for the association:

\[ Y \sim X + PC_{1} + PC_{2} + …+ PC_{n} \]

In this case the number of covariates is selected somehow arbitrary based on common sense, while here I would like to demonstrate that this number might be crucial for recovering true effects of phenotype vs. genotype associations.

Let us simulate a linear relation between $X$ and $Y$ as:

\[ Y=2*X + N(\mu, \sigma) \],

where $N(\mu, \sigma)$ is a noise distribution. Suppose we have a linear model $Y \sim X$, let us adjust this model for gradually increasing number of PCs slightly correlated with $Y$, i.e. \(PC_{1}=0.9*Y\); \(PC_{2}=0.8*Y\) etc.:

set.seed(12345)

# Simulate data y~x with true effect 2 of x
N <- 100
x <- rnorm(N)
y <- 2 * x + rnorm(N)
df <- data.frame(x, y)

# Add PCs to the linear model
for (i in 1:10) {
  df[, paste0("PC", i)] <- 1 * (1 - i / 10) * y + rnorm(N)
}

# Perform linear regression gradually increasing dimensionality
effect <- vector()
effect <- append(effect, 2)
adj_r_squared <- vector()
adj_r_squared <- append(adj_r_squared, summary(lm(y ~ x, data = df))$adj.r.squared)

for (i in 1:10) {
  formula <- as.formula(paste0("y~x+", paste0("PC", seq(1:i), collapse = "+")))
  effect <- append(effect, summary(lm(formula, data = df))$coefficients[2, 1])
  adj_r_squared <- append(adj_r_squared, summary(lm(formula, data = df))$adj.r.squared)
}

plot(effect ~ seq(from = 0, to = 10, by = 1), type = 'o',
     xlab = "PRINCIPAL COMPONENTS", ylab = "EFFECT")
plot(adj_r_squared ~ seq(from = 0, to = 10, by = 1), type = 'o',
     xlab = "PRINCIPAL COMPONENTS", ylab = "VARIANCE EXPLAINED BY MODEL")
No True Effects in High Dimensions
No True Effects in High Dimensions

What we observe here is that the true effect 2 of association between $Y$ and $X$ becomes more and more difficult to recover when adding $PCs$ correlated with $Y$. Despite we still have $n > p$, when 10 $PCs$ are added, the recovered effect is approximately 0.5 which is far from the true effect 2. In addition, we observe that the adjusted R squared (variance explained) reaches almost 1 implying we explain essentially all the variance in $Y$ by including only 10 covariates.

In the code above please note the coefficient 1 in front of (1-$i$/10). Let us call this a covariation coefficient between $Y$ and $PCs$. We have just seen that for the case when $Covariation=1$, the true effect 2 can be falsely calculated as 0.5. What happens now if we increase the covariation between $Y$ and $PCs$, i.e. for $Covariation=2$ and $3$?

effect_master <- list()

for (k in 1:3) {
  for (i in 1:10) {
    df[, paste0("PC", i)] <- k * (1 - i / 10) * y + rnorm(N)
  }

  effect <- vector()
  effect <- append(effect, 2)
  for (i in 1:10) {
    formula <- as.formula(paste0("y~x+", paste0("PC", seq(1:i), collapse = "+")))
    effect <- append(effect, summary(lm(formula, data = df))$coefficients[2, 1])
  }
  effect_master[[k]] <- effect
}

plot(effect_master[[1]] ~ seq(from = 0, to = 10, by = 1), type = 'o',
     xlab = "PRINCIPAL COMPONENTS", ylab = "EFFECT", xlim = c(0, 10), ylim = c(0, 2), col = "blue",
     main = "Effect of x vs. y at different covariation between y and PCs")

points(effect_master[[2]] ~ seq(from = 0, to = 10, by = 1), col = "green")
lines(effect_master[[2]] ~ seq(from = 0, to = 10, by = 1), col = "green")
points(effect_master[[3]] ~ seq(from = 0, to = 10, by = 1), col = "red")
lines(effect_master[[3]] ~ seq(from = 0, to = 10, by = 1), col = "red")
legend("topright", c("1", "2", "3"), title = "Covariation", fill = c("blue", "green", "red"),
       inset = 0.02)
No True Effects in High Dimensions

We observe that the more $Y$ is correlated with the covariates $PCs$, the harder it is to reconstruct the true effect. When $Covariation=3$, the reconstructed effect is essentially zero which implies that there is no association between $Y$ and $X$ despite we have simulated this effect to be 2. Therefore, even though it can make a lot of biological sense to adjust your linear model for covariates, this should be performed with caution since the math might not stand it due to the Curse of Dimensionality.

Summary

In this article, we have learnt that multivariate analysis blows up when the number of features / dimensions approaches the number of samples due to the Curse of Dimensionality. In addition, performing association studies you should pay particular attention to the number of covariates your linear model is adjusted for. Too many covariates might blur the true effects and make their reconstruction impossible.

]]>
<![CDATA[Clustering in High Dimensions]]>The next step after the dimensional reduction has been applied to your single- cell data is to do cluster analysis for detecting boundaries between cell populations.

However, clustering suffers from both the limitations of the algorithms and the, so called, Curse of Dimensionality, which often results in a contradiction between

]]>
raevskymichail.github.io//clustering-in-high-dimensions/5fbd08068cea5b024e5b950bTue, 24 Nov 2020 13:51:03 GMT

The next step after the dimensional reduction has been applied to your single- cell data is to do cluster analysis for detecting boundaries between cell populations.

However, clustering suffers from both the limitations of the algorithms and the, so called, Curse of Dimensionality, which often results in a contradiction between the visual interpretation of the tSNE plot and the cluster analysis output. In this article, I propose an automated way of reaching an agreement between dimensionality reduction and clustering for scRNAseq data analysis.

Artifacts of Cluster Analysis

One problem with clustering scRNAseq is the discrepancy between what we see in a tSNE plot and what cluster analysis reports in terms of number of clusters and cell assignment to clusters. For example, we clearly observe three clusters in the figure below and can even manually draw boundaries between the clusters. However, running a clustering algorithm might result in artifacts such as wrong number of clusters or wrong cell assignment to clusters, i.e. when a uniform cluster happens to be split into pieces by the algorithm.

Clustering in High Dimensions
Results of clustering contradict our visual interpretation, image source

Moreover, trying to reproduce results of scientific papers might be confusing due to the lack of robustness of clustering algorithms. For example, using the data from Kolodziejczyk et al., Cell Stem Cell 2015, eight clusters are visible in the tSNE plot, however the clustering algorithm used in the paper seems to detect only three clusters.

Clustering in High Dimensions
Contradiction between tSNE (left) and clustering (right) from Kolodziejczyk et al., Cell Stem Cell 2015

The contradiction between the dimensionality reduction and clustering has a dual nature. On the one hand, it is notoriously difficult to define a distance between data points in high-dimensional scRNAseq space due to the Curse of Dimensionality; one the other hand, clustering algorithms often use idealistic assumptions which do not hold for the real world data.

Euclidean distance breaks in High Dimensions

Mathematics of High Dimensions is an active area of research. There are many weird phenomena arising in high-dimensional space. One of them is that the distance between the data points and the origin of the coordinate system grows as a square root of the number of dimensions D. This can be seen as the data points deplete the center and concentrate in the shell of the n-dimensional ball at large D.

Clustering in High Dimensions
Data points occupy the surface and deplete the center of the n-ball in high dimensions, image source

Consequently, the mean distance between data points diverges and looses its meaning which in turn leads to the divergence of the Euclidean distance, the most common distance used for clustering. Manhattan distance is a better choice for scRNAseq, however it does not fully help in high dimensions either.

Assumptions and limitations of clustering methods

Despite the Curse of Dimensionality is the major obstacle for cluster analysis of scRNAseq, many clustering algorithms might perform poorly even in low dimensions due to their internal assumptions and limitations. All clustering methods can roughly be divided into four groups:

  1. Hierarchical clustering
  2. Centroid-based clustering
  3. Graph-based clustering
  4. Density-based clustering

Python library Scikit-learn provides a collection of clustering methods with an excellent overview which emphasizes their advantages and disadvantages:

Clustering in High Dimensions
Clustering methods overview at scikit-learn Python library

Hierarchical (agglomerative) clustering is too sensitive to noise in the data. Centroid-based clustering (K-means, Gaussian Mixture Models) can handle only clusters with spherical or ellipsoidal symmetry.

Graph-based clustering (Spectral, SNN-cliq, Seurat) is perhaps most robust for high-dimensional data as it uses the distance on a graph, e.g. the number of shared neighbors, which is more meaningful in high dimensions compared to the Euclidean distance.

Clustering in High Dimensions
Graph-based clustering uses distance on a graph: A and F have 3 shared neighbors, image source

owever, to build the graph this method still uses the Euclidean distance. In addition, the number of clusters has to be implicitly specified a-priori via the “resolution” hyperparameters. Changing the hyperparameters can easily result in fewer or more clusters which is somehow arbitrary and hence quite unsatisfactory as there is no obvious way to define an objective function for automated tuning of the hyperparameters.

Out of all clustering algorithms, only Density-based (Mean-Shift, DBSCAN, OPTICS, HDBSCAN) allows clustering without specifying the number of clusters. The algorithms work via sliding windows moving toward the high density of points, i.e. they find however many dense regions are present.

Clustering in High Dimensions
DBSCAN clustering finds dense regions in the data, image source

In addition, these algorithms are cluster shape independent and capture any topology of scRNAseq data. Below I will show how we can easily tune hyperparameters of the algorithm via minimization of an objective function.

How to tune hyperparameters of HDBSCAN

Clustering is an unsupervised learning problem, meaning we do not know the ground truth (number of clusters), and can not use cross-validation for optimizing hyperparameters of an algorithm. Nevertheless, there is a way to automatically optimize hyperparameters of HDBSCAN.

HDBSCAN, i.e. Hierarchical DBSCAN, is a powerful density-based clustering algorithm which is: 1) indifferent to the shape of clusters, 2) does not require the number of clusters to be specified, 3) robust with respect to clusters with different density. Further, HBDSCAN is very attractive because it has only one hyperparameter minPts which is the minimal number of points in a cluster. It is relatively fast for large data sets, detects outlying cells, and for each cell it reports a probability of assignment to a cluster. The fraction of cells with a low probability of assignment to a cluster can be used as an objective function for optimizing  minPts which in turn gives the optimal number of clusters.

Below, I will continue using the Cancer Associated Fibroblasts (CAFs) from the previous post where we found optimal perplexity (optPerp) and number of Principal Components (optPC). Now we will run HDBSCAN on the tSNE dimensionality reduction for different minimal sizes of clusters, i.e. minPts ranging from 3 to N_pt=50. For each minPts, we calculate the Score function as a fraction of cells with low confidence (probability < 5%) assignment to a cluster, by minimization of the Score function we want to reduce the number of unassigned data points. Due to internal stochasticity, tSNE changes from run to run, therefore for robustness we repeat the clustering N_iter=50 times and average the results. Plotting the Score function vs. minPts reveals a clear minimum for a certain minimal cluster size, minPts.

library("Rtsne"); library("dbscan"); library("foreach")
library("doParallel"); library("parallel")
N_iter <- 50; N_pt <- 50
score <- vector(length = N_pt-2)
for(i in 3:N_pt)
{
  score_iter <- foreach (1:N_iter, .combine = c) %dopar% 
  {
    tsne_iter <- Rtsne(t(log10(expr+1)), initial_dims=optPC,
                       perplexity=optPerp, max_iter=10000)
    res <- hdbscan(tsne_iter$Y, minPts = i)
    score_iter_temp <- sum(res$membership_prob < 0.05) / N_cells
    return(score_iter_temp)
  }
  score[i-2] <- mean(score_iter, na.rm = TRUE)
}
plot(log(score + 1) ~ seq(from=3, to=N_pt, by=1), type='o', 
     xlab="MIN SIZE OF CLUSTERS", ylab="LOG ( SCORE )")
names(score) <- seq(from=3, to=N_pt, by=1)
opt_score <- min(as.numeric(names(score)[score==min(score)]))
mtext(paste0("OPTIMAL MIN SIZE OF CLUSTERS = ",opt_score))
Clustering in High Dimensions

Now after we have found the optimal minimal size of clusters for our data set, minPts, we will use this value for the final HDBSCAN run on tSNE, this gives us the number of clusters, assignments of cells to the clusters, as well as cells which can not be confidently assigned to any cluster, i.e. outlying cells. The latter is an advantage and a peculiarity of density-based clustering compared to other algorithms, however here for simplicity I classified the outlying cells to their nearest neighbor clusters using the KNN classifier. Note, the tSNE was run N_tsne=50 times, and the minimal Kullback-Leibler divergence plot was selected as the final plot. Below, I show this for a few scRNAseq data sets.

library("class"); library("pals")
N_tsne <- 50
tsne_out <- list(length = N_tsne)
KL <- vector(length = N_tsne)
for(k in 1:N_tsne)
{
  tsne_out[[k]] <- Rtsne(t(log10(expr+1)),initial_dims=optPC,perplexity=optPerp,
                         max_iter=10000)
  KL[k] <- tail(tsne_out[[k]]$itercosts, 1)
}
names(KL) <- c(1:N_tsne)
opt_tsne <- tsne_out[[as.numeric(names(KL)[KL==min(KL)])]]$Y
res_opt <- hdbscan(opt_tsne, minPts = opt_score)
if(length(res_opt$cluster[res_opt$cluster==0]) > 0 & 
   length(res_opt$cluster[res_opt$cluster==1]) > 0)
{
  res_opt$cluster[res_opt$cluster==0]<-class::knn(
    opt_tsne[which(res_opt$cluster!=0),],opt_tsne[which(res_opt$cluster==0),],
    res_opt$cluster[res_opt$cluster!=0], k=3)
}
print(res_opt)
N_clust<-length(sort(unique(res_opt$cluster)))
if(N_clust <= 25){colors <- cols25(N_clust)}else{colors <- rainbow(N_clust)}
names(colors) <- sort(unique(res_opt$cluster))
plot(opt_tsne,main=paste0("TSNE PLOT: ",file_name),
     col=colors[as.character(res_opt$cluster)], xlab="tSNE1", ylab="tSNE2")
mtext(paste0("OPTIMAL NUMBER OF CLUSTERS = ", N_clust))
Clustering in High Dimensions
Results of HDBSCAN clustering on tSNE

The results of applying HDBSCAN to various scRNAseq data sets do not look too bad given that these plots were obtained by providing the raw expression matrix only that is without manual tweaking of the clustering parameters.

Clustering on tSNE, PCs and raw expression matrix

Running clustering on a raw expression matrix can be very challenging due to the Curse of Dimensionality. Therefore it is almost always recommended to perform any sort of dimensionality reduction before proceeding with cluster analysis. Here I compare performance of 9 popular clustering algorithms on the CAFs data set: HDBSCAN (described above), Kmeans, Gaussian Mixture Models (GMM), Hierarchical clustering, Spectral clustering, SC3, Bootstrap Consensus Clustering, SNN-cliq and Seurat. Indeed, running the clustering algorithms on the raw CAFs expression matrix brings unsatisfactory results:

Clustering in High Dimensions
Clustering on raw expression matrix

While all the algorithms fail clustering on the raw expression matrix, only SC3, which performs consensus clustering across different algorithms and distance metrics, provided surprisingly good cell assignment. Now we will apply the clustering algorithms to the significant PCs found by permuting expression matrix in the previous post.

Clustering in High Dimensions

Now, SC3 did not perform so well, HDBSCAN was not perfect either. Instead, Seurat graph-based community detection and hierarchical clustering seem to perform the best. Finally, let us apply the 9 clustering algorithms to the tSNE reduced dimension representation of the scRNAseq expression data:

Clustering in High Dimensions

Despite tSNE plot is a 2D dimensionality reduction, many algorithms such as K-means, Gaussian Mixture Models (GMM), Hierarchical clustering, Spectral clustering, Bootsrap Consensus clustering and SC3 fail to correctly assign the cells to their clusters. HDBSCAN and graph-based clustering methods (Seurat and SNN-cliq) seem to perform the best. However, the latter needed a manual hyperparameters tweaking which was not necessary for HDBSCAN due to the automated optimization procedure described above.

Summary

In this article, we have learnt that clustering of high-dimensional scRNAseq data is challenging due to the Curse of Dimensionality and limitations of the clustering methods. Frustratingly, most of clustering algorithms require the number of clusters to be specified a-priori which is hard to optimize due to inapplicability of cross-validation for clustering. However, the HDBSCAN stands out here as an algorithm with only one hyperparameter which is easy enough to optimize via minimization of the amount of unassigned cells.gi

]]>
<![CDATA[Dimensional Reduction for Single Cell Data]]>Here we are going to talk about dimension reduction techniques applied to single cell genomics data. Among others, we are going to compare linear and non-linear dimension reduction techniques in order to understand why tSNE and UMAP became golden standard for Single Cell biology.

Non-Linear Structure of Single Cell Data

]]>
raevskymichail.github.io//dimensional-reduction-for-single-cell-data/5fbd03d68cea5b024e5b94bdTue, 24 Nov 2020 13:14:25 GMT

Here we are going to talk about dimension reduction techniques applied to single cell genomics data. Among others, we are going to compare linear and non-linear dimension reduction techniques in order to understand why tSNE and UMAP became golden standard for Single Cell biology.

Non-Linear Structure of Single Cell Data

Single cell genomics is a high dimensional data with approximately 20 000 dimensions corresponding to the protein coding genes. Usually not all of the genes are equally important for the cellular function, i.e. there are redundant genes which can be eliminated from the analysis in order to simplify the data complexity and overcome the Curse of Dimensionality. To exclude redundant genes and project the data onto a latent space where the hidden stricture of the data becomes transparent is the goal of dimension reduction.

Dimensional Reduction for Single Cell Data
Path from raw expression matrix to latent features: from Wang et al., Nature Methods 14, p. 414–416 (2017)

Principal Component Analysis (PCA) is a basic linear dimension reduction technique which has been widely recognized to be inappropriate for the single cell data due to their highly non-linear structure. Intuitively the non-linearity comes from the large fraction of stochastic zeros in the expression matrix due to the dropout effect. Typically, single cell data have 60–80% zero elements in the expression matrix. In this way, single cell data are similar to image data where e.g. for the images of hand-written digits MNIST data set we have 86% of pixels having zero intensity.

Dimensional Reduction for Single Cell Data
tSNE non-linear dimensionality reduction for MNIST hand-written digits data set, image source

This large percentage of stochastic zeros in the expression matrix makes the data behave similar to the non-linear Heaviside step function, i.e. the gene expression is either zero or non-zero regardless of how much “non-zero”. Let us demonstrate basic linear and non-linear dimension reduction techniques applied to the Cancer Associated Fibroblasts (CAFs) single cell data set.

Linear Dimension Reduction Techniques

We are going to start with the linear dimension reduction techniques. Here we read and log-transform the single cell CAFs expression data, the latter can be viewed as a way to normalize the data. As usually, the columns are genes and the rows are cells, the last column corresponds to the IDs of the 4 populations (clusters) found previously in the CAFs data set.

import numpy as np
import pandas as pd
expr = pd.read_csv('CAFs.txt',sep='\t')
print("\n" + "Dimensions of input file: " + str(expr.shape) + "\n")
print("\n" + "Last column corresponds to cluster assignments: " + "\n")
print(expr.iloc[0:4, (expr.shape[1]-4):expr.shape[1]])
X = expr.values[:,0:(expr.shape[1]-1)]
Y = expr.values[:,expr.shape[1]-1]
X = np.log(X + 1)

Now we are going to apply the most popular linear dimensionality reduction techniques using the scikit-learn manifold learning library. Check here for the complete script, here for simplicity I show only the codes for the Linear Discriminant Analysis (LDA):

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

model = LinearDiscriminantAnalysis(n_components = 2, priors = None,
                                   shrinkage = 'auto', 
                                   solver = 'eigen',
                                   store_covariance = False,
                                   tol = 0.0001)

lda = model.fit_transform(X, Y)

plt.scatter(lda[:, 0], lda[:, 1], c = Y, cmap = 'viridis', s = 1)
plt.xlabel("LDA1", fontsize = 20); plt.ylabel("LDA2", fontsize = 20)
feature_importances = pd.DataFrame({'Gene':np.array(expr.columns)[:-1], 
                                    'Score':abs(model.coef_[0])})
print(feature_importances.sort_values('Score', ascending = False).head(20))
Dimensional Reduction for Single Cell Data
Linear dimensionality reduction techniques applied to the CAFs single cell data set

What we can immediately conclude is that the linear dimensionality reduction techniques can not fully resolve the heterogeneity in the single cell data. For instance, the small yellow cluster does not seem to be distinct from the other 3 cell clusters. The only exception is the LDA plot, however this is not a truly unsupervised learning method as in contrast to the rest it uses the cell labels in order to construct the latent features where the classes are most separable from each other. Thus the linear dimension reduction techniques are good at preserving the global structure of the data (connections between all the data points) while it seems that for single cell data it is more important to keep the local structure of the data (connections between neighboring points).

Dimensional Reduction for Single Cell Data
MDS tries to preserve global stricture while LLE preserves local structure of the data, image source

It is really fascinating to observe how the data is converted into the latent low-dimensional representation, and each of the dimension reduction techniques deserves its own article. It is interesting to think about why basically each of the techniques is applicable in one particular research area and not common in other areas. For example, Independent Component Analysis (ICA) is used in signal processing, Non-Negative Matrix Factorization (NMF) is popular for text mining, Non-Metric Multi-Dimensional Scaling (NMDS) is very common in Metagenomics analysis etc., but it is rare to see e.g. NMF to be used for RNA sequencing data analysis.

Non-Linear Dimension Reduction Techniques

We turn to the non-linear dimension reduction techniques and check whether they are capable of resolving all hidden structure in the single cell data. Again, complete codes can be found here, for simplicity I show below only the code for the tSNE algorithm which is known to have difficulty handling really high dimensional data and hence is common to be run on 20–50 pre-reduced dimensions with e.g. PCA.

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 30).fit_transform(X)
model = TSNE(learning_rate = 10, n_components = 2, random_state = 123, perplexity = 30)
tsne = model.fit_transform(X_reduced)
plt.scatter(tsne[:, 0], tsne[:, 1], c = Y, cmap = 'viridis', s = 1)
plt.title('tSNE on PCA', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
Dimensional Reduction for Single Cell Data
Non-linear dimensionality reduction techniques applied to the CAFs single cell data set

Here we see that most of the non-linear dimensionality reduction techniques have successfully resolved the small yellow cluster via preserving the local structure (connections between neighboring points) of the data. For instance, the Locally Linear Embedding (LLE) approximates the non-linear manifold by a collection of locally linear planar surfaces which happens to be crucial for reconstructing the rare cell populations such as the small yellow cluster. The best visualization though comes from tSNE and UMAP. The former captures the local structure as well as LLE, and seems to work on raw expression data in this particular case, however a pre-dimension reduction with PCA provides typically more condensed and distinct clusters. In contrast, UMAP preserves both the local and the global data structure and is based on the topological data analysis where a manifold is represented as a collection of elementary Simplex units, this ensures that the distances between the data points do not look all equal in the high-dimensional manifold but there is a distribution of distances similar to the 2D space. In other words, UMAP is an elegant attempt to beat the Curse of Dimensionality.

Summary

In this post, we have learnt that single cell genomics data have a non-linear structure which comes from the large proportion of stochastic zeros in the expression matrix due to the drop out effect. The linear manifold learning techniques preserve the global structure of the data and are not capable of fully resolving all cell populations present. In contrast, preserving the local connectivity between the data points (LLE, tSNE, UMAP) is the key factor for a successful dimensionality reduction of single cell genomics data.

]]>
<![CDATA[How to Batch Correct Single Cell Data]]>In my previous post, I started covering methods of removing technical variation in scRNAseq data by comparing normalization strategies. Here I am going to expand on the topic and talk about correcting systematic differences between same type of cells sequenced by different technologies, labs, protocols etc, i.e. batch-effects detection

]]>
raevskymichail.github.io//how-to-batch-correct-single-cell-data/5fbcfa648cea5b024e5b9446Tue, 24 Nov 2020 12:41:33 GMT

In my previous post, I started covering methods of removing technical variation in scRNAseq data by comparing normalization strategies. Here I am going to expand on the topic and talk about correcting systematic differences between same type of cells sequenced by different technologies, labs, protocols etc, i.e. batch-effects detection and elimination.

How to Detect Batch-Effects?

Batch-effects have a danger to confound the true biological signal, which is the discovery of the cell types present in a sample for the case of scRNAseq. Batches can originate from: different dates of sequencing, people doing the sequencing, flow-cells / plates, chemistry / protocol, lanes, read length, labs produced the data or even different organisms as the extreme case scenario.

How to Batch Correct Single Cell Data
Human vs. mouse (left ) as well as stimulated vs. control (right) same type cells form distinct clusters, from Butler et al., Nat. Biot. 36, 411–420 (2018)

Batch-effects can be genome-wide, i.e. present for majority of genes, or gene-specific, i.e. certain genes happen to be influenced by the batch. The former is very easy to see from the dimension reduction plots like PCA or tSNE. Here for demonstration purposes I will continue using the innate lymphoid cells (ILC) scRNAseq data set from Å. Björklund et al., Nature Immunology 17, p. 451–460 (2016). Below I compute the PCA and tSNE dimensionality reduction plots for the ILC cells from three donors.

How to Batch Correct Single Cell Data

The donor-specific batch is clearly visible from the plots, i.e. the cells from the same donor tend to cluster together and distinctly from the cells coming from other donors. To further quantify, let us display how much of variation in each Principal Component (PC) is explained by the Donor, Plate and Cell Type.

How to Batch Correct Single Cell Data

From the heatmap above it follows that 44% of PC1 is explained by the Donor batch while the Cell Type contribution is negligible, therefore we have a severe genome-wide Donor related batch effect. However, what if we did not observe any clustering by Donor at the dimension reduction plots above, does it mean that there are no batch-effects present in our data set? No it does not. Certain genes might still have their variation to a large extent due to the batch and not the cell type. We can rank all genes influenced by the batch by correlating their expression with the Donor variable. From the left figure below we observe that there aregenes with > 20% of their variation due to the Donor batch. To assess whether this number is a lot or a little we shuffle the batch variable (or alternatively the gene expression) a number of times and calculate thenoise zone, i.e. variation in gene expression due to a random variable. We can also do a formal statistical test and calculate a p-value of significance of deviation from the noise zone (right figure below). In this way, we end up with a list of approximately 200 genes which are significantly influenced by the batch, those genes are good to keep an eye on for the further downstream analysis.

How to Batch Correct Single Cell Data
Ranking the genes by their variance explained by the batch variable and computing the noise zone

How to Correct Batch-Effects?

Once we have detected the batch-effects in our data, we can apply multiple batch correction algorithms available. I do not recommend unsupervised correction similar to SVA for scRNAseq data, because scRNAseq analysis is also unsupervised with not known ground truth (cell types), thus removing the surrogate variables approximating the batch may damage the biological signal. For supervised batch correction, i.e. when the batch is known a-priori, a simple linear correction with the Combat method would be a good start.

How to Batch Correct Single Cell Data

Combat performs batch correction through the Bayesian linear regression and therefore offers a considerable improvement compared to the simple aka Frequentist linear batch removal implemented e.g. in Limma. Besides better performance for small sample sizes, Combat uses the Bayesian shrinkage towards the mean, this implies that correcting each gene we use a “mean” information from other genes while Limma does the correction gene-by-gene independently. Nevertheless, Combat applies a linear batch-effect correction.

Mutual Nearest Neighbors (MNN) algorithm uses a non-linear batch-effect correction inspired by the idea from K-Nearest Neighbors (KNN). The method tries to find most similar cells (mutual neighbors) between the batches, those cells are assumed to belong to the same type and the systematic differences between the MNN cells quantify the strength of the batch, this information is used to scale the rest of the cells in the batches.

How to Batch Correct Single Cell Data
Idea behind Mutual Nearest Neighbor (MNN) batch correction, from Haghverdi et al., Nat. Biot. 36, 2018

My general problem with this method is that I always failed to understand the assumption that the MNN cells should belong to the same type. My intuition is that a batch can completely mess up the concept of distance between the cells in the batches, so two cells that are closest neighbors between the batches can actually belong to different cell types but seem to be very similar because of the batch itself. Nevertheless, the MNN batch-effect correction seems to work well in practice and we will see later that it does a good job on our benchmark.

Another popular scRNAseq specific batch correction method which sometimes is seen as an across samples integration is the Seurat Canonical Correlation Analysis (CCA) technique. The concept of CCA is very similar to the PLS data integration which I described in one of my previous posts. The data from the batches are projected into a low-dimensional space by maximizing correlation (or covariance) between the data sets from different batches. This implies that the projections of the data sets are correlated but not necessarily aligned, i.e. do not overlap quite well in the low-dimensional space. The latter is solved by applying the Dynamic Time Warping (DTW) technique that locally stretches and squeezes the CCA data projections in order to further align them.

How to Batch Correct Single Cell Data
Idea behind Seurat Canonical Correlation Analysis (CCA) batch correction, from Butler et al., Nat. Biot. 36, 2018

While Combat, MNN and Seurat CCA seek to transform the data in order to merge the sub-sets from different batches, SCMAP algorithm tries to project the query cells onto a reference data set, that might be e.g. the Human Cell Atlas, without actually producing a single corrected data set.

How to Batch Correct Single Cell Data
Idea behind SCMAP single cell projection onto a reference data set, from Kiselev et al., Nat. Met. 15, 2018

SCMAP uses a KNN classifier for finding correspondence between tested cells and a presumably much bigger reference data set which was used for training the classifier.

SCMAP uses a KNN classifier for finding correspondence between tested cells and a presumably much bigger reference data set which was used for training the classifier.

Comparing Batch Correction Methods for ILC

How do these methods work in practice? We will use the ILC scRNAseq data set in order to run them and try to remove the Donor related batch effect and thus harmonize the cells between the batches. Skipping the technicalities, here are the final comparison of tSNE plots:

How to Batch Correct Single Cell Data

Looking at the tSNE plots above, I would say that all the methods have done a good job by well mixing the cells from different donors. Following the Occam’s Razor principle, I think the performance of the simplest Combat is satisfactory enough while being super easy to understand and program.

Interestingly, when training the SCMAP classifier on two donors as a reference and projecting the cells from the third donor on the the reference resulted in a poor cell assignment / classification accuracy of 84% taking into account that about the half of the cells could not be reliably assigned to any cluster at all. This might be due to the training set was not big enough or the data did not satisfy the spherical symmetry that is crucial for the KNN classifier.

How to Batch Correct Single Cell Data

Summary

In this post, we have learnt that batch-effects represent a technical variation which might confound the biological signal in scRNaseq analyses. These effects can be detected in genome-wide or gene-specific way and corrected using linear (Combat) or non-linear (MNN, Seurat CCA, SCMAP) algorithms. Overall, the batch correction methods work well on benchmarks suggesting that the simplest (Combat) model should be prioritized.

]]>
<![CDATA[How to Normalize Single Cell Data]]>Here we are going to talk about the challenges of removing technical variation in the exciting single cell genomics area by comparing most popular normalization algorithms.

From Bulk to Single Cell Normalization

It is widely recognized that the massive cell-to-cell gene expression variation in the single cell RNA sequencing (scRNAseq

]]>
raevskymichail.github.io//how-to-normalize-single-cell/5fbcf17b8cea5b024e5b93f1Tue, 24 Nov 2020 11:53:12 GMT

Here we are going to talk about the challenges of removing technical variation in the exciting single cell genomics area by comparing most popular normalization algorithms.

From Bulk to Single Cell Normalization

It is widely recognized that the massive cell-to-cell gene expression variation in the single cell RNA sequencing (scRNAseq) experiments can be to a large extent technical and thus confound the biological variation related to cell types. The sources of technical variation vary from PCR amplification bias to reverse transcription (RT) efficiency and stochasticity of molecular sampling during the preparation and sequencing steps. Therefore normalization across cells has to be applied in order to eliminate the technical variation and keep the biological one. However, common bulk RNAseq methods such as TMM and DESeq are based on constructing per cell size factors which represent ratios where each cell is normalized by a reference cell built by some sort of averaging across all other cells.

How to Normalize Single Cell Data
Bulk normalization (TMM, DESeq) builds a reference sample, from Mar Gonzalez-Porta teaching material

A peculiarity of scRNAseq data is that they contain in contrast to bulk RNAseq large amounts of stochastic zeros owing to low amounts of RNA per cell and imperfect RNA capture efficiency (dropout). Therefore the TMM and DESeq size factors may become unreliably inflated or equal to zero. Therefore new single cell specific normalization methods were developed accounting for the large amounts of zeros in the scRNAseq data.

How to Normalize Single Cell Data
Genes expressed in one cell are missing in another cell due to dropout (left), zero inflated scRNAseq data (right)

A very elegant attempt of accounting for stochastic zeros in the normalization procedure was made in Marioni’s lab with the deconvolution normalization method. The idea behind the algorithm is to represent all cells by a number of pools of cells, for each pool expression values from multiple cells are summed, which results in fewer zeros, and scaled by an average expression across all cells. Constructed in this way per pool size factors can be deconvolved in a linear system of equations yielding the per cell size factors.

How to Normalize Single Cell Data
Pooling across cell, from A. Lun et al., Genome Biology 2017

A disadvantage of such a “global size factor approach” is that different groups of genes should not be normalized with the same constant per cell size factor but it should be adjusted individually for each group of genes such as highly, moderately and lowly expressed. This was taken into account in the SCnorm and more recent Pearson residuals normalization algorithms. Both are based on regressing out the sequencing depth bias for different groups of genes.

How to Normalize Single Cell Data
SCnorm builds per cell per group of genes size factors, from Bacher et al., Nature Methods 2017

Below we will compare different popular normalization strategies using the Innate lymphoid cells (ILC) scRNAseq data from Å. Björklund et al., Nature Immunology 17, p. 451–460 (2016)

Comparing scRNAseq Normalization Methods

The scRNAseq analysis workflow is quite complicated, here I will concentrate on the results but encourage you to check the full code examples here.

First of all, the ILC scRNAseq data set was produced with the SmartSeq2 full-length transcript sequencing technology, and includes very handy spike-ins (control transcripts with constant concentration across cells) for assessing the technological background noise. These spike-ins can be used for detecting a set of genes with variation above the noise level via plotting the Coefficient of Variation (CV=sd/mean) as a function of the mean gene expression for each normalization method:

How to Normalize Single Cell Data
Detecting genes with variation above technological noise using spike-ins

On these plots: one point is one gene, the red curve represents the variation of the spike-ins, the genes above the spike-in curve demonstrate the variation above the technical variation (Highly Variable Genes). One can observe, the number of variable genes varies depending on the normalization method. The RPKMnormalization demonstrates the lowest number of variable genes which is perhaps due to spike-ins short length which is used for normalization in the RPKM. Next, we will examine how the Principal Component Analysis (PCA) plot varies depending on the scRNAseq normalization strategy:

How to Normalize Single Cell Data
PCA plots for different scRNAseq normalization strategies

It is hard to see any difference between the normalization methods, the PCA plots look nearly identical demonstrating the cells from the 4 ILC clusters heavily overlapping. Further, as expected, we can get a much more distinct clustering of the ILC cells using the tSNE plots:

How to Normalize Single Cell Data
tSNE plots for different scRNAseq normalization strategies

However, the way of normalization does not seem to change much the 4 ILC clusters, they are quite clearly visible regardless of the normalization method. Finally, we can compute the size factors for different normalization strategies and check their correlations:

How to Normalize Single Cell Data
Comparing size factors across different normalization strategies

Here for simplicity I averaged the SCnorm size factors across different groups of genes. Surprisingly, the size factors are almost indistinguishable across different normalization methods, even including the bulk RNAseq algorithms TMM and DEseq. However, for other scRNAseq data sets this might not look as good as for ILC.

Looking at the plots above, it feels like the importance of normalization for scRNAseq is slightly overestimated as at least for the SmartSeq2 data set the difference in performance was negligible. If cell populations are distinct enough, i.e. if there is a signal in the data, it does not play a major role which scRNAseq normalization method to choose. Moreover, with the widespread use of UMI-based sequencing protocols, the need for complex normalization algorithm becomes even less pronounced as those protocols are in principle capable of removing the amplification and sequencing depth bias as multiple reads are collapsed into a single UMI count.

How to Normalize Single Cell Data
Major sources of technical variation in scRNAseq that can be improved by UMI-based protocols, images source

Summary

In this post, we have learnt that due to massive amounts of stochastic zeros in scRNAseq data sets, the ratio based traditional bulk RNAseq normalization algorithms are not applicable. Deconvolution and SCnorm represent elegant scRNAseq specific normalization methods. However, comparison of scRNAseq normalization methodologies does not demonstrate pronounced difference in their performance, this might become even less notable with the advances in UMI-based scRNAseq protocols.

]]>
<![CDATA[Python Dictionary and Methods - Explained]]>raevskymichail.github.io//python-dictionaries-methods-3/5fafd68ca19eef00904d0ad6Sat, 14 Nov 2020 13:12:42 GMT❓What is a dictionary❓Python Dictionary and Methods - Explained

A dictionary is an unordered data structure that allows you to store key-value pairs. Here's an example of a Python Dictionary:

Python Dictionary and Methods - Explained

This dictionary uses strings as keys, but the key can be, in principle, any immutable data type. The value of a particular key can be anything. Here's another example of a dictionary where keys are numbers and values are strings:

Python Dictionary and Methods - Explained

An important clarification: if you try to use a mutable data type as a key, you will get an error:

Python Dictionary and Methods - Explained

In fact, the problem is not with mutable data types, but with non-hashable data types, but usually, they are the same thing.

Retrieving data from a dictionary 📖

Square brackets are used to get the value of a particular key []. Suppose we have a pair in our dictionary 'marathon': 26.

Python Dictionary and Methods - Explained

Again, you will get an error if you try to retrieve a value for a non-existent key. There are methods to avoid such mistakes, which we will talk about now.

Adding and updating keys 🔑

Adding new pairs to the dictionary is quite simple:

Python Dictionary and Methods - Explained

Updating existing values is exactly the same:

Python Dictionary and Methods - Explained

Deleting keys ❌

To remove a key and corresponding value from a dictionary one can use del

Python Dictionary and Methods - Explained

🍴 Methods 🍴

Python dictionaries have many different useful methods to help you in your work with them. Here are just a few of them.

👉 Update

The method update() is useful if you need to update several pairs at once. The method takes another dictionary as an argument.

Python Dictionary and Methods - Explained

If you are wondering why the data in the dictionary is not in the order in which it was entered, it is because the dictionaries are not ordered.

👉 Get

Let's suppose we have a dictionary:

Python Dictionary and Methods - Explained

The method get() returns the value for the specified key. If the specified key does not exist, the method will return None.

Python Dictionary and Methods - Explained

The method can be used to check for the presence of keys in the dictionary:

Python Dictionary and Methods - Explained
Get Method for Python Dictionaries

You can also specify a default value that will be returned instead of None if the key is not in the dictionary:

Python Dictionary and Methods - Explained

👉 Pop

The method pop() deletes the key and returns the corresponding value.

Python Dictionary and Methods - Explained

👉 Keys

The method keys() returns a collection of keys in the dictionary.

Python Dictionary and Methods - Explained

👉 Values

The method values() returns the collection of values in the dictionary.

Python Dictionary and Methods - Explained

👉 Items

The method items() returns key-value pairs.

Python Dictionary and Methods - Explained

👉 Iterating through a dictionary

You can iterate over each key in the dictionary.

Python Dictionary and Methods - Explained

Obviously, instead of story_count you can usestory_count.keys().

In the example code below, the loopfor uses the methoditems() to get a key-value pair for each iteration.

Python Dictionary and Methods - Explained

Read More

If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.

]]>