In a new report on Science Advances, Hui Kwon Kim and interdisciplinary researchers at the departments of Pharmacology, Electrical and Computer Engineering, Medical Sciences, Nanomedicine and Bioinformatics in the Republic of Korea, evaluated the activities of SpCas9; a bacterial RNA-guided Cas9 endonuclease variant (a bacterial enzyme that cuts DNA for genome editing) from Streptococcus pyogenes. They used a high-throughput approach with 12,832 target sequences based on a human cell library to build a deep learning model and predict the activity of SpCas9.
The data contained oligonucleotides (nucleotides or building blocks) containing target sequence pairs and a corresponding guide sequence to encode single-guide RNA (sgRNA), which can direct the Cas9 protein to bind and cleave a specific DNA sequence for genome editing. They implemented deep learning-based training on the large dataset of SpCas9-induced indel (insertion or deletion) frequencies to develop an SpCas9 activity predicting model named DeepSpCas9 now available online. When the team tested the software against independently generated datasets, the results showed high generalization performance, i.e. the model could properly adapt to new, previously unseen data.
The CRISPR-Cas prokaryotic adaptive immune system functions as a genome editing tool with translational research potential in a variety of species and cell types including human cells, where the capacity to accurately predict SpCas9 enzyme activity is important. Researchers had previously developed several computational models to predict SpCas9 activity based on datasets of phenotypic changes of gene-edited cells or based on medium-sized datasets of plasmid-based (vehicles that transfer genes between bacteria and other cells) library-on-library approaches. However, the generalization performance of these models were limited, since the quality and size of the datasets were not ideal. For instance, model-predicted gene insertions and deletions (indels) to create functional knockout models (a method to inactivate genes in an experimental animal model in lab) resulted in false negatives. Additionally, these SpCas9-induced indel frequency datasets were also only medium-sized.
Kim et al. had previously reported on a deep learning-based computational model named DeepCpf1 to predict the activity of a different endonuclease (AsCpf1 from Acidaminococcus species) with high generalization performance. For this, they used lentiviral libraries of guide-RNA-encoding, target sequence pairs to generate a large training dataset known as DeepCpf1. While similar library-based methods were used to develop computational models that predicted indel frequencies generated by the Cas9 enzyme, a large dataset of Cas9-induced frequencies remains to be formed.
Scientists must therefore develop Cas9 activity-predicting computational models with high generalization performance. In this work, Kim et al. generated a high-throughput model to test SpCas9-induced indel frequencies at tens of thousands of target sequences by modifying their previously developed DeepCpf1 method to form DeepSpCas9. The DeepSpCas9 web tool is a deep learning-based model that can accurately predict the activities of SpCas9 with high generalization performance.
Kim et al. first prepared a lentiviral (a complex retrovirus subfamily that can incorporate foreign DNA) library of 15,656 guide RNA (gRNA)-encoding and target sequence pairs, for high-throughput assessment of SpCas9 activities. The research team amplified the pool of oligonucleotides containing pairs of guide and target sequences using the polymerase chain reaction (PCR) and cloned them into a lentiviral plasmid (transgene delivery system to transfer genetic material between cells) using the Gibson DNA assembly technique.
In a two-step approach, the researchers cut plasmids and inserted the sgRNA scaffold sequence at the cut site to generate plasmid libraries. To subsequently form a cell library, the scientists treated human embryonic kidney cells (HEK 293T) with lentivirus generated from the plasmid library. Each cell now contained a synthetic target sequence in its genome and expressed the corresponding sgRNA. The scientists then treated the cell library with the SpCas9-encoding lentivirus to cause sgRNA-directed cleavage and indel formation at the target sequences with frequencies that depended on the sgRNA activity. To measure the indel frequencies, the scientists PCR-amplified the target sequences and subjected them to deep sequencing. Based on the high throughput experiments, Kim et al. generated two datasets for training and testing purposes of the DeepSpCas9 model.
The scientists selected SpCas9 activities at 124 endogenous target sites with different properties of chromatin accessibility (effect of chromatin structure modifications on gene transcription) to test if the indel frequencies at the integrated synthetic target sequence correlated with those at the corresponding endogenous site. They observed a strong correlation between indel frequencies at the ingrained target sites and at the endogenous locations within the HEK cells.
The research team next developed an accurate computational model to predict SpCas9 activity on a large dataset using an end-to-end deep learning framework to form DeepSpCas9 and predict the SpCas9 activity. For the base model architecture, they used a convolutional neural network (CNN, similar to ordinary neural networks) and for the input sequence they used a 30-nucleotide sequence, which they converted into a four-dimensional binary matrix using one-hot encoding (splitting columns containing numerical categorical data to many columns). To understand the generalization performance of model selection and training, the team conducted 10-fold cross-validation using Spearman correlation coefficients between experimental measurements and predicted Cas9 activity levels.
When they increased the size of the training dataset for cross-validation, the average Spearman correlation coefficients between the experimental indel frequencies and predicted scores from the DeepSpCas9 model steadily increased up to 0.77. Compared to conventional machine learning algorithms such as support vector machine (SVM), AdaBoost (adaptive boosting), random forest and gradient-boosted regression trees, previously used for SpCas9 activity prediction, Spearman correlations of the DeepSpCas9 model were significantly higher. In total, DeepSpCas9 exhibited the best performance among all models.
In previous work, Kim et al. considered chromatin accessibility information to improve the prediction of AsCpf1 enzyme activities at endogenous target sites. They sought to determine if such considerations would also improve SpCas9 activity predictions. The results implied that fine-tuning with chromatin accessibility information barely improved the accuracy of DeepSpCas9 to predict indel frequencies at endogenous sites compared to their previous efforts with AsCpf1. The SpCas9 activity was only therefore slightly affected by chromatin accessibility in strong contrast to the previously developed DeepCpf1 algorithm.
To understand the generalization performance of DeepSpCas9, the research team tested the model using sufficiently large, published datasets derived from diverse research studies as test data. They compared the results with those of other SpCas9 activity predicting programs such as DeepCRISPR. The results suggested DeepSpCas9 to maintain the highest generalization function among nine published models used to predict SpCas9 activity. In this way, Hui Kwon Kim and research team extensively validated the potential to accurately predict SpCas9 activity using the DeepSpCas9 web tool, now available online, alongside supplementary code provided for research scientists to incorporate DeepSpCas9 into existing models. Based on the high generalization performance of DeepSpCas9, the research team expect to facilitate higher accuracy for SpCas9-based genome editing.