Generalising better: applying deep-learning to prioritise deleterious point mutations

Ilia Korvigo


The dramatic increase in our capacity to feasibly sequence human exomes has already pushed the technology into consumer market, while intensifying research towards the integration of personal genomic data into medical practice. Correspondingly, over the past fifteen years researchers have developed a plethora of individual deleteriousness scoring systems, among which are established tools such as PolyPhen and SIFT, as well as many new and promising options, including FATHMM. Lately, the focus has been shifting from creating novel standalone tools towards combining available scoring systems into ensembles or meta-scores [1]. Nevertheless, two highly important issues remain hugely unaddressed: relatively few high-quality labelled reference SNPs (used to train and test automatic classifiers) and high probability of lacking enough information to categorise an SNP. Although the quality and size of reference mutations has been growing steadily, the datasets are biased towards intensively studied proteins and in vitro model systems, making it challenging to make good generalisations [1]. On the other hand, a combination of many different individual features used for SNP evaluation results in a highly complicated nonlinear irregular space, rendering many popular missing data imputation techniques ineffective and further highlighting the need for better generalisation. We believe both issues can be addressed by a combination of deep unsupervised and supervised learning.

It has been shown, that due to the universal approximation theorem any input function can be approximated by an ANN (artificial neural network) given enough units and layers [3]. We’ve focused on two distinct designs that have been successful in different fields of science and industry: dense MLPs (multilayer perceptron) with ReLU (rectified linear unit) activation and dropout training and dense denoising Autoencoders with greedy pre-training. We used a genetic algorithm to find optimal architectures for these networks and our early results clearly showed a dramatic increase in generalisation capability when processing SNPs with both complete and incomplete records. Our model outperformed the best available classifier on the industry standard VariBench dataset in both ROC-curve AUC (area under the curve) and precession-recall by statistically significant margins even though our algorithm had to work with incomplete records full of missing data, while the competitors only worked with complete records.

Inspired by these results, we then focused on the ways to produce physiologically-meaningful interpretations of individual mutations and their combinations (that is to account for the epistasis). In order to do that we started collecting and investigating physiological and biochemical data on protein-protein interactions, protein domain structure, in vitro mutagenesis, tissue-specific splicing, metabolic pathways and phylogenetic gene-family expansion. We believe these data, integrated in a single framework, can produce descriptions of greater sense and value for medical geneticists. The main challenge we face – like many researchers before us – apart from mining and stacking together many different conflicting sources of information, is feature design and extraction. The good news is that recent developments in CNNs (convolutional neural networks) with special convolutional layers, trained as feature-filters, have significantly simplified the task, boiling it down to getting enough data and fine-tuning the design, though we are yet to evaluate the approach.



  1. Dong et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics (2014)
  2. Vincent et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research (2010).
  3. G. Cybenko. Approximation by Superposition of a Sigmoid Function. Math. Control Signals Systems (1989)


Ilya Korvigo


В 2014 году закончил бакалавриат биологического факультета СПбГУ по направлению молекулярная биология на кафедре микробиологии. Дипломная работа была посвящена анализу временных серий метагеномов и метатранскриптомов микробных сообществ почвы. После поступления в магистратуру на ту же кафедру занимался моделированием состояния микробиомов в условиях промышленной эксплуатации земель. Результаты магистерской дипломной работы уже опубликованы в виде статьи, защита магистерской степени состоится в июне 2016 года. В 2015 году закончил Институт Биоинформатики при Санкт-Петербургском Академическом Университете. За время обучения выполнил две научные работы, посвященные эволюции происходящих от мобильных элементов повторов в геномах млекопитающих и разработке алгоритма быстрого поиска фрагментов гомологичных генов в несобранных метагеномных библиотеках на примере семейства симбиотических генов ризобий. После окончания Института Биоинформатики проходил летнюю стажировку в компании iBinom, занимаясь исследованием существующих систем автоматической оценки патогенности точечных мутаций и совершенствованием имеющегося для этого у iBinom проприетарного алгоритма. Параллельно занимаюсь статистическим анализом данных, математическим моделированием и разработкой ПО в ФГБНУ ВНИИСХМ в лаборатории микробиологического мониторинга и биоремедиации почв. Круг научных интересов: метагеномика, популяционная генетика, математическое моделирование, машинное обучение.