DeepVariant: Highly Accurate Genomes with Deep Neural Networks(research.googleblog.com) |
DeepVariant: Highly Accurate Genomes with Deep Neural Networks(research.googleblog.com) |
The approach is the right one for small genetic variants. But it will be hard to handle more complex kinds of variation without adapting the alignments to training example synthesis.
I think the field should cool it on calling the results of something like deepvariant "genomes". These are genotypes, not fully sequenced and reconstructed genomes. The evaluations are typically on easy regions and we have no reason to believe that those are the only ones that are important. One important tool to dig into this is syndip, which is a simulated synthetic diploid where the full haplotypes are known. It is a mixture of two haploid human genomes that were de novo sequenced with pacbio technology. (https://www.biorxiv.org/content/early/2017/11/22/223297). For the curious these haploid human genomes only exist in molar pregnancies, so even this isn't ideal but it is maybe the best resource we have at present.
GATK is still the standard, not because better variant callers don't exist, but because it's more important that everyone uses the same tool for comparisons between studies.
It's actually possible that DeepVariant is implicitly learning some of these correlations (1). This would make it really really bad for picking out the rare persons that don't fit a trend (and tend to be very important for identifying disease loci). GATK definitely does not know about correlated SNPs.
(1) The paper implies this is not the case, saying that DeepVariant works for other genomes without retraining, but they don't show the relevant results.
Obligatory reference: https://xkcd.com/1831
"“Why Should I Trust You?” Explaining the Predictions of Any Classifier": https://arxiv.org/pdf/1602.04938.pdf
https://homes.cs.washington.edu/~marcotcr/blog/lime/
https://github.com/marcotcr/lime
Anytime anyone makes snide HN comments like "oh you can't understand why neural networks make predictions" the correct response should always be "why doesn't LIME work in your specific case".
LIME is being used within the EU to explain credit decisions and fraud detection flagging on neural network based models, which is quite a high bar to regulatory oversight to pass.
In this case, I understood the question to be "will deep learning do a good job predicting the function of/phenotype emerging from individual SNPs" and I don't think model interpretation would help (for starters, the model is trained to predict linkage and doesn't deal with data related to phenotypes).
Of course the NN won't interprete the results, it will just provide better results.