Mark DePristo

Los Altos, California, United States Contact Info
11K followers 500+ connections

Join to view profile

About

Mark DePristo (@MarkDePristo on Twitter) is co-founder and CEO of BigHat Biosciences, an…

Activity

Join now to see all activity

Experience & Education

  • BigHat Biosciences

View Mark’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Likelihood Ratios for Out-of-Distribution Detection

    NeurIPS

    Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model…

    Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. Finally, we demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.

    See publication
  • Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

    Nature Biotechnology

    The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall…

    The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the ‘genome in a bottle’ (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.

    See publication
  • GenomeWarp: an alignment-based variant coordinate transformation

    Bioinformatics

    Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool…

    Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome.

    See publication
  • A guide to deep learning in healthcare

    Nature Medicine

    Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health…

    Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep-learning methods for genomics are reviewed.

    See publication
  • A universal SNP and small-indel variant caller using deep neural networks

    Nature Biotechnology

    Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing…

    Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.

    See publication
  • Deep learning of genomic variation and regulatory network data

    Human molecular genetics

    The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also…

    The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also portrays the rapid adoption of artificial intelligence/deep neural networks in genomics; in particular, deep learning approaches are well suited to model the complex dependencies in the regulatory landscape of the genome, and to provide predictors for genetic variant calling and interpretation.

    See publication
  • Evaluating the contribution of rare variants to type 2 diabetes and related traits using pedigrees

    PNAS

    A major challenge in evaluating the contribution of rare variants to complex disease is identifying enough copies of the rare alleles to permit informative statistical analysis. To investigate the contribution of rare variants to the risk of type 2 diabetes (T2D) and related traits, we performed deep whole-genome analysis of 1,034 members of 20 large Mexican-American families with high prevalence of T2D. If rare variants of large effect accounted for much of the diabetes risk in these families,…

    A major challenge in evaluating the contribution of rare variants to complex disease is identifying enough copies of the rare alleles to permit informative statistical analysis. To investigate the contribution of rare variants to the risk of type 2 diabetes (T2D) and related traits, we performed deep whole-genome analysis of 1,034 members of 20 large Mexican-American families with high prevalence of T2D. If rare variants of large effect accounted for much of the diabetes risk in these families, our experiment was powered to detect association. Using gene expression data on 21,677 transcripts for 643 pedigree members, we identified evidence for large-effect rare-variant cis-expression quantitative trait loci that could not be detected in population studies, validating our approach. However, we did not identify any rare variants of large effect associated with T2D, or the related traits of fasting glucose and insulin, suggesting that large-effect rare variants account for only a modest fraction of the genetic risk of these traits in this sample of families. Reliable identification of large-effect rare variants will require larger samples of extended pedigrees or different study designs that further enrich for such variants.

    See publication
  • A framework for the interpretation of de novo mutation in human disease

    Nature Genetics

    Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied…

    Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.

    Other authors
    See publication
  • A polygenic burden of rare disruptive mutations in schizophrenia

    Nature

    Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated…

    Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated cytoskeleton-associated scaffold protein (ARC) of the postsynaptic density, sets previously implicated by genome-wide association and copy-number variation studies. Similar to reports in autism, targets of the fragile X mental retardation protein (FMRP, product of FMR1) are enriched for case mutations. No individual gene-based test achieves significance after correction for multiple testing and we do not detect any alleles of moderately low frequency (approximately 0.5 to 1 per cent) and moderately large effect. Taken together, these data suggest that population-based exome sequencing can discover risk alleles and complements established gene-mapping paradigms in neuropsychiatric disease.

    See publication
  • Loss-of-function mutations in APOC3, triglycerides, and coronary disease

    New England of Journal of Medicine

    BACKGROUND:
    Plasma triglyceride levels are heritable and are correlated with the risk of coronary heart disease. Sequencing of the protein-coding regions of the human genome (the exome) has the potential to identify rare mutations that have a large effect on phenotype.
    METHODS:
    We sequenced the protein-coding regions of 18,666 genes in each of 3734 participants of European or African ancestry in the Exome Sequencing Project. We conducted tests to determine whether rare mutations in…

    BACKGROUND:
    Plasma triglyceride levels are heritable and are correlated with the risk of coronary heart disease. Sequencing of the protein-coding regions of the human genome (the exome) has the potential to identify rare mutations that have a large effect on phenotype.
    METHODS:
    We sequenced the protein-coding regions of 18,666 genes in each of 3734 participants of European or African ancestry in the Exome Sequencing Project. We conducted tests to determine whether rare mutations in coding sequence, individually or in aggregate within a gene, were associated with plasma triglyceride levels. For mutations associated with triglyceride levels, we subsequently evaluated their association with the risk of coronary heart disease in 110,970 persons.
    RESULTS:
    An aggregate of rare mutations in the gene encoding apolipoprotein C3 (APOC3) was associated with lower plasma triglyceride levels. Among the four mutations that drove this result, three were loss-of-function mutations: a nonsense mutation (R19X) and two splice-site mutations (IVS2+1G→A and IVS3+1G→T). The fourth was a missense mutation (A43T). Approximately 1 in 150 persons in the study was a heterozygous carrier of at least one of these four mutations. Triglyceride levels in the carriers were 39% lower than levels in noncarriers (P<1×10(-20)), and circulating levels of APOC3 in carriers were 46% lower than levels in noncarriers (P=8×10(-10)). The risk of coronary heart disease among 498 carriers of any rare APOC3 mutation was 40% lower than the risk among 110,472 noncarriers (odds ratio, 0.60; 95% confidence interval, 0.47 to 0.75; P=4×10(-6)).
    CONCLUSIONS:
    Rare mutations that disrupt APOC3 function were associated with lower levels of plasma triglycerides and APOC3. Carriers of these mutations were found to have a reduced risk of coronary heart disease. (Funded by the National Heart, Lung, and Blood Institute and others.).

    See publication
  • A systematic survey of loss-of-function variants in human protein-coding genes

    Science

    Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious…

    Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

    Other authors
    See publication
  • An integrated map of genetic variation from 1,092 human genomes

    Nature

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million…

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    Other authors
    See publication
  • Pacific biosciences sequencing technology for genotyping and variation discovery in human data

    BMC Genomics

    BACKGROUND:
    Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing…

    BACKGROUND:
    Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects.

    RESULTS:
    We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis.

    CONCLUSION:
    Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.

    Other authors
    See publication
  • A framework for variation discovery and genotyping using next-generation DNA sequencing data

    Nature Genetics

    Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and…

    Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

    Other authors
    See publication
  • Variation in genome-wide mutation rates within and between families

    Nature Genetics

    J.B.S. Haldane proposed in 1947 that the male germline may be more mutagenic than the female germline1. Diverse studies have supported Haldane's contention of a higher average mutation rate in the male germline in a variety of mammals, including humans2, 3. Here we present, to our knowledge, the first direct comparative analysis of male and female germline mutation rates from the complete genome sequences of two parent-offspring trios. Through extensive validation, we identified 49 and 35…

    J.B.S. Haldane proposed in 1947 that the male germline may be more mutagenic than the female germline1. Diverse studies have supported Haldane's contention of a higher average mutation rate in the male germline in a variety of mammals, including humans2, 3. Here we present, to our knowledge, the first direct comparative analysis of male and female germline mutation rates from the complete genome sequences of two parent-offspring trios. Through extensive validation, we identified 49 and 35 germline de novo mutations (DNMs) in two trio offspring, as well as 1,586 non-germline DNMs arising either somatically or in the cell lines from which the DNA was derived. Most strikingly, in one family, we observed that 92% of germline DNMs were from the paternal germline, whereas, in contrast, in the other family, 64% of DNMs were from the maternal germline. These observations suggest considerable variation in mutation rates within and between families.

    Other authors
    See publication
  • The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data

    Genome Research

    Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer…

    Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

    Other authors
    See publication
  • Darwinian evolution can follow only very few mutational paths to fitter proteins

    Science

    Five point mutations in a particular beta-lactamase allele jointly increase bacterial resistance to a clinically important antibiotic by a factor of approximately 100,000. In principle, evolution to this high-resistance beta-lactamase might follow any of the 120 mutational trajectories linking these alleles. However, we demonstrate that 102 trajectories are inaccessible to Darwinian selection and that many of the remaining trajectories have negligible probabilities of realization, because four…

    Five point mutations in a particular beta-lactamase allele jointly increase bacterial resistance to a clinically important antibiotic by a factor of approximately 100,000. In principle, evolution to this high-resistance beta-lactamase might follow any of the 120 mutational trajectories linking these alleles. However, we demonstrate that 102 trajectories are inaccessible to Darwinian selection and that many of the remaining trajectories have negligible probabilities of realization, because four of these five mutations fail to increase drug resistance in some combinations. Pervasive biophysical pleiotropy within the beta-lactamase seems to be responsible, and because such pleiotropy appears to be a general property of missense mutations, we conclude that much protein evolution will be similarly constrained. This implies that the protein tape of life may be largely reproducible and even predictable.

    Other authors
    See publication
  • Crystallographic refinement by knowledge-based exploration of complex energy landscapes

    http://www.ncbi.nlm.nih.gov/pubmed/16154088

    Although X-ray crystallography remains the most versatile method to determine the three-dimensional atomic structure of proteins and much progress has been made in model building and refinement techniques, it remains a challenge to elucidate accurately the structure of proteins in medium-resolution crystals. This is largely due to the difficulty of exploring an immense conformational space to identify the set of conformers that collectively best fits the experimental diffraction pattern. We…

    Although X-ray crystallography remains the most versatile method to determine the three-dimensional atomic structure of proteins and much progress has been made in model building and refinement techniques, it remains a challenge to elucidate accurately the structure of proteins in medium-resolution crystals. This is largely due to the difficulty of exploring an immense conformational space to identify the set of conformers that collectively best fits the experimental diffraction pattern. We show here that combining knowledge-based conformational sampling in RAPPER with molecular dynamics/simulated annealing (MD/SA) vastly improves the quality and power of refinement compared to MD/SA alone. The utility of this approach is highlighted by the automated determination of a lysozyme mutant from a molecular replacement solution that is in congruence with a model prepared independently by crystallographers. Finally, we discuss the implications of this work on structure determination in particular and conformational sampling and energy minimization in general.

    Other authors
    See publication

Patents

  • Processing of biological sequences with neural networks

    Filed US 20190295688

    Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing a biological sequence using a neural network. One of the methods includes obtaining data identifying a biological sequence; generating, from the obtained data, an encoding of the biological sequence; processing the encoding using a deep neural network, wherein the deep neural network is configured through training to process the encoding to generate a score distribution over a set of…

    Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing a biological sequence using a neural network. One of the methods includes obtaining data identifying a biological sequence; generating, from the obtained data, an encoding of the biological sequence; processing the encoding using a deep neural network, wherein the deep neural network is configured through training to process the encoding to generate a score distribution over a set of biological labels for the biological sequence; and classifying the biological sequence using the score distribution.

    See patent
  • Deep learning analysis pipeline for next generation sequencing

    Issued US 10354747

    A method for variant calling in a next generation sequencing analysis pipeline involves obtaining a plurality of sequence reads that each include a nucleotide aligned at a nucleotide position within a sample genome. The method also involves obtaining a plurality of alleles associated with the nucleotide position. The method further involves determining that a particular allele of the plurality of alleles matches one or more sequence reads of the plurality of sequence reads, wherein the…

    A method for variant calling in a next generation sequencing analysis pipeline involves obtaining a plurality of sequence reads that each include a nucleotide aligned at a nucleotide position within a sample genome. The method also involves obtaining a plurality of alleles associated with the nucleotide position. The method further involves determining that a particular allele of the plurality of alleles matches one or more sequence reads of the plurality of sequence reads, wherein the particular allele is located at the nucleotide position. Additionally, the method involves generating an image based on information associated with the plurality of sequence reads. Further, the method involves determining, by providing the generated image to a trained neural network, a likelihood that the sample genome contains the particular allele. The method may also involves providing an output signal indicative of the determined likelihood.

    Other inventors
    See patent
  • Methods and systems for determining autism spectrum disorder risk

    Issued US 9,176,113

    In certain embodiments, the invention stems from the discovery that analysis of population distribution curves of metabolite levels in blood can be used to facilitate predicting risk of autism spectrum disorder (ASD) and/or to differentiate between ASD and non-ASD developmental delay (DD) in a subject. In certain aspects, information from assessment of the presence, absence, and/or direction (upper or lower) of a tail effect in a metabolite distribution curve is utilized to predict risk of ASD…

    In certain embodiments, the invention stems from the discovery that analysis of population distribution curves of metabolite levels in blood can be used to facilitate predicting risk of autism spectrum disorder (ASD) and/or to differentiate between ASD and non-ASD developmental delay (DD) in a subject. In certain aspects, information from assessment of the presence, absence, and/or direction (upper or lower) of a tail effect in a metabolite distribution curve is utilized to predict risk of ASD and/or to differentiate between ASD and DD.

    Other inventors
    See patent
  • Reference Sample Based Pooled Hybrid Selection Sequencing Method And Analytics

    Filed US PCT/US2013/031429

    Systems and methods are provided for reducing (e.g. compressing) representations of DNA sequencing data. These can include providing a plurality of DNA sequence reads, selecting a region of the reads and comparing the reads in the region to determine an ambiguity, and creating a consensus region and collapsing the selected region into a synthetic read if the ambiguity does not exceed a threshold.

    Other inventors
    See patent
  • Systems and Methods for Reducing Representations of Genome Sequencing Data

    Filed US 61/893,874

    The present disclosure relates generally to the field of genome sequencing. More particularly, the disclosure relates to methods and systems for reducing the amount of data of sequence reads without substantially losing accuracy and/or statistical power.

    Other inventors

Projects

  • DeepVariant

    - Present

    DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

    See https://research.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html for a general introduction.

    See project
  • 1000 Genomes Project

    - Present

    The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly.

    See project
  • Genome Analysis Toolkit (GATK)

    - Present

    The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

    See project
  • Public talks

    -

    2017
    * Santa Clara IEEE invited lecture
    * National Academy Precision Medicine and AI invited speaker
    * TensorFlow and Biology at TensorFlow Workshop
    * American Association for Cancer Research invited speaker
    * GCP Genomics workshop speaker
    * Stanford Biostats invited lecture
    * Stanford Biodesign invited speaker
    * JASON briefing invited speaker
    * Singularity University invited speaker
    * Festival of Genomics invited speaker
    * AGBT scientific talk selected from…

    2017
    * Santa Clara IEEE invited lecture
    * National Academy Precision Medicine and AI invited speaker
    * TensorFlow and Biology at TensorFlow Workshop
    * American Association for Cancer Research invited speaker
    * GCP Genomics workshop speaker
    * Stanford Biostats invited lecture
    * Stanford Biodesign invited speaker
    * JASON briefing invited speaker
    * Singularity University invited speaker
    * Festival of Genomics invited speaker
    * AGBT scientific talk selected from among submitted abstracts

    2016
    * Illumina's Scientific Advisory Board external speaker
    * UC Berkeley Genomics Program Biotechnology Companies Day invited speaker
    * Pacific Biosciences invited speaker
    * Deep Cheminformatics conference at Stanford invited speaker
    * The Broad Institute data sciences lecture

Languages

  • English

    Native or bilingual proficiency

More activity by Mark

View Mark’s full profile

  • See who you know in common
  • Get introduced
  • Contact Mark directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses