Vector Institute reposted this
Chief Artificial Intelligence Scientist @ UHN; Assistant Professor @ University of Toronto; CIFAR AI Chair @ Vector Institute ; Former PHD, CS @ Stanford University ; Twitter : @BoWang87
🎉 Exciting News in Genomic Foundation Models! 🎉 Genetic variants (GVs) are key in diagnosing and treating genetic diseases. With the plummeting costs of next-generation sequencing, we now have a treasure trove of GV data. But this surge presents a challenge for clinicians: how to prioritize patient-specific GVs and integrate them with existing databases for better patient care. 🔍 Deep learning models, and more recently foundation models, have shown promise in variant effect prediction (VEP), but they often oversimplify the problem into binary classifications: pathogenic or benign. These models also lack standardized performance assessments and fail to consider the complexities of genetic expression, such as penetrance and expressivity, across different biological contexts. 💡 Enter: Representation Learning! We believe it’s the key to effectively classifying and aligning unknown GVs with clinically-verified ones. Introducing our large-scale dataset: 🌟 GV-Rep 🌟 Designed for foundation models, it features: --Comprehensive Dataset: 7 million records, including data from 3,166,541 MAVEs, 17,548 gene knockout tests across 1,107 cell lines, QTLs across 14 tissue types, 1,808 oligenic variant combinations, and 156 clinically verified GVs. --Detailed Analysis: Exploring the structure and properties of the dataset. --Experimentation with Genomic Foundation Models (GFMs): Revealing the gap between current GFM capabilities and accurate GV representation, especially for cell- and tissue-level tasks. We hope GV-Rep will advance genomic foundation models and bridge this gap. 📚 Preprint: https://lnkd.in/gy722qZi 💻 Code: https://lnkd.in/ghKgrzMt Huge shoutout to Vallijah Subasri, an ML scientist in our team, and Zehui Li, an intern Vector Institute, for their leadership in this project! Stay tuned for our full paper with detailed experiments and more clinically meaningful use cases! 🚀