-
GreenFaaS: Maximizing Energy Efficiency of HPC Workloads with FaaS
Authors:
Alok Kamatar,
Valerie Hayot-Sasson,
Yadu Babuji,
Andre Bauer,
Gourav Rattihalli,
Ninad Hogade,
Dejan Milojicic,
Kyle Chard,
Ian Foster
Abstract:
Application energy efficiency can be improved by executing each application component on the compute element that consumes the least energy while also satisfying time constraints. In principle, the function as a service (FaaS) paradigm should simplify such optimizations by abstracting away compute location, but existing FaaS systems do not provide for user transparency over application energy cons…
▽ More
Application energy efficiency can be improved by executing each application component on the compute element that consumes the least energy while also satisfying time constraints. In principle, the function as a service (FaaS) paradigm should simplify such optimizations by abstracting away compute location, but existing FaaS systems do not provide for user transparency over application energy consumption or task placement. Here we present GreenFaaS, a novel open source framework that bridges this gap between energy-efficient applications and FaaS platforms. GreenFaaS can be deployed by end users or providers across systems to monitor energy use, provide task-specific feedback, and schedule tasks in an energy-aware manner. We demonstrate that intelligent placement of tasks can both reduce energy consumption and improve performance. For a synthetic workload, GreenFaaS reduces the energy-delay product by 45% compared to alternatives. Furthermore, running a molecular design application through GreenFaaS can reduce energy consumption by 21% and runtime by 63% by better matching tasks with machines.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving
Authors:
Yifei Li,
Ryan Chard,
Yadu Babuji,
Kyle Chard,
Ian Foster,
Zhuozhao Li
Abstract:
Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated…
▽ More
Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated function-as-a-service (FaaS) model to enable composition of distributed, scalable, and high-performance scientific workflows, and to support fine-grained function-level management. UniFaaS provides a unified programming interface to compose dynamic task graphs with transparent wide-area data management. UniFaaS exploits an observe-predict-decide approach to efficiently map workflow tasks to target heterogeneous and dynamic resources. We propose a dynamic heterogeneity-aware scheduling algorithm that employs a delay mechanism and a re-scheduling mechanism to accommodate dynamic resource capacity. Our experiments show that UniFaaS can efficiently execute workflows across computing resources with minimal scheduling overhead. We show that UniFaaS can improve the performance of a real-world drug screening workflow by as much as 22.99% when employing an additional 19.48% of resources and a montage workflow by 54.41% when employing an additional 47.83% of resources across multiple distributed clusters, in contrast to using a single cluster
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
The Changing Role of RSEs over the Lifetime of Parsl
Authors:
Daniel S. Katz,
Ben Clifford,
Yadu Babuji,
Kevin Hunter Kesling,
Anna Woodard,
Kyle Chard
Abstract:
This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.
This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.
△ Less
Submitted 20 July, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Developing Distributed High-performance Computing Capabilities of an Open Science Platform for Robust Epidemic Analysis
Authors:
Nicholson Collier,
Justin M. Wozniak,
Abby Stevens,
Yadu Babuji,
Mickaël Binois,
Arindam Fadikar,
Alexandra Würth,
Kyle Chard,
Jonathan Ozik
Abstract:
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among domain experts, mathematical modelers, and scientific computing specialists. Computationally, however, it also revealed critical gaps in the ability of researchers to exploit advanced computing systems. These challenging areas includ…
▽ More
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among domain experts, mathematical modelers, and scientific computing specialists. Computationally, however, it also revealed critical gaps in the ability of researchers to exploit advanced computing systems. These challenging areas include gaining access to scalable computing systems, porting models and workflows to new systems, sharing data of varying sizes, and producing results that can be reproduced and validated by others. Informed by our team's work in supporting public health decision makers during the COVID-19 pandemic and by the identified capability gaps in applying high-performance computing (HPC) to the modeling of complex social systems, we present the goals, requirements, and initial implementation of OSPREY, an open science platform for robust epidemic analysis. The prototype implementation demonstrates an integrated, algorithm-driven HPC workflow architecture, coordinating tasks across federated HPC resources, with robust, secure and automated access to each of the resources. We demonstrate scalable and fault-tolerant task execution, an asynchronous API to support fast time-to-solution algorithms, an inclusive, multi-language approach, and efficient wide-area data management. The example OSPREY code is made available on a public repository.
△ Less
Submitted 10 May, 2023; v1 submitted 27 April, 2023;
originally announced April 2023.
-
Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources
Authors:
Logan Ward,
J. Gregory Pauloski,
Valerie Hayot-Sasson,
Ryan Chard,
Yadu Babuji,
Ganesh Sivaraman,
Sutanay Choudhury,
Kyle Chard,
Rajeev Thakur,
Ian Foster
Abstract:
Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach…
▽ More
Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer.
We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
funcX: Federated Function as a Service for Science
Authors:
Zhuozhao Li,
Ryan Chard,
Yadu Babuji,
Ben Galewsky,
Tyler Skluzacek,
Kirill Nagaitsev,
Anna Woodard,
Ben Blaiszik,
Josh Bryan,
Daniel S. Katz,
Ian Foster,
Kyle Chard
Abstract:
funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, and superc…
▽ More
funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, and supercomputers, in effect turning them into function serving systems. funcX's cloud-hosted service provides a single location for registering, sharing, and managing both functions and endpoints. It allows for transparent, secure, and reliable function execution across the federated ecosystem of endpoints--enabling users to route functions to endpoints based on specific needs. funcX uses containers (e.g., Docker, Singularity, and Shifter) to provide common execution environments across endpoints. funcX implements various container management strategies to execute functions with high performance and efficiency on diverse funcX endpoints. funcX also integrates with an in-memory data store and Globus for managing data that may span endpoints. We motivate the need for funcX, present our prototype design and implementation, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than 130 000 concurrent workers. We show that funcX's container warming-aware routing algorithm can reduce the completion time for 3000 functions by up to 61% compared to a randomized algorithm and the in-memory data store can speed up data transfers by up to 3x compared to a shared file system.
△ Less
Submitted 23 September, 2022;
originally announced September 2022.
-
Extended Abstract: Productive Parallel Programming with Parsl
Authors:
Kyle Chard,
Yadu Babuji,
Anna Woodard,
Ben Clifford,
Zhuozhao Li,
Mihael Hategan,
Ian Foster,
Mike Wilde,
Daniel S. Katz
Abstract:
Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be executed concurrently. Developers can then link together…
▽ More
Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be executed concurrently. Developers can then link together functions via the exchange of data. Parsl establishes a dynamic dependency graph and sends tasks for execution on connected resources when dependencies are resolved. Parsl's runtime system enables different compute resources to be used, from laptops to supercomputers, without modification to the Parsl program.
△ Less
Submitted 4 May, 2022; v1 submitted 3 May, 2022;
originally announced May 2022.
-
Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing
Authors:
Logan Ward,
Ganesh Sivaraman,
J. Gregory Pauloski,
Yadu Babuji,
Ryan Chard,
Naveen Dandu,
Paul C. Redfern,
Rajeev S. Assary,
Kyle Chard,
Larry A. Curtiss,
Rajeev Thakur,
Ian Foster
Abstract:
Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We…
▽ More
Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
Extreme Scale Survey Simulation with Python Workflows
Authors:
A. S. Villarreal,
Yadu Babuji,
Tom Uram,
Daniel S. Katz,
Kyle Chard,
Katrin Heitmann
Abstract:
The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will soon carry out an unprecedented wide, fast, and deep survey of the sky in multiple optical bands. The data from LSST will open up a new discovery space in astronomy and cosmology, simultaneously providing clues toward addressing burning issues of the day, such as the origin of dark energy and and the nature of dark matter, w…
▽ More
The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will soon carry out an unprecedented wide, fast, and deep survey of the sky in multiple optical bands. The data from LSST will open up a new discovery space in astronomy and cosmology, simultaneously providing clues toward addressing burning issues of the day, such as the origin of dark energy and and the nature of dark matter, while at the same time yielding data that will, in turn, pose fresh new questions. To prepare for the imminent arrival of this remarkable data set, it is crucial that the associated scientific communities be able to develop the software needed to analyze it. Computational power now available allows us to generate synthetic data sets that can be used as a realistic training ground for such an effort. This effort raises its own challenges -- the need to generate very large simulations of the night sky, scaling up simulation campaigns to large numbers of compute nodes across multiple computing centers with different architectures, and optimizing the complex workload around memory requirements and widely varying wall clock times. We describe here a large-scale workflow that melds together Python code to steer the workflow, Parsl to manage the large-scale distributed execution of workflow components, and containers to carry out the image simulation campaign across multiple sites. Taking advantage of these tools, we developed an extreme-scale computational framework and used it to simulate five years of observations for 300 square degrees of sky area. We describe our experiences and lessons learned in developing this workflow capability, and highlight how the scalability and portability of our approach enabled us to efficiently execute it on up to 4000 compute nodes on two supercomputers.
△ Less
Submitted 24 September, 2021;
originally announced September 2021.
-
ExaWorks: Workflows for Exascale
Authors:
Aymen Al-Saadi,
Dong H. Ahn,
Yadu Babuji,
Kyle Chard,
James Corbett,
Mihael Hategan,
Stephen Herbein,
Shantenu Jha,
Daniel Laney,
Andre Merzky,
Todd Munson,
Michael Salim,
Mikhail Titov,
Matteo Turilli,
Justin M. Wozniak
Abstract:
Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. These software combinations and integrations, however, are difficult to achieve due to challenges of coordination and deployment of heterogeneous software components on diverse and massive platforms.…
▽ More
Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. These software combinations and integrations, however, are difficult to achieve due to challenges of coordination and deployment of heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which can address many of these challenges: ExaWorks is leading a co-design process to create a workflow software development Toolkit (SDK) consisting of a wide range of workflow management tools that can be composed and interoperate through common interfaces. We describe the initial set of tools and interfaces supported by the SDK, efforts to make them easier to apply to complex science challenges, and examples of their application to exemplar cases. Furthermore, we discuss how our project is working with the workflows community, large computing facilities as well as HPC platform vendors to sustainably address the requirements of workflows at the exascale.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual Screening
Authors:
Austin Clyde,
Thomas Brettin,
Alexander Partin,
Hyunseung Yoo,
Yadu Babuji,
Ben Blaiszik,
Andre Merzky,
Matteo Turilli,
Shantenu Jha,
Arvind Ramanathan,
Rick Stevens
Abstract:
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standa…
▽ More
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01\% of detecting the underlying best scoring 0.1\% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100x or even 1000x faster than current techniques.
△ Less
Submitted 30 June, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development
Authors:
Rafael Ferreira da Silva,
Henri Casanova,
Kyle Chard,
Tainã Coleman,
Dan Laney,
Dong Ahn,
Shantenu Jha,
Dorran Howell,
Stian Soiland-Reys,
Ilkay Altintas,
Douglas Thain,
Rosa Filgueira,
Yadu Babuji,
Rosa M. Badia,
Bartosz Balis,
Silvina Caino-Lores,
Scott Callaghan,
Frederik Coppens,
Michael R. Crusoe,
Kaushik De,
Frank Di Natale,
Tu M. A. Do,
Bjoern Enders,
Thomas Fahringer,
Anne Fouilloux
, et al. (33 additional authors not shown)
Abstract:
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role i…
▽ More
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role in the data-oriented and post-Moore's computing landscape as they democratize the application of cutting-edge research techniques, computationally intensive methods, and use of new computing platforms. As workflows continue to be adopted by scientific projects and user communities, they are becoming more complex. Workflows are increasingly composed of tasks that perform computations such as short machine learning inference, multi-node simulations, long-running machine learning model training, amongst others, and thus increasingly rely on heterogeneous architectures that include CPUs but also GPUs and accelerators. The workflow management system (WMS) technology landscape is currently segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. Another fundamental problem is that there are conflicting theoretical bases and abstractions for a WMS. Systems that use the same underlying abstractions can likely be translated between, which is not the case for systems that use different abstractions. More information: https://workflowsri.org/summits/technical
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Workflows Community Summit: Bringing the Scientific Workflows Community Together
Authors:
Rafael Ferreira da Silva,
Henri Casanova,
Kyle Chard,
Dan Laney,
Dong Ahn,
Shantenu Jha,
Carole Goble,
Lavanya Ramakrishnan,
Luc Peterson,
Bjoern Enders,
Douglas Thain,
Ilkay Altintas,
Yadu Babuji,
Rosa M. Badia,
Vivien Bonazzi,
Taina Coleman,
Michael Crusoe,
Ewa Deelman,
Frank Di Natale,
Paolo Di Tommaso,
Thomas Fahringer,
Rosa Filgueira,
Grigori Fursin,
Alex Ganose,
Bjorn Gruning
, et al. (20 additional authors not shown)
Abstract:
Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) pla…
▽ More
Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. The "Workflows Community Summit" was held online on January 13, 2021. The overarching goal of the summit was to develop a view of the state of the art and identify crucial research challenges in the workflow community. Prior to the summit, a survey sent to stakeholders in the workflow community (including both developers of WMSs and users of workflows) helped to identify key challenges in this community that were translated into 6 broad themes for the summit, each of them being the object of a focused discussion led by a volunteer member of the community. This report documents and organizes the wealth of information provided by the participants before, during, and after the summit.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
DESC DC2 Data Release Note
Authors:
LSST Dark Energy Science Collaboration,
Bela Abolfathi,
Robert Armstrong,
Humna Awan,
Yadu N. Babuji,
Franz Erik Bauer,
George Beckett,
Rahul Biswas,
Joanne R. Bogart,
Dominique Boutigny,
Kyle Chard,
James Chiang,
Johann Cohen-Tanugi,
Andrew J. Connolly,
Scott F. Daniel,
Seth W. Digel,
Alex Drlica-Wagner,
Richard Dubois,
Eric Gawiser,
Thomas Glanzman,
Salman Habib,
Andrew P. Hearin,
Katrin Heitmann,
Fabio Hernandez,
Renée Hložek
, et al. (32 additional authors not shown)
Abstract:
In preparation for cosmological analyses of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), the LSST Dark Energy Science Collaboration (LSST DESC) has created a 300 deg$^2$ simulated survey as part of an effort called Data Challenge 2 (DC2). The DC2 simulated sky survey, in six optical bands with observations following a reference LSST observing cadence, was processed with th…
▽ More
In preparation for cosmological analyses of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), the LSST Dark Energy Science Collaboration (LSST DESC) has created a 300 deg$^2$ simulated survey as part of an effort called Data Challenge 2 (DC2). The DC2 simulated sky survey, in six optical bands with observations following a reference LSST observing cadence, was processed with the LSST Science Pipelines (19.0.0). In this Note, we describe the public data release of the resulting object catalogs for the coadded images of five years of simulated observations along with associated truth catalogs. We include a brief description of the major features of the available data sets. To enable convenient access to the data products, we have developed a web portal connected to Globus data services. We describe how to access the data and provide example Jupyter Notebooks in Python to aid first interactions with the data. We welcome feedback and questions about the data release via a GitHub repository.
△ Less
Submitted 13 June, 2022; v1 submitted 12 January, 2021;
originally announced January 2021.
-
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads
Authors:
Aymen Al Saadi,
Dario Alfe,
Yadu Babuji,
Agastya Bhati,
Ben Blaiszik,
Thomas Brettin,
Kyle Chard,
Ryan Chard,
Peter Coveney,
Anda Trifan,
Alex Brace,
Austin Clyde,
Ian Foster,
Tom Gibbs,
Shantenu Jha,
Kristopher Keipert,
Thorsten Kurth,
Dieter Kranzlmüller,
Hyungro Lee,
Zhuozhao Li,
Heng Ma,
Andre Merzky,
Gerald Mathias,
Alexander Partin,
Junqi Yin
, et al. (11 additional authors not shown)
Abstract:
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating…
▽ More
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple algorithmic innovations to overcome this fundamental limitation, development and deployment of computational infrastructure at scale integrates multiple artificial intelligence and simulation-based approaches. Three measures of performance are:(i) throughput, the number of ligands per unit time; (ii) scientific performance, the number of effective ligands sampled per unit time and (iii) peak performance, in flop/s. The capabilities outlined here have been used in production for several months as the workhorse of the computational infrastructure to support the capabilities of the US-DOE National Virtual Biotechnology Laboratory in combination with resources from the EU Centre of Excellence in Computational Biomedicine.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
The LSST DESC DC2 Simulated Sky Survey
Authors:
LSST Dark Energy Science Collaboration,
Bela Abolfathi,
David Alonso,
Robert Armstrong,
Éric Aubourg,
Humna Awan,
Yadu N. Babuji,
Franz Erik Bauer,
Rachel Bean,
George Beckett,
Rahul Biswas,
Joanne R. Bogart,
Dominique Boutigny,
Kyle Chard,
James Chiang,
Chuck F. Claver,
Johann Cohen-Tanugi,
Céline Combet,
Andrew J. Connolly,
Scott F. Daniel,
Seth W. Digel,
Alex Drlica-Wagner,
Richard Dubois,
Emmanuel Gangler,
Eric Gawiser
, et al. (55 additional authors not shown)
Abstract:
We describe the simulated sky survey underlying the second data challenge (DC2) carried out in preparation for analysis of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) by the LSST Dark Energy Science Collaboration (LSST DESC). Significant connections across multiple science domains will be a hallmark of LSST; the DC2 program represents a unique modeling effort that stresses…
▽ More
We describe the simulated sky survey underlying the second data challenge (DC2) carried out in preparation for analysis of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) by the LSST Dark Energy Science Collaboration (LSST DESC). Significant connections across multiple science domains will be a hallmark of LSST; the DC2 program represents a unique modeling effort that stresses this interconnectivity in a way that has not been attempted before. This effort encompasses a full end-to-end approach: starting from a large N-body simulation, through setting up LSST-like observations including realistic cadences, through image simulations, and finally processing with Rubin's LSST Science Pipelines. This last step ensures that we generate data products resembling those to be delivered by the Rubin Observatory as closely as is currently possible. The simulated DC2 sky survey covers six optical bands in a wide-fast-deep (WFD) area of approximately 300 deg^2 as well as a deep drilling field (DDF) of approximately 1 deg^2. We simulate 5 years of the planned 10-year survey. The DC2 sky survey has multiple purposes. First, the LSST DESC working groups can use the dataset to develop a range of DESC analysis pipelines to prepare for the advent of actual data. Second, it serves as a realistic testbed for the image processing software under development for LSST by the Rubin Observatory. In particular, simulated data provide a controlled way to investigate certain image-level systematic effects. Finally, the DC2 sky survey enables the exploration of new scientific ideas in both static and time-domain cosmology.
△ Less
Submitted 26 January, 2021; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
Authors:
Yadu Babuji,
Ben Blaiszik,
Tom Brettin,
Kyle Chard,
Ryan Chard,
Austin Clyde,
Ian Foster,
Zhi Hong,
Shantenu Jha,
Zhuozhao Li,
Xuefeng Liu,
Arvind Ramanathan,
Yi Ren,
Nicholaus Saint,
Marcus Schwarting,
Rick Stevens,
Hubertus van Dam,
Rick Wagner
Abstract:
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort,…
▽ More
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
△ Less
Submitted 27 May, 2020;
originally announced June 2020.
-
funcX: A Federated Function Serving Fabric for Science
Authors:
Ryan Chard,
Yadu Babuji,
Zhuozhao Li,
Tyler Skluzacek,
Anna Woodard,
Ben Blaiszik,
Ian Foster,
Kyle Chard
Abstract:
Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are ava…
▽ More
Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
Serverless Supercomputing: High Performance Function as a Service for Science
Authors:
Ryan Chard,
Tyler J. Skluzacek,
Zhuozhao Li,
Yadu Babuji,
Anna Woodard,
Ben Blaiszik,
Steven Tuecke,
Ian Foster,
Kyle Chard
Abstract:
Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to special…
▽ More
Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as "serverless supercomputing." We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.
△ Less
Submitted 13 August, 2019;
originally announced August 2019.
-
Parsl: Pervasive Parallel Programming in Python
Authors:
Yadu Babuji,
Anna Woodard,
Zhuozhao Li,
Daniel S. Katz,
Ben Clifford,
Rohan Kumar,
Lukasz Lacinski,
Ryan Chard,
Justin M. Wozniak,
Ian Foster,
Michael Wilde,
Kyle Chard
Abstract:
High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking h…
▽ More
High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking how parallelism is expressed in programs. Here, we present Parsl, a parallel scripting library that augments Python with simple, scalable, and flexible constructs for encoding parallelism. These constructs allow Parsl to construct a dynamic dependency graph of components that it can then execute efficiently on one or many processors. Parsl is designed for scalability, with an extensible set of executors tailored to different use cases, such as low-latency, high-throughput, or extreme-scale execution. We show, via experiments on the Blue Waters supercomputer, that Parsl executors can allow Python scripts to execute components with as little as 5 ms of overhead, scale to more than 250 000 workers across more than 8000 nodes, and process upward of 1200 tasks per second. Other Parsl features simplify the construction and execution of composite programs by supporting elastic provisioning and scaling of infrastructure, fault-tolerant execution, and integrated wide-area data management. We show that these capabilities satisfy the needs of many-task, interactive, online, and machine learning applications in fields such as biology, cosmology, and materials science.
△ Less
Submitted 17 May, 2019; v1 submitted 6 May, 2019;
originally announced May 2019.
-
DLHub: Model and Data Serving for Science
Authors:
Ryan Chard,
Zhuozhao Li,
Kyle Chard,
Logan Ward,
Yadu Babuji,
Anna Woodard,
Steve Tuecke,
Ben Blaiszik,
Michael J. Franklin,
Ian Foster
Abstract:
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and ser…
▽ More
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its selfservice model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.
-
Enabling Interactive Analytics of Secure Data using Cloud Kotta
Authors:
Yadu N. Babuji,
Kyle Chard,
Eamon Duede
Abstract:
Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a…
▽ More
Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a range of (sometimes large) analyses in an iterative and collaborative manner. The batch computing model offered by many data enclaves is well suited to executing large compute tasks; however it is far from ideal for day-to-day discovery science. As researchers must submit jobs to queues and wait for results, the high latencies inherent in queue-based, batch computing systems hinder interactive analysis. In this paper we describe how we have augmented the Cloud Kotta secure data enclave to support collaborative and interactive analysis of sensitive data. Our model uses Jupyter notebooks as a flexible analysis environment and Python language constructs to support the execution of arbitrary functions on private data within this secure framework.
△ Less
Submitted 28 April, 2017;
originally announced May 2017.
-
Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud
Authors:
Yadu N. Babuji,
Kyle Chard,
Aaron Gerow,
Eamon Duede
Abstract:
Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven, computationally intensive research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and…
▽ More
Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven, computationally intensive research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and analysis. Here we present CLOUD KOTTA: a cloud-based data management and analytics framework. CLOUD KOTTA delivers an end-to-end solution for coordinating secure access to large datasets, and an execution model that provides both automated infrastructure scaling and support for executing analytics near to the data. CLOUD KOTTA implements a fine-grained security model ensuring that only authorized users may access, analyze, and download protected data. It also implements automated methods for acquiring and configuring low-cost storage and compute resources as they are needed. We present the architecture and implementation of CLOUD KOTTA and demonstrate the advantages it provides in terms of increased performance and flexibility. We show that CLOUD KOTTA's elastic provisioning model can reduce costs by up to 16x when compared with statically provisioned models.
△ Less
Submitted 18 October, 2016; v1 submitted 10 October, 2016;
originally announced October 2016.
-
A Secure Data Enclave and Analytics Platform for Social Scientists
Authors:
Yadu N. Babuji,
Kyle Chard,
Aaron Gerow,
Eamon Duede
Abstract:
Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure a…
▽ More
Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure and flexible analysis of the data. To address these needs we present CLOUD KOTTA, a cloud-based architecture for the secure management and analysis of social science data. CLOUD KOTTA leverages reliable, secure, and scalable cloud resources to deliver capabilities to users, and removes the need for users to manage complicated infrastructure. CLOUD KOTTA implements automated, cost-aware models for efficiently provisioning tiered storage and automatically scaled compute resources. CLOUD KOTTA has been used in production for several months and currently manages approximately 10TB of data and has been used to process more than 5TB of data with over 75,000 CPU hours. It has been used for a broad variety of text analysis workflows, matrix factorization, and various machine learning algorithms, and more broadly, it supports fast, secure and cost-effective research.
△ Less
Submitted 10 October, 2016;
originally announced October 2016.
-
Evaluating Distributed Execution of Workloads
Authors:
Matteo Turilli,
Yadu Nand Babuji,
Andre Merzky,
Ming Tai Ha,
Michael Wilde,
Daniel S. Katz,
Shantenu Jha
Abstract:
Resource selection and task placement for distributed execution poses conceptual and implementation difficulties. Although resource selection and task placement are at the core of many tools and workflow systems, the methods are ad hoc rather than being based on models. Consequently, partial and non-interoperable implementations proliferate. We address both the conceptual and implementation diffic…
▽ More
Resource selection and task placement for distributed execution poses conceptual and implementation difficulties. Although resource selection and task placement are at the core of many tools and workflow systems, the methods are ad hoc rather than being based on models. Consequently, partial and non-interoperable implementations proliferate. We address both the conceptual and implementation difficulties by experimentally characterizing diverse modalities of resource selection and task placement. We compare the architectures and capabilities of two systems: the AIMES middleware and Swift workflow scripting language and runtime. We integrate these systems to enable the distributed execution of Swift workflows on Pilot-Jobs managed by the AIMES middleware. Our experiments characterize and compare alternative execution strategies by measuring the time to completion of heterogeneous uncoupled workloads executed at diverse scale and on multiple resources. We measure the adverse effects of pilot fragmentation and early binding of tasks to resources and the benefits of backfill scheduling across pilots on multiple resources. We then use this insight to execute a multi-stage workflow across five production-grade resources. We discuss the importance and implications for other tools and workflow systems.
△ Less
Submitted 2 November, 2021; v1 submitted 31 May, 2016;
originally announced May 2016.