subscribe to arXiv mailings

DESC DC2 Data Release Note

Authors: LSST Dark Energy Science Collaboration, Bela Abolfathi, Robert Armstrong, Humna Awan, Yadu N. Babuji, Franz Erik Bauer, George Beckett, Rahul Biswas, Joanne R. Bogart, Dominique Boutigny, Kyle Chard, James Chiang, Johann Cohen-Tanugi, Andrew J. Connolly, Scott F. Daniel, Seth W. Digel, Alex Drlica-Wagner, Richard Dubois, Eric Gawiser, Thomas Glanzman, Salman Habib, Andrew P. Hearin, Katrin Heitmann, Fabio Hernandez, Renée Hložek , et al. (32 additional authors not shown)

Abstract: In preparation for cosmological analyses of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), the LSST Dark Energy Science Collaboration (LSST DESC) has created a 300 deg$^2$ simulated survey as part of an effort called Data Challenge 2 (DC2). The DC2 simulated sky survey, in six optical bands with observations following a reference LSST observing cadence, was processed with th… ▽ More In preparation for cosmological analyses of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), the LSST Dark Energy Science Collaboration (LSST DESC) has created a 300 deg$^2$ simulated survey as part of an effort called Data Challenge 2 (DC2). The DC2 simulated sky survey, in six optical bands with observations following a reference LSST observing cadence, was processed with the LSST Science Pipelines (19.0.0). In this Note, we describe the public data release of the resulting object catalogs for the coadded images of five years of simulated observations along with associated truth catalogs. We include a brief description of the major features of the available data sets. To enable convenient access to the data products, we have developed a web portal connected to Globus data services. We describe how to access the data and provide example Jupyter Notebooks in Python to aid first interactions with the data. We welcome feedback and questions about the data release via a GitHub repository. △ Less

Submitted 13 June, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: 25 pages, 3 figures; 9 tables. A detailed changelog can be found in Appendix A. To obtain data, visit the DESC Data Portal at https://data.lsstdesc.org/

arXiv:2010.05926 [pdf, other]

doi 10.3847/1538-4365/abd62c

The LSST DESC DC2 Simulated Sky Survey

Authors: LSST Dark Energy Science Collaboration, Bela Abolfathi, David Alonso, Robert Armstrong, Éric Aubourg, Humna Awan, Yadu N. Babuji, Franz Erik Bauer, Rachel Bean, George Beckett, Rahul Biswas, Joanne R. Bogart, Dominique Boutigny, Kyle Chard, James Chiang, Chuck F. Claver, Johann Cohen-Tanugi, Céline Combet, Andrew J. Connolly, Scott F. Daniel, Seth W. Digel, Alex Drlica-Wagner, Richard Dubois, Emmanuel Gangler, Eric Gawiser , et al. (55 additional authors not shown)

Abstract: We describe the simulated sky survey underlying the second data challenge (DC2) carried out in preparation for analysis of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) by the LSST Dark Energy Science Collaboration (LSST DESC). Significant connections across multiple science domains will be a hallmark of LSST; the DC2 program represents a unique modeling effort that stresses… ▽ More We describe the simulated sky survey underlying the second data challenge (DC2) carried out in preparation for analysis of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) by the LSST Dark Energy Science Collaboration (LSST DESC). Significant connections across multiple science domains will be a hallmark of LSST; the DC2 program represents a unique modeling effort that stresses this interconnectivity in a way that has not been attempted before. This effort encompasses a full end-to-end approach: starting from a large N-body simulation, through setting up LSST-like observations including realistic cadences, through image simulations, and finally processing with Rubin's LSST Science Pipelines. This last step ensures that we generate data products resembling those to be delivered by the Rubin Observatory as closely as is currently possible. The simulated DC2 sky survey covers six optical bands in a wide-fast-deep (WFD) area of approximately 300 deg^2 as well as a deep drilling field (DDF) of approximately 1 deg^2. We simulate 5 years of the planned 10-year survey. The DC2 sky survey has multiple purposes. First, the LSST DESC working groups can use the dataset to develop a range of DESC analysis pipelines to prepare for the advent of actual data. Second, it serves as a realistic testbed for the image processing software under development for LSST by the Rubin Observatory. In particular, simulated data provide a controlled way to investigate certain image-level systematic effects. Finally, the DC2 sky survey enables the exploration of new scientific ideas in both static and time-domain cosmology. △ Less

Submitted 26 January, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: 39 pages, 19 figures, version accepted for publication in ApJS

arXiv:1705.00070 [pdf, other]

Enabling Interactive Analytics of Secure Data using Cloud Kotta

Authors: Yadu N. Babuji, Kyle Chard, Eamon Duede

Abstract: Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a… ▽ More Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a range of (sometimes large) analyses in an iterative and collaborative manner. The batch computing model offered by many data enclaves is well suited to executing large compute tasks; however it is far from ideal for day-to-day discovery science. As researchers must submit jobs to queues and wait for results, the high latencies inherent in queue-based, batch computing systems hinder interactive analysis. In this paper we describe how we have augmented the Cloud Kotta secure data enclave to support collaborative and interactive analysis of sensitive data. Our model uses Jupyter notebooks as a flexible analysis environment and Python language constructs to support the execution of arbitrary functions on private data within this secure framework. △ Less

Submitted 28 April, 2017; originally announced May 2017.

Comments: To appear in Proceedings of Workshop on Scientific Cloud Computing, Washington, DC USA, June 2017 (ScienceCloud 2017), 7 pages

arXiv:1610.03108 [pdf, other]

Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud

Authors: Yadu N. Babuji, Kyle Chard, Aaron Gerow, Eamon Duede

Abstract: Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven, computationally intensive research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and… ▽ More Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven, computationally intensive research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and analysis. Here we present CLOUD KOTTA: a cloud-based data management and analytics framework. CLOUD KOTTA delivers an end-to-end solution for coordinating secure access to large datasets, and an execution model that provides both automated infrastructure scaling and support for executing analytics near to the data. CLOUD KOTTA implements a fine-grained security model ensuring that only authorized users may access, analyze, and download protected data. It also implements automated methods for acquiring and configuring low-cost storage and compute resources as they are needed. We present the architecture and implementation of CLOUD KOTTA and demonstrate the advantages it provides in terms of increased performance and flexibility. We show that CLOUD KOTTA's elastic provisioning model can reduce costs by up to 16x when compared with statically provisioned models. △ Less

Submitted 18 October, 2016; v1 submitted 10 October, 2016; originally announced October 2016.

Comments: A version of this paper is forthcoming at BigData 2016

arXiv:1610.03105 [pdf, other]

A Secure Data Enclave and Analytics Platform for Social Scientists

Authors: Yadu N. Babuji, Kyle Chard, Aaron Gerow, Eamon Duede

Abstract: Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure a… ▽ More Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure and flexible analysis of the data. To address these needs we present CLOUD KOTTA, a cloud-based architecture for the secure management and analysis of social science data. CLOUD KOTTA leverages reliable, secure, and scalable cloud resources to deliver capabilities to users, and removes the need for users to manage complicated infrastructure. CLOUD KOTTA implements automated, cost-aware models for efficiently provisioning tiered storage and automatically scaled compute resources. CLOUD KOTTA has been used in production for several months and currently manages approximately 10TB of data and has been used to process more than 5TB of data with over 75,000 CPU hours. It has been used for a broad variety of text analysis workflows, matrix factorization, and various machine learning algorithms, and more broadly, it supports fast, secure and cost-effective research. △ Less

Submitted 10 October, 2016; originally announced October 2016.

Comments: Forthcoming eScience 2016

arXiv:1605.09513 [pdf, other]

doi 10.1109/eScience.2017.41

Evaluating Distributed Execution of Workloads

Authors: Matteo Turilli, Yadu Nand Babuji, Andre Merzky, Ming Tai Ha, Michael Wilde, Daniel S. Katz, Shantenu Jha

Abstract: Resource selection and task placement for distributed execution poses conceptual and implementation difficulties. Although resource selection and task placement are at the core of many tools and workflow systems, the methods are ad hoc rather than being based on models. Consequently, partial and non-interoperable implementations proliferate. We address both the conceptual and implementation diffic… ▽ More Resource selection and task placement for distributed execution poses conceptual and implementation difficulties. Although resource selection and task placement are at the core of many tools and workflow systems, the methods are ad hoc rather than being based on models. Consequently, partial and non-interoperable implementations proliferate. We address both the conceptual and implementation difficulties by experimentally characterizing diverse modalities of resource selection and task placement. We compare the architectures and capabilities of two systems: the AIMES middleware and Swift workflow scripting language and runtime. We integrate these systems to enable the distributed execution of Swift workflows on Pilot-Jobs managed by the AIMES middleware. Our experiments characterize and compare alternative execution strategies by measuring the time to completion of heterogeneous uncoupled workloads executed at diverse scale and on multiple resources. We measure the adverse effects of pilot fragmentation and early binding of tasks to resources and the benefits of backfill scheduling across pilots on multiple resources. We then use this insight to execute a multi-stage workflow across five production-grade resources. We discuss the importance and implications for other tools and workflow systems. △ Less

Submitted 2 November, 2021; v1 submitted 31 May, 2016; originally announced May 2016.

Showing 1–6 of 6 results for author: Babuji, Y N