subscribe to arXiv mailings

A High-Quality Workflow for Multi-Resolution Scientific Data Reduction and Visualization

Authors: Daoce Wang, Pascal Grosset, Jesus Pulido, Tushar M. Athawale, Jiannan Tian, Kai Zhao, Zarija Lukić, Axel Huebl, Zhe Wang, James Ahrens, Dingwen Tao

Abstract: Multi-resolution methods such as Adaptive Mesh Refinement (AMR) can enhance storage efficiency for HPC applications generating vast volumes of data. However, their applicability is limited and cannot be universally deployed across all applications. Furthermore, integrating lossy compression with multi-resolution techniques to further boost storage efficiency encounters significant barriers. To thi… ▽ More Multi-resolution methods such as Adaptive Mesh Refinement (AMR) can enhance storage efficiency for HPC applications generating vast volumes of data. However, their applicability is limited and cannot be universally deployed across all applications. Furthermore, integrating lossy compression with multi-resolution techniques to further boost storage efficiency encounters significant barriers. To this end, we introduce an innovative workflow that facilitates high-quality multi-resolution data compression for both uniform and AMR simulations. Initially, to extend the usability of multi-resolution techniques, our workflow employs a compression-oriented Region of Interest (ROI) extraction method, transforming uniform data into a multi-resolution format. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to understand the potential impacts of lossy compression. Experimental evaluation demonstrates that our workflow achieves significant compression quality improvements. △ Less

Submitted 11 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

Comments: accepted by SC '24

arXiv:2307.09609 [pdf, other]

AMRIC: A Novel In Situ Lossy Compression Framework for Efficient I/O in Adaptive Mesh Refinement Applications

Authors: Daoce Wang, Jesus Pulido, Pascal Grosset, Jiannan Tian, Sian Jin, Houjun Tang, Jean Sexton, Sheng Di, Zarija Lukić, Kai Zhao, Bo Fang, Franck Cappello, James Ahrens, Dingwen Tao

Abstract: As supercomputers advance towards exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to… ▽ More As supercomputers advance towards exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how AMR and error-bounded lossy compression can function together. To this end, this study presents a novel in-situ lossy compression framework that employs the HDF5 filter to improve both I/O costs and boost compression quality for AMR applications. We implement our solution into the AMReX framework and evaluate on two real-world AMR applications, Nyx and WarpX, on the Summit supercomputer. Experiments with 4096 CPU cores demonstrate that AMRIC improves the compression ratio by up to 81X and the I/O performance by up to 39X over AMReX's original compression solution. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: 12 pages, 18 figures, 3 tables, accepted by ACM/IEEE SC '23

arXiv:2306.14070 [pdf, other]

SuperBench: A Super-Resolution Benchmark Dataset for Scientific Machine Learning

Authors: Pu Ren, N. Benjamin Erichson, Shashank Subramanian, Omer San, Zarija Lukic, Michael W. Mahoney

Abstract: Super-Resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation. There is growing interest in applying SR methods to complex spatiotemporal systems within the Scientific Machine Learning (SciML) community, with the hope of accelerating numerical simulations and/or improving forecasts… ▽ More Super-Resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation. There is growing interest in applying SR methods to complex spatiotemporal systems within the Scientific Machine Learning (SciML) community, with the hope of accelerating numerical simulations and/or improving forecasts in weather, climate, and related areas. However, the lack of standardized benchmark datasets for comparing and validating SR methods hinders progress and adoption in SciML. To address this, we introduce SuperBench, the first benchmark dataset featuring high-resolution datasets (up to $2048\times2048$ dimensions), including data from fluid flows, cosmology, and weather. Here, we focus on validating spatial SR performance from data-centric and physics-preserved perspectives, as well as assessing robustness to data degradation tasks. While deep learning-based SR methods (developed in the computer vision community) excel on certain tasks, despite relatively limited prior physics information, we identify limitations of these methods in accurately capturing intricate fine-scale features and preserving fundamental physical properties and constraints in scientific data. These shortcomings highlight the importance and subtlety of incorporating domain knowledge into ML models. We anticipate that SuperBench will significantly advance SR methods for scientific tasks. △ Less

Submitted 24 June, 2023; originally announced June 2023.

arXiv:2206.14761 [pdf, other]

Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5

Authors: Sian Jin, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, Franck Cappello

Abstract: Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to… ▽ More Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve the parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space in the process to handle possible data overflows resulting from prediction uncertainty in compression ratios. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores from Summit show that our solution improves the write performance by up to 4.5X and 2.9X over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (compared to original data) on two real-world HPC applications. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: 13 pages, 18 figures, accepted by ACM/IEEE SC'22

arXiv:2110.13151 [pdf, other]

Self-supervised similarity search for large scientific datasets

Authors: George Stein, Peter Harrington, Jacqueline Blaum, Tomislav Medan, Zarija Lukic

Abstract: We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distill low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then us… ▽ More We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distill low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then use the representations to construct and publicly release an interactive semantic similarity search tool. We demonstrate how our tool can be used to rapidly discover rare objects given only a single example, increase the speed of crowd-sourcing campaigns, and construct and improve training sets for supervised applications. While we focus on images from sky surveys, the technique is straightforward to apply to any scientific dataset of any dimensionality. The similarity search web app can be found at https://github.com/georgestein/galaxy_search △ Less

Submitted 30 November, 2021; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures. The similarity search web app can be found at https://github.com/georgestein/galaxy_search. Accepted to the Fourth Workshop on Machine Learning and the Physical Sciences (NeurIPS 2021). ArXiv admin note: text overlap with arXiv:2110.00023

arXiv:2110.00023 [pdf, other]

doi 10.3847/1538-4357/ac6d63

Mining for Strong Gravitational Lenses with Self-supervised Learning

Authors: George Stein, Jacqueline Blaum, Peter Harrington, Tomislav Medan, Zarija Lukic

Abstract: We employ self-supervised representation learning to distill information from 76 million galaxy images from the Dark Energy Spectroscopic Instrument Legacy Imaging Surveys' Data Release 9. Targeting the identification of new strong gravitational lens candidates, we first create a rapid similarity search tool to discover new strong lenses given only a single labelled example. We then show how train… ▽ More We employ self-supervised representation learning to distill information from 76 million galaxy images from the Dark Energy Spectroscopic Instrument Legacy Imaging Surveys' Data Release 9. Targeting the identification of new strong gravitational lens candidates, we first create a rapid similarity search tool to discover new strong lenses given only a single labelled example. We then show how training a simple linear classifier on the self-supervised representations, requiring only a few minutes on a CPU, can automatically classify strong lenses with great efficiency. We present 1192 new strong lens candidates that we identified through a brief visual identification campaign, and release an interactive web-based similarity search tool and the top network predictions to facilitate crowd-sourcing rapid discovery of additional strong gravitational lenses and other rare objects: https://github.com/georgestein/ssl-legacysurvey. △ Less

Submitted 21 June, 2022; v1 submitted 30 September, 2021; originally announced October 2021.

Comments: 24 Pages, 15 figures, published in ApJ, data at github.com/georgestein/ssl-legacysurvey

Journal ref: The Astrophysical Journal, Volume 932, Number 2, 2022

arXiv:2106.12662 [pdf, other]

doi 10.3847/1538-4357/ac5faa

Fast, high-fidelity Lyman $α$ forests with convolutional neural networks

Authors: Peter Harrington, Mustafa Mustafa, Max Dornfest, Benjamin Horowitz, Zarija Lukić

Abstract: Full-physics cosmological simulations are powerful tools for studying the formation and evolution of structure in the universe but require extreme computational resources. Here, we train a convolutional neural network to use a cheaper N-body-only simulation to reconstruct the baryon hydrodynamic variables (density, temperature, and velocity) on scales relevant to the Lyman-$α$ (Ly$α$) forest, usin… ▽ More Full-physics cosmological simulations are powerful tools for studying the formation and evolution of structure in the universe but require extreme computational resources. Here, we train a convolutional neural network to use a cheaper N-body-only simulation to reconstruct the baryon hydrodynamic variables (density, temperature, and velocity) on scales relevant to the Lyman-$α$ (Ly$α$) forest, using data from Nyx simulations. We show that our method enables rapid estimation of these fields at a resolution of $\sim$20kpc, and captures the statistics of the Ly$α$ forest with much greater accuracy than existing approximations. Because our model is fully-convolutional, we can train on smaller simulation boxes and deploy on much larger ones, enabling substantial computational savings. Furthermore, as our method produces an approximation for the hydrodynamic fields instead of Ly$α$ flux directly, it is not limited to a particular choice of ionizing background or mean transmitted flux. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: 10 pages, 6 figures

arXiv:2101.04293 [pdf, other]

Estimating Galactic Distances From Images Using Self-supervised Representation Learning

Authors: Md Abul Hayat, Peter Harrington, George Stein, Zarija Lukić, Mustafa Mustafa

Abstract: We use a contrastive self-supervised learning framework to estimate distances to galaxies from their photometric images. We incorporate data augmentations from computer vision as well as an application-specific augmentation accounting for galactic dust. We find that the resulting visual representations of galaxy images are semantically useful and allow for fast similarity searches, and can be succ… ▽ More We use a contrastive self-supervised learning framework to estimate distances to galaxies from their photometric images. We incorporate data augmentations from computer vision as well as an application-specific augmentation accounting for galactic dust. We find that the resulting visual representations of galaxy images are semantically useful and allow for fast similarity searches, and can be successfully fine-tuned for the task of redshift estimation. We show that (1) pretraining on a large corpus of unlabeled data followed by fine-tuning on some labels can attain the accuracy of a fully-supervised model which requires 2-4x more labeled data, and (2) that by fine-tuning our self-supervised representations using all available data labels in the Main Galaxy Sample of the Sloan Digital Sky Survey (SDSS), we outperform the state-of-the-art supervised learning method. △ Less

Submitted 11 January, 2021; originally announced January 2021.

arXiv:2012.13083 [pdf, other]

doi 10.3847/2041-8213/abf2c7

Self-Supervised Representation Learning for Astronomical Images

Authors: Md Abul Hayat, George Stein, Peter Harrington, Zarija Lukić, Mustafa Mustafa

Abstract: Sky surveys are the largest data generators in astronomy, making automated tools for extracting meaningful scientific information an absolute necessity. We show that, without the need for labels, self-supervised learning recovers representations of sky survey images that are semantically useful for a variety of scientific tasks. These representations can be directly used as features, or fine-tuned… ▽ More Sky surveys are the largest data generators in astronomy, making automated tools for extracting meaningful scientific information an absolute necessity. We show that, without the need for labels, self-supervised learning recovers representations of sky survey images that are semantically useful for a variety of scientific tasks. These representations can be directly used as features, or fine-tuned, to outperform supervised methods trained only on labeled data. We apply a contrastive learning framework on multi-band galaxy photometry from the Sloan Digital Sky Survey (SDSS) to learn image representations. We then use them for galaxy morphology classification, and fine-tune them for photometric redshift estimation, using labels from the Galaxy Zoo 2 dataset and SDSS spectroscopy. In both downstream tasks, using the same learned representations, we outperform the supervised state-of-the-art results, and we show that our approach can achieve the accuracy of supervised models while using 2-4 times fewer labels for training. △ Less

Submitted 8 April, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

Comments: The codes, trained models, and data can be found at https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss

Journal ref: The Astrophysical Journal Letters, Volume 911 (2021), Number 2, Letter 33

arXiv:1706.02390 [pdf, other]

doi 10.1186/s40668-019-0029-9

CosmoGAN: creating high-fidelity weak lensing convergence maps using Generative Adversarial Networks

Authors: Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Zarija Lukić, Rami Al-Rfou, Jan M. Kratochvil

Abstract: Inferring model parameters from experimental data is a grand challenge in many sciences, including cosmology. This often relies critically on high fidelity numerical simulations, which are prohibitively computationally expensive. The application of deep learning techniques to generative modeling is renewing interest in using high dimensional density estimators as computationally inexpensive emulat… ▽ More Inferring model parameters from experimental data is a grand challenge in many sciences, including cosmology. This often relies critically on high fidelity numerical simulations, which are prohibitively computationally expensive. The application of deep learning techniques to generative modeling is renewing interest in using high dimensional density estimators as computationally inexpensive emulators of fully-fledged simulations. These generative models have the potential to make a dramatic shift in the field of scientific simulations, but for that shift to happen we need to study the performance of such generators in the precision regime needed for science applications. To this end, in this work we apply Generative Adversarial Networks to the problem of generating weak lensing convergence maps. We show that our generator network produces maps that are described by, with high statistical confidence, the same summary statistics as the fully simulated maps. △ Less

Submitted 22 May, 2019; v1 submitted 7 June, 2017; originally announced June 2017.

Comments: 11 pages, 8 figures

Journal ref: Computational Astrophysics and CosmologySimulations, Data Analysis and Algorithms 2019 6:1

arXiv:1211.4864 [pdf, other]

The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q

Authors: Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, David Daniel, Patricia Fasel, Nicholas Frontiere, Zarija Lukic

Abstract: Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -- dark matter and dark energy -- remain mysterious. Sky surveys that map billions of galaxies to explore the `Dark Universe', demand a corresponding extreme-scale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been de… ▽ More Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -- dark matter and dark energy -- remain mysterious. Sky surveys that map billions of galaxies to explore the `Dark Universe', demand a corresponding extreme-scale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been designed to deliver this level of performance now, and into the future. With its novel algorithmic structure, HACC allows flexible tuning across diverse architectures, including accelerated and multi-core systems. On the IBM BG/Q, HACC attains unprecedented scalable performance -- currently 13.94 PFlops at 69.2% of peak and 90% parallel efficiency on 1,572,864 cores with an equal number of MPI ranks, and a concurrency of 6.3 million. This level of performance was achieved at extreme problem sizes, including a benchmark run with more than 3.6 trillion particles, significantly larger than any cosmological simulation yet performed. △ Less

Submitted 19 November, 2012; originally announced November 2012.

Comments: 11 pages, 11 figures, final version of paper for talk presented at SC12

Showing 1–11 of 11 results for author: Lukic, Z