Skip to main content

Showing 1–46 of 46 results for author: Cappello, F

  1. Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

    Authors: Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

    Abstract: Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can scale to a large number of GPUs, which redu… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Accepted at FlexScience'24' Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (co-located with HPDC'24)

  2. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

    Authors: Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

    Abstract: LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Th… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Published at HPDC '24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing. Source code at https://github.com/DataStates/datastates-llm

  3. arXiv:2405.02520  [pdf, other

    cs.DC

    TurboFFT: A High-Performance Fast Fourier Transform with Fault Tolerance on GPU

    Authors: Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Huangliang Dai, Sheng Di, Zizhong Chen, Franck Cappello

    Abstract: The Fast Fourier Transform (FFT), as a core computation in a wide range of scientific applications, is increasingly threatened by reliability issues. In this paper, we introduce TurboFFT, a high-performance FFT implementation equipped with a two-sided checksum scheme that detects and corrects silent data corruptions at computing units efficiently. The proposed two-sided checksum addresses the erro… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  4. arXiv:2404.02840  [pdf, ps, other

    cs.DC

    A Survey on Error-Bounded Lossy Compression for Scientific Datasets

    Authors: Sheng Di, Jinyang Liu, Kai Zhao, Xin Liang, Robert Underwood, Zhaorui Zhang, Milan Shah, Yafan Huang, Jiajun Huang, Xiaodong Yu, Congrong Ren, Hanqi Guo, Grant Wilkins, Dingwen Tao, Jiannan Tian, Sian Jin, Zizhe Jian, Daoce Wang, MD Hasanur Rahman, Boyuan Zhang, Jon C. Calhoun, Guanpeng Li, Kazutomo Yoshii, Khalid Ayed Alharthi, Franck Cappello

    Abstract: Error-bounded lossy compression has been effective in significantly reducing the data storage/transfer burden while preserving the reconstructed data fidelity very well. Many error-bounded lossy compressors have been developed for a wide range of parallel and distributed use cases for years. These lossy compressors are designed with distinct compression models and design principles, such that each… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: submitted to ACM Computing journal, requited to be 35 pages including references

  5. arXiv:2403.15953  [pdf, other

    cs.LG cs.AI

    Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

    Authors: Robert Underwood, Jon C. Calhoun, Sheng Di, Franck Cappello

    Abstract: Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing (HPC). However, these methods depend on vast volumes of floating point data for training and validation which need methods to share the data on a wide area network (WAN) or to transfer it from edge devices to data centers. Data compression can be a solution to these problems, bu… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

    Comments: 12 pages, 4 figures

    ACM Class: I.2.6; E.2; C.4

  6. arXiv:2312.13461  [pdf, other

    cs.DC

    FedSZ: Leveraging Error-Bounded Lossy Compression for Federated Learning Communications

    Authors: Grant Wilkins, Sheng Di, Jon C. Calhoun, Zilinghan Li, Kibaek Kim, Robert Underwood, Richard Mortier, Franck Cappello

    Abstract: With the promise of federated learning (FL) to allow for geographically-distributed and highly personalized services, the efficient exchange of model updates between clients and servers becomes crucial. FL, though decentralized, often faces communication bottlenecks, especially in resource-constrained scenarios. Existing data compression techniques like gradient sparsification, quantization, and p… ▽ More

    Submitted 24 April, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Appearing at 44th IEEE International Conference on Distributed Computing Systems (ICDCS)

  7. arXiv:2312.05492  [pdf, other

    cs.DC

    cuSZ-$i$: High-Ratio Scientific Lossy Compression on GPUs with Optimized Multi-Level Interpolation

    Authors: Jinyang Liu, Jiannan Tian, Shixun Wu, Sheng Di, Boyuan Zhang, Robert Underwood, Yafan Huang, Jiajun Huang, Kai Zhao, Guanpeng Li, Dingwen Tao, Zizhong Chen, Franck Cappello

    Abstract: Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based compressors, GPU-based compressors exhibit substantially higher throughputs, fitting better for today's HPC applications. However, the critical limitations of existing GPU-based compressors are their low compression ratios and qualities, severely restricting their appli… ▽ More

    Submitted 11 July, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

    Comments: accepted by SC '24

  8. arXiv:2311.12133  [pdf, other

    cs.DC cs.CE cs.DB

    High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation

    Authors: Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Sian Jin, Zizhe Jian, Jiajun Huang, Shixun Wu, Zizhong Chen, Franck Cappello

    Abstract: Error-bounded lossy compression has been identified as a promising solution for significantly reducing scientific data volumes upon users' requirements on data distortion. For the existing scientific error-bounded lossy compressors, some of them (such as SPERR and FAZ) can reach fairly high compression ratios and some others (such as SZx, SZ, and ZFP) feature high compression speeds, but they rare… ▽ More

    Submitted 13 December, 2023; v1 submitted 20 November, 2023; originally announced November 2023.

  9. arXiv:2310.14133  [pdf, other

    cs.DC

    Dynamic Quality Metric Oriented Error-bounded Lossy Compression for Scientific Datasets

    Authors: Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: With the ever-increasing execution scale of high performance computing (HPC) applications, vast amounts of data are being produced by scientific research every day. Error-bounded lossy compression has been considered a very promising solution to address the big-data issue for scientific applications because it can significantly reduce the data volume with low time cost meanwhile allowing users to… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

  10. arXiv:2309.04037  [pdf, other

    cs.LG cs.DC cs.IT

    SRN-SZ: Deep Leaning-Based Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks

    Authors: Jinyang Liu, Sheng Di, Sian Jin, Kai Zhao, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: The fast growth of computational power and scales of modern super-computing systems have raised great challenges for the management of exascale scientific data. To maintain the usability of scientific data, error-bound lossy compression is proposed and developed as an essential technique for the size reduction of scientific data with constrained data distortion. Among the diverse datasets generate… ▽ More

    Submitted 6 November, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

  11. arXiv:2308.05199  [pdf, other

    cs.DC

    gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

    Authors: Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

    Abstract: GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose… ▽ More

    Submitted 6 May, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

    Comments: 12 pages, 13 figures, and 2 tables. ICS '24

  12. arXiv:2307.09609  [pdf, other

    cs.DC

    AMRIC: A Novel In Situ Lossy Compression Framework for Efficient I/O in Adaptive Mesh Refinement Applications

    Authors: Daoce Wang, Jesus Pulido, Pascal Grosset, Jiannan Tian, Sian Jin, Houjun Tang, Jean Sexton, Sheng Di, Zarija Lukić, Kai Zhao, Bo Fang, Franck Cappello, James Ahrens, Dingwen Tao

    Abstract: As supercomputers advance towards exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: 12 pages, 18 figures, 3 tables, accepted by ACM/IEEE SC '23

  13. arXiv:2307.05416  [pdf, other

    cs.DC cs.DB

    Optimizing Scientific Data Transfer on Globus with Error-bounded Lossy Compression

    Authors: Yuanjian Liu, Sheng Di, Kyle Chard, Ian Foster, Franck Cappello

    Abstract: The increasing volume and velocity of science data necessitate the frequent movement of enormous data volumes as part of routine research activities. As a result, limited wide-area bandwidth often leads to bottlenecks in research progress. However, in many cases, consuming applications (e.g., for analysis, visualization, and machine learning) can achieve acceptable performance on reduced-precision… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

  14. arXiv:2305.08801  [pdf, other

    cs.DC cs.IT

    Black-Box Statistical Prediction of Lossy Compression Ratios for Scientific Data

    Authors: Robert Underwood, Julie Bessac, David Krasowska, Jon C. Calhoun, Sheng Di, Franck Cappello

    Abstract: Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no theoretical or data-based quantification of lossy compressibility exists for scientific data. Users rely on trial and error to assess lossy compre… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: 16 pages, 10 figures

  15. FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

    Authors: Boyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Yunhe Feng, Xin Liang, Dingwen Tao, Franck Cappello

    Abstract: Today's large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many appli… ▽ More

    Submitted 2 May, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 14 pages, 12 figures, accepted by ACM HPDC '23

  16. GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

    Authors: Boyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Martin Swany, Dingwen Tao, Franck Cappello

    Abstract: Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput d… ▽ More

    Submitted 2 May, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

    Comments: 12 pages, 9 figures, 3 tables, accepted by ACM ICS '23

  17. arXiv:2304.03890  [pdf, other

    cs.DC

    An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

    Authors: Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

    Abstract: With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address th… ▽ More

    Submitted 17 January, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

    Comments: 13 pages, 18 figures, 6 tables, IPDPS '24

  18. arXiv:2206.14761  [pdf, other

    cs.DC cs.PF

    Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5

    Authors: Sian Jin, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, Franck Cappello

    Abstract: Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: 13 pages, 18 figures, accepted by ACM/IEEE SC'22

  19. arXiv:2206.11297  [pdf, other

    cs.DC

    ROIBIN-SZ: Fast and Science-Preserving Compression for Serial Crystallography

    Authors: Robert Underwood, Chun Yoon, Ali Gok, Sheng Di, Franck Cappello

    Abstract: Crystallography is the leading technique to study atomic structures of proteins and produces enormous volumes of information that can place strains on the storage and data transfer capabilities of synchrotron and free-electron laser light sources. Lossy compression has been identified as a possible means to cope with the growing data volumes; however, prior approaches have not produced sufficient… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: 12 pages, 8 figures

    ACM Class: J.2; D.1.3; E.4

  20. arXiv:2201.13020  [pdf, ps, other

    cs.DC

    SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

    Authors: Xiaodong Yu, Sheng Di, Kai Zhao, jiannan Tian, Dingwen Tao, Xin Liang, Franck Cappello

    Abstract: Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly c… ▽ More

    Submitted 31 January, 2022; originally announced January 2022.

  21. arXiv:2201.09118  [pdf, other

    cs.DC

    Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

    Authors: Cody Rivera, Sheng Di, Jiannan Tian, Xiaodong Yu, Dingwen Tao, Franck Cappello

    Abstract: More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ de… ▽ More

    Submitted 9 March, 2022; v1 submitted 22 January, 2022; originally announced January 2022.

    Comments: 11 pages, 5 figures, 5 tables, accepted by IEEE IPDPS'22

  22. arXiv:2201.04614  [pdf, other

    cs.DC

    SIMD Lossy Compression for Scientific Data

    Authors: Griffin Dube, Jiannan Tian, Sheng Di, Dingwen Tao, Jon Calhoun, Franck Cappello

    Abstract: Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Data reduction techniques, such as lossy compression, help to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but the prediction/quantizatio… ▽ More

    Submitted 12 January, 2022; originally announced January 2022.

  23. arXiv:2112.02289  [pdf, other

    cs.DC

    Towards Aggregated Asynchronous Checkpointing

    Authors: Mikaila J. Gossman, Bogdan Nicolae, Jon C. Calhoun, Franck Cappello, Melissa C. Smith

    Abstract: High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background… ▽ More

    Submitted 4 December, 2021; originally announced December 2021.

    Comments: Accepted submission to the SuperCheck Workshop at the SuperComputing Conference held in St. Louis, MO. November 14-19, 2021(SC'21)

  24. arXiv:2111.09815  [pdf, other

    cs.DB cs.DC

    Improving Prediction-Based Lossy Compression Dramatically via Ratio-Quality Modeling

    Authors: Sian Jin, Sheng Di, Jiannan Tian, Suren Byna, Dingwen Tao, Franck Cappello

    Abstract: Error-bounded lossy compression is one of the most effective techniques for scientific data reduction. However, the traditional trial-and-error approach used to configure lossy compressors for finding the optimal trade-off between reconstructed data quality and compression ratio is prohibitively expensive. To resolve this issue, we develop a general-purpose analytical ratio-quality model based on… ▽ More

    Submitted 5 May, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

    Comments: 14 pages, 14 figures, published by IEEE ICDE 2022

  25. arXiv:2111.02925  [pdf, other

    cs.DC

    SZ3: A Modular Framework for Composing Prediction-Based Error-Bounded Lossy Compressors

    Authors: Xin Liang, Kai Zhao, Sheng Di, Sihuan Li, Robert Underwood, Ali M. Gok, Jiannan Tian, Junjing Deng, Jon C. Calhoun, Dingwen Tao, Zizhong Chen, Franck Cappello

    Abstract: Today's scientific simulations require a significant reduction of data volume because of extremely large amounts of data they produce and the limited I/O bandwidth and storage space. Error-bounded lossy compressor has been considered one of the most effective solutions to the above problem. In practice, however, the best-fit compression method often needs to be customized/optimized in particular b… ▽ More

    Submitted 11 November, 2021; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: 13 pages

  26. arXiv:2105.12912  [pdf, other

    cs.DC

    Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

    Authors: Jiannan Tian, Sheng Di, Xiaodong Yu, Cody Rivera, Kai Zhao, Sian Jin, Yunhe Feng, Xin Liang, Dingwen Tao, Franck Cappello

    Abstract: Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors (such as cuSZ+ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose cuSZ+ to target both hi… ▽ More

    Submitted 3 September, 2021; v1 submitted 26 May, 2021; originally announced May 2021.

    Comments: 12 pages, 3 figures, 7 tables, accepted by IEEE Cluster'21

  27. arXiv:2105.11730  [pdf, other

    cs.LG cs.AI cs.DC

    Exploring Autoencoder-based Error-bounded Compression for Scientific Data

    Authors: Jinyang Liu, Sheng Di, Kai Zhao, Sian Jin, Dingwen Tao, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: Error-bounded lossy compression is becoming an indispensable technique for the success of today's scientific projects with vast volumes of data produced during simulations or instrument data acquisitions. Not only can it significantly reduce data size, but it also can control the compression errors based on user-specified error bounds. Autoencoder (AE) models have been widely used in image compres… ▽ More

    Submitted 21 October, 2023; v1 submitted 25 May, 2021; originally announced May 2021.

  28. arXiv:2103.02131  [pdf, ps, other

    cs.DC

    VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

    Authors: Bogdan Nicolae, Adam Moody, Gregory Kosinovsky, Kathryn Mohror, Franck Cappello

    Abstract: Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming i… ▽ More

    Submitted 2 March, 2021; originally announced March 2021.

    Journal ref: SuperCheck'21: First International Symposium on Checkpointing for Supercomputing, 2021

  29. SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors

    Authors: Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Julie Bessac, Zizhong Chen, Franck Cappello

    Abstract: Efficient error-controlled lossy compressors are becoming critical to the success of today's large-scale scientific applications because of the ever-increasing volume of data produced by the applications. In the past decade, many lossless and lossy compressors have been developed with distinct design principles for different scientific datasets in largely diverse scientific domains. In order to su… ▽ More

    Submitted 8 January, 2021; originally announced January 2021.

    Comments: Published in Proceedings of the 1st International Workshop on Big Data Reduction @BigData'20

    Journal ref: 2020 IEEE International Conference on Big Data (Big Data)

  30. arXiv:2010.10039  [pdf, other

    cs.DC

    Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

    Authors: Jiannan Tian, Cody Rivera, Sheng Di, Jieyang Chen, Xin Liang, Dingwen Tao, Franck Cappello

    Abstract: Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be f… ▽ More

    Submitted 1 March, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: 11 pages, 3 figures, 6 tables, published by IEEE IPDPS'21

  31. arXiv:2010.03144  [pdf, other

    cs.DC

    SDC Resilient Error-bounded Lossy Compressor

    Authors: Sihuan Li, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: Lossy compression is one of the most important strategies to resolve the big science data issue, however, little work was done to make it resilient against silent data corruptions (SDC). In fact, SDC is becoming non-negligible because of exa-scale computing demand on complex scientific simulations with vast volume of data being produced or in some particular instruments/devices (such as interplane… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

  32. cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data

    Authors: Jiannan Tian, Sheng Di, Kai Zhao, Cody Rivera, Megan Hickman Fulp, Robert Underwood, Sian Jin, Xin Liang, Jon Calhoun, Dingwen Tao, Franck Cappello

    Abstract: Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versio… ▽ More

    Submitted 21 September, 2020; v1 submitted 19 July, 2020; originally announced July 2020.

    Comments: 13 pages, 8 figures, 9 tables, published in PACT '20

  33. FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

    Authors: Kai Zhao, Sheng Di, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang, Franck Cappello, Zizhong Chen

    Abstract: Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process agains… ▽ More

    Submitted 7 September, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: 13 pages

    Journal ref: IEEE Transactions on Parallel and Distributed Systems, 2020

  34. arXiv:2001.06139  [pdf, other

    cs.DC

    FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data

    Authors: Robert Underwood, Sheng Di, Jon C. Calhoun, Franck Cappello

    Abstract: With ever-increasing volumes of scientific floating-point data being produced by high-performance computing applications, significantly reducing scientific floating-point data size is critical, and error-controlled lossy compressors have been developed for years. None of the existing scientific floating-point lossy data compressors, however, support effective fixed-ratio lossy compression. Yet fix… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

    Comments: 12 pages

  35. DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

    Authors: Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, Franck Cappello

    Abstract: DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today's DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly restricting their utilization on r… ▽ More

    Submitted 22 April, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: 12 pages, 6 figures, accepted by HPDC'19

  36. arXiv:1811.05630  [pdf, other

    quant-ph cs.CC cs.ET

    Memory-Efficient Quantum Circuit Simulation by Using Lossy Data Compression

    Authors: Xin-Chuan Wu, Sheng Di, Franck Cappello, Hal Finkel, Yuri Alexeev, Frederic T. Chong

    Abstract: In order to evaluate, validate, and refine the design of new quantum algorithms or quantum computers, researchers and developers need methods to assess their correctness and fidelity. This requires the capabilities of quantum circuit simulations. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requir… ▽ More

    Submitted 14 November, 2018; v1 submitted 13 November, 2018; originally announced November 2018.

    Comments: 2 pages, 2 figures. The 3rd International Workshop on Post-Moore Era Supercomputing (PMES)

  37. arXiv:1811.05140  [pdf, other

    quant-ph cs.ET

    Amplitude-Aware Lossy Compression for Quantum Circuit Simulation

    Authors: Xin-Chuan Wu, Sheng Di, Franck Cappello, Hal Finkel, Yuri Alexeev, Frederic T. Chong

    Abstract: Classical simulation of quantum circuits is crucial for evaluating and validating the design of new quantum algorithms. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requirement for the simulations. In this paper, we present a new data reduction technique to reduce the memory requirement of quantum… ▽ More

    Submitted 14 November, 2018; v1 submitted 13 November, 2018; originally announced November 2018.

    Comments: 6pages, 6 figures. The 4th International Workshop on Data Reduction for Big Scientific Data (DRBSD-4)

  38. arXiv:1806.08901  [pdf, other

    cs.DC

    Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP

    Authors: Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: With ever-increasing volumes of scientific data produced by HPC applications, significantly reducing data size is critical because of limited capacity of storage space and potential bottlenecks on I/O or networks in writing/reading or transferring data. SZ and ZFP are the two leading lossy compressors available to compress scientific data sets. However, their performance is not consistent across d… ▽ More

    Submitted 5 January, 2019; v1 submitted 22 June, 2018; originally announced June 2018.

    Comments: 14 pages, 9 figures, first revision

  39. arXiv:1805.07384  [pdf, other

    cs.IT cs.DC

    Fixed-PSNR Lossy Compression for Scientific Data

    Authors: Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: Error-controlled lossy compression has been studied for years because of extremely large volumes of data being produced by today's scientific simulations. None of existing lossy compressors, however, allow users to fix the peak signal-to-noise ratio (PSNR) during compression, although PSNR has been considered as one of the most significant indicators to assess compression quality. In this paper, w… ▽ More

    Submitted 13 July, 2018; v1 submitted 17 May, 2018; originally announced May 2018.

    Comments: 5 pages, 2 figures, 2 tables, accepted by IEEE Cluster'18. arXiv admin note: text overlap with arXiv:1806.08901

  40. Improving Performance of Iterative Methods by Lossy Checkponting

    Authors: Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello

    Abstract: Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage spa… ▽ More

    Submitted 28 May, 2018; v1 submitted 30 April, 2018; originally announced April 2018.

    Comments: 14 pages, 10 figures, HPDC'18

  41. arXiv:1711.03888  [pdf, other

    cs.DC

    In-Depth Exploration of Single-Snapshot Lossy Compression Techniques for N-Body Simulations

    Authors: Dingwen Tao, Sheng Di, Zizhong Chen, Franck Cappello

    Abstract: In situ lossy compression allowing user-controlled data loss can significantly reduce the I/O burden. For large-scale N-body simulations where only one snapshot can be compressed at a time, the lossy compression ratio is very limited because of the fairly low spatial coherence of the particle data. In this work, we assess the state-of-the-art single-snapshot lossy compression techniques of two com… ▽ More

    Submitted 10 November, 2017; originally announced November 2017.

    Comments: Accepted by IEEE BigData 2017

  42. arXiv:1707.09320  [pdf

    cs.OH astro-ph.IM cs.CE

    Z-checker: A Framework for Assessing Lossy Compression of Scientific Data

    Authors: Dingwen Tao, Sheng Di, Hanqi Guo, Zizhong Chen, Franck Cappello

    Abstract: Because of vast volume of data being produced by today's scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific datasets and understand the data alteration aft… ▽ More

    Submitted 10 November, 2017; v1 submitted 12 June, 2017; originally announced July 2017.

    Comments: Accepted by The International Journal of High Performance Computing Application

  43. arXiv:1707.08205  [pdf, other

    cs.IT astro-ph.IM

    Exploration of Pattern-Matching Techniques for Lossy Compression on Cosmology Simulation Data Sets

    Authors: Dingewn Tao, Sheng Di, Zizhong Chen, Franck Cappello

    Abstract: Because of the vast volume of data being produced by today's scientific simulations, lossy compression allowing user-controlled information loss can significantly reduce the data size and the I/O burden. However, for large-scale cosmology simulation, such as the Hardware/Hybrid Accelerated Cosmology Code (HACC), where memory overhead constraints restrict compression to only one snapshot at a time,… ▽ More

    Submitted 6 August, 2017; v1 submitted 17 June, 2017; originally announced July 2017.

    Comments: 12 pages, 4 figures, accepted for DRBSD-1 in conjunction with ISC'17

  44. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization

    Authors: Dingwen Tao, Sheng Di, Zizhong Chen, Franck Cappello

    Abstract: Today's HPC applications are producing extremely large amounts of data, such that data storage and analysis are becoming more challenging for scientific research. In this work, we design a new error-controlled lossy compression algorithm for large-scale scientific data. Our key contribution is significantly improving the prediction hitting rate (or prediction accuracy) for each data point based on… ▽ More

    Submitted 12 June, 2017; originally announced June 2017.

    Comments: Accepted by IPDPS'17, 11 pages, 10 figures, double column

  45. arXiv:0911.5593  [pdf, ps, other

    cs.DC

    Checkpointing vs. Migration for Post-Petascale Machines

    Authors: Franck Cappello, Henri Casanova, Yves Robert

    Abstract: We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose?

    Submitted 30 November, 2009; originally announced November 2009.

  46. arXiv:cs/0307066  [pdf, ps, other

    cs.DC

    Augernome & XtremWeb: Monte Carlos computation on a global computing platform

    Authors: Oleg Lodygensky, Gilles Fedak, Vincent Neri, Alain Cordier, Franck Cappello

    Abstract: In this paper, we present XtremWeb, a Global Computing platform used to generate monte carlos showers in Auger, an HEP experiment to study the highest energy cosmic rays at Mallargue-Mendoza, Argentina. XtremWeb main goal, as a Global Computing platform, is to compute distributed applications using idle time of widely interconnected machines. It is especially dedicated to -but not limited to-… ▽ More

    Submitted 29 July, 2003; originally announced July 2003.

    ACM Class: D.0

    Journal ref: ECONF C0303241:THAT001,2003