About
Activity
-
KV cache becoming too big and demanding too much memory bandwidth can be a challenge to increase context lengths in LLM inference. 4-bit quantization…
KV cache becoming too big and demanding too much memory bandwidth can be a challenge to increase context lengths in LLM inference. 4-bit quantization…
Shared by Jongsoo Park
-
We are happy to share that our work on the efficient upscaling of recommendation models, including both sparse and dense scaling, has been accepted…
We are happy to share that our work on the efficient upscaling of recommendation models, including both sparse and dense scaling, has been accepted…
Liked by Jongsoo Park
-
Zhaodong Chen is going to present his CUTLASS paper - EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree in ASPLOS'24 on May 1. EVT…
Zhaodong Chen is going to present his CUTLASS paper - EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree in ASPLOS'24 on May 1. EVT…
Liked by Jongsoo Park
Experience & Education
Publications
-
Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory
IPDPS
Other authors -
-
Distributed SociaLite: A Datalog-based Language for Large-Scale Graph Analysis
VLDB
Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support large-scale graph analysis.
With distributed SociaLite, programmers simply annotate how data are to be distributed, then the…Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support large-scale graph analysis.
With distributed SociaLite, programmers simply annotate how data are to be distributed, then the necessary communication is automatically inferred to generate parallel code for cluster of multi-core machines. It optimizes the evaluation of recursive monotone aggregate functions using a delta stepping technique. In addition, approximate computation is supported in SociaLite, allowing programmers to trade off accuracy for less time and space.
We evaluated SociaLite with six core graph algorithms used in many social network analyses. Our experiment with 64 Amazon EC2 8-core instances shows that SociaLite programs performed within a factor of two with respect to ideal weak scaling. Compared to optimized Giraph, an open-source alternative of Pregel, SociaLite programs are 4 to 12 times faster across benchmark algorithms, and 22 times more succinct on average.
As a declarative query language, SociaLite, with the help of a compiler that generates efficient parallel and approximate code, can be used easily to create many social apps that operate on large-scale distributed graphs. -
Billion-Particle SIMD-Friendly Two-Point Correlation on Large-Scale HPC Cluster Systems
Supercomputing Conference
-
Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors
Supercomputing Conference
-
Buffer-space Efficient and Deadlock-free Scheduling of Stream Applications on Multi-core Architectures
Symposium on Parallelism in Algorithms and Architectures (SPAA)
-
Fine-grain Dynamic Instruction Placement for L0 Scratch-pad Memory
International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES)
-
An Energy-Efficient Processor Architecture for Embedded Systems
Computer Architecture Letters
-
Efficient Embedded Computing
IEEE Computer
-
Hierarchical Instruction Register Organization
Computer Architecture Letters
-
Register Pointer Architecture for Efficient Embedded Processors
Proceedings of the Conference on Design Automation and Test in Europe (DATE)
Patents
-
Fourier transform computation for distributed processing environments
Issued US US20140101219 A1
Fourier transform computation for distributed processing environments is disclosed. Example methods disclosed herein to compute a Fourier transform of an input data sequence include performing first processing on the input data sequence using a plurality of processors, the first processing resulting in an output data sequence having more data elements than the input data sequence Such example methods also include performing second processing on the output data sequence using the plurality of…
Fourier transform computation for distributed processing environments is disclosed. Example methods disclosed herein to compute a Fourier transform of an input data sequence include performing first processing on the input data sequence using a plurality of processors, the first processing resulting in an output data sequence having more data elements than the input data sequence Such example methods also include performing second processing on the output data sequence using the plurality of processors, the output data sequence being permutated among the plurality of processors, each of the processors performing the second processing on a respective permutated portion of the output data sequence to determine a respective, ordered segment of the Fourier transform of the input data sequence.
Other inventorsSee patent
Honors & Awards
-
Best paper finalist, Supercomputing Conference, for automatic wave-front parallelization
-
-
Best student paper finalist, Supercomputing Conference, for sparse tensor completion
-
-
ACM Gorden Bell award finalist for billion-particle two-point correlation computation
-
-
Best paper finalist, Supercomputing Conference, for efficient synthetic aperture radar algorithm
-
-
Best paper, Supercomputing Conference, for low-communication FFT
-
-
World Fastest High-Performance Conjugate Gradient benchmark (HPCG), 2014-2016
-
More activity by Jongsoo
-
It was great to share some of my experiences within Meta. If you are interested to come work on what (at least I) see as the most fun and exciting…
It was great to share some of my experiences within Meta. If you are interested to come work on what (at least I) see as the most fun and exciting…
Liked by Jongsoo Park
-
Llama3 8B and 70B are out, with pretty exciting results! * The ~400B is still training but results already look promising. * Meta's own Chat…
Llama3 8B and 70B are out, with pretty exciting results! * The ~400B is still training but results already look promising. * Meta's own Chat…
Liked by Jongsoo Park
-
It has been one of the best collaboration I've had that I learned a lot from. More than 400 TF/s/GPU at 16K GPU scale for production training…
It has been one of the best collaboration I've had that I learned a lot from. More than 400 TF/s/GPU at 16K GPU scale for production training…
Shared by Jongsoo Park
-
I am so excited to work on multi-dimensional parallelism and pretraining efficiency for Llama3 scaling. We did many optimizations on memory usage and…
I am so excited to work on multi-dimensional parallelism and pretraining efficiency for Llama3 scaling. We did many optimizations on memory usage and…
Liked by Jongsoo Park
-
I'm thrilled to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are…
I'm thrilled to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are…
Liked by Jongsoo Park
-
When I joined Meta as a thermal engineer over a decade ago, I couldn’t have imagined the road ahead. Here’s how my journey has followed AI innovation…
When I joined Meta as a thermal engineer over a decade ago, I couldn’t have imagined the road ahead. Here’s how my journey has followed AI innovation…
Liked by Jongsoo Park
-
Details on Meta's 24K H100 GPU clusters for LLaMa3 training. https://lnkd.in/guEjG46v
Details on Meta's 24K H100 GPU clusters for LLaMa3 training. https://lnkd.in/guEjG46v
Shared by Jongsoo Park
People also viewed
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More