Jongsoo Park

Stanford, California, United States Contact Info

Sign in to view Jongsoo’s full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

1K followers 500+ connections

View mutual connections with Jongsoo

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

About

Software/processor-architecture co-design and optimization of machine learning, HPC, and…

Activity

KV cache becoming too big and demanding too much memory bandwidth can be a challenge to increase context lengths in LLM inference. 4-bit quantization…

KV cache becoming too big and demanding too much memory bandwidth can be a challenge to increase context lengths in LLM inference. 4-bit quantization…

Shared by Jongsoo Park
We are happy to share that our work on the efficient upscaling of recommendation models, including both sparse and dense scaling, has been accepted…

We are happy to share that our work on the efficient upscaling of recommendation models, including both sparse and dense scaling, has been accepted…

Liked by Jongsoo Park
Zhaodong Chen is going to present his CUTLASS paper - EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree in ASPLOS'24 on May 1. EVT…

Zhaodong Chen is going to present his CUTLASS paper - EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree in ASPLOS'24 on May 1. EVT…

Liked by Jongsoo Park

Join now to see all activity

Experience & Education

Meta

***** ***********

******/*****/****** ***** ******** *********
******** **********

******** *********
******** **********

*** ********** ***********

2007 - 2011
******** **********

** ********** ***********

2005 - 2007

View Jongsoo’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

Enabling Sparse Winograd Convolution by Native Pruning

Mar 2017

See publication
Faster CNNs with Direct Sparse Convolutions and Guided Pruning

ICLR Mar 2017

See publication
Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory

IPDPS Mar 2017
Other authors
An Exploration of Optimization Algorithms for High Performance Tensor Completion

SC Nov 2016
Other authors
See publication
Automating Wavefront Parallelization for Sparse Matrix Codes

SC Nov 2016

See publication
Sparso: Context-driven Optimizations of Sparse Linear Algebra

PACT 2016

See publication
High-Performance Algebraic Multigrid Solver Optimized for Multi-Core Based Distributed Parallel Systems,

accepted to 2015 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2015
Other authors
HPCG on Intel Architecture

ISC'15 2015

See publication
Distributed SociaLite: A Datalog-based Language for Large-Scale Graph Analysis

VLDB Aug 2014

Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support large-scale graph analysis.
With distributed SociaLite, programmers simply annotate how data are to be distributed, then the…

Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support large-scale graph analysis.
With distributed SociaLite, programmers simply annotate how data are to be distributed, then the necessary communication is automatically inferred to generate parallel code for cluster of multi-core machines. It optimizes the evaluation of recursive monotone aggregate functions using a delta stepping technique. In addition, approximate computation is supported in SociaLite, allowing programmers to trade off accuracy for less time and space.
We evaluated SociaLite with six core graph algorithms used in many social network analyses. Our experiment with 64 Amazon EC2 8-core instances shows that SociaLite programs performed within a factor of two with respect to ideal weak scaling. Compared to optimized Giraph, an open-source alternative of Pregel, SociaLite programs are 4 to 12 times faster across benchmark algorithms, and 22 times more succinct on average.
As a declarative query language, SociaLite, with the help of a compiler that generates efficient parallel and approximate code, can be used easily to create many social apps that operate on large-scale distributed graphs.

See publication
A Framework for Low-Communication 1-D FFT

Supercomputing Conference 2012

See publication
Billion-Particle SIMD-Friendly Two-Point Correlation on Large-Scale HPC Cluster Systems

Supercomputing Conference 2012

See publication
CloudRAMSort: Fast and Efficient Large-scale Distributed RAM Sort on Shared-Nothing Cluster

SIGMOD 2012

See publication
Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors

Supercomputing Conference 2012

See publication
Buffer-space Efficient and Deadlock-free Scheduling of Stream Applications on Multi-core Architectures

Symposium on Parallelism in Algorithms and Architectures (SPAA) 2010
Fine-grain Dynamic Instruction Placement for L0 Scratch-pad Memory

International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES) 2010
An Energy-Efficient Processor Architecture for Embedded Systems

Computer Architecture Letters 2008
Efficient Embedded Computing

IEEE Computer 2008
Hierarchical Instruction Register Organization

Computer Architecture Letters 2008
Register Pointer Architecture for Efficient Embedded Processors

Proceedings of the Conference on Design Automation and Test in Europe (DATE) 2007

Patents

Fourier transform computation for distributed processing environments

Issued April 10, 2014 US US20140101219 A1
Fourier transform computation for distributed processing environments is disclosed. Example methods disclosed herein to compute a Fourier transform of an input data sequence include performing first processing on the input data sequence using a plurality of processors, the first processing resulting in an output data sequence having more data elements than the input data sequence Such example methods also include performing second processing on the output data sequence using the plurality of…

Fourier transform computation for distributed processing environments is disclosed. Example methods disclosed herein to compute a Fourier transform of an input data sequence include performing first processing on the input data sequence using a plurality of processors, the first processing resulting in an output data sequence having more data elements than the input data sequence Such example methods also include performing second processing on the output data sequence using the plurality of processors, the output data sequence being permutated among the plurality of processors, each of the processors performing the second processing on a respective permutated portion of the output data sequence to determine a respective, ordered segment of the Fourier transform of the input data sequence.

Other inventors
See patent

Honors & Awards

Best paper finalist, Supercomputing Conference, for automatic wave-front parallelization

-

2016
Best student paper finalist, Supercomputing Conference, for sparse tensor completion

-

2016
ACM Gorden Bell award finalist for billion-particle two-point correlation computation

-

2012
Best paper finalist, Supercomputing Conference, for efficient synthetic aperture radar algorithm

-

2012
Best paper, Supercomputing Conference, for low-communication FFT

-

2012
World Fastest High-Performance Conjugate Gradient benchmark (HPCG), 2014-2016

-

More activity by Jongsoo

It was great to share some of my experiences within Meta. If you are interested to come work on what (at least I) see as the most fun and exciting…

It was great to share some of my experiences within Meta. If you are interested to come work on what (at least I) see as the most fun and exciting…

Liked by Jongsoo Park
Llama3 8B and 70B are out, with pretty exciting results! * The ~400B is still training but results already look promising. * Meta's own Chat…

Llama3 8B and 70B are out, with pretty exciting results! * The ~400B is still training but results already look promising. * Meta's own Chat…

Liked by Jongsoo Park
It has been one of the best collaboration I've had that I learned a lot from. More than 400 TF/s/GPU at 16K GPU scale for production training…

It has been one of the best collaboration I've had that I learned a lot from. More than 400 TF/s/GPU at 16K GPU scale for production training…

Shared by Jongsoo Park
I am so excited to work on multi-dimensional parallelism and pretraining efficiency for Llama3 scaling. We did many optimizations on memory usage and…

I am so excited to work on multi-dimensional parallelism and pretraining efficiency for Llama3 scaling. We did many optimizations on memory usage and…

Liked by Jongsoo Park
I'm thrilled to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are…

I'm thrilled to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are…

Liked by Jongsoo Park
When I joined Meta as a thermal engineer over a decade ago, I couldn’t have imagined the road ahead. Here’s how my journey has followed AI innovation…

When I joined Meta as a thermal engineer over a decade ago, I couldn’t have imagined the road ahead. Here’s how my journey has followed AI innovation…

Liked by Jongsoo Park
Details on Meta's 24K H100 GPU clusters for LLaMa3 training. https://lnkd.in/guEjG46v

Details on Meta's 24K H100 GPU clusters for LLaMa3 training. https://lnkd.in/guEjG46v

Shared by Jongsoo Park

View Jongsoo’s full profile

See who you know in common
Get introduced
Contact Jongsoo directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Jongsoo Park

35 others named Jongsoo Park are on LinkedIn

See others named Jongsoo Park

Add new skills with these courses

See all courses

About

Activity

KV cache becoming too big and demanding too much memory bandwidth can be a challenge to increase context lengths in LLM inference. 4-bit quantization…

Shared by Jongsoo Park

We are happy to share that our work on the efficient upscaling of recommendation models, including both sparse and dense scaling, has been accepted…

Liked by Jongsoo Park

Zhaodong Chen is going to present his CUTLASS paper - EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree in ASPLOS'24 on May 1. EVT…

Liked by Jongsoo Park

Experience & Education

Meta

******** ********* (*********)

View Jongsoo’s full experience

See their title, tenure and more.

Publications

Mar 2017

ICLR Mar 2017

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory

IPDPS Mar 2017

SC Nov 2016

SC Nov 2016

PACT 2016

High-Performance Algebraic Multigrid Solver Optimized for Multi-Core Based Distributed Parallel Systems,

accepted to 2015 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2015

ISC'15 2015

VLDB Aug 2014

Supercomputing Conference 2012

Supercomputing Conference 2012

SIGMOD 2012

Supercomputing Conference 2012

Buffer-space Efficient and Deadlock-free Scheduling of Stream Applications on Multi-core Architectures

Symposium on Parallelism in Algorithms and Architectures (SPAA) 2010

Fine-grain Dynamic Instruction Placement for L0 Scratch-pad Memory

International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES) 2010

An Energy-Efficient Processor Architecture for Embedded Systems

Computer Architecture Letters 2008

Efficient Embedded Computing

IEEE Computer 2008

Hierarchical Instruction Register Organization

Computer Architecture Letters 2008

Register Pointer Architecture for Efficient Embedded Processors

Proceedings of the Conference on Design Automation and Test in Europe (DATE) 2007

Patents

Issued April 10, 2014 US US20140101219 A1

Honors & Awards

Best paper finalist, Supercomputing Conference, for automatic wave-front parallelization

-

Best student paper finalist, Supercomputing Conference, for sparse tensor completion

-

ACM Gorden Bell award finalist for billion-particle two-point correlation computation

-

Best paper finalist, Supercomputing Conference, for efficient synthetic aperture radar algorithm

-

Best paper, Supercomputing Conference, for low-communication FFT

-

World Fastest High-Performance Conjugate Gradient benchmark (HPCG), 2014-2016

-

More activity by Jongsoo

It was great to share some of my experiences within Meta. If you are interested to come work on what (at least I) see as the most fun and exciting…

Liked by Jongsoo Park

Llama3 8B and 70B are out, with pretty exciting results! * The ~400B is still training but results already look promising. * Meta's own Chat…

Liked by Jongsoo Park

It has been one of the best collaboration I've had that I learned a lot from. More than 400 TF/s/GPU at 16K GPU scale for production training…

Shared by Jongsoo Park

I am so excited to work on multi-dimensional parallelism and pretraining efficiency for Llama3 scaling. We did many optimizations on memory usage and…

Liked by Jongsoo Park

I'm thrilled to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are…

Liked by Jongsoo Park

When I joined Meta as a thermal engineer over a decade ago, I couldn’t have imagined the road ahead. Here’s how my journey has followed AI innovation…

Liked by Jongsoo Park

Details on Meta's 24K H100 GPU clusters for LLaMa3 training. https://lnkd.in/guEjG46v

Shared by Jongsoo Park

View Jongsoo’s full profile

Sign in

People also viewed

Yuchen Hao

Mustafa Ozdal

Cao Gao

Jiyan Yang

James Hongyi Zeng