Use PyArrow for zero-copy interaction with the Ray Object Store #36

franklsf95 · 2023-03-31T15:51:31Z

Ray Shuffle is currently 2x slower than disk-based shuffle. My theory is that there is too much serde and memcpys going on. There really shouldn't be any because the Arrow in-memory format is supported natively by the Ray object store. This PR addresses that.

In this PR:

Removed PyResultSet and PyRecordBatch. Instead, use the native pyarrow.RecordBatch and pyarrow.ResultSet. This makes them picklable so that we don't have to convert them from bytes.
I realized schedule_execution really doesn't have to be remote tasks since with Ray shuffle we are scheduling all tasks at the beginning of execution. Hence made it a recursive function call. This also saves serde cost of execution plans.

Using PyArrow, Ray Shuffle is now slightly faster than disk-based shuffle on a single node. (See before/after comparison. Plot titles are wrong; this is on a single node).

I also tested against SparkSQL on a 4-node cluster and RaySQL is 2.5x faster.

franklsf95 added 4 commits March 24, 2023 05:05

Optimize Ray shuffle with zero-copy object store

04090de

remove more clones

95bb5b5

change bytes to pyarrow.array

4820aff

Merge branch 'zerocopy'

7ae52e4

franklsf95 mentioned this pull request Mar 31, 2023

[WIP] Optimize Ray shuffle with zero-copy object store #34

Closed

franklsf95 added 4 commits March 31, 2023 15:55

revert /tmp

df2cccb

remove empty_result_set

29960df

remove empty_result_set

213a91f

Fix input partition count bug

8951499

andygrove merged commit f985808 into datafusion-contrib:main Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PyArrow for zero-copy interaction with the Ray Object Store #36

Use PyArrow for zero-copy interaction with the Ray Object Store #36

franklsf95 commented Mar 31, 2023 •

edited

Loading

Use PyArrow for zero-copy interaction with the Ray Object Store #36

Use PyArrow for zero-copy interaction with the Ray Object Store #36

Conversation

franklsf95 commented Mar 31, 2023 • edited Loading

franklsf95 commented Mar 31, 2023 •

edited

Loading