Skip to content

Commit

Permalink
update README (#21)
Browse files Browse the repository at this point in the history
  • Loading branch information
andygrove committed Feb 1, 2023
1 parent ae07096 commit b952203
Show file tree
Hide file tree
Showing 8 changed files with 40 additions and 10 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 6 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description = "RaySQL: DataFusion on Ray"
homepage = "https://github.com/andygrove/ray-sql"
repository = "https://github.com/andygrove/ray-sql"
authors = ["Andy Grove <andygrove73@gmail.com>"]
version = "0.2.0"
version = "0.3.0"
edition = "2021"
readme = "README.md"
license = "Apache-2.0"
Expand Down Expand Up @@ -33,4 +33,8 @@ name = "raysql"
crate-type = ["cdylib", "rlib"]

[package.metadata.maturin]
name = "raysql._raysql_internal"
name = "raysql._raysql_internal"

[profile.release]
codegen-units = 1
lto = true
30 changes: 25 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# RaySQL: DataFusion on Ray

This is an experimental research project to evaluate the concept of performing distributed SQL queries from Python, using
This is a personal research project to evaluate performing distributed SQL queries from Python, using
[Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).

## Example

See [examples/tips.py](examples/tips.py).
Run the following example live in your browser using a Google Colab [notebook](https://colab.research.google.com/drive/1tmSX0Lu6UFh58_-DBUVoyYx6BoXHOszP?usp=sharing).

```python
import ray
Expand Down Expand Up @@ -40,11 +40,31 @@ print(result_set)

## Performance

This chart shows the relative performance of RaySQL compared to other open-source distributed SQL frameworks.
This chart shows the performance of RaySQL compared to Apache Spark for
[SQLBench-H](https://sqlbenchmarks.io/sqlbench-h/) at a very small data set (10GB), running on my desktop (Threadripper
with 24 physical cores). Both RaySQL and Spark are configured with 24 executors.

Performance is looking pretty respectable!
Note that query 15 is excluded from both results since RaySQL does not support DDL yet.

![SQLBench-H Performance Chart](./docs/sqlbench-h-workstation-10-distributed-perquery.png)
### Overall Time

RaySQL is ~30% faster overall for this scale factor and environment.

![SQLBench-H Total](./docs/sqlbench-h-total.png)

### Per Query Time

Spark is much faster on some queries, likely due to broadcast exchanges, which RaySQL hasn't implemented yet.

![SQLBench-H Per Query](./docs/sqlbench-h-per-query.png)

### Performance Plan

I'm planning on experimenting with the following changes to improve performance:

- Make better use of Ray futures to run more tasks in parallel
- Use Ray object store for shuffle data transfer to reduce disk I/O cost
- Keep upgrading to newer versions of DataFusion to pick up the latest optimizations

## Building

Expand Down
Binary file added docs/sqlbench-h-per-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sqlbench-h-total.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
2 changes: 1 addition & 1 deletion raysql/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def execute_query_stage(self, graph, stage):
print("Scheduling query stage #{} with {} input partitions and {} output partitions".format(stage.id(), partition_count, stage.get_output_partition_count()))

# serialize the plan
plan_bytes = self.ctx.serialize_execution_plan(stage.get_execution_plan())
plan_bytes = ray.put(self.ctx.serialize_execution_plan(stage.get_execution_plan()))

# round-robin allocation across workers
futures = []
Expand Down
8 changes: 7 additions & 1 deletion src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,13 @@ pub struct PyContext {
impl PyContext {
#[new]
pub fn new(target_partitions: usize) -> Self {
let config = SessionConfig::default().with_target_partitions(target_partitions);
let config = SessionConfig::default()
.with_target_partitions(target_partitions)
.with_batch_size(16*1024)
.with_repartition_aggregations(true)
.with_repartition_windows(true)
.with_repartition_joins(true)
.with_parquet_pruning(true);
Self {
ctx: SessionContext::with_config(config),
}
Expand Down

0 comments on commit b952203

Please sign in to comment.