Investigate using S3 Select #48

matthewmturner · 2022-02-28T00:07:09Z

It seems support was added for this based on https://github.com/awslabs/aws-sdk-rust/releases/tag/v0.0.17-alpha

Look into integrating this into S3FileSystem or using it to create a TableProvider.

The text was updated successfully, but these errors were encountered:

seddonm1 · 2022-02-28T01:10:37Z

What are you trying to achieve? it looks like SELECT only queries JSON structures?

matthewmturner · 2022-02-28T01:47:23Z

It was raised on slack (https://the-asf.slack.com/archives/C01QUFS30TD/p1645989728729579?thread_ts=1645245240.528129&cid=C01QUFS30TD), i dont have any particular insight at this stage. Just created this to log the request and will look into later when i have some more time.

matthewmturner · 2022-02-28T02:01:13Z

If i recall correctly, S3 Select worked on CSV, JSON, and Parquet. But I read about it a while ago so dont hold me to that. Doing zero research i thought maybe we could add something like a select method to S3FileSystem.

Honestly though I havent used before or had time to look into this so ill just come back or see if someone else (maybe the person who raised it) looks into it.

jychen7 · 2022-03-01T02:31:22Z

It was raised on slack

Hi, I raises this up as an idea only

it looks like SELECT only queries JSON structures?

As of 2022-02, from source

For input, Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format
- there are also other limitation on Parquet, e.g. only columnar compression using GZIP or Snappy. Amazon S3 Select doesn't support whole-object compression
For output, Amazon S3 Select only support CSV or JSON.

What are you trying to achieve?

S3 select supports aggregation pushdown and predicate pushdown, it could improve performance based on use cases. e.g. Using S3 Select Pushdown with Presto to improve performance

Licht-T · 2022-10-16T09:13:26Z

I am now looking into this. Let me share my investigation and opinion.

S3 Select itself

Presto and Ceph only support CSV S3 Select. There are several reasons:
- Parquet has column metadata, and we are already doing predicate pushdown with them.
- As for JSON, the odd type MISSING exists and breaks predicate pushdown consistency.
  Let's assume the following data. On S3 Select, missing fields are treated as MISSING. In this case, the second row's c is MISSING. The result set of SELECT * FROM s WHERE c IS NULL is empty because, unlike UNKNOWN, MISSING is not the same as NULL.
```
{"a": "foo", "b": 1, "c": "aaa"}
{"a": "bar", "b": 3}
```
We can do the parallel scan to one text file by using ScanRange.

It enables us to accelerate the large file reading. Please note that ScanRange does not support the compressed text data.

How to achieve the S3 Select acceleration

As per the previous two examples, we should integrate S3 Select into the CSV scan. Since we need to pass predicates and build the SQL query from them, I believe it's not an ObjectStore matter.

Actually, I did the implementation as a physical_plan. It can be switched over by its URL scheme. While I already wrote the integration tests, I am not fully sure this is the best way we can get.

matthewmturner · 2022-10-17T13:01:36Z

@Licht-T Hi thanks for raising this. This repo will be archived soon. There is now object_store which is preferred. I recommend raising this request there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using S3 Select #48

Investigate using S3 Select #48

matthewmturner commented Feb 28, 2022

seddonm1 commented Feb 28, 2022

matthewmturner commented Feb 28, 2022

matthewmturner commented Feb 28, 2022

jychen7 commented Mar 1, 2022 •

edited

Loading

Licht-T commented Oct 16, 2022 •

edited

Loading

matthewmturner commented Oct 17, 2022

Investigate using S3 Select #48

Investigate using S3 Select #48

Comments

matthewmturner commented Feb 28, 2022

seddonm1 commented Feb 28, 2022

matthewmturner commented Feb 28, 2022

matthewmturner commented Feb 28, 2022

jychen7 commented Mar 1, 2022 • edited Loading

Licht-T commented Oct 16, 2022 • edited Loading

S3 Select itself

How to achieve the S3 Select acceleration

matthewmturner commented Oct 17, 2022

jychen7 commented Mar 1, 2022 •

edited

Loading

Licht-T commented Oct 16, 2022 •

edited

Loading