Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated async reader to read local files directly, no async #17

Merged
merged 1 commit into from
May 2, 2024

Conversation

maximedion2
Copy link
Collaborator

No description provided.

@maximedion2
Copy link
Collaborator Author

@tshauck is that what you had in mind? looking at your comment on the issue, you mentioned the enum that ObjectStore::get returns, so I looked into using that, the changes here basically make it so that the async reader behaves like a sync reader when the file is local, using the std lib (blocking) methods to read files.

I have this large-ish toy dataset on my computer, I don't commit it to the test data repo, but I use it for very rough benchmarking occasionally. reading it takes about 7-8 secs, there's definitely some variability, but with the changes I made in this PR, the async reader was consistently faster, I'd say by about 10% or so.

Copy link
Collaborator

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it'd work to me. I was thinking more at the FileOpener level w/i datafusion, then delegate to either the async or non-async reader zarr reader. To your point this this mades the async reader work in non-async contexts... which may obviate the need for non-async reader?

@maximedion2
Copy link
Collaborator Author

Right, so initially I thought I was gonna create a non-async reader in the FileOpener like you said, but I looked into the ObjectStore::get method in more details, and I thought I could do something really simple and clean within the async reader. In the case of a local file, we now have an async method that has a relatively heavy, blocking operation in it, which I think in general is something to avoid, but in this specific case, I think it's acceptable, since the operation, reading a local file, doesn't seem to be very efficient when it runs async anyway.

Regarding the non-async reader, currently the async reader essentially creates a non-async reader internally and uses that (through a little trick with the ZarrInMemoryChunkContainer), so that the "core" logic is all in the non-async reader. We could definitely combine all that into a single async reader, it would just be a little bit of refactoring. The main reason I could think of not doing that would be that ultimately, even for calls with local files, if we only have an async reader you always need to set up the async runtime (e.g. with tokio) and all that. If someone wanted to read zarr chunks without the async runtime, the sync reader allows them to do that. Overall that's probably a pretty minor concern though.

All that to say that I'll leave things as is for now and merge, but I'm open to combining everything into a single reader at some point if we think it makes the code cleaner.

@maximedion2 maximedion2 merged commit 96bce2a into main May 2, 2024
3 checks passed
@maximedion2 maximedion2 linked an issue May 2, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants