Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Closed
Jefffrey opened this issue Mar 31, 2024 · 6 comments · Fixed by #93
Closed

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Jefffrey opened this issue Mar 31, 2024 · 6 comments · Fixed by #93
Labels
enhancement New feature or request low Low priority

Comments

@Jefffrey
Copy link
Collaborator

ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.

We translate this into the TimestampNanoseconds type.

However this may limit the range that ORC timestamp in actuality can represent.

Need to consider this?

@Jefffrey Jefffrey added low Low priority enhancement New feature or request labels Apr 1, 2024
@waynexia
Copy link
Collaborator

#91 adds a validation on converting timestamp data. It would be more generic if we could support all four types of timestamp precision.

@progval
Copy link
Contributor

progval commented May 24, 2024

The options I see:

  1. parse them as null
  2. try to parse whole batches as nanoseconds. If it overflows try milliseconds. If it is missing precision return an error; or if it overflow again then seconds. If it is missing precision return an error
  3. Use statistics (if available) to find the min and max value and pick precision based on that, then decode. While decoding if we see we are losing precision for any of the values, return an error.
  4. allow the user to configure what precision to use somehow, and stick to that

variant: allow users to opt in to losing precision instead of returning an error. or maybe of returning null or saturating.

What do you think?

@waynexia
Copy link
Collaborator

allow the user to configure what precision to use somehow, and stick to that

4 looks more concrete to me. ORC's spec says all timestamps are with nanosecond precision. To not lose precision while reading an ORC file, we may have to check if all the timestamp in the file are end with 000 or 000000 etc, which means the writer is probably using milli or micro seconds. But this check might be expensive. So providing a way for the caller to specify the expected precision looks easier to achieve and use.

@Jefffrey
Copy link
Collaborator Author

Option 4 also makes sense to me too. In reader builder options can allow user to configure second/milli/micro precision explicitly, and maybe allow it to be strict or not. Strict = fail if there is a value with higher precision than configured (e.g. has nanoseconds when expecting TimestampSeconds) and non-strict = truncate extra precision (just lose the nanoseconds).

By default we would have TimestampNanoseconds (doesn't really matter if strict or not as we can't get more precise anyway)

@progval
Copy link
Contributor

progval commented May 24, 2024

How would you make it configurable per-column? A map from indices to policies?

@Jefffrey
Copy link
Collaborator Author

How would you make it configurable per-column? A map from indices to policies?

I didn't consider this. I guess it could be grouped as part of some wider configuration where the user can provide an expected Arrow schema to decode into for the whole file (which could allow them to take advantage of dictionary encoding for example), which includes specifying the specific timestamp precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low Low priority
3 participants