-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TimestampNanosecond may not be able to represent whole ORC timestamp range #75
Comments
#91 adds a validation on converting timestamp data. It would be more generic if we could support all four types of timestamp precision. |
The options I see:
variant: allow users to opt in to losing precision instead of returning an error. or maybe of returning null or saturating. What do you think? |
4 looks more concrete to me. ORC's spec says all timestamps are with nanosecond precision. To not lose precision while reading an ORC file, we may have to check if all the timestamp in the file are end with |
Option 4 also makes sense to me too. In reader builder options can allow user to configure second/milli/micro precision explicitly, and maybe allow it to be strict or not. Strict = fail if there is a value with higher precision than configured (e.g. has nanoseconds when expecting TimestampSeconds) and non-strict = truncate extra precision (just lose the nanoseconds). By default we would have TimestampNanoseconds (doesn't really matter if strict or not as we can't get more precise anyway) |
How would you make it configurable per-column? A map from indices to policies? |
I didn't consider this. I guess it could be grouped as part of some wider configuration where the user can provide an expected Arrow schema to decode into for the whole file (which could allow them to take advantage of dictionary encoding for example), which includes specifying the specific timestamp precision. |
ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.
We translate this into the TimestampNanoseconds type.
However this may limit the range that ORC timestamp in actuality can represent.
Need to consider this?
The text was updated successfully, but these errors were encountered: