TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Jefffrey · 2024-03-31T00:04:48Z

ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.

We translate this into the TimestampNanoseconds type.

However this may limit the range that ORC timestamp in actuality can represent.

Need to consider this?

waynexia · 2024-05-24T08:51:55Z

#91 adds a validation on converting timestamp data. It would be more generic if we could support all four types of timestamp precision.

progval · 2024-05-24T08:56:42Z

The options I see:

parse them as null
try to parse whole batches as nanoseconds. If it overflows try milliseconds. If it is missing precision return an error; or if it overflow again then seconds. If it is missing precision return an error
Use statistics (if available) to find the min and max value and pick precision based on that, then decode. While decoding if we see we are losing precision for any of the values, return an error.
allow the user to configure what precision to use somehow, and stick to that

variant: allow users to opt in to losing precision instead of returning an error. or maybe of returning null or saturating.

What do you think?

waynexia · 2024-05-24T09:10:00Z

allow the user to configure what precision to use somehow, and stick to that

4 looks more concrete to me. ORC's spec says all timestamps are with nanosecond precision. To not lose precision while reading an ORC file, we may have to check if all the timestamp in the file are end with 000 or 000000 etc, which means the writer is probably using milli or micro seconds. But this check might be expensive. So providing a way for the caller to specify the expected precision looks easier to achieve and use.

Jefffrey · 2024-05-24T13:38:00Z

Option 4 also makes sense to me too. In reader builder options can allow user to configure second/milli/micro precision explicitly, and maybe allow it to be strict or not. Strict = fail if there is a value with higher precision than configured (e.g. has nanoseconds when expecting TimestampSeconds) and non-strict = truncate extra precision (just lose the nanoseconds).

By default we would have TimestampNanoseconds (doesn't really matter if strict or not as we can't get more precise anyway)

progval · 2024-05-24T13:47:16Z

How would you make it configurable per-column? A map from indices to policies?

Jefffrey · 2024-05-25T04:41:22Z

How would you make it configurable per-column? A map from indices to policies?

I didn't consider this. I guess it could be grouped as part of some wider configuration where the user can provide an expected Arrow schema to decode into for the whole file (which could allow them to take advantage of dictionary encoding for example), which includes specifying the specific timestamp precision.

Jefffrey added low Low priority enhancement New feature or request labels Apr 1, 2024

This was referenced Jun 4, 2024

Pass target arrow type to array_decoder_factory #92

Merged

Add support for configuring time units through ArrowReaderBuilder::with_schema #93

Merged

WenyXu closed this as completed in #93 Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Jefffrey commented Mar 31, 2024

waynexia commented May 24, 2024

progval commented May 24, 2024 •

edited

Loading

waynexia commented May 24, 2024

Jefffrey commented May 24, 2024

progval commented May 24, 2024

Jefffrey commented May 25, 2024

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Comments

Jefffrey commented Mar 31, 2024

waynexia commented May 24, 2024

progval commented May 24, 2024 • edited Loading

waynexia commented May 24, 2024

Jefffrey commented May 24, 2024

progval commented May 24, 2024

Jefffrey commented May 25, 2024

progval commented May 24, 2024 •

edited

Loading