Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] date and timestamp parsers #2353

Merged
merged 14 commits into from
Jun 18, 2024

Conversation

murex971
Copy link
Contributor

@murex971 murex971 commented Jun 7, 2024

Resolves #2309
Resolves #2302

@murex971 murex971 marked this pull request as draft June 7, 2024 20:15
@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 7, 2024
@murex971 murex971 marked this pull request as ready for review June 8, 2024 11:12
@murex971 murex971 changed the title [FEAT] string to date function Jun 11, 2024
@murex971 murex971 changed the title [FEAT] string to date and datetime functions Jun 11, 2024
@samster25 samster25 requested review from kevinzwang and colin-ho and removed request for kevinzwang June 11, 2024 18:15
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far. Overall, to_date looks good, and to_datetime can be enhanced to support timeunit inference and timezones

@@ -1436,6 +1436,30 @@ def substr(self, start: int | Expression, length: int | Expression | None = None
length_expr = Expression._to_expression(length)
return Expression._from_pyexpr(self._expr.utf8_substr(start_expr._expr, length_expr._expr))

def to_date(self, format: str | Expression) -> Expression:
"""Converts a string to a date using the specified format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add link to the chrono docs here that specify valid format strings? I believe its here: https://docs.rs/chrono/latest/chrono/format/strftime/index.html

And also could you do the same for to_datetime as well

@@ -1436,6 +1436,30 @@ def substr(self, start: int | Expression, length: int | Expression | None = None
length_expr = Expression._to_expression(length)
return Expression._from_pyexpr(self._expr.utf8_substr(start_expr._expr, length_expr._expr))

def to_date(self, format: str | Expression) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change this to just accept str. See the comment below for changing to_datetime to accept a single str as well.

format_expr = Expression._to_expression(format)
return Expression._from_pyexpr(self._expr.utf8_to_date(format_expr._expr))

def to_datetime(self, format: str | Expression) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support different timeunits, as well as datetimes with timezones. To do this, lets change the function signature to to_datetime(self, format : str, timezone: str | None = None)

By constraining format to be a single str, we can infer the correct timeunit when we call to_field in src/daft-dsl/src/functions/utf8/to_datetime.rs by checking if the format string contains nanosecond, i.e. %9f or %.9f, or millisecond, i.e. %.3f or %3f, precision, otherwise default to microsecond.

By allowing the timezone: str | None = None argument, we can parse the string using chrono::NaiveDateTime::parse_from_str or chrono::DateTime::parse_from_str at the array level.

with pytest.raises(ValueError):
s.str.to_date(formats)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for datetimes with timezones as well. A particular edge case to consider would be columns with different timezones, where each entry should be converted to the user provided timezone.

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 79.06977% with 36 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@7d613a6). Learn more about missing BASE report.
Report is 26 commits behind head on main.

Current head 9179eb5 differs from pull request most recent head 1e69c73

Please upload reports for the commit 1e69c73 to get more accurate results.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2353   +/-   ##
=======================================
  Coverage        ?   63.51%           
=======================================
  Files           ?      908           
  Lines           ?   102078           
  Branches        ?        0           
=======================================
  Hits            ?    64831           
  Misses          ?    37247           
  Partials        ?        0           
Files Coverage Δ
daft/expressions/expressions.py 93.07% <100.00%> (ø)
src/daft-core/src/python/series.rs 94.52% <100.00%> (ø)
src/daft-core/src/series/ops/utf8.rs 93.40% <100.00%> (ø)
src/daft-dsl/src/functions/utf8/mod.rs 99.53% <100.00%> (ø)
src/daft-dsl/src/python.rs 91.82% <100.00%> (ø)
daft/series.py 91.85% <87.50%> (ø)
src/daft-core/src/array/ops/utf8.rs 93.68% <94.02%> (ø)
src/daft-dsl/src/functions/utf8/to_datetime.rs 42.30% <42.30%> (ø)
src/daft-dsl/src/functions/utf8/to_date.rs 44.82% <44.82%> (ø)
@murex971 murex971 requested a review from colin-ho June 17, 2024 18:12
@samster25
Copy link
Member

@murex971 Is this ready for rereview?

@colin-ho colin-ho merged commit 042f483 into Eventual-Inc:main Jun 18, 2024
41 checks passed
@colin-ho
Copy link
Contributor

Thanks for this @murex971 ! I just made some additional commits to get this to the finish line, didn't think another round of back and forth was needed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
3 participants