Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr_reconstruct can create data.table with corrupted secondary index #7048

Open
ZIBOWANGKANGYU opened this issue Jul 3, 2024 · 1 comment

Comments

@ZIBOWANGKANGYU
Copy link

ZIBOWANGKANGYU commented Jul 3, 2024

Problem

Thanks @AMDraghici for your suggestions!

For example, in bind_rows, if the first input is a data.table, the output table can have corrupt indexing due to how the underlying dplyr_reconstruct function deals with the attributes of the two inputs

Reprex

The example below shows that the index attribute can be incorrect for the output.

> a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c") # Create a data.table
> attributes(a)$.internal.selfref <- new("externalptr") # Set pointer to nil. This is necessary for the subset error below to happen in data.table. But it is not necessary to re-produce the corrupted index. 
> a[cola == 4] # Give data.table a secondary index ("cola" column) by auto-indexing
    cola      colb      colc   cold
   <num>     <num>     <num> <char>
1:     4 0.1495679 0.6097216      c
> 
> attributes(a)$index # The secondary index is set correctly
integer(0)
attr(,"__cola")
[1] 2 3 4 1
> 
> b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
> 
> combined <- dplyr::bind_rows(list(a,b))
> 
> combined # combined is a data.table, with 5 rows
Index: <cola>
    cola      colb      colc   cold
   <num>     <num>     <num> <char>
1:     5 0.5535855 0.6024416      c
2:     2 0.3407051 0.9291365      c
3:     3 0.5007208 0.6823528      c
4:     4 0.1495679 0.6097216      c
5:    -1 2.0000000 3.0000000      d
> 
> attributes(combined)$index # Wrong! length of secondary index is only 4
integer(0)
attr(,"__cola")
[1] 2 3 4 1
> combined[cola==-1]
Empty data.table (0 rows and 4 cols): cola,colb,colc,cold # Wrong! The last row of combined should be returned
> combined
Index: <cola>
    cola       colb      colc   cold
   <num>      <num>     <num> <char>
1:     1 0.83105427 0.4214379      c
2:     2 0.05702599 0.1354883      c
3:     3 0.63866251 0.1644736      c
4:     4 0.21441544 0.2198251      c
5:    -1 2.00000000 3.0000000      d

Cause

In the bind_rows function, dplyr_reconstruct is used to set attributes for the output dataframe.

out <- dplyr_reconstruct(out, first)

out <- dplyr_reconstruct(out, first)

Looking at the dplyr_reconstruct function, it is essentially giving all attributes other than names and row.names in template_ to data.

SEXP ffi_dplyr_reconstruct(SEXP data, SEXP template_) {

In the case above, all attributes of first (which has four rows), including index are given to out, which has five rows. This causes the problem.

Impact

Because the data.table produced by bind_rows has corrupted secondary index, the filter functionality of data.table is skipping some rows when filtering by the index column.
Also, I found that this problem is not limited to bind_rows. Other dplyr functions that calls dplyr_reconstruct can result in data.tables with corrupted secondary index. For example, the full_join function can also produce unexpected results due to corrupted secondary index.

> a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
> 
> attributes(a)$.internal.selfref <- new("externalptr") # Set pointer to nil
> a[cola == 3]
       cola         colb         colc      cold
   <int>     <num>     <num> <char>
1:     3 0.9968646 0.8137836      d
> 
> b <- data.table::data.table(cola = -1, cole = "e")
> 
> combined <- dplyr::full_join(a, b, by = "cola")
> 
> combined[cola==-1]
Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole
@etiennebacher
Copy link

Just putting the same examples with clearer (IMO) formatting:

Example 1

# Create a data.table
a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c") 
# Set pointer to nil. This is necessary for the subset error below to happen 
# in data.table. But it is not necessary to re-produce the corrupted index. 
attributes(a)$.internal.selfref <- new("externalptr") 
# Give data.table a secondary index ("cola" column) by auto-indexing
a[cola == 4] 
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     4 0.8401062 0.09284545      c
# The secondary index is set correctly
attributes(a)$index 
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1
b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
combined <- dplyr::bind_rows(list(a,b))
# combined is a data.table, with 5 rows
combined 
#> Index: <cola>
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     5 0.4526811 0.38061661      c
#> 2:     2 0.6131192 0.28859921      c
#> 3:     3 0.7053851 0.85011065      c
#> 4:     4 0.8401062 0.09284545      c
#> 5:    -1 2.0000000 3.00000000      d
# Wrong! length of secondary index is only 4
attributes(combined)$index 
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1
combined[cola==-1]
#> Error: Internal error: index 'cola' exists but is invalid
combined
#> Index: <cola>
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     5 0.4526811 0.38061661      c
#> 2:     2 0.6131192 0.28859921      c
#> 3:     3 0.7053851 0.85011065      c
#> 4:     4 0.8401062 0.09284545      c
#> 5:    -1 2.0000000 3.00000000      d

Example 2

a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
# Set pointer to nil
attributes(a)$.internal.selfref <- new("externalptr") 
a[cola == 3]
#>     cola      colb      colc   cold
#>    <int>     <num>     <num> <char>
#> 1:     3 0.1962404 0.8902132      d
b <- data.table::data.table(cola = -1, cole = "e")
combined <- dplyr::full_join(a, b, by = "cola")
combined
#> Index: <cola>
#>     cola      colb      colc   cold   cole
#>    <num>     <num>     <num> <char> <char>
#> 1:     1 0.1566911 0.6529508      d   <NA>
#> 2:     2 0.7213704 0.9832597      d   <NA>
#> 3:     3 0.1962404 0.8902132      d   <NA>
#> 4:     4 0.5184152 0.3268725      d   <NA>
#> 5:    -1        NA        NA   <NA>      e
combined[cola==-1]
#> Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants