Skip to content

filter() after read_parquet_duckdb() silently fails with factor columns #849

@CoryMcCartan

Description

@CoryMcCartan

If the parquet file has factor/dictionary columns, read_parquet_duckdb() runs without errors. However, attempting operations such as filtering on the factor column do not work. Interestingly, other verbs like count() do work.

Clearly related to #88, but I did not see documentation on the duckplyr website of the factor-variable limitation at all, and it seems like the issue is possibly an interaction of filter() and read_parquet_duckdb(), since count() works.

Reprex (apologies for the file size, didn't have time to trim further
county.parquet.zip

library(duckplyr)

d = read_parquet_duckdb("county.parquet")

count(d, contest)
#> # A duckplyr data frame: 2 variables
#>   contest        n
#>   <chr>      <int>
#> 1 house     194040
#> 2 president 181106
#> 3 senate    101726
nrow(filter(d, contest == "president"))
#> [1] 0

d2 = as_duckdb_tibble(collect(d))
count(d2, contest)
#> # A duckplyr data frame: 2 variables
#>   contest        n
#>   <chr>      <int>
#> 1 house     194040
#> 2 president 181106
#> 3 senate    101726
nrow(filter(d2, contest == "president"))
#> [1] 181106

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions