Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit handling of missing values #15

Open
adkabo opened this issue Nov 25, 2021 · 1 comment
Open

Explicit handling of missing values #15

adkabo opened this issue Nov 25, 2021 · 1 comment

Comments

@adkabo
Copy link

adkabo commented Nov 25, 2021

Hello DataKnots team,

I'm looking into the DataKnots project and I'm excited about what I see. It looks like a very powerful tool.

I do have an issue that I'd like to discuss. In my use cases, values are "missing not at random", and I need to treat them with caution. For example, it might be that the lowest true values are always unobserved. Naive behavior when filtering, joining, or aggregating on missing values will lead me to incorrect conclusions.

In base Julia, filter lets me be confident I'm not accidentally dropping significant missing values.

# Note the missing salary.
julia> employee_csv = """
                  name,department,position,salary
                  "ANTHONY A","POLICE","POLICE OFFICER",72510
                  "DANIEL A","FIRE","FIRE FIGHTER-EMT",95484
                  "JAMES A","FIRE","FIRE ENGINEER-EMT",103350
                  "JEFFERY A","POLICE","SERGEANT",101442
                  "NANCY A","POLICE","POLICE OFFICER",80016
                  "ROBERT K","FIRE","FIRE FIGHTER-EMT",
                  """ |> IOBuffer |> CSV.File
6-element CSV.File{false}:
 CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
 CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
 CSV.Row: (name = "JAMES A", department = "FIRE", position = "FIRE ENGINEER-EMT", salary = 103350)
 CSV.Row: (name = "JEFFERY A", department = "POLICE", position = "SERGEANT", salary = 101442)
 CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)
 CSV.Row: (name = "ROBERT K", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = missing)

julia> filter(x->x.salary < 100_000, employee_csv)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] filter(f::var"#11#12", a::CSV.File{false})
   @ Base ./array.jl:2522
 [2] top-level scope
   @ REPL[29]:1

julia> filter(x-> coalesce(x.salary < 100_000, false), employee_csv)
3-element Vector{CSV.Row}:
 CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
 CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
 CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)

On the other hand, currently DataKnots.jl silently drops missing values.

julia> chicago = DataKnot(:employee => employee_csv);

julia> @query chicago begin
        employee
        filter(salary < 100000)
        end
  │ employee                                        │
  │ name       department  position          salary │
──┼─────────────────────────────────────────────────┼
1 │ ANTHONY A  POLICE      POLICE OFFICER     725102 │ DANIEL A   FIRE        FIRE FIGHTER-EMT   954843 │ NANCY A    POLICE      POLICE OFFICER     80016

Using tools that require me to mentally track missingness and ensure rows aren't silently dropped takes effort I'd rather spend on other parts of my analysis. Tools like Missings.jl's passmissing(f)(x) and f(skipmissing(xs)) make it easier to do this explicitly.


For more discussion, see JuliaData/DataFrames.jl#2499 about joining tables on missing values.

@xitology
Copy link
Member

@adkabo, thank you for your suggestion. DataKnots borrowed the missing semantics from SQL, where NULL being equivalent to FALSE is the default, and, in fact, the only option. But I see your point and perhaps it's worth revisiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants