-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement similar functions for polars #1352
Comments
the current |
I assumed (wrongly) that polars' join maintains order (it only does so for left join). need to rethink the computation logic for |
eagerly awaits for 0.28.0 release! |
@3SMMZRjWgS version 0.28.0 is released. would love feedback on the functions - would also love PRs if you are interested. |
|
example below about the performance hit for a single column extraction: import polars as pl
import janitor.polars
In [58]: df = pl.DataFrame(
...: {
...: "Sepal.Length": [5.1, 5.9],
...: "Sepal.Width": [3.5, 3.0],
...: "Petal.Length": [1.4, 5.1],
...: "Petal.Width": [0.2, 1.8],
...: "Species": ["setosa", "virginica"],
...: }
...: )
...: df
Out[58]:
shape: (2, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
DF = pl.concat([df]*5_000_000,rechunk=True)
orig=(DF
.select('Species',
pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'),
pl.struct(Length='Petal.Length',Width='Petal.Width').alias('Petal'))
.unpivot(index='Species', variable_name='part').unnest('value')
)
other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
In [72]: orig.sort(pl.all()).equals(other.sort(pl.all()))
Out[72]: True
In [73]: %timeit orig=DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal
...: .Width').alias('Petal')).unpivot(index='Species', variable_name='part').unnest('value')
95.9 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [75]: %timeit other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
188 ms ± 6.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) a 2x performance slowdown of In [76]: another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
In [77]: another
Out[77]:
shape: (20_000_000, 5)
┌───────────┬───────┬───────┬────────┬───────┐
│ Species ┆ part1 ┆ part2 ┆ Length ┆ Width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞═══════════╪═══════╪═══════╪════════╪═══════╡
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ virginica ┆ Pe ┆ tal ┆ 5.1 ┆ 1.8 │
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ virginica ┆ Pe ┆ tal ┆ 5.1 ┆ 1.8 │
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
│ setosa ┆ Se ┆ pal ┆ 5.1 ┆ 3.5 │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
│ setosa ┆ Se ┆ pal ┆ 5.1 ┆ 3.5 │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
└───────────┴───────┴───────┴────────┴───────┘
In [78]: %timeit another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
204 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [86]: DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Width').alia
...: s('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable.str.sli
...: ce(offset=2)).drop('variable')
Out[86]:
shape: (20_000_000, 5)
┌───────────┬────────┬───────┬───────┬───────┐
│ Species ┆ Length ┆ Width ┆ part1 ┆ part2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ str ┆ str │
╞═══════════╪════════╪═══════╪═══════╪═══════╡
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ virginica ┆ 5.9 ┆ 3.0 ┆ Se ┆ pal │
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ virginica ┆ 5.9 ┆ 3.0 ┆ Se ┆ pal │
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ … ┆ … ┆ … ┆ … ┆ … │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
│ setosa ┆ 1.4 ┆ 0.2 ┆ Pe ┆ tal │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
│ setosa ┆ 1.4 ┆ 0.2 ┆ Pe ┆ tal │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
└───────────┴────────┴───────┴───────┴───────┘
In [85]: %timeit DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Widt
...: h').alias('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable
...: .str.slice(offset=2)).drop('variable')
301 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) It's just a crude example of where |
the long rant above does leave a question though - can we possibly speed up |
@samukweku I just wanted to let you know that the Other join-types for Either way, I am looking forward to using pyjanitor alongside polars in the future 🚀 |
@Phil-Garmann thanks for the feedback; it is much appreciated. I'll keep an eye on the progress for |
in relation to #1343 - this is a list of functions missing in the polars library that could be implemented :
clean_names
pivot_longer
pivot_wider
xlsx_tables
xlsx_cells
read_commandline
polars has a pl.join_where to cover thisconditional_join
complete
expand_grid
pl.join
withhow='cross'
covers thisconvert_excel_date
convert_matlab_date
convert_unix_date
pl.from_epoch
covers thisbin_numeric
pl.Expr.cut
covers thiscan be replicated withconcatenate_columns
pl.concat_str
deconcatenate_columns
pl.Expr.str.split
covers thisfactorize_columns
pl.rank(dense)
orpl.Expr.to_physical
covers thisget_dupes
Expr.is_duplicated()
covers thisjitter
limit_column_characters
min_max_scale
can be replicated with polars' selectorsmove
row_to_names
shuffle
pl.Expr.shuffle
covers thissort_naturally
take_first
group_by.first()
covers thisalso
Care should be taken to not create the function, if an existing solution already exists for any of these functions (probably named differently, or a combination of existing polars functions that covers all use cases of any of the listed functions above)
The text was updated successfully, but these errors were encountered: