Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars xlsx_cells #1358

Merged
merged 61 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
862c7dd
add make_clean_names function that can be applied to polars
Apr 19, 2024
01531cc
add examples for make_clean_names
Apr 20, 2024
0fb440e
changelog
Apr 20, 2024
5e944b2
limit import location for polars
Apr 20, 2024
501d9c6
limit import location for polars
Apr 20, 2024
9506832
fix polars in environment-dev.yml
Apr 20, 2024
1ae8edd
install polars in doctest
Apr 20, 2024
3b1829b
limit polars imports - user should have polars already installed
Apr 20, 2024
52fd80c
use subprocess.run
Apr 20, 2024
2dce78b
add subprocess.devnull to docstrings
Apr 20, 2024
37b3feb
add subprocess.devnull to docstrings
Apr 20, 2024
0953f2d
add subprocess.devnull to docstrings
Apr 20, 2024
d7c71b6
add subprocess.devnull to docstrings
Apr 20, 2024
40b8502
add os.devnull
Apr 20, 2024
4f11d09
add polars as requirement for docs
Apr 20, 2024
54b179c
add polars to tests requirements
Apr 20, 2024
25b39b9
delete irrelevant folder
Apr 20, 2024
a09f34b
changelog
Apr 20, 2024
1b375f8
create submodule for polars
Apr 21, 2024
799532f
fix doctests
Apr 21, 2024
dbce4b9
fix tests; add polars to documentation
Apr 21, 2024
1c642e6
fix tests; add polars to documentation
Apr 21, 2024
407d21b
import janitor.polars
Apr 21, 2024
aedfc65
control docs output for polars submodule
Apr 21, 2024
db9b486
exclude functions in docs rendering
Apr 21, 2024
6a91e67
exclude functions in docs rendering
Apr 21, 2024
7a88078
show_submodules=true
Apr 21, 2024
6d7885e
fix docstring rendering for polars
Apr 21, 2024
944fa02
Expression -> expression
Apr 21, 2024
b9aefaa
Merge dev into samukweku/polars_clean_names
ericmjl Apr 23, 2024
e9c370a
rename functions.py
Apr 23, 2024
ee66d2a
pivot_longer implemented for polars
Apr 29, 2024
959b082
changelog
Apr 30, 2024
3177503
keep changes related only to pivot_longer
Apr 30, 2024
ee899b2
pd -> pl
Apr 30, 2024
8ea9b71
pd -> pl
Apr 30, 2024
d12ae1a
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
652f3e3
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
9b9c1a9
pd -> pl
Apr 30, 2024
69c273f
pd -> pl
Apr 30, 2024
b3391e8
add >>> df
Apr 30, 2024
4ffaac5
add >>> df
Apr 30, 2024
1de57bb
keep changes related only to polars pivot_longer
Apr 30, 2024
e495790
add polars support to read_commandline
May 1, 2024
a5c331a
remove irrelevant files
May 1, 2024
4d9c35f
minor edit to docs
May 1, 2024
3b781c1
xlsx_table now supports polars
May 1, 2024
5364f8d
xlsx_cells now supports polars
May 1, 2024
bceefe8
changelog
May 1, 2024
f6795f8
docs fix
May 1, 2024
0264109
docs fix
May 1, 2024
d580316
docs fix
May 1, 2024
de59dfa
docs fix
May 1, 2024
9de6065
docs fix
May 1, 2024
9c1b725
Merge dev into samukweku/polars_xlsx_cells
ericmjl May 6, 2024
9f50e3b
Merge dev into samukweku/polars_xlsx_cells
ericmjl May 10, 2024
0644b78
Merge dev into samukweku/polars_xlsx_cells
ericmjl May 19, 2024
4e00d71
Merge dev into samukweku/polars_xlsx_cells
ericmjl May 23, 2024
02ede24
Merge dev into samukweku/polars_xlsx_cells
ericmjl May 27, 2024
2f0302e
Merge dev into samukweku/polars_xlsx_cells
ericmjl Jun 2, 2024
5ad131f
Merge branch 'dev' into samukweku/polars_xlsx_cells
ericmjl Jun 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Changelog

## [Unreleased]
- [ENH] `xlsx_table` function now supports polars - Issue #1352

- [ENH] `xlsx_cells` function now supports polars - Issue #1352
- [ENH] `xlsx_table` function now supports polars - Issue #1352
- [ENH] Improved performance for non-equi joins when using numba - @samukweku PR #1341
- [ENH] Added a `clean_names` method for polars - it can be used to clean the column names, or clean column values . Issue #1343

Expand Down
77 changes: 70 additions & 7 deletions janitor/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,21 +337,24 @@ def xlsx_cells(
border: bool = False,
protection: bool = False,
comment: bool = False,
engine: str = "pandas",
**kwargs: Any,
) -> Union[dict, pd.DataFrame]:
) -> Mapping:
"""Imports data from spreadsheet without coercing it into a rectangle.

Each cell is represented by a row in a dataframe, and includes the
cell's coordinates, the value, row and column position.
The cell formatting (fill, font, border, etc) can also be accessed;
usually this is returned as a dictionary in the cell, and the specific
cell format attribute can be accessed using `pd.Series.str.get`.
cell format attribute can be accessed using `pd.Series.str.get`
or `pl.struct.field` if it is a polars DataFrame.

Inspiration for this comes from R's [tidyxl][link] package.
[link]: https://nacnudus.github.io/tidyxl/reference/tidyxl.html

Examples:
>>> import pandas as pd
>>> import polars as pl
>>> from janitor import xlsx_cells
>>> pd.set_option("display.max_columns", None)
>>> pd.set_option("display.expand_frame_repr", False)
Expand Down Expand Up @@ -398,6 +401,40 @@ def xlsx_cells(
7 00000000
Name: fill, dtype: object

Access cell formatting in a polars DataFrame:

>>> out = xlsx_cells(filename, sheetnames="highlights", engine='polars', fill=True).get_column('fill')
>>> out
shape: (8,)
Series: 'fill' [struct[3]]
[
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
{"solid",{"FFFFFF00","rgb",0.0},{"FFFFFF00","rgb",0.0}}
{"solid",{"FFFFFF00","rgb",0.0},{"FFFFFF00","rgb",0.0}}
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
{null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
]

Specific cell attributes can be acessed via Polars' struct:

>>> out.struct.field('fgColor').struct.field('rgb')
shape: (8,)
Series: 'rgb' [str]
[
"00000000"
"00000000"
"00000000"
"00000000"
"FFFFFF00"
"FFFFFF00"
"00000000"
"00000000"
]


Args:
path: Path to the Excel File. It can also be an openpyxl Workbook.
sheetnames: Names of the sheets from which the cells are to be extracted.
Expand Down Expand Up @@ -426,6 +463,7 @@ def xlsx_cells(
It is usually returned as a dictionary.
comment: If `True`, return comment properties of the cell.
It is usually returned as a dictionary.
engine: DataFrame engine. Should be either pandas or polars.
**kwargs: Any other attributes of the cell, that can be accessed from openpyxl.

Raises:
Expand All @@ -434,7 +472,7 @@ def xlsx_cells(
is not a openpyxl cell attribute.

Returns:
A pandas DataFrame, or a dictionary of DataFrames.
A DataFrame, or a dictionary of DataFrames.
""" # noqa : E501

try:
Expand Down Expand Up @@ -462,6 +500,21 @@ def xlsx_cells(
path = load_workbook(
filename=path, read_only=read_only, keep_links=False
)
if engine not in {"pandas", "polars"}:
raise ValueError("engine should be one of pandas or polars.")
base_engine = pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samukweku what are your thoughts on assigning base_engine whenever we dispatch differently between polars and pandas? Does it make sense for us to use one pattern instead?

if engine == "polars":
try:
import polars as pl

base_engine = pl
except ImportError:
import_message(
submodule="polars",
package="polars",
conda_channel="conda-forge",
pip_install=True,
)
# start_point and end_point applies if the user is interested in
# only a subset of the Excel File and knows the coordinates
if start_point or end_point:
Expand Down Expand Up @@ -533,6 +586,7 @@ def xlsx_cells(
start_point,
end_point,
include_blank_cells,
base_engine=base_engine,
)
for sheetname in sheetnames
}
Expand All @@ -552,6 +606,7 @@ def _xlsx_cells(
start_point: Union[str, int],
end_point: Union[str, int],
include_blank_cells: bool,
base_engine,
):
"""
Function to process a single sheet. Returns a DataFrame.
Expand All @@ -567,7 +622,7 @@ def _xlsx_cells(
path_is_workbook: True/False.

Returns:
A pandas DataFrame.
A DataFrame.
"""

if start_point:
Expand All @@ -579,15 +634,23 @@ def _xlsx_cells(
if (cell.value is None) and (not include_blank_cells):
continue
for value in defaults:
frame[value].append(getattr(cell, value, None))
outcome = getattr(cell, value, None)
if value.startswith("is_"):
pass
elif outcome is not None:
outcome = str(outcome)
frame[value].append(outcome)
for parent, boolean_value in parameters.items():
check(f"The value for {parent}", boolean_value, [bool])
if not boolean_value:
continue
boolean_value = _object_to_dict(getattr(cell, parent, None))
if isinstance(boolean_value, dict) or (boolean_value is None):
pass
else:
boolean_value = str(boolean_value)
frame[parent].append(boolean_value)

return pd.DataFrame(frame, copy=False)
return base_engine.DataFrame(frame)


def _object_to_dict(obj):
Expand Down
Loading