-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add read archive function #1440
base: dev
Are you sure you want to change the base?
ENH: Add read archive function #1440
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1440 +/- ##
==========================================
- Coverage 89.07% 83.16% -5.91%
==========================================
Files 87 87
Lines 5374 6440 +1066
==========================================
+ Hits 4787 5356 +569
- Misses 587 1084 +497 |
@Sabrina-Hassaim kudos on getting docs and tests working ... kindly work on adding tests for the missing parts, as indicated by the CI. from there we can work on the code itself ... i've got a couple of suggestions but we could do it in steps, after coverage is ok for the existing setup |
also @Sabrina-Hassaim kindly close the other PR |
4c84cd6
to
36caf34
Compare
return dfs if len(dfs) > 1 else dfs[0] | ||
|
||
|
||
def _select_files_interactively(compatible_files: list[str]) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure we should support this - is there any benefit to this? @ericmjl @pyjanitor-devs/core-devs thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth keeping around just to see what it might do for the library. If it turns out not to be used very widely we can just deprecate it at a later date. On the other hand, if it's very popular, then we have the benefit of having it around.
extract_to_df: bool = True, | ||
file_type: str | None = None, | ||
selected_files: list[str] | None = None, | ||
) -> pd.DataFrame | list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should allow for more flexibility, via kwargs, where you can pass extra info to read_csv
, read_excel
, read_parquet
, etc
extract_to_df: bool = True, | ||
file_type: str | None = None, | ||
selected_files: list[str] | None = None, | ||
) -> pd.DataFrame | list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add an engine
argument to support other dataframe libraries, e.g polars. Have a look at some of the IO functions that support polars
kindly add a line to changelog.md |
PR Description
Please describe the changes proposed in the pull request:
1. Implementation of the read_archive Function:
Added a new method to read archive files (.zip, .tar, .tar.gz) and extract their contents as a DataFrame or a list of compatible files.
Supports CSV and Excel file formats within the archives.
2. Unit Tests
**This PR resolves #1171 **
PR Checklist
Please ensure that you have done the following:
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.AUTHORS.md
.CHANGELOG.md
under the latest version header (i.e. the one that is "on deck") describing the contribution.Automatic checks
There will be automatic checks run on the PR. These include:
Relevant Reviewers
Please tag maintainers to review.