Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add read archive function #1440

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

Sabrina-Hassaim
Copy link

@Sabrina-Hassaim Sabrina-Hassaim commented Jan 25, 2025

PR Description

Please describe the changes proposed in the pull request:
1. Implementation of the read_archive Function:

  • Added a new method to read archive files (.zip, .tar, .tar.gz) and extract their contents as a DataFrame or a list of compatible files.

  • Supports CSV and Excel file formats within the archives.

2. Unit Tests

  • Added tests to validate the behavior of the read_archive method:
  • Ensures correct reading of files from .zip and .tar.gz formats.
  • Handles cases where the file is not a valid archive or does not contain compatible files.
  • Tests include interactive behavior for file selection.

**This PR resolves #1171 **

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

Copy link

codecov bot commented Jan 25, 2025

Codecov Report

Attention: Patch coverage is 75.75758% with 16 lines in your changes missing coverage. Please review.

Project coverage is 83.16%. Comparing base (6e77fbc) to head (e0b0bb6).
Report is 45 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1440      +/-   ##
==========================================
- Coverage   89.07%   83.16%   -5.91%     
==========================================
  Files          87       87              
  Lines        5374     6440    +1066     
==========================================
+ Hits         4787     5356     +569     
- Misses        587     1084     +497     

@samukweku
Copy link
Collaborator

@Sabrina-Hassaim kudos on getting docs and tests working ... kindly work on adding tests for the missing parts, as indicated by the CI. from there we can work on the code itself ... i've got a couple of suggestions but we could do it in steps, after coverage is ok for the existing setup

@samukweku
Copy link
Collaborator

also @Sabrina-Hassaim kindly close the other PR

@Sabrina-Hassaim Sabrina-Hassaim force-pushed the Sabrina_Hassaim/read_archive branch from 4c84cd6 to 36caf34 Compare January 27, 2025 10:16
return dfs if len(dfs) > 1 else dfs[0]


def _select_files_interactively(compatible_files: list[str]) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure we should support this - is there any benefit to this? @ericmjl @pyjanitor-devs/core-devs thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth keeping around just to see what it might do for the library. If it turns out not to be used very widely we can just deprecate it at a later date. On the other hand, if it's very popular, then we have the benefit of having it around.

extract_to_df: bool = True,
file_type: str | None = None,
selected_files: list[str] | None = None,
) -> pd.DataFrame | list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should allow for more flexibility, via kwargs, where you can pass extra info to read_csv, read_excel, read_parquet, etc

extract_to_df: bool = True,
file_type: str | None = None,
selected_files: list[str] | None = None,
) -> pd.DataFrame | list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add an engine argument to support other dataframe libraries, e.g polars. Have a look at some of the IO functions that support polars

@samukweku
Copy link
Collaborator

kindly add a line to changelog.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read files from archive - zip, tar, etc
3 participants