ENH: Add read archive function #1440

Sabrina-Hassaim · 2025-01-25T21:14:03Z

PR Description

Please describe the changes proposed in the pull request:
1. Implementation of the read_archive Function:

Added a new method to read archive files (.zip, .tar, .tar.gz) and extract their contents as a DataFrame or a list of compatible files.
Supports CSV and Excel file formats within the archives.

2. Unit Tests

Added tests to validate the behavior of the read_archive method:
Ensures correct reading of files from .zip and .tar.gz formats.
Handles cases where the file is not a valid archive or does not contain compatible files.
Tests include interactive behavior for file selection.

**This PR resolves #1171 **

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

codecov · 2025-01-25T21:26:44Z

Codecov Report

Attention: Patch coverage is 75.75758% with 16 lines in your changes missing coverage. Please review.

Project coverage is 83.16%. Comparing base (6e77fbc) to head (e0b0bb6).
Report is 45 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1440      +/-   ##
==========================================
- Coverage   89.07%   83.16%   -5.91%     
==========================================
  Files          87       87              
  Lines        5374     6440    +1066     
==========================================
+ Hits         4787     5356     +569     
- Misses        587     1084     +497

samukweku · 2025-01-26T02:08:39Z

@Sabrina-Hassaim kudos on getting docs and tests working ... kindly work on adding tests for the missing parts, as indicated by the CI. from there we can work on the code itself ... i've got a couple of suggestions but we could do it in steps, after coverage is ok for the existing setup

samukweku · 2025-01-26T02:08:56Z

also @Sabrina-Hassaim kindly close the other PR

samukweku · 2025-01-28T07:26:58Z

janitor/io.py

+    return dfs if len(dfs) > 1 else dfs[0]
+
+
+def _select_files_interactively(compatible_files: list[str]) -> list[str]:


i'm not sure we should support this - is there any benefit to this? @ericmjl @pyjanitor-devs/core-devs thoughts?

I think it's worth keeping around just to see what it might do for the library. If it turns out not to be used very widely we can just deprecate it at a later date. On the other hand, if it's very popular, then we have the benefit of having it around.

samukweku · 2025-01-28T07:30:35Z

janitor/io.py

+    extract_to_df: bool = True,
+    file_type: str | None = None,
+    selected_files: list[str] | None = None,
+) -> pd.DataFrame | list[str]:


i think we should allow for more flexibility, via kwargs, where you can pass extra info to read_csv, read_excel, read_parquet, etc

samukweku · 2025-01-28T07:31:35Z

janitor/io.py

+    extract_to_df: bool = True,
+    file_type: str | None = None,
+    selected_files: list[str] | None = None,
+) -> pd.DataFrame | list[str]:


we should add an engine argument to support other dataframe libraries, e.g polars. Have a look at some of the IO functions that support polars

samukweku · 2025-01-28T07:33:38Z

kindly add a line to changelog.md

ENH: Add read archive function

7f971c4

Fix : coverage

36caf34

Sabrina-Hassaim force-pushed the Sabrina_Hassaim/read_archive branch from 4c84cd6 to 36caf34 Compare January 27, 2025 10:16

Sabrina-Hassaim added 2 commits January 27, 2025 11:33

Fix : unit tests

15134a1

Update contributors list

3c8570d

samukweku reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add read archive function #1440

ENH: Add read archive function #1440

Sabrina-Hassaim commented Jan 25, 2025 •

edited

Loading

codecov bot commented Jan 25, 2025

samukweku commented Jan 26, 2025

samukweku commented Jan 26, 2025

samukweku Jan 28, 2025

ericmjl Jan 28, 2025

samukweku Jan 28, 2025

samukweku Jan 28, 2025

samukweku commented Jan 28, 2025

		return dfs if len(dfs) > 1 else dfs[0]


		def _select_files_interactively(compatible_files: list[str]) -> list[str]:

ENH: Add read archive function #1440

Are you sure you want to change the base?

ENH: Add read archive function #1440

Conversation

Sabrina-Hassaim commented Jan 25, 2025 • edited Loading

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

codecov bot commented Jan 25, 2025

Codecov Report

samukweku commented Jan 26, 2025

samukweku commented Jan 26, 2025

samukweku Jan 28, 2025

Choose a reason for hiding this comment

ericmjl Jan 28, 2025

Choose a reason for hiding this comment

samukweku Jan 28, 2025

Choose a reason for hiding this comment

samukweku Jan 28, 2025

Choose a reason for hiding this comment

samukweku commented Jan 28, 2025

Sabrina-Hassaim commented Jan 25, 2025 •

edited

Loading