Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the curl command in a for loop for multiple pdfs in a single folder #75

Closed
clewis96 opened this issue Sep 9, 2024 · 6 comments
Closed

Comments

@clewis96
Copy link

clewis96 commented Sep 9, 2024

Hello,

I am currently using this repo (which is amazing by the way!!) to convert legal documents in PDF form to JSON and text formatting for the purpose of having clean text for future sentiment and textual analysis work. However, I want to be able to run this on multiple PDFs in a single folder automatically. I wrote a bash script that incorporates your curl command to do that, but I am not extremely familiar with Docker so I have not been able to get it to run properly.

Is there anyway you could add a script that runs a for loop and converts all PDFs in a single folder to JSONs? Or, help me get this script running within docker? I think this feature could be useful beyond just my use case, and for anyone really who is converting a big corpus of PDFs to text. Ideally you can just point to the input folder where the PDFs are stored and output folder to store the JSON files. I have not been able to test my script since I mentioned I have no been able to get it to run properly in docker using chmod + command, but here is what I was thinking:

#!/bin/bash

# Directory containing the PDFs
PDF_DIR="/input_pdfs_test"

# Server URL
SERVER_URL="http://localhost:5060"

# Directory to store the output files
OUTPUT_DIR="/output"

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Loop through all PDF files in the directory
for pdf_file in "$PDF_DIR"/*.pdf; do
    # Extract the base name of the PDF file (e.g., "document.pdf" from "/pdfs/document.pdf")
    base_name=$(basename "$pdf_file")
    
    # Define the output file name (e.g., "document_output.json")
    output_file="$OUTPUT_DIR/${base_name%.pdf}_output.json"
    
    # Run the curl command for each PDF and store the output in the corresponding file
    curl -X POST -F "file=@${pdf_file}" "$SERVER_URL" > "$output_file"
done

Thank you so much for your help!

@ali6parmak
Copy link
Collaborator

Hi, you can use this Python script to do this:

import json
import subprocess
from pathlib import Path
from os import listdir
from os.path import join


def analyze_documents(pdfs_path: str, jsons_path: str):
    for file in listdir(pdfs_path):
        file_path = join(pdfs_path, file)
        command = [
            "curl",
            "-X",
            "POST",
            "-F",
            f"file=@{file_path}",
            "localhost:5060",
        ]

        result = subprocess.run(command, capture_output=True, text=True)
        json_data = json.loads(result.stdout)
        Path(join(jsons_path, file.replace(".pdf", ".json"))).write_text(json.dumps(json_data, indent=4))


if __name__ == '__main__':
    pdfs_path = "/path/to/pdfs/folder"
    jsons_path = "/path/to/output/jsons/folder"
    analyze_documents(pdfs_path, jsons_path)

Don't forget to put your pdfs_path and jsons_path for output. Hope this helps!

@gabriel-piles
Copy link
Member

@clewis96

Thank you for your input.

The script you shared works well. I had to create the output folder beforehand and use my own paths, but it successfully processed all the PDFs. If you can share the error text, we can try to help you.

We will incorporate this functionality into the service.

Have a great day!

@clewis96
Copy link
Author

clewis96 commented Sep 9, 2024

Hi @gabriel-piles - thanks for trying this out! So I tried adding this bash script I gave you (named pdf_txt.sh) to the pdf-document-layout-analysis-main folder and then tried through the docker terminal to build the script using chmod +x pdf_txt.sh. However, when I try to then run the script after in the Docker terminal using .\pdt_txt.sh, it says: zsh: command not found: .pdf_txt.sh - which made me think I just was not building this properly in docker because I didn't adjust or change the docker file/makefile.

@gabriel-piles
Copy link
Member

hi @clewis96,

Instead of running the script inside Docker, you can start the service using "make start" and then execute the script in a regular terminal (outside of Docker) with ./path/to/script.sh. You do not need to alter the Docker container build process for this to function.

The terminal might display "command not found" if you execute ".pdf_txt.sh" instead of "./pdf_txt.sh". If the script doesn't exist, the error message should be "no such file or directory"

I hope you find the fix.

@clewis96
Copy link
Author

@gabriel-piles - I got it working! Thank you so much for your help. I am engineer turned law student so it has been a while since I have worked with code. Two comments: 1. I really do think this would be a great feature to add for anyone wanting to automate this on multiple pdfs in the future and 2. I am wondering what the motivating problem you all were trying to solve when you created this LLM and repo?

Thank you for your time and help!

@gabriel-piles
Copy link
Member

Thank you for your support and interest. Please find an article about this project here:

https://huridocs.org/2024/08/new-open-source-ai-tool-unlocks-content-and-structure-of-pdfs-effortlessly/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants