Run the curl command in a for loop for multiple pdfs in a single folder #75

clewis96 · 2024-09-09T03:34:26Z

Hello,

I am currently using this repo (which is amazing by the way!!) to convert legal documents in PDF form to JSON and text formatting for the purpose of having clean text for future sentiment and textual analysis work. However, I want to be able to run this on multiple PDFs in a single folder automatically. I wrote a bash script that incorporates your curl command to do that, but I am not extremely familiar with Docker so I have not been able to get it to run properly.

Is there anyway you could add a script that runs a for loop and converts all PDFs in a single folder to JSONs? Or, help me get this script running within docker? I think this feature could be useful beyond just my use case, and for anyone really who is converting a big corpus of PDFs to text. Ideally you can just point to the input folder where the PDFs are stored and output folder to store the JSON files. I have not been able to test my script since I mentioned I have no been able to get it to run properly in docker using chmod + command, but here is what I was thinking:

#!/bin/bash

# Directory containing the PDFs
PDF_DIR="/input_pdfs_test"

# Server URL
SERVER_URL="http://localhost:5060"

# Directory to store the output files
OUTPUT_DIR="/output"

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Loop through all PDF files in the directory
for pdf_file in "$PDF_DIR"/*.pdf; do
    # Extract the base name of the PDF file (e.g., "document.pdf" from "/pdfs/document.pdf")
    base_name=$(basename "$pdf_file")
    
    # Define the output file name (e.g., "document_output.json")
    output_file="$OUTPUT_DIR/${base_name%.pdf}_output.json"
    
    # Run the curl command for each PDF and store the output in the corresponding file
    curl -X POST -F "file=@${pdf_file}" "$SERVER_URL" > "$output_file"
done

Thank you so much for your help!

The text was updated successfully, but these errors were encountered:

ali6parmak · 2024-09-09T08:49:23Z

Hi, you can use this Python script to do this:

import json
import subprocess
from pathlib import Path
from os import listdir
from os.path import join


def analyze_documents(pdfs_path: str, jsons_path: str):
    for file in listdir(pdfs_path):
        file_path = join(pdfs_path, file)
        command = [
            "curl",
            "-X",
            "POST",
            "-F",
            f"file=@{file_path}",
            "localhost:5060",
        ]

        result = subprocess.run(command, capture_output=True, text=True)
        json_data = json.loads(result.stdout)
        Path(join(jsons_path, file.replace(".pdf", ".json"))).write_text(json.dumps(json_data, indent=4))


if __name__ == '__main__':
    pdfs_path = "/path/to/pdfs/folder"
    jsons_path = "/path/to/output/jsons/folder"
    analyze_documents(pdfs_path, jsons_path)

Don't forget to put your pdfs_path and jsons_path for output. Hope this helps!

gabriel-piles · 2024-09-09T09:32:26Z

@clewis96

Thank you for your input.

The script you shared works well. I had to create the output folder beforehand and use my own paths, but it successfully processed all the PDFs. If you can share the error text, we can try to help you.

We will incorporate this functionality into the service.

Have a great day!

clewis96 · 2024-09-09T15:40:26Z

Hi @gabriel-piles - thanks for trying this out! So I tried adding this bash script I gave you (named pdf_txt.sh) to the pdf-document-layout-analysis-main folder and then tried through the docker terminal to build the script using chmod +x pdf_txt.sh. However, when I try to then run the script after in the Docker terminal using .\pdt_txt.sh, it says: zsh: command not found: .pdf_txt.sh - which made me think I just was not building this properly in docker because I didn't adjust or change the docker file/makefile.

gabriel-piles · 2024-09-10T13:15:59Z

hi @clewis96,

Instead of running the script inside Docker, you can start the service using "make start" and then execute the script in a regular terminal (outside of Docker) with ./path/to/script.sh. You do not need to alter the Docker container build process for this to function.

The terminal might display "command not found" if you execute ".pdf_txt.sh" instead of "./pdf_txt.sh". If the script doesn't exist, the error message should be "no such file or directory"

I hope you find the fix.

clewis96 · 2024-09-10T16:36:18Z

@gabriel-piles - I got it working! Thank you so much for your help. I am engineer turned law student so it has been a while since I have worked with code. Two comments: 1. I really do think this would be a great feature to add for anyone wanting to automate this on multiple pdfs in the future and 2. I am wondering what the motivating problem you all were trying to solve when you created this LLM and repo?

Thank you for your time and help!

gabriel-piles · 2024-09-24T16:38:41Z

Thank you for your support and interest. Please find an article about this project here:

https://huridocs.org/2024/08/new-open-source-ai-tool-unlocks-content-and-structure-of-pdfs-effortlessly/

gabriel-piles closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the curl command in a for loop for multiple pdfs in a single folder #75

Run the curl command in a for loop for multiple pdfs in a single folder #75

clewis96 commented Sep 9, 2024 •

edited

Loading

ali6parmak commented Sep 9, 2024

gabriel-piles commented Sep 9, 2024

clewis96 commented Sep 9, 2024 •

edited

Loading

gabriel-piles commented Sep 10, 2024

clewis96 commented Sep 10, 2024

gabriel-piles commented Sep 24, 2024

Run the curl command in a for loop for multiple pdfs in a single folder #75

Run the curl command in a for loop for multiple pdfs in a single folder #75

Comments

clewis96 commented Sep 9, 2024 • edited Loading

ali6parmak commented Sep 9, 2024

gabriel-piles commented Sep 9, 2024

clewis96 commented Sep 9, 2024 • edited Loading

gabriel-piles commented Sep 10, 2024

clewis96 commented Sep 10, 2024

gabriel-piles commented Sep 24, 2024

clewis96 commented Sep 9, 2024 •

edited

Loading

clewis96 commented Sep 9, 2024 •

edited

Loading