-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the curl command in a for loop for multiple pdfs in a single folder #75
Comments
Hi, you can use this Python script to do this: import json
import subprocess
from pathlib import Path
from os import listdir
from os.path import join
def analyze_documents(pdfs_path: str, jsons_path: str):
for file in listdir(pdfs_path):
file_path = join(pdfs_path, file)
command = [
"curl",
"-X",
"POST",
"-F",
f"file=@{file_path}",
"localhost:5060",
]
result = subprocess.run(command, capture_output=True, text=True)
json_data = json.loads(result.stdout)
Path(join(jsons_path, file.replace(".pdf", ".json"))).write_text(json.dumps(json_data, indent=4))
if __name__ == '__main__':
pdfs_path = "/path/to/pdfs/folder"
jsons_path = "/path/to/output/jsons/folder"
analyze_documents(pdfs_path, jsons_path) Don't forget to put your pdfs_path and jsons_path for output. Hope this helps! |
Thank you for your input. The script you shared works well. I had to create the output folder beforehand and use my own paths, but it successfully processed all the PDFs. If you can share the error text, we can try to help you. We will incorporate this functionality into the service. Have a great day! |
Hi @gabriel-piles - thanks for trying this out! So I tried adding this bash script I gave you (named pdf_txt.sh) to the pdf-document-layout-analysis-main folder and then tried through the docker terminal to build the script using |
hi @clewis96, Instead of running the script inside Docker, you can start the service using "make start" and then execute the script in a regular terminal (outside of Docker) with ./path/to/script.sh. You do not need to alter the Docker container build process for this to function. The terminal might display "command not found" if you execute ".pdf_txt.sh" instead of "./pdf_txt.sh". If the script doesn't exist, the error message should be "no such file or directory" I hope you find the fix. |
@gabriel-piles - I got it working! Thank you so much for your help. I am engineer turned law student so it has been a while since I have worked with code. Two comments: 1. I really do think this would be a great feature to add for anyone wanting to automate this on multiple pdfs in the future and 2. I am wondering what the motivating problem you all were trying to solve when you created this LLM and repo? Thank you for your time and help! |
Thank you for your support and interest. Please find an article about this project here: |
Hello,
I am currently using this repo (which is amazing by the way!!) to convert legal documents in PDF form to JSON and text formatting for the purpose of having clean text for future sentiment and textual analysis work. However, I want to be able to run this on multiple PDFs in a single folder automatically. I wrote a bash script that incorporates your curl command to do that, but I am not extremely familiar with Docker so I have not been able to get it to run properly.
Is there anyway you could add a script that runs a for loop and converts all PDFs in a single folder to JSONs? Or, help me get this script running within docker? I think this feature could be useful beyond just my use case, and for anyone really who is converting a big corpus of PDFs to text. Ideally you can just point to the input folder where the PDFs are stored and output folder to store the JSON files. I have not been able to test my script since I mentioned I have no been able to get it to run properly in docker using chmod + command, but here is what I was thinking:
Thank you so much for your help!
The text was updated successfully, but these errors were encountered: