bug/Execution speed is very slow in AWS LAMBDA environment #2916

cds-code · 2024-04-22T04:37:57Z

Describe the bug
Execution speed is very slow in AWS Lambda environment with extract text from txt,pdf,docx etc, but very fast in local windows environment.

scanny · 2024-04-22T18:05:34Z

@cds-code can you describe how you are running unstructured in AWS Lambda?

cds-code · 2024-04-23T02:03:09Z

Im running a docker image in AWS Lambda

FROM public.ecr.aws/docker/library/python:3.11.6-slim-bookworm
RUN apt-get update && apt-get install -y \
    # unstructured package requirements for file type detection
    libmagic-mgc libmagic1 \
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]
RUN [ "python3", "-c", "import nltk; nltk.download('averaged_perceptron_tagger', download_dir='/usr/local/nltk_data')" ]

pip install 
unstructured
unstructured[pdf]
unstructured[docx]
unstructured[xlsx]
unstructured[pptx]
unstructured[md]

from unstructured.partition.auto import partition
partition(filename="XXXXX.pdf")

scanny · 2024-04-23T05:08:56Z

Have you accounted for spin-up (cold-start) time of the Lambda instance? Like only start timing after receiving the first response?

Also, can you provide some specific timings?

scanny · 2024-04-23T05:35:40Z

And how much memory is allocated to the Lambda instance?

danielfornarini · 2024-04-23T10:00:05Z

I have the same problem in AWS Batch running on fargate. I allocated 2 vCPUs and 4 GB of memory

cds-code · 2024-04-24T02:52:38Z

Have you accounted for spin-up (cold-start) time of the Lambda instance? Like only start timing after receiving the first response?

Also, can you provide some specific timings?

Does not contain Lambda instance cold-start time. just partition(filename="XXXXX.pdf") .
The memory of Lambda instance is 1024MB, 2048MB, 4096MB, and the execution time is the same.

When the program is executed three times in a loop. Only the first time was very long.
It seems that the initialization of unstructured takes a lot of time in AWS Lambda environment

```
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:09:52.577591 ⇒loop 1
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:07.032373 ⇒loop 1
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:10:07.032521 ⇒loop 2
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:08.106393 ⇒loop 2
| 2024-04-24T05:10:30.306Z | --1--- 2024-04-24 05:10:08.106490 ⇒loop 3
| 2024-04-24T05:10:30.306Z | --2--- 2024-04-24 05:10:08.840540 ⇒loop 3

scanny · 2024-04-24T19:16:44Z

~~@cds-code what "extras" are you installing with unstructured? Like where you do pip install unstructured[x], what x do you use?~~ NM, I see you posted that above :)

scanny · 2024-04-24T19:17:26Z

Also, just out of curiosity, can you give me a sense of the cold-start times you've seen?

adieuadieu · 2024-04-29T10:31:51Z

We are able to run Unstructured in a container on AWS Lambda without issue (or, well, there are issues, but we can work around them.)

Things to consider (sorry that these points are a bit............unstructured):

If your container image is large (e.g. 1GB or more), each time after you deploy your container (e.g. when you deploy your Lambda function), on the initial (first) invocation after, it can take quite a while (e.g. 30 seconds) for Lambda to download your container image from AWS ECR. This only happens on the first invocation after deployment of a new image. After that, Lambda will have cached the image. It looks like you install NLTK models and magick libs into the container, so your image is likely around 1 GB.
- Since you mention that it's only the first invocation that's slow, this initial downloading/caching from ECR is probably what you're seeing
- Consider using multi-stage builds to reduce the number of new layers from the container Lambda has to download from ECR.
I noticed you're not pre-downloading/installing any of the models that Unstructured uses to partition/read/parse PDF files. Unless you're using the "fast" strategy, it's likely that Unstructured will try to use a model to parse PDF files. Unstructured lazy-loads these models, so they're not downloaded until Unstructured needs them, which will take some time the first time — That's assuming you've configured your environment to download these models into /tmp with enough disk space — the only writeable location in Lambda. Also, note that Unstructureds current default model detectron2onnx won't work in AWS Lambda because of an underlying issue in onnxruntime->pytorch->cpuinfo. Use the model-override parameter to specify chipper instead-->though this will further balloon your container image. Consider using only the fast strategy instead. e.g. partition(file, strategy="fast") — make sure to add tesseract to your container image.

If you're already accounting for the ECR download/caching time, one other thing you can try is to run a "fake" partition script during the build of your container image. This will help "warm up" any libraries/dependencies which may want to run some initial first-time setup tasks (like building/caching fonts, or downloading models). For example, in the same way you "warm up" the NLTK libraries, you could add a RUN step:

COPY XXXXX.pdf /tmp/XXXXX.pdf
RUN [ "python3", "-c", "from unstructured.partition.auto import partition; partition(filename='/tmp/XXXXX.pdf')" ]

But, this will potentially exacerbate the first point about the container image size.

cds-code · 2024-04-30T05:02:22Z

This works for me thanks.

COPY XXXXX.pdf /tmp/XXXXX.pdf
RUN [ "python3", "-c", "from unstructured.partition.auto import partition; partition(filename='/tmp/XXXXX.pdf')" ]

Does unstructured itself have an initial load method to Implement the above function?

sanketsanjaypote29 · 2024-06-19T10:36:53Z

@adieuadieu sir could you please explain the how you are able to run unstructured package on lambda function, actually i am facing problem to do this please help me to solve this issue

adieuadieu · 2024-06-19T11:20:14Z

@adieuadieu sir could you please explain the how you are able to run unstructured package on lambda function, actually i am facing problem to do this please help me to solve this issue

Hi @sanketsanjaypote29. I suspect this thread is not the correct avenue for that sort of request, nor am I available to offer general support. But, briefly, here's some high-level guidance which should start you in the right direction:

You'll want to deploy your Lambda function using a container image. Add unstructured to your list of requirements.txt and import the unstructured module as you would in any other context. Follow the "Using an AWS base image for Python" guide here: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html#python-image-instructions

Alternatively, unless you're just noodling around with stuff for funsies, consider using Unstructured's hosted API service to save yourself the time of trying to run it on Lambda: https://docs.unstructured.io/api-reference/api-services/free-api

rmukhop3 · 2024-07-26T19:11:14Z

@cds-code, @adieuadieu
This is my Dockerfile:

FROM public.ecr.aws/lambda/python:3.11
ADD . ${LAMBDA_TASK_ROOT}

COPY requirements.txt .

RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

RUN python3 -m spacy download en_core_web_sm --target "${LAMBDA_TASK_ROOT}"
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='tmp/local/nltk_data')" ]

CMD [ "main.lambda_handler" ]

Requirements.txt
unstructured[all-docs]

I am getting the following error while downloading punkt for pptx, pdf file-types using strategy="fast". Environment = AWS Lambda

[nltk_data] Downloading package punkt to /home/sbx_user1051/nltk_data...
[Errno 30] Read-only file system: '/home/sbx_user1051'

adieuadieu · 2024-07-26T19:20:59Z

@rmukhop3

It looks like NLTK isn't able to find the pre-downloaded models at runtime.

You'll want to set the base of download_dir to the lambda task root:

download_dir='${LAMBDA_TASK_ROOT}/nltk_data')

Then you also need to make sure to set the NLTK_DATAenvironment variable to the same location (in the Dockerfile):

ENV NLTK_DATA=${LAMBDA_TASK_ROOT}/nltk_data

You cannot put it into /tmp as this will not be persisted.

pastram-i · 2024-10-29T19:23:08Z

@adieuadieu sir could you please explain the how you are able to run unstructured package on lambda function, actually i am facing problem to do this please help me to solve this issue

Hi @sanketsanjaypote29. I suspect this thread is not the correct avenue for that sort of request, nor am I available to offer general support. But, briefly, here's some high-level guidance which should start you in the right direction:

You'll want to deploy your Lambda function using a container image. Add unstructured to your list of requirements.txt and import the unstructured module as you would in any other context. Follow the "Using an AWS base image for Python" guide here: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html#python-image-instructions

Alternatively, unless you're just noodling around with stuff for funsies, consider using Unstructured's hosted API service to save yourself the time of trying to run it on Lambda: https://docs.unstructured.io/api-reference/api-services/free-api

While I think @adieuadieu is right, this may not be the best place for this - it's where I ended up when investigating running on lambda so I'm going to expand some details here on how to even run on lambda. Also some credit to @adieuadieu's other comment as well.

For me, the main problems with running unstructured on lambda came down to 2 issues, 1) onnxruntime and 2) image size.

onnxruntime
- Just doesn't like to run on lambda
  - After some digging in threads, I was able to find this project, which housed a workaround. In reality the work around is just adding this file to your project, and these lines to your Dockerfile.

COPY patch.txt /sys/devices/system/cpu/possible
COPY patch.txt /sys/devices/system/cpu/present

Image size
- Lambda max image size is 10GB, but unstructured (and it's requirements) end up being >10GB
  - The workaround is to use CPU packages of torch, as lambda is a non-GPU environment.

Hope this helps.

EDIT - I'm still having issues running in lambda environment, but I've moved that to it's own issue.
#3759

cds-code added the bug Something isn't working label Apr 22, 2024

scanny mentioned this issue Apr 22, 2024

bug/Execution speed is very slow in AWS Lambda environment #2915

Closed

scanny added investigating Issues that require more information before they are actionable and removed bug Something isn't working labels Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/Execution speed is very slow in AWS LAMBDA environment #2916

bug/Execution speed is very slow in AWS LAMBDA environment #2916

cds-code commented Apr 22, 2024

scanny commented Apr 22, 2024

cds-code commented Apr 23, 2024 •

edited

Loading

scanny commented Apr 23, 2024 •

edited

Loading

scanny commented Apr 23, 2024

danielfornarini commented Apr 23, 2024 •

edited

Loading

cds-code commented Apr 24, 2024 •

edited

Loading

scanny commented Apr 24, 2024 •

edited

Loading

scanny commented Apr 24, 2024

adieuadieu commented Apr 29, 2024 •

edited

Loading

cds-code commented Apr 30, 2024 •

edited

Loading

sanketsanjaypote29 commented Jun 19, 2024

adieuadieu commented Jun 19, 2024 •

edited

Loading

rmukhop3 commented Jul 26, 2024 •

edited

Loading

adieuadieu commented Jul 26, 2024 •

edited

Loading

pastram-i commented Oct 29, 2024 •

edited

Loading

bug/Execution speed is very slow in AWS LAMBDA environment #2916

bug/Execution speed is very slow in AWS LAMBDA environment #2916

Comments

cds-code commented Apr 22, 2024

scanny commented Apr 22, 2024

cds-code commented Apr 23, 2024 • edited Loading

scanny commented Apr 23, 2024 • edited Loading

scanny commented Apr 23, 2024

danielfornarini commented Apr 23, 2024 • edited Loading

cds-code commented Apr 24, 2024 • edited Loading

scanny commented Apr 24, 2024 • edited Loading

scanny commented Apr 24, 2024

adieuadieu commented Apr 29, 2024 • edited Loading

cds-code commented Apr 30, 2024 • edited Loading

sanketsanjaypote29 commented Jun 19, 2024

adieuadieu commented Jun 19, 2024 • edited Loading

rmukhop3 commented Jul 26, 2024 • edited Loading

adieuadieu commented Jul 26, 2024 • edited Loading

pastram-i commented Oct 29, 2024 • edited Loading

cds-code commented Apr 23, 2024 •

edited

Loading

scanny commented Apr 23, 2024 •

edited

Loading

danielfornarini commented Apr 23, 2024 •

edited

Loading

cds-code commented Apr 24, 2024 •

edited

Loading

scanny commented Apr 24, 2024 •

edited

Loading

adieuadieu commented Apr 29, 2024 •

edited

Loading

cds-code commented Apr 30, 2024 •

edited

Loading

adieuadieu commented Jun 19, 2024 •

edited

Loading

rmukhop3 commented Jul 26, 2024 •

edited

Loading

adieuadieu commented Jul 26, 2024 •

edited

Loading

pastram-i commented Oct 29, 2024 •

edited

Loading