Bug: I can't extract the images from PDF #995

aropb · 2025-02-19T19:04:16Z

aropb
Feb 19, 2025

Here is an example document.

39. Trade Finance on the Blockchain.pdf

Page.GetImages() - does not return an image inside a pdf. Is there any way to extract it now?

BobLd · 2025-02-20T19:11:43Z

BobLd
Feb 20, 2025
Maintainer

@aropb I've just added documentation, see here https://github.com/UglyToad/PdfPig/wiki/Images#additional-filters

0 replies

aropb · 2025-02-20T20:49:56Z

aropb
Feb 20, 2025
Author

@BobLd

Thanks.

Please tell me, taking this into account, wouldn't it be more reliable to render such pdf files into an image? I'm doing this now, but which solution is more hopeful and universal?

"Since PDF content may define many different ColorSpaces for rendering not all of these are yet supported by PdfPig. Where the ColorSpace is common, e.g. DeviceGray, DeviceRGB, DeviceCMYK decoding of the image to a PNG is supported. Other ColorSpaces are either not supported or only have partial support. IPdfImage defines the ColorSpace and ColorSpaceDetails properties for more information of the active ColorSpace when this image was rendered to the page."

0 replies

BobLd · 2025-02-20T20:56:06Z

BobLd
Feb 20, 2025
Maintainer

@aropb it depends what you are trying to do? Are you trying to render the document pages as images? If yes, you can use https://github.com/BobLd/PdfPig.Rendering.Skia

"Since PDF content may define many different ColorSpaces for rendering not all of these are yet supported by PdfPig. Where the ColorSpace is common, e.g. DeviceGray, DeviceRGB, DeviceCMYK decoding of the image to a PNG is supported. Other ColorSpaces are either not supported or only have partial support. IPdfImage defines the ColorSpace and ColorSpaceDetails properties for more information of the active ColorSpace when this image was rendered to the page."

I have removed this part of the documentation as this is not accurate anymore

0 replies

aropb · 2025-02-20T21:01:53Z

aropb
Feb 20, 2025
Author

I need to extract text with maximum quality and accuracy, and pdfs can be very different, I don't know which ones in advance. These can be text documents or even scans of books.

0 replies

aropb · 2025-02-20T21:08:39Z

aropb
Feb 20, 2025
Author

Am I guaranteed to get all the images using filters now? Next, I use Tesseract OCR to convert to text. Before that, there were cases when the image format was not suitable for him and he gave an error. In addition, I see that I need to extract text from a pdf if possible, if it is there, it is more reliable and faster.

0 replies

aropb · 2025-02-21T08:09:10Z

aropb
Feb 21, 2025
Author

@BobLd

Thanks a lot!

https://github.com/UglyToad/PdfPig/wiki/Images#additional-filters

I did as you wrote above, but this PDF's doesn't work:

#995 (comment)

192. The EU's 15th package adopted today and latest developments in the EU, UK and US Russia sanctions programmes.pdf

0 replies

BobLd · 2025-02-23T10:58:08Z

BobLd
Feb 23, 2025
Maintainer

@aropb I believe all the issues are not fixed (see BobLd/UglyToad.PdfPig.Filters.Jbig2.PdfboxJbig2#2). I will mark the discussion as closed. Feel free to reopen if need be

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: I can't extract the images from PDF #995

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Bug: I can't extract the images from PDF #995

aropb Feb 19, 2025

Replies: 7 comments

BobLd Feb 20, 2025 Maintainer

aropb Feb 20, 2025 Author

BobLd Feb 20, 2025 Maintainer

aropb Feb 20, 2025 Author

aropb Feb 20, 2025 Author

aropb Feb 21, 2025 Author

BobLd Feb 23, 2025 Maintainer

aropb
Feb 19, 2025

BobLd
Feb 20, 2025
Maintainer

aropb
Feb 20, 2025
Author

BobLd
Feb 20, 2025
Maintainer

aropb
Feb 20, 2025
Author

aropb
Feb 20, 2025
Author

aropb
Feb 21, 2025
Author

BobLd
Feb 23, 2025
Maintainer