reducing size of output jpg #23

eroux · 2019-03-04T08:27:37Z

Context: on S3, the tif corresponding to this image is below 30KB, but the output jpg on the iiif server is 461KB.

On the current website (using JAI), the corresponding image (here) is a png of about 30KB.

The png version is only 53KB, much more reasonable but still significantly more than the current website.

It seems hymir just uses the basic javax.imageio functions (see here) as provided by twelvemonkeys. The parameters that we can use in JPEGImageWriteParam look very limited. There doesn't seem to be a much better option in Java though.

This is an important issue for various reasons:

bigger files take longer to load
they cost us more to transfer to the user (we're paying the bandwidth)
a factor 10 in size is just completely unreasonable and probably indicates some deep problems

Here are a few ideas to start dealing with the issue:

first, let's bring @TBRC-JimK in: Jim, you'll develop some expertise in image treatment in Java for the asset manager, maybe we should share our doc, techniques, libraries, code, etc.?
a first easy action would be to tweak the Java jpg encoding quality values with the method suggested here, 90% is probably sufficient
we should log the decoders/encoders used by javax.imageio to make sure the correct ones are used (and probably also make sure we understand what the correct ones are)
we can also make some experiments to understand why a png produced with JAI is half the size as a png produced with imageio and report bugs or tweak configuration if needed
(I'm not sure my diagnosis is right here) then we should understand why the output jpg is full color while the original tif is black and white. It will require some diving into the Java APIs and internal image representation in Java
then we should understand if in these cases we can indicate to the iiif viewer to prefer png to jpg. This will require also some diving, this time in the iiif APIs (that's probably a job for me)

eroux · 2019-03-04T08:44:05Z

(edited)

MarcAgate · 2019-03-04T10:15:17Z

I just want to add one remark here: unlike jpg rendering, png rendering of tiffs produces png having roughly twice the size of the initial tiff. It seems to me we might have an issue with jpg only.

eroux · 2019-03-04T10:49:27Z

I'm in the train now but we should run the same kind of tests on jpg files yes, and compare the size of S3, tbrc.org and iiif

eroux · 2019-03-04T11:01:10Z

(Aldo note that 53KB is 76% larger than 30KB, this is not something we can just ignore as a rounding error)

MarcAgate · 2019-03-04T11:04:21Z

As color vs. BW is concerned, see https://github.com/dbmdz/iiif-server-hymir/blob/master/src/main/java/de/digitalcollections/iiif/hymir/image/frontend/IIIFImageApiController.java#L125 where COLOR is hardcoded. We can fix that on our side since we have our own Controller Implementation.

Posted an issue and suggestion on hymir repo: dbmdz/iiif-server-hymir#59 (implemented it on our server)

eroux · 2019-03-04T11:09:45Z

Excellent yes, this is something we should implement and contribute to Hymir

MarcAgate · 2019-03-04T13:44:11Z

here is a perf report (after I got rid of a double call to S3 by modifying hymir Image Service implementation)- Most of the time is taken by Image processing and png processing takes approximately twice the time of jpg processing.

eroux · 2019-03-04T13:46:58Z

Can you record the methodology and numbers in a Google doc? Also, maybe the cache mechanism should cache the result (png or jpg) instead of the source? It would probably make more sense in most cases

xristy · 2019-03-04T15:54:12Z

@MarcAgate thanks for the initial data. Some questions:

what is the "Building identifier from ldspdi" task. At ~12 ms per identifier that seems rather high to construct an identifier so there must be more to the task. If it's round-trip from hymir to ldspdi to fuseki to ldspdi to hymir then perhaps some short-circuiting should be considered
For the 6 samples the average transfer rate is 284 KBps w/ an average latency of 90 ms. It would be excellent to see data for fetching 1, 2, 4, 8 images at a time (either concurrently or via some sort of a bulk transfer, if possible, via the s3 api) with images around 25KB and several series of larger images, e.g., 500KB, 1MB, 2MB, and 5MB. The objective is to get a sense of how sensitive the ec2-s3 access is to how much is fetched at once and the overall size of the transfer
caching of the result (jpg or png) is not unreasonable, but it is worth considering caching the source as well we since if a pdf is requested generating directly from the source will avoid per image png/jpg processing from tiff which pdf is quite happy with.
further, regarding pdf generation it will be reasonable to use concurrent requests to s3 which s3 is designed for (rather than low-latency access). See Request Rate and Performance Guidelines and 10 Things About Using S3

MarcAgate · 2019-03-04T16:05:03Z

Yes, building identifier is a round trip hymir-ldspdi-fuseki
Given hymir implementation, we need to cache both the raw tiff and processed images (in any standard format). I am currently working on this.
and 4) we already use multithreaded and concurrent s3 requests. I'll try to get more data on this if we decide that I should spend all that time on the overall iiif performance matter.

eroux · 2019-03-04T16:18:05Z

Thanks for the Google doc, I missed it the first time, sorry!

The round trip to Fuseki should happen only once per volume and not once per image request (is that the case?) so it's not the most important part I think (although of course we cannot ignore it).

The iiif performance are a crucial part of the new system yes, there should be at least no regression from tbrc.org (if possible an improvement). So it's absolutely worth spending some time on it. Maybe Jim can work on the image processing part (what do you think Jim?) But the rest (S3 connection, cache, etc.) are important too.

jimk-bdrc · 2019-03-04T21:15:56Z

I have to finish audit tool (could be 6 weeks.) If I start building image processing, that would delay starting Asset Manager data plane & other things, although I think it would be a good thing to have in place where both Asset Manager and BUDA could get to it.

eroux · 2019-03-07T21:53:55Z

There's another optimization to be done around here I think. We could have a function that detects if the image needs to be transformed, and if not the imgReader could be redirected directly to the output. In short: when the original is a jpg and the request is for the exact same jpg (/full/full/0/default.jpg), then the original can be served directly. This will optimize many cases.

eroux · 2019-03-09T10:49:34Z

bad news, none of the viewers can be asked to use png by default, see

IIIF/api#1786

so we need to make the jpg output better

eroux · 2019-03-09T17:18:40Z

@berger-n is it feasible to package and use our fork of openseadragon? if it requires some code change (which I suspect it will), it would be best to create a new branch (buda-package), that would include the fix-1343 branch. I think if we can do that we could potentially use the .png instead of .jpg

MarcAgate · 2019-03-13T01:52:04Z

I think we have reached a first optimized state at this point:

Some method where rewritten to avoid multiple calls to S3 "repository" (since hymir is using local disk storage, this part was left "un-optimized")
Ldspdi is called only one time per volume (cached using the volumeId as the key)
S3 source images are being cached (useful for "zoom calls" inside image viewers)
PNG file size has been reduced to the size we actually have on tbrc.org (by using JAI library)
JPG Processing time has dramatically improved by using turbojpeg (that was buggy and failing)
Request are redirected when the requested file format is that of the S3 image source
Overall, request processing time has improved on a scale going from 6 to 12 (to get an idea, a 600ms processing time went down to 100 to 50ms)

At this point:

we still have to improve PNG processing time (it is at least twice the time of jpeg processing)
we are kind of "stuck" on jpeg output size ( we can have a 50k image here and a 529k image here) it obviously depends upon the source image so we might have to configure and refine the output processing according to it)
another size issue concerns PNG in color (for instance W22073 images are 650k on average and this one is 6,7Mg )

eroux · 2019-03-13T09:04:14Z

Great job, that's really a huge improvement, and now we have a production-ready server, thanks! The 3 final points are less important I think as they don't hinder the performance too much and there's not much we can do about it... closing

eroux closed this as completed Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reducing size of output jpg #23

reducing size of output jpg #23

eroux commented Mar 4, 2019 •

edited by MarcAgate

Loading

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 •

edited

Loading

eroux commented Mar 4, 2019

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 •

edited

Loading

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 •

edited

Loading

eroux commented Mar 4, 2019

xristy commented Mar 4, 2019

MarcAgate commented Mar 4, 2019

eroux commented Mar 4, 2019

jimk-bdrc commented Mar 4, 2019 via email •

edited

Loading

eroux commented Mar 7, 2019

eroux commented Mar 9, 2019

eroux commented Mar 9, 2019

MarcAgate commented Mar 13, 2019

eroux commented Mar 13, 2019

reducing size of output jpg #23

reducing size of output jpg #23

Comments

eroux commented Mar 4, 2019 • edited by MarcAgate Loading

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 • edited Loading

eroux commented Mar 4, 2019

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 • edited Loading

eroux commented Mar 4, 2019

MarcAgate commented Mar 4, 2019 • edited Loading

eroux commented Mar 4, 2019

xristy commented Mar 4, 2019

MarcAgate commented Mar 4, 2019

eroux commented Mar 4, 2019

jimk-bdrc commented Mar 4, 2019 via email • edited Loading

eroux commented Mar 7, 2019

eroux commented Mar 9, 2019

eroux commented Mar 9, 2019

MarcAgate commented Mar 13, 2019

eroux commented Mar 13, 2019

eroux commented Mar 4, 2019 •

edited by MarcAgate

Loading

MarcAgate commented Mar 4, 2019 •

edited

Loading

MarcAgate commented Mar 4, 2019 •

edited

Loading

MarcAgate commented Mar 4, 2019 •

edited

Loading

jimk-bdrc commented Mar 4, 2019 via email •

edited

Loading