Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reducing size of output jpg #23

Closed
6 tasks
eroux opened this issue Mar 4, 2019 · 17 comments
Closed
6 tasks

reducing size of output jpg #23

eroux opened this issue Mar 4, 2019 · 17 comments

Comments

@eroux
Copy link
Contributor

eroux commented Mar 4, 2019

Context: on S3, the tif corresponding to this image is below 30KB, but the output jpg on the iiif server is 461KB.

On the current website (using JAI), the corresponding image (here) is a png of about 30KB.

The png version is only 53KB, much more reasonable but still significantly more than the current website.

It seems hymir just uses the basic javax.imageio functions (see here) as provided by twelvemonkeys. The parameters that we can use in JPEGImageWriteParam look very limited. There doesn't seem to be a much better option in Java though.

This is an important issue for various reasons:

  • bigger files take longer to load
  • they cost us more to transfer to the user (we're paying the bandwidth)
  • a factor 10 in size is just completely unreasonable and probably indicates some deep problems

Here are a few ideas to start dealing with the issue:

  • first, let's bring @TBRC-JimK in: Jim, you'll develop some expertise in image treatment in Java for the asset manager, maybe we should share our doc, techniques, libraries, code, etc.?
  • a first easy action would be to tweak the Java jpg encoding quality values with the method suggested here, 90% is probably sufficient
  • we should log the decoders/encoders used by javax.imageio to make sure the correct ones are used (and probably also make sure we understand what the correct ones are)
  • we can also make some experiments to understand why a png produced with JAI is half the size as a png produced with imageio and report bugs or tweak configuration if needed
  • (I'm not sure my diagnosis is right here) then we should understand why the output jpg is full color while the original tif is black and white. It will require some diving into the Java APIs and internal image representation in Java
  • then we should understand if in these cases we can indicate to the iiif viewer to prefer png to jpg. This will require also some diving, this time in the iiif APIs (that's probably a job for me)
@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

(edited)

@MarcAgate
Copy link
Collaborator

MarcAgate commented Mar 4, 2019

I just want to add one remark here: unlike jpg rendering, png rendering of tiffs produces png having roughly twice the size of the initial tiff. It seems to me we might have an issue with jpg only.

@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

I'm in the train now but we should run the same kind of tests on jpg files yes, and compare the size of S3, tbrc.org and iiif

@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

(Aldo note that 53KB is 76% larger than 30KB, this is not something we can just ignore as a rounding error)

@MarcAgate
Copy link
Collaborator

MarcAgate commented Mar 4, 2019

As color vs. BW is concerned, see https://github.com/dbmdz/iiif-server-hymir/blob/master/src/main/java/de/digitalcollections/iiif/hymir/image/frontend/IIIFImageApiController.java#L125 where COLOR is hardcoded. We can fix that on our side since we have our own Controller Implementation.

Posted an issue and suggestion on hymir repo: dbmdz/iiif-server-hymir#59 (implemented it on our server)

@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

Excellent yes, this is something we should implement and contribute to Hymir

@MarcAgate
Copy link
Collaborator

MarcAgate commented Mar 4, 2019

here is a perf report (after I got rid of a double call to S3 by modifying hymir Image Service implementation)- Most of the time is taken by Image processing and png processing takes approximately twice the time of jpg processing.

@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

Can you record the methodology and numbers in a Google doc? Also, maybe the cache mechanism should cache the result (png or jpg) instead of the source? It would probably make more sense in most cases

@xristy
Copy link

xristy commented Mar 4, 2019

@MarcAgate thanks for the initial data. Some questions:

  • what is the "Building identifier from ldspdi" task. At ~12 ms per identifier that seems rather high to construct an identifier so there must be more to the task. If it's round-trip from hymir to ldspdi to fuseki to ldspdi to hymir then perhaps some short-circuiting should be considered
  • For the 6 samples the average transfer rate is 284 KBps w/ an average latency of 90 ms. It would be excellent to see data for fetching 1, 2, 4, 8 images at a time (either concurrently or via some sort of a bulk transfer, if possible, via the s3 api) with images around 25KB and several series of larger images, e.g., 500KB, 1MB, 2MB, and 5MB. The objective is to get a sense of how sensitive the ec2-s3 access is to how much is fetched at once and the overall size of the transfer
  • caching of the result (jpg or png) is not unreasonable, but it is worth considering caching the source as well we since if a pdf is requested generating directly from the source will avoid per image png/jpg processing from tiff which pdf is quite happy with.
  • further, regarding pdf generation it will be reasonable to use concurrent requests to s3 which s3 is designed for (rather than low-latency access). See Request Rate and Performance Guidelines and 10 Things About Using S3

@MarcAgate
Copy link
Collaborator

  1. Yes, building identifier is a round trip hymir-ldspdi-fuseki
  2. Given hymir implementation, we need to cache both the raw tiff and processed images (in any standard format). I am currently working on this.
  3. and 4) we already use multithreaded and concurrent s3 requests. I'll try to get more data on this if we decide that I should spend all that time on the overall iiif performance matter.

@eroux
Copy link
Contributor Author

eroux commented Mar 4, 2019

Thanks for the Google doc, I missed it the first time, sorry!

The round trip to Fuseki should happen only once per volume and not once per image request (is that the case?) so it's not the most important part I think (although of course we cannot ignore it).

The iiif performance are a crucial part of the new system yes, there should be at least no regression from tbrc.org (if possible an improvement). So it's absolutely worth spending some time on it. Maybe Jim can work on the image processing part (what do you think Jim?) But the rest (S3 connection, cache, etc.) are important too.

@jimk-bdrc
Copy link

jimk-bdrc commented Mar 4, 2019 via email

@eroux
Copy link
Contributor Author

eroux commented Mar 7, 2019

There's another optimization to be done around here I think. We could have a function that detects if the image needs to be transformed, and if not the imgReader could be redirected directly to the output. In short: when the original is a jpg and the request is for the exact same jpg (/full/full/0/default.jpg), then the original can be served directly. This will optimize many cases.

@eroux
Copy link
Contributor Author

eroux commented Mar 9, 2019

bad news, none of the viewers can be asked to use png by default, see

IIIF/api#1786

so we need to make the jpg output better

@eroux
Copy link
Contributor Author

eroux commented Mar 9, 2019

@berger-n is it feasible to package and use our fork of openseadragon? if it requires some code change (which I suspect it will), it would be best to create a new branch (buda-package), that would include the fix-1343 branch. I think if we can do that we could potentially use the .png instead of .jpg

@MarcAgate
Copy link
Collaborator

I think we have reached a first optimized state at this point:

  • Some method where rewritten to avoid multiple calls to S3 "repository" (since hymir is using local disk storage, this part was left "un-optimized")
  • Ldspdi is called only one time per volume (cached using the volumeId as the key)
  • S3 source images are being cached (useful for "zoom calls" inside image viewers)
  • PNG file size has been reduced to the size we actually have on tbrc.org (by using JAI library)
  • JPG Processing time has dramatically improved by using turbojpeg (that was buggy and failing)
  • Request are redirected when the requested file format is that of the S3 image source
  • Overall, request processing time has improved on a scale going from 6 to 12 (to get an idea, a 600ms processing time went down to 100 to 50ms)

At this point:

  • we still have to improve PNG processing time (it is at least twice the time of jpeg processing)
  • we are kind of "stuck" on jpeg output size ( we can have a 50k image here and a 529k image here) it obviously depends upon the source image so we might have to configure and refine the output processing according to it)
  • another size issue concerns PNG in color (for instance W22073 images are 650k on average and this one is 6,7Mg )

@eroux
Copy link
Contributor Author

eroux commented Mar 13, 2019

Great job, that's really a huge improvement, and now we have a production-ready server, thanks! The 3 final points are less important I think as they don't hinder the performance too much and there's not much we can do about it... closing

@eroux eroux closed this as completed Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants