-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory optimisation for extract using roaringbitmap? #258
Comments
Digging a little deeper I see that the IDs are stored in an osmium::index::IdSetDense. The implementation is using a bitmap under the hood, seems I assumed wrong there. I'm not super familiar but roaring might provide an improvement, I believe there are three strategies they use depending on the total members in the set, but it might be negligible while also introducing a dependency. |
I'm not a CPP programmer but it seems that this line is allocating a lot more memory than it really needs. I might be reading this wrong but it seems that the size of
The issue with using a vector rather than using a map is that there are potentially large sections of the vector which are empty, is this consuming memory? |
I've generated a custom PBF file which demonstrates the issue: When I run the following command all the RAM on my system is consumed: osmium extract -s 'smart' -b '0,0,2,2' -o out.pbf --overwrite osmium-oom.pbf The input file contains 73 elements, all of which have IDs under the max IDs currently available in OSM |
Are you just playing around with this and had an idea for an improvement or do you actually have some problem using it with OSM data? |
It seems the Bitmask implementation in I wanted to open a ticket to discuss before implementing anything, to see if it was worth putting in effort to implement a different Bitmask algorithm, and whether you would consider merging it. It's hard to do a definitive comparison on implementation because they all perform differently in terms of memory consumption and CPU utilization depending on how sparse/dense the set is. The main thing I would consider changing is that the Bitmask size correlates to the highest ID in the set rather than the population of the set. Is there any reason why you're using a Vector instead of a Hashmap? |
I assume this means "just playing around". Feel free to do that and implement something better. The reason the code is the way it is, is because it is simple and because it works for the stuff that people generally do. I am fully aware that it is not perfect and that it will not work for all use cases, but it is hard to write something that works in all cases, uses a minimum amount of memory, is performant etc. If you come up with something better which is supported by tests, sure I'll consider it. If you want to introduce a new dependency, like RoaringBitmap, there is a good chance that I'll reject it, though. Adding dependencies comes with a lot of cost for users of the library, so it better be worth it. As a minimum it must be available for all platforms we support, which also means somewhat older Ubuntu versions etc. (If it can be optional that would be a possibility also.)
I assume you are talking about |
totally agree, this is why I would advocate using Roaring as it's well studied and battle-tested, rather than hacking something myself.
The CRoaring library offers an 'amalgamation' build, would you consider including this in libosmium if it was proven to have significant benefits over the existing bitmap implementation? |
My CPP is terrible but I managed to butcher together a branch for testing here: |
The comment underneath details is no longer relevant for the problem at hand…
For my Overpass API fork I implemented "IdSetHybrid" which combines ideas from IdSetSmall and IdSetDense by using whatever representation (sorted int vector+ binary search, or bitmap) uses less memory. Also, the vector bucket stores 64 bit values as 32 bit to further cut down memory usage.
Conversion from one representation to the other happens on the fly, once the size of a bucket hits a certain threshold value. If I understood the Roaring bitmap paper correctly, one part of their implementation follows a similar pattern. I believe libosmium already comes with an IdSet implementation, which is using a global switch to migrate from binary search to bitmaps, whereas in my implementation, this decision is being made on a bucket-by-bucket basis. The implementation is not perfect, and doesn't support all of the methods provided IdSetDense. Still it was good enough for my use case. Feel free to play around with it: https://github.com/mmd-osm/Overpass-API/blob/test7591/src/overpass_api/data/abstract_processing.h#L38-L199 This code doesn't introduce any new external dependencies, like CRoaring. |
I've some some basic benchmarking of
The reduced memory requirement was very useful for me when cutting down the planet file to smaller extracts, as per Geofabrik downloads. Previously the machine would go to swap with 128GB RAM, it now uses ~20GB. I think it would be nice to have a default bitmask implementation for |
The reason why your example file takes so much memory is the single node with a negative node id:
If you change that node from e.g. -1 to 2 in the file, the memory consumption is insignificant. IdSetDense works with unsigned integers, so -1 will end up as a fairly large positive value (0xFFFFFFFFFFFFFFFF). I'd say, this issue is probably a duplicate of #234, or at least a related issue. Also the question comes up if this is a realistic test case in the first place. Planet files don't contain any negative object ids. |
Hi I've read the interesting discussion here. I have one case where I am investigating the osmium-tool extract performance. To generate the files for Norway, it takes about 10200 seconds, and the osmium extract time is about 7800 of these seconds. My initial investigation showed that it looked promising to make one json input file to osmium-tool extract, and have it create My initial test based on 30 tiles, was that I could decrease the time used by 70%, by invoking osmium tool once instead of 30 times. So I looked at this discussion, and made some tests. The CRoaring also looks interesting since they are working on adding AVX2 and AVX512 support, but I am not sure if that would eventually help on the performance, RoaringBitmap/CRoaring#454. I observe the similar things as @missinglink , which is that using the CRoaring really reduces the memory usage sharply, but the performance also goes down. It would be really interesting to test how the memory usage / performance would be for my case using that hybrid approach. I also notice that the CPU usage is around 100% on a 16 core machine, i.e. only one core is really used. The memory usage using the CRoaring for Norway was around 6GB (shown by htop) (but virtual memory usage was 86GB). Here is the "/usr/bin/time -v" output on linux for my run for the "filtered_names" input file.
And here for the filtered input file, with example of how the size of the tiles are.
File sizes for Norway, although I appreciate that it is really the highest node id value to influences max memory requirement.
So total time is about 3500 seconds, which is even a lot better than the 7700 seconds (although on a different machine, but CPUs are quite similar.) I am also doing testing on Germany and France, which both are broken up into less than 100 tiles. But there are still memory issues there. I then tried with another cloud machine, with 176 GB RAM. It ran out of memory just after starting phase 2 for Norway. For Germany, I could complete one test with standard osmium and one with the croaring implementation.
Osmium modified to use CRoaring
So for this input file, the croaring impl was actually 6 times slower, but used 95% less memory. For the Germany filtered_names.o5m.pbf input, I saw this using standard osmium
And using Croaring
So this was a lot quicker than when using the filtered.o5m.pbf input file. For this case, the CRoaring actually was faster for the phase 2. And it was really phase3 that was a lot slow for the CRoaring. File size for Germany
Version info
Overall, I am surprised that the CPU usage is not higher, mainly 1-2 CPU cores being used, it seems like. |
GitHub seems to be hiding that info in this comment under an arrow |
@missinglink I saw the performance numbers in your comment, but I wonder where the source code for the IdSetHybrid, I would like to test that source code on my test cases. |
It's linked in that comment |
I tried today with the IdSetHybrid, taken from https://github.com/mmd-osm/Overpass-API/blob/test7591/src/overpass_api/data/abstract_processing.h#L38-L199. I actually used until line 226. But for the "relations", the IdSetHybrid seems to be missing a begin and end method, since I get compilation error
I then tried using the IdSetHybrid for all except the relations, where I used the BufferedBitmask. I see that CRoaring has the iterator methods that I assume are needed
but the IdSetHybrid seems to be missing this. @missinglink Did you have any issues with this ? |
Sorry it was quite a while ago, I checked my old laptop but I don't have the code any longer. I don't recall having any issues getting the IdSetHybrid code to compile, sorry I don't recall if I tested the smart strategy. |
Yes, that's intentional, I didn't have any use for it, that's why I didn't implement the Iterator in IdSetHybrid. |
Thanks for the feedback, I am looking at the CRoaring approach now and not on the IdSetHybrid anymore, since Croaring seems to have good potential, I seem to be able to speed up the phase3 by using some CRoaring functionality. I will also look into a couple of other ideas I have to speed up the performance. Also, since the extract command does not seem to be very multithreaded, I also see that I can run many extract commands in paralell, so it seems like the Canada case could be done in a couple of hours (with 160GB RAM and 20 vCPU), instead of a couple of days, if I use about 800 extracts per "config file", instead of invoking the extract command for each of the 4500 tiles. |
For me, CRoaring would have introduced too much of a dependency. All I needed was a mix of IdSetSmall and IdSetDense, which switches from array to bitmap container when hitting a break even point. CRoaring does similar things (see https://arxiv.org/pdf/1603.06549.pdf page 6), yet adds further performance improvements like Run length encoding, and processor specific optimizations. Also, I only needed adding values to IdSetHybrid, and checking for their existence later on. |
What version of osmium-tool are you using?
What operating system version are you using?
Tell us something about your system
64GB RAM
What did you do exactly?
Cutting many extracts uses a large amount of memory (as documented).
What did you expect to happen?
I expected the internal memory requirement to be lower, assuming that a bitmap structure was being used to track dependent OSM elements.
What did happen instead?
It seems that element IDs are being stored as integers?
What did you do to try analyzing the problem?
I would just like to enquire if something like https://roaringbitmap.org/ could be used to keep track of which elements are required in each extract, in order to greatly reduce the memory requirements?
The text was updated successfully, but these errors were encountered: