Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code returns inconsistent results with CGAL 5.5 #3

Open
d-krupke opened this issue Oct 23, 2022 · 40 comments
Open

Code returns inconsistent results with CGAL 5.5 #3

d-krupke opened this issue Oct 23, 2022 · 40 comments

Comments

@d-krupke
Copy link
Member

When compiling the code with GCC and CGAL 5.5, we get wrong results from CGAL::join and sometimes segmentation faults from deep within CGAL. We should extract the corresponding instances and write a minimal piece of code to reproduce this error. If we cannot do this, the bug is probably from our side. If we can do it, it may actually be a CGAL bug and we should fill out a bug report.

Additionally, make sure that if the system provides CGAL, still the conan version is used.

@gfonsecabr
Copy link

I had segmentation faults on join when using my testing code (that was a very similar modified copy of the official test code) compiling against the development version of CGAL with the attached solution file. This is the smallest one I could obtain. The crash is fairly rare on small invalid solutions and becomes very common on instances with more than 10,000 vertices.

The crash happens on the join function that fills a vector of polygons with holes, as in the official code. I then switched my code to use the Polygon_set_2 join method instead and the problem disappeared.

cheese528.BAD.sol.zip

@phillip-keldenich
Copy link
Contributor

phillip-keldenich commented Nov 14, 2022

Yes, this is about what we have seen with that version of CGAL.
Are we sure this is a bug in CGAL? Other than a potential bug in CGAL, do you see any reason why our code would exhibit this behavior, i.e., because it inputs incorrect things into CGAL?
Maybe we should file a bug/issue with CGAL...

@gfonsecabr
Copy link

I do not see any reason why it would be a bug in our code. I could not find any pattern on the cases when it happens to build a small example, though. However, it happens on every execution of the problematic files.

@d-krupke
Copy link
Member Author

d-krupke commented Nov 14, 2022

I just asked Efi and he suggested to try (UsePolylines=)Tag_false in cgal::join as in this file.

@phillip-keldenich
Copy link
Contributor

phillip-keldenich commented Nov 14, 2022

Ah - but does he think that this behavior is working as intended or would they be interested in a bug report?

@d-krupke
Copy link
Member Author

d-krupke commented Nov 14, 2022

With this I get a 'the union of the polygons leaves uncovered some area of volume 104.986317756746 at or near point (1289.8, 1191.5)'. With the original version I could recreate the segmentation fault on Mac (intel).
Efi thinks it is a bug: The new code is more efficient but not as well tested.

@d-krupke
Copy link
Member Author

The following C++-code results in a segmentation fault with CGAL 5.5 cgal_problem.cpp.zip. It essentially converts the solution directly to a vector of cgal polygons without any further dependencies.

@gfonsecabr
Copy link

I think that even on the version of CGAL used by pyutils23, the problem still exists, even if it is more rare. I'm attaching a solution that gives a segmentation fault, and I believe it is a valid solution. Using Polygon_set_2::join seems to solve the problem.
cheese4951.segfault.sol.zip

@d-krupke
Copy link
Member Author

d-krupke commented Dec 6, 2022

That would be surprising: If I remember correctly, Efi told me that the old join is thoroughly tested.
We will take a look into this, but it may take some days.

@gfonsecabr
Copy link

So, maybe the problem is somewhere else and deserves a separate issue. All, I know is that when I test this solutions with pyutils23, I consistently get a segmentation fault (could be anywhere).

@d-krupke
Copy link
Member Author

d-krupke commented Dec 6, 2022

I would prefer it to be our fault, because then it is easier to fix. 😅
But it would be odd if exchanging the join suddenly makes it work. Have to look at it in the debugger.

@gfonsecabr
Copy link

I have no idea how to debug this mixed python C++, or even how to compile everything with debug information. I think it'll be easier for you to test this file when you have the time. It's not urgent. I was just surprised that it gives a segmentation fault and it was the only file to do so out of several hundreds. When I tested it with a separate code by myself (that is only really meant to test coverage, not if the polygons stick outside for example), the file seems ok. So, I thought it was valuable information for the future.

@phillip-keldenich
Copy link
Contributor

I cannot reproduce the segmentation fault locally. The instance/solution combination, for me, is verified correctly (i.e., claimed to be valid).
How do you start/call verification?

@gfonsecabr
Copy link

That's surprising. I test it with:

#!/usr/bin/python3

from cgshop2023_pyutils import InstanceDatabase, read_solution, verify
import sys

idb = InstanceDatabase("/home/fonseca/Documents/src/socg23challenge/instances/")
solution = read_solution(sys.argv[1])
instance = idb[solution["instance"]]

err_msg = verify(instance, solution)
if err_msg:
    print(sys.argv[1],"SOLUTION INVALID:", err_msg)

@phillip-keldenich
Copy link
Contributor

Looks solid. Even under debug builds, it still works for me. There are, however, a few sanitizer warnings from the undefined behavior sanitizer that look like some sort of type confusion within CGAL/Surface_sweep_2/Arr_no_intersection_insertion_ss_visitor.h (or may be false positives, though I'm not sure what would cause such false positives). Are you absolutely sure that you built against the 5.3.2 version of CGAL, or is there even a slight chance that somehow, the newer headers ended up being the ones dragged in for some reason?

@gfonsecabr
Copy link

I have no idea which version of CGAL is used. How can I find out? I simply ran

pip install cgshop2023-pyutils

as a normal user on a Fedora machine without any CGAL installed system-wide.

@d-krupke
Copy link
Member Author

d-krupke commented Dec 7, 2022

Then it should be the exact same version we are using. I am worried: As far as I know, there are only two different algorithms used. One of them, we already know to be buggy. If there also is a bug in the other one, just calling it (indirectly) in another way won't be a reliable workaround.
Still, I have hope that the problem may be on our side as this older part should be tested well. Worst case, we would have to quickly build another verifier directly on arrangements.

@phillip-keldenich
Copy link
Contributor

@gfonsecabr You can kinda do a debug build as follows: Pull the latest version of the repository. There is some minor change to the way cmake looks up find modules, which should make a CMake build of the C++ side easier. Create a build directory in the top-level directory of the repository, e.g., called 'cmake-build-debug' or so, and enter it. Execute 'conan install .. -s build_type=Debug --build=missing' in it. Then execute 'cmake -DCMAKE_BUILD_TYPE=Debug ..' in it, followed by 'cmake --build .' This should produce a debug build in the build directory; the tricky part is (temporarily) replacing the installed version of the C++ module by the debug version to test it/be able to see where the problem occurs. For that, you'd have to find the installation location of the package you installed (probably in some conda environment).

@phillip-keldenich
Copy link
Contributor

Also, could you possibly tell me exactly what C++ compiler/version you are using? Maybe I can reproduce the issue on a linux system, then.

@gfonsecabr
Copy link

I'm using g++ (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)

@phillip-keldenich
Copy link
Contributor

I've tried to reproduce on linux with GCC 12 (unfortunately, i only have 12.1 not 12.2...); still no luck. The instance verifies just fine. I guess you already tried to reinstall the package?

@phillip-keldenich
Copy link
Contributor

phillip-keldenich commented Dec 7, 2022

Okay, now we're getting somewhere (needed to patch python with patchelf to add libstdc++ as fake dependency to python3 to get address sanitizer to work in the mixed python-and-c++ setting, and LD_PRELOAD to preload libasan into python3). While it doesn't segfault, it definitely does something wrong. Address Sanitizer recognizes at least one heap use-after-free and a lot of the already-known errors that look like type confusion between a halfedge and a face.
SUMMARY: AddressSanitizer: heap-use-after-free (/.../miniconda3/envs/.../lib/python3.8/site-packages/cgshop2023_pyutils/core/_cgshop2023_core.cpython-38-x86_64-linux-gnu.so+0x263001f) in CGAL::Gps_on_surface_base_2<CGAL::Gps_segment_traits_2<CGAL::Epeck, std::vector<CGAL::Point_2<CGAL::Epeck>, std::allocator<CGAL::Point_2<CGAL::Epeck> > >, CGAL::Arr_segment_traits_2<CGAL::Epeck> >, CGAL::Arr_bounded_planar_topology_traits_2<CGAL::Gps_segment_traits_2<CGAL::Epeck, std::vector<CGAL::Point_2<CGAL::Epeck>, std::allocator<CGAL::Point_2<CGAL::Epeck> > >, CGAL::Arr_segment_traits_2<CGAL::Epeck> >, CGAL::Gps_default_dcel<CGAL::Gps_segment_traits_2<CGAL::Epeck, std::vector<CGAL::Point_2<CGAL::Epeck>, std::allocator<CGAL::Point_2<CGAL::Epeck> > >, CGAL::Arr_segment_traits_2<CGAL::Epeck> > > >, CGAL::Boolean_set_operation_2_internal::PreconditionValidationPolicy>::~Gps_on_surface_base_2()
It doesn't segfault per se (for me, but this error could definitely cause a segfault with different memory size, compiler, ...), but this is definitely an error, and it looks like it's in CGAL.

@phillip-keldenich
Copy link
Contributor

Even with CGAL 5.5 and UsePolylines = CGAL::Tag_false, we sometimes have segfaults on our testing machine (rtx01); whether a segfault occurs or not depends, among other things, on debug vs. release mode.
Unless CGAL identifies the problem and publishes a fixed version, the only thing I see we could do is to reimplement CGAL::difference (using arrangements?) ourselves.

@efifogel
Copy link

The implementation of 2D Boolean ops of CGAL depends on the 2D Arrangement package of CGAL.
The Polygon_set_2 template in particular is based on the Arrangement_2<GeometryTraits, DCEL> template, where the GeometryTraits template parameter is substituted with an instance of the Arr_segment_traits_2<> template; it means that the arrangement is induced by segments. In the free operations case (CGAL::join(), CGAL::intersection(), etc), on the other hand, the geometry traits is replaced by an instance of the Arr_polyline_traits_2<> by default. This behavior can be overridden using the UsePolylines flag (as mentioned above) to fall back to using segments. The switch to using polylines by default was done a couple of years ago, and it hasn't gained a lot of millage I guess. Also, the code that handles polylines is much more complicated. Dominik has sent to me a test case that fails with the default (that is polylines) and I was able to reproduce it. I suspect that this particular failure lies in the plane-sweep code (The code that handles (partially) overlapping polylines is complicated.) but, everything is possible...
If you have something concrete that fails with segments, I'll be happy to try and reproduce myself.

@gfonsecabr
Copy link

Did you test if doing a random permutation of the input set of polygons changes anything?

@efifogel
Copy link

Not sure whether you referred the question to me; in case you did, I haven't tested anything with segments yes.
I've downloaded the cheese528 bad solution file posted in this thread, but I need to add some code to try it out.
BW, thank you for the contest; it was delightful!
In my program I only saved the coordinates as numerators and denominators and each as a string.
Please refer to me the code that parses solution json files that also reads a coordinates represented as a single integral number, or as a numerator and denumerator, each as integral number.
While at it, conceptually, an integral number can be larger than 64 bits, so using strings guarantees correctness, for example:

    using NT = CGAL::Gmpq;
    using Kernel = CGAL::Cartesian<NT>;
    ....
    auto ratx = json_data["x"];
    std::string xnum_str = ratx["num"];
    std::string xden_str = ratx["den"];
    std::string x_str = xnum_str + "/" + xden_str;
    NT x(x_str);

Last thing, I don't understand the output of the sanitizer (perhaps I need to read it again), but the last line contains '~Gps_on_surface_base_2'. So just for laughs, can you please comment out the body of this destructor and try again...

@gfonsecabr
Copy link

I mean that a possible work around for the contest could be doing a random permutation of the solution polygons. This would be very likely to work if the polygons are added one by one and their union is updated. With the current join function that computes the union of many polygons, it depends a lot on the algorithm used (the order shouldn't change anything if a sweep line approach is used internally), but I believe that the join function with iterators is only a shortcut to apply join many times.

@efifogel
Copy link

efifogel commented Jan 30, 2023 via email

@d-krupke
Copy link
Member Author

Using a random permutation, we were able to verify one of the submissions. The other four submissions still make problems after very many shuffles. Also tried working with Debug-compilation mode.

@afabri
Copy link

afabri commented Feb 1, 2023

Hello, I only today learned about your problem through a mail sent by Sándor.
Is your bug reproducible? When you dump the input of a faulty join() operation into a file and read the file in an independent program that calls join() do you then have the same problem?

@afabri
Copy link

afabri commented Feb 1, 2023

@efifogel just filed an issue.

@d-krupke
Copy link
Member Author

d-krupke commented Feb 1, 2023

Great! I was lucky to share an office with Efi for the last three months, so he consulted us directly regarding this problem.
Actually, there is also a problem (segmentation fault) with the old implementation, but it is much more difficult to reproduce. We try to provide a reasonably small example (which will probably still be huge). Maybe there is a relationship between both issues.

@sloriot
Copy link

sloriot commented Feb 1, 2023

what is "huge"?

@d-krupke
Copy link
Member Author

d-krupke commented Feb 1, 2023

Well, the solution files for which this happens range between 6-30mb.

@sloriot
Copy link

sloriot commented Feb 1, 2023

You can open an issue here and I'll try to reduce it. You can put the data at https://gist.github.com

@afabri
Copy link

afabri commented Feb 3, 2023

Hello, @sloriot has identified the problem and you can find his fix in the PR CGAL/cgal#7243
We would be glad if you gave it a try.
And the next time you encounter a problem with CGAL please file an issue. We consider bug reports as contributions, as only if we learn about a bug we can fix it.

@efifogel
Copy link

efifogel commented Feb 3, 2023 via email

@afabri
Copy link

afabri commented Feb 6, 2023

You mean the "fix" does not fix the problem described in this issue, but only a subproblem you, Efi, filed here?

@efifogel
Copy link

efifogel commented Feb 6, 2023

The "fix" addresses a problem found about 3 months ago, and worked around by falling back to the old code of the free Boolean op functions, which do not use polylines. It stopped being an issue once this workaround was in place, which happened 10 minutes after it was brought to my attention (again ~3 months ago). BW, this problem occurs only when using the free functions. When using the CGAL::Polygon_set_2 for example, segments (and not polylines) are used by default. About 2 weeks ago a new issue was encountered. As far as I understand it was (still is) hard to reproduce. Dominik told me that despite this issue, the contest verification process has completed by now. He also told me, that being a "good citizen" in our community, he or his colleges would nevertheless try to peruse it, so that it can be properly addressed. (sorry for the many words...)

@d-krupke
Copy link
Member Author

d-krupke commented Feb 7, 2023

Exactly, we have been falling back to the old code, which worked for most solutions. However, we still got some rare segmentation faults, apparently within CGAL. Also at least one participant reported similar problems within his own code. However, I already spent some time in trying to track down the origin of the problem and it doesn't make much sense to me so far. The big problem with this bug hunt is, that it is not only extremely rare and system dependent, but it only appeared in complete verification processes which are mixed Python/C++-code (PyBind11) and complex to debug properly (cannot really provide a reproducible example based on that). We may let a student assistant create a pure C++ version, which can then be analyzed easier, but I can imagine that then the bug suddenly does no longer appear. I won't rule out completely that this remaining problem may be a bug on our side, but the segmentation fault is happening within CGAL and we check before that the polygons are all nice and simple.
We will also try to check out the fix by simply letting it run over all submitted solutions, but I don't know how quickly we will find the time (or a competent student assistant) for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants