-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for run-bug-run runbugrun #39
Comments
https://github.com/giganticode/run_bug_run_data/releases/tag/v0.0.1 Seems like the first release is out |
happy to take on that |
sounds good, let me know if you have any question! |
I started looking into this yesterday. Few things about run bug run:
|
When you integrate the benchmark you define which commands to run when compiling/testing each bug/patch. The only part that is currently hard-coded for Java is the extraction of single functions, removal of comments, etc. See https://github.com/ASSERT-KTH/elle-elle-aime/tree/master/elleelleaime/core/utils/java To integrate a Python benchmark you'll need to implement similar functions for Python (or even better, using tree-sitter to support more languages). |
Sharing progress so far: #166 I'm a little unclear on the Bug.failing_tests -- it maps test methods to the resulting error message? In run bug run there are simply test inputs and expected outputs, and the buggy code is not always a self-contained function. Also the ground_truth diff only comes into play when evaluating the LLM-generated fix? Why not simply check if tests pass. Similarly, not sure if I'm using the checkout logic correctly -- seems like a drag to have to make a copy each time and I instead simply read from the original buggy file. Any feedback/corrections welcome! |
Exactly, it maps fully qualified test method names to the error messages.
I see the solution you came up with, and that seems reasonable. The only problem will be in extracting the test case (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L269). This means that we need to add a special case for RunBugRun here.
The ground_truth diff is used in two places right now:
Executing tests to check is great, but there is known problem in program repair called patch overfitting. This problem lies in patches that pass the tests but are different from what the developer intends (see e.g., Is the cure worse than the disease? overfitting in automated program repair. For this reason, we use the ground-truth patch as a reference in some evaluation metrics like exact-match or ast-match.
It's important to have that logic (every checkout copies the files from an untouched source) due to the parallelism. We want to be able to evaluate hundreds/thousands of patches at the same time, and this requires them to be in different locations.
Could you rebase your PR? I changed the CI config to enable it on PRs. That way we can check if the tests are green. Thanks :) |
So, test cases right now are simple asserts about the returned value. febe8e4#diff-3f4ea3e207b6866ea3514390ef0148073207b05d1a8ca4da933d8f926e1be2d5 Got all the other points, will rebase PR. |
Looks like a good solution, thanks! Let me know if you have any problem with the CI |
PR updated. One tricky issue is that during initialize to get failing test cause, it is necessary to execute test cases, which takes a very long time. For now I'm only checking if a bug has an associated runtime exception (which is stored as part of the dataset), without executing. This skips about 3/4ths of all bugs that do not throw an exception but output the wrong result. Here is what samples generated for run bug run look like:
|
One solution is to execute once, store the results, and then always load from them. WDYT? You can store there in a fork of RunBugRun |
They look great! Thanks for the work in integrating RunBugRun :)) |
@andre15silva I've cleaned up the test format in the prompt, it looks much better this way. Keep in mind run bug run bugs can contain as much as a hundred test cases! Also added a caching logic for test results when the dataset is first loaded, takes an hour without timeout. I am going to go through the rest of the pipeline, e.g. generate_patches and evaluate_patches. For the CI, how do I make sure my changes pass? |
Thanks! Could you add a test for this prompting method? Something similar to
That's great, the more the merrier! If it becomes too much, we can always add an option to include only N failing tests later.
I see. I think this might not be the best approach since it requires executing every time we start a new instance of the script. Would you be able run this once, save the all the results in a json file and then load them from this json file? That way we do not need to be executing all the time. WDYT?
Sounds great, let me know if you have any doubt! I have now enabled the CI in your PR, so it should run every time you push a new commit. Current problems:
|
@andre15silva I've added the TestInstruct class for run bug run and checked in an archive with pre-computed test results (uncompressed during of setup). Also, removed comments from setup and removed rbugr submodule, which I originally added but ended up not using. Trying to finish patch evaluation before I go on vacation. |
I quickly went through the rest of the pipeline: generate, evaluate, export -- all seemed to output plausible results. Couple of things I noticed:
|
Thanks for reporting the error, I updated the README!
The evaluation strategy always checks-out the buggy version, which is the default option for checkout. We could explicitly set As for the buggy and fixed programs having different paths, I suggest we only look at the buggy path. See what I have done for HumanEval-Java (https://github.com/ASSERT-KTH/repairbench-framework/blob/master/elleelleaime/core/benchmarks/humanevaljava/humanevaljavabug.py), where the setup is similar. repairbench-framework/elleelleaime/core/benchmarks/humanevaljava/humanevaljava.py Lines 54 to 71 in 30cf46f
Yes and no. There is a textual comparison between the However, if this returns false, we still need to evaluate the generated code by running the tests. For the CI, you should run I also added a comment to fix a typo in your PR. Thanks :)) |
@andre15silva Addressed all of the above. PR ready for review. Thanks |
RunBugRun -- An Executable Dataset for Automated Program Repair
https://github.com/giganticode/run_bug_run
The text was updated successfully, but these errors were encountered: