This folder contains the scripts to run the tracer on C programs from Project CodeNet. The output of this process produces the "raw" trace outputs which can be postprocessed into model input format and used to pre-traing the TRACED model.
⚠️ WIP: We are making this tool available to foster further research. Please be aware that the currently-released version of the tracer tool may not reproduce our work, as we have not fully verified it end-to-end. Thank you!
Uses GDB (GNU DeBugger). Tested with GNU gdb (GDB) 8.2
.
To run tracing on all projects, all solutions, all inputs (expect it to take a while!) to generate a dataset similar to the one we used for pretraining, run this script:
bash full_run.sh
This guide walks through each step of the full run.
This script downloads and extracts the data needed for trace collection:
- The C programs and metadata from Project CodeNet
- The inputs and expected outputs we extracted from Project CodeNet
bash 01_preprocess/download_all_data.sh
# directories "Project_CodeNet/" and "all_input_output/" should have been created
Install the extra utility packages needed for tracing.
# Assuming you already have the conda environment named "traced", no package installs needed
conda activate traced && pip install -r requirements.txt
sudo apt install -y gdb libxml2-utils
Run this command to compile and run the solution programs to generate traces.
Outputs XML of the program trace to trace/p*/C/s*/input_*.txt_log.xml
and text output of the program trace/p*/C/s*/input_*.txt_stdout.txt
.
The expected console outputs are included, but keep in mind that these expected outputs were generated by running on a small sample of the data and, though the structure of the output should be the same, you should expect different execution time/numbers in the outputs.
# compile all solutions for all problems
$ python 01_preprocess/compile_all.py --begin_problem 0 --end_problem 4052
problems: 100%|██████████████████████████████████████████████████████████| 4052/4052 [09:15<00:00, 555.64s/it]
rows: 100%|██████████████████████████████████████████████████████████████| 10080/10080 [09:15<00:00, 18.15it/s]
INFO:root:p00000 outcome
compile_error 6054
success 4026
...
Name: count, dtype: int64
# trace one solution for one problem
$ mkdir -p results
$ python 02_trace/analyze.py compile_output/ p00000 C s000552118 all_input_output/p00000/input_0.txt results/ --verbose 1
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 DEBUG args=Namespace(exe_dir='compile_output/', problem_id='p00000', language='C', submission_id='s000552118', input_file='all_input_output/p00000/input_0.txt', cwd_dir='results/', verbose=1, timeout=10) trace_py=/home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/02_trace/trace_asm.py
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 INFO begin
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 DEBUG subprocess args=gdb /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/compile_output/p00000/C/s000552118 -batch -nh -ex "set logging file /dev/null" -ex "set logging redirect on" -ex "set logging on" -ex "set print elements unlimited" -ex "set print repeats unlimited" -ex "set max-value-size unlimited" -ex "source /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/02_trace/trace_asm.py" -ex "start < /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/all_input_output/p00000/input_0.txt > trace/p00000/C/s000552118/input_0.txt_stdout.txt" -ex "trace-asm trace/p00000/C/s000552118/input_0.txt_log.xml"
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO end
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO elapsed seconds: 3.617031
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO exit code: 0
# trace all solutions for all problems
bash 02_trace/trace_problem.sh compile_output p00000 C all_input_output results
We use a script to read the trace in .xml
format, combine it with the source code and inputs, and output pretraining data in .jsonl
format.
# transform one XML to tree format
$ mkdir -p results/tree
$ python 03_postprocess/transform_xml.py results/p00000/C/s000552118/input_0.txt_log.xml --schema tree --output results/tree/p00000/C/s000552118/input_0.txt_log.xml
# convert all XMLs to JSONL format
mkdir -p results/sequences
$ python 03_postprocess/sequenceize_logs_from_metadata.py --lang C --base_dirs results/tree --src_dirs Project_CodeNet/data --input_dir all_input_output --metadata_dir Project_CodeNet/metadata --begin_problem 0 --end_problem 4052 --output results/sequences
args=Namespace(lang='C', base_dirs=['results/tree'], src_dirs=['../Project_CodeNet/data'], input_dir='all_input_output', metadata_dir='../Project_CodeNet/metadata', begin_problem=0, end_problem=0, limit_solutions=1, limit_sequences=None, nproc=1, output='results/tree_sequences')
p00000.csv, total=16099, filtered=4849, excluded=11250
100%|█████████████████████████████████████████████████████████████████████| 4849/4849 [00:01<00:00, 3733.83it/s, missing_log=4848, success=1]
# transform all XMLs to tree format and sequence-ize
bash 03_postprocess/postprocess_all_problems.sh results
To preprocess the trace data for pretraining, we separate the traced code into separate lines. These keys are added to the resulting JSONL file.
src_lines
: The source code insrc
, split into an array of lines.src_linenos
: The line numbers of each element insrc_lines
.
For the prediction ground-truth, we extract masks on the src_linenos
and add them as a key covered_in_trace
, in two versions:
- Line coverage prediction:
true
: The line was covered in the trace.false
: The line was NOT covered in the trace.
- Branch coverage prediction:
null
: The line does not contain a control-flow branch. Otherwise:true
: The branch was covered in the trace.false
: The branch was NOT covered in the trace.
We used the tree-climber
package (formerly named treehouse
) to extract CFGs https://github.com/bstee615/tree-climber. We vendored it into this repository for ease of use.
# branch-prediction
$ python -m 04_coverage_prediction.conversion --input_file results/sequences/sequences_*_full.jsonl --output_file results/sequences/sequences_BRANCH.jsonl --mode branch --lang c
convert sequences: 90it [00:00, 787.55it/s]
success: 84
total: 90
# line-prediction
$ python -m 04_coverage_prediction.conversion --input_file results/sequences/sequences_*_full.jsonl --output_file results/sequences/sequences_LINE.jsonl --mode separate_lines --lang c
convert sequences: 90it [00:00, 787.55it/s]
success: 84
total: 90