C tracer

This folder contains the scripts to run the tracer on C programs from Project CodeNet. The output of this process produces the "raw" trace outputs which can be postprocessed into model input format and used to pre-traing the TRACED model.

⚠️ WIP: We are making this tool available to foster further research. Please be aware that the currently-released version of the tracer tool may not reproduce our work, as we have not fully verified it end-to-end. Thank you!

Uses GDB (GNU DeBugger). Tested with GNU gdb (GDB) 8.2.

Simple usage

To run tracing on all projects, all solutions, all inputs (expect it to take a while!) to generate a dataset similar to the one we used for pretraining, run this script:

bash full_run.sh

Detailed usage

This guide walks through each step of the full run.

Get the data

This script downloads and extracts the data needed for trace collection:

The C programs and metadata from Project CodeNet
The inputs and expected outputs we extracted from Project CodeNet

bash 01_preprocess/download_all_data.sh
# directories "Project_CodeNet/" and "all_input_output/" should have been created

Set up environment

Install the extra utility packages needed for tracing.

# Assuming you already have the conda environment named "traced", no package installs needed
conda activate traced && pip install -r requirements.txt
sudo apt install -y gdb libxml2-utils

How to generate traces

Run this command to compile and run the solution programs to generate traces. Outputs XML of the program trace to trace/p*/C/s*/input_*.txt_log.xml and text output of the program trace/p*/C/s*/input_*.txt_stdout.txt.

The expected console outputs are included, but keep in mind that these expected outputs were generated by running on a small sample of the data and, though the structure of the output should be the same, you should expect different execution time/numbers in the outputs.

# compile all solutions for all problems
$ python 01_preprocess/compile_all.py --begin_problem 0 --end_problem 4052
problems: 100%|██████████████████████████████████████████████████████████| 4052/4052 [09:15<00:00, 555.64s/it]
rows: 100%|██████████████████████████████████████████████████████████████| 10080/10080 [09:15<00:00, 18.15it/s]
INFO:root:p00000 outcome
compile_error    6054
success          4026
...
Name: count, dtype: int64

# trace one solution for one problem
$ mkdir -p results
$ python 02_trace/analyze.py compile_output/ p00000 C s000552118 all_input_output/p00000/input_0.txt results/ --verbose 1
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 DEBUG args=Namespace(exe_dir='compile_output/', problem_id='p00000', language='C', submission_id='s000552118', input_file='all_input_output/p00000/input_0.txt', cwd_dir='results/', verbose=1, timeout=10) trace_py=/home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/02_trace/trace_asm.py
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 INFO begin
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:37 DEBUG subprocess args=gdb /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/compile_output/p00000/C/s000552118 -batch -nh -ex "set logging file /dev/null" -ex "set logging redirect on" -ex "set logging on" -ex "set print elements unlimited" -ex "set print repeats unlimited" -ex "set max-value-size unlimited" -ex "source /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/02_trace/trace_asm.py" -ex "start < /home/benjis/Code/trace-modeling_icse2024_recovery/trace_collection_c_cpp/all_input_output/p00000/input_0.txt > trace/p00000/C/s000552118/input_0.txt_stdout.txt" -ex "trace-asm trace/p00000/C/s000552118/input_0.txt_log.xml"
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO end
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO elapsed seconds: 3.617031
p00000/C/s000552118 input_0.txt 2024-02-07T09:16:41 INFO exit code: 0

# trace all solutions for all problems
bash 02_trace/trace_problem.sh compile_output p00000 C all_input_output results

How to convert traces to `.jsonl` format

We use a script to read the trace in .xml format, combine it with the source code and inputs, and output pretraining data in .jsonl format.

# transform one XML to tree format
$ mkdir -p results/tree
$ python 03_postprocess/transform_xml.py results/p00000/C/s000552118/input_0.txt_log.xml --schema tree --output results/tree/p00000/C/s000552118/input_0.txt_log.xml

# convert all XMLs to JSONL format
mkdir -p results/sequences
$ python 03_postprocess/sequenceize_logs_from_metadata.py --lang C --base_dirs results/tree --src_dirs Project_CodeNet/data --input_dir all_input_output --metadata_dir Project_CodeNet/metadata --begin_problem 0 --end_problem 4052 --output results/sequences
args=Namespace(lang='C', base_dirs=['results/tree'], src_dirs=['../Project_CodeNet/data'], input_dir='all_input_output', metadata_dir='../Project_CodeNet/metadata', begin_problem=0, end_problem=0, limit_solutions=1, limit_sequences=None, nproc=1, output='results/tree_sequences')

p00000.csv, total=16099, filtered=4849, excluded=11250
100%|█████████████████████████████████████████████████████████████████████| 4849/4849 [00:01<00:00, 3733.83it/s, missing_log=4848, success=1]

# transform all XMLs to tree format and sequence-ize
bash 03_postprocess/postprocess_all_problems.sh results

How to convert the traces to branch-prediction or line-prediction format

To preprocess the trace data for pretraining, we separate the traced code into separate lines. These keys are added to the resulting JSONL file.

src_lines: The source code in src, split into an array of lines.
src_linenos: The line numbers of each element in src_lines.

For the prediction ground-truth, we extract masks on the src_linenos and add them as a key covered_in_trace, in two versions:

Line coverage prediction:
- true: The line was covered in the trace.
- false: The line was NOT covered in the trace.
Branch coverage prediction:
- null: The line does not contain a control-flow branch. Otherwise:
- true: The branch was covered in the trace.
- false: The branch was NOT covered in the trace.

We used the tree-climber package (formerly named treehouse) to extract CFGs https://github.com/bstee615/tree-climber. We vendored it into this repository for ease of use.

# branch-prediction
$ python -m 04_coverage_prediction.conversion --input_file results/sequences/sequences_*_full.jsonl --output_file results/sequences/sequences_BRANCH.jsonl --mode branch --lang c
convert sequences: 90it [00:00, 787.55it/s]
success: 84
total: 90
# line-prediction
$ python -m 04_coverage_prediction.conversion --input_file results/sequences/sequences_*_full.jsonl --output_file results/sequences/sequences_LINE.jsonl --mode separate_lines --lang c
convert sequences: 90it [00:00, 787.55it/s]
success: 84
total: 90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

C tracer

Simple usage

Detailed usage

Get the data

Set up environment

How to generate traces

How to convert traces to `.jsonl` format

How to convert the traces to branch-prediction or line-prediction format

Files

README.md

Latest commit

History

README.md

File metadata and controls

C tracer

Simple usage

Detailed usage

Get the data

Set up environment

How to generate traces

How to convert traces to .jsonl format

How to convert the traces to branch-prediction or line-prediction format

How to convert traces to `.jsonl` format