VulDetectArtifact

Artifact for TOSEM paper: Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors.

1.Datasets

For SARD dataset we have uploaded to zenodo, for Fan dataset, the related information is at MSR_20_Code_vulnerability_CSV_Dataset, the dataset csv can be downloaded from google driver. We extract func_before and func_after from it.

2.Preprocess Pipeline

For preprocess code into graph, please refer to preprocess/ReadMe.md

3.Pretrain embedding model

Run python pretrain.py detector_name path2train_datas embedding_model_path

detector_name: The name of detectors, choice is reveal, devign, ivdetect, deepwukong, we will soon add remaining 3 sequence-based detectors into this pipeline.
path2train_datas: The dir which stores train_vul.json, train_normal.json, eval_vul.json, eval_normal.json, test_vul.json, test_normal.json, the script will read training data from train jsons.
embedding_model_path: The path to the saved embedding model.

4.Detection Pipeline

Run python detection.py <args> to train detectors. <args> includes:

--detector <detector_name>, <detector_name> could be one of ["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"]
--w2v_model_path <model_path>, <model_path> could be relative or absolute path of pretrained word2vec model.
--dataset_dir <dataset_dir>, <dataset_dir> is path to the dir storing json datas. It should include train_vul.json, train_normal.json, eval_vul.json, eval_normal.json, test_vul.json, test_normal.json.
--model_dir <model_dir>, <model_dir> is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir.
--train, means will train model. If there exist a model in <model_dir>, the script will first load that model and then train.
--test, means will test the model. There must be a model in <model_dir> first.

5.Explanation Pipeline

Run python explain.py <args>. <args> includes:

--detector <detector_name>, <detector_name> could be one of ["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"]
--w2v_model_path <model_path>, <model_path> could be relative or absolute path of pretrained word2vec model.
--dataset_dir <dataset_dir>, <dataset_dir> is path to the dir storing json datas. It should include test_vul.json.
--model_dir <model_dir>, <model_dir> is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir.
--explainer <explainer_name>, <explainer_name> could be one of ["gnnexplainer", "pgexplainer", "gnnlrp", "gradcam", "deeplift"] for now. We are organizing the code in sequence-based explainers into this pipeline.

6.Citation

@misc{cheng2024fidelity,
      title={Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors}, 
      author={Baijun Cheng and Shengming Zhao and Kailong Wang and Meizhen Wang and Guangdong Bai and Ruitao Feng and Yao Guo and Lei Ma and Haoyu Wang},
      year={2024},
      eprint={2401.02686},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
graph		graph
preprocess		preprocess
sequence		sequence
testcases/test1		testcases/test1
utils		utils
.gitattributes		.gitattributes
README.md		README.md
detection.py		detection.py
explain.py		explain.py
pretrain.py		pretrain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VulDetectArtifact

1.Datasets

2.Preprocess Pipeline

3.Pretrain embedding model

4.Detection Pipeline

5.Explanation Pipeline

6.Citation

About

Releases

Packages

Languages

for-just-we/VulDetectArtifact

Folders and files

Latest commit

History

Repository files navigation

VulDetectArtifact

1.Datasets

2.Preprocess Pipeline

3.Pretrain embedding model

4.Detection Pipeline

5.Explanation Pipeline

6.Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages