llm4decompile-ref dataset #36

kleinercubs · 2024-12-02T20:17:43Z

Hi,

I am working with the llm4decompile-ref family of models (pseudo->source code) and have 2 questions about the dataset used for training.

Are these models trained solely using the LLM4Binary/decompile-ghidra-100k dataset?
Upon examining this dataset, it appears there may be a significant amount of duplicated data. Could you confirm if this is expected or if there might be errors when handling?

Any clarification on this would be greatly appreciated. Thanks!

albertan017 · 2024-12-03T04:07:43Z

The LLM4Binary/decompile-ghidra-100k dataset is a sample dataset used for the v2 series models. For training the v2 series, we use a larger dataset consisting of 1 billion tokens (approximately 1.6 million samples) and train for 2 epochs.

Regarding the duplicated data, it's caused by different optimization levels (O0 to O3) applied during the compilation process. Each optimization level can result in slightly different pseudo code representations, leading to duplicates in the dataset.

kleinercubs · 2024-12-03T04:11:29Z

for larger dataset, do you mean compile AnghaBench first and then Ghidra decompile?

albertan017 · 2024-12-03T04:14:28Z

We're using the ExeBench with the first 400K functions, which contains the AnghaBench. Yes, compile the bench and decompile by Ghidra.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm4decompile-ref dataset #36

llm4decompile-ref dataset #36

kleinercubs commented Dec 2, 2024

albertan017 commented Dec 3, 2024 •

edited

Loading

kleinercubs commented Dec 3, 2024

albertan017 commented Dec 3, 2024

llm4decompile-ref dataset #36

llm4decompile-ref dataset #36

Comments

kleinercubs commented Dec 2, 2024

albertan017 commented Dec 3, 2024 • edited Loading

kleinercubs commented Dec 3, 2024

albertan017 commented Dec 3, 2024

albertan017 commented Dec 3, 2024 •

edited

Loading