Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm4decompile-ref dataset #36

Open
kleinercubs opened this issue Dec 2, 2024 · 3 comments
Open

llm4decompile-ref dataset #36

kleinercubs opened this issue Dec 2, 2024 · 3 comments

Comments

@kleinercubs
Copy link

Hi,

I am working with the llm4decompile-ref family of models (pseudo->source code) and have 2 questions about the dataset used for training.

  1. Are these models trained solely using the LLM4Binary/decompile-ghidra-100k dataset?
  2. Upon examining this dataset, it appears there may be a significant amount of duplicated data. Could you confirm if this is expected or if there might be errors when handling?

Any clarification on this would be greatly appreciated. Thanks!

@albertan017
Copy link
Owner

albertan017 commented Dec 3, 2024

The LLM4Binary/decompile-ghidra-100k dataset is a sample dataset used for the v2 series models. For training the v2 series, we use a larger dataset consisting of 1 billion tokens (approximately 1.6 million samples) and train for 2 epochs.

Regarding the duplicated data, it's caused by different optimization levels (O0 to O3) applied during the compilation process. Each optimization level can result in slightly different pseudo code representations, leading to duplicates in the dataset.

@kleinercubs
Copy link
Author

for larger dataset, do you mean compile AnghaBench first and then Ghidra decompile?

@albertan017
Copy link
Owner

We're using the ExeBench with the first 400K functions, which contains the AnghaBench. Yes, compile the bench and decompile by Ghidra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants