You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with the llm4decompile-ref family of models (pseudo->source code) and have 2 questions about the dataset used for training.
Are these models trained solely using the LLM4Binary/decompile-ghidra-100k dataset?
Upon examining this dataset, it appears there may be a significant amount of duplicated data. Could you confirm if this is expected or if there might be errors when handling?
Any clarification on this would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered:
The LLM4Binary/decompile-ghidra-100k dataset is a sample dataset used for the v2 series models. For training the v2 series, we use a larger dataset consisting of 1 billion tokens (approximately 1.6 million samples) and train for 2 epochs.
Regarding the duplicated data, it's caused by different optimization levels (O0 to O3) applied during the compilation process. Each optimization level can result in slightly different pseudo code representations, leading to duplicates in the dataset.
Hi,
I am working with the llm4decompile-ref family of models (pseudo->source code) and have 2 questions about the dataset used for training.
Any clarification on this would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered: