I have 10k vulnerabilities found in around 100 C++ projects. For the culture I would like to try to train an LLM to, given a file, to highlight the vulnerabilities. Each vulnerability report contains:
- a title and a description
- a link to either a file or a particular line of the file (or more!)
I’m just thinking about it but I wonder how would I build the dataset. Ideally I would go by pairing the file concerned by the issue and the report. But AFAI understand the context window won’t allow me to put a 300ish long file with a 1k characters vulnerability report. Even if the context window wouldn’t be an issue the problem would be that multiple vulnerability reports be in the same file.
So maybe pairing on file with a list of vulnerabilities summaries and their lines would do the trick.
Just thinking out loud here. How would you do it? Am I missing something obvious?
depending on the size of the model your fine-tuning your going to want to limit the amount of context not pertaining to a code vulnerability. The major issue I see is that code vulnerabilities will probably deal with multiple functions spread across different files.
So you could pass in just snippets of different functions relating to the vulnerability report but that isn’t very helpful for identifying vulnerabilities given a code file. You would have to pass in a specific function and all functions it references (and so on) for this format to work then it would write a vulnerability report on that. So you’d probably also want to include some reports which don’t include vulnerabilities or just be prepared for the tuned model to think every function you pass in to contain a vulnerability.
I strongly believe just referencing the line number will not build a strong enough attention link between the actual code and the vulnerability report.
My 2 cents
Probably the line number, range on the line, the CWE ID, and, to help the AI understand and link the CWE to the code, the description from the CWE too.