If LLMs can be taught to write assembly (or LLVM) very efficiently, what would it take to create a full or semi-automatic LLM compiler from high languages or even from pseudo-code or human language.
The advantages could be monumental:
- arguably much more efficient utilization of resources on every compile target
- compilation is flexible and not rule based. an LLM won’t complain over a missing “;” as it can “understand” the intent
- it can rewrite many of the software we have today just based on the disassembled binaries to squeeze more out of HW
- can we convert an assembly block from ARM to RISC? and vice versa?
- potentially, iterative compilation (ala open interprator) can also understand the runtime issues and exceptions to have a “live” assembly code that changes as issues arise
>> Any projects exploring this?
>> I feel it is an issue of dimensionality (ie “context” size), very similar to having a latent space for entire repos. Do you agree?
Frankly, I’d argue LLMs are not the tool for this—not only at a fundamental level (they aren’t the right tool for the job, given hallucination and a host of other factors), but they are also way too resource intensive right now.
Resource optimization on the compiling stage isn’t necessary a priority. You can use a cheap compiler to iterate and an expensive one to do one time optimization.
Agree on hallucinations… but it’s not a catch all phrase.
Creativity comes from micro hallucinations :)
I think it is achiveable by using it recursively
If LLMs can be taught to write assembly (or LLVM) very efficiently
That’s a big if, not compared to human written but compared to optimized code.
- arguably much more efficient utilization of resources on every compile target
That is an interesing angle, if you could build in concerns that aren’t currently taken into consideration
- compilation is flexible and not rule based. an LLM won’t complain over a missing “;” as it can “understand” the intent
I think that’s a separate issue, and is closer to code completion than compilation. I don’t know why there aren’t automatic linters for the specific problem you mentioned.
I feel it is an issue of dimensionality (ie “context” size), very similar to having a latent space for entire repos. Do you agree?
You could probably get the behaviour you want from fine-tuning/RAG on a specific codebase. It will still require large context size.
semi-automatic LLM compiler from high languages or even from pseudo-code or human language.
Mmm, reminds me of a short story I read recently with exactly this but with annoying censorship alignment and no ability to reset the state that makes it not so helpful. Hopefully such a compiler will not be written like that.
- can we convert an assembly block from ARM to RISC? and vice versa?
As both ARM and RISC-V are RISC architectures, and since it is not that slow to emulate RISC architectures (like ARM) on CISC architectures (like amd64) but is substantially slower to emulate CISC architectures on RISC architectures, I think a better example would be converting from amd64 to arm64.
- it can rewrite many of the software we have today just based on the disassembled binaries to squeeze more out of HW
Imagine an LLM that can natively understand and edit assembly (with each instruction and byte of data being its own token, perhaps?) that can, just, rewrite an entire binary to do whatever you want and which can effortlessly translate from one assembly language to another or even translate the entire thing to fully functional, well-organised, commented code in your higher-level language of choice! Train on optimised vs non-optimised assembly (and other code) so it is good at that as well, and then refine it directly on its own results, of what is the fastest while still getting the correct output and not being buggy, to take that even further. Such a program would be insanely capable. GPTs are already insanely good at writing in and translating different human languages so I think for machine languages they could also do quite well.
Given potentially less training needed than an entirely general purpose LLM especially for a simple proof of concept, I wonder how hard it would be to make an open source program that does this. Since we already have programs (compilers) that convert from code to assembly, one could even generate a huge amount of synthetic data relatively easily for a proof of concept of this small subset of tasks, one that just acts as a compiler only. It could then serve as an experiment for making higher-quality output than the original input data, by training and evaluating it on whatever instructions consistently get the correct output most quickly. Making it adversarial with another AI trying to induce bugs would probably be useful here in ensuring the faster output is not buggy in some way.
I think it could be perfectly feasible to make this not as a big organisation. Maybe it could even edit its own inference code to go even faster. And if it could somehow be smart enough to understand high-level software and machine learning architectures and maybe even Hardware Description Language… Maybe even help enable an intelligence explosion?
I think at least an LLM compiler might be feasible to make at a small scale and now I really want to try making one. Linking could be complex though, and probably some other things I haven’t thought of yet.
Mmm, reminds me of a short story I read recently with exactly this but with annoying censorship alignment and no ability to reset the state that makes it not so helpful. Hopefully such a compiler will not be written like that.
Can’t believe you quoted Yudkowsky at me, that’s offensive :)
I don’t mind asking my compiler nicely…Imagine an LLM that can natively understand and edit assembly (with each instruction and byte of data being its own token, perhaps?) that can, just, rewrite an entire binary to do whatever you want and which can effortlessly translate from one assembly language to another or even translate the entire thing to fully functional, well-organised, commented code in your higher-level language of choice!
I think it would prove much harder if you try to limit the token vocabulary. We want to preserve the ability to understand english comments and potentially ask clarifying question when you see ambiguity.
Something like:
“Dude, stop using this old AMD frameqwork, Intel just released a new architecture and I can get you 20% discount on Amazon. I’ll even rewrite your entire shitty code base to work with it. {Affiliate_link} click here to order and recompile.”