I had some crazy thoughts today while I was in physical therapy (funny what some electrodes on the base of your skull will make you think of…) – But, I feel like there’s a chance I might be on to something and wanted to share it, let me know what you guys think… – or if in the more likely scenario, I’m just crazy, or this is already known and just impractical in some way… let me know that too! 😅
…
But I was thinking about LoRAs today and about how we use them for finetuning transformer models… – and I thought:
… why not train a second model* to produce LoRAs as an output? that can be consumed by an LLM (or other transformer) as an input?(*: either a regular ML model, or a transformer model – but not an LLM – it produces LoRAs as the output.)
I know it could be hard to train without a specialized strategy but… it seems like LoRAs have the ability to slide the model’s latent space around through different windows & lenses; and with a sliding latent space, you get a lot of extra horsepower for “free” (i.e. you only invoke the higher level network when it needs to nudge the model in a certain direction – which the model can be trained to do by outputting special token/vector pairs).
…
So, you could do some really interesting things… like instead of using a mixture of experts (MoE) where one knows programming, and another knows healthcare, you could train the LLM to recognize when it needs to make minor changes to it’s current LoRA and output a special token & vector which is fed into the dynamic LoRA offset model which makes minor modifications to the current LoRA that is applied via the PEFT adapter.
I feel like if you took a bunch of LoRAs (assuming the tokens were not retrained) and labled them, you could train the higher level dynamic LoRA tensor network to be able to blend between LoRAs that it knows about, and you could train the LLM to output the special tokens when it recognizes it needs to shift between latent spaces…
…
And for that matter you could take that in a couple of interesting directions… For example:
- This one is kind of boring, but you could try using a model with a smaller context window (even 4k or 8k), and when it got near the end of its context, you could have the pre-trained dynamic LoRA tensor network evaluate the current context / tokens and spit out modifications to the current LoRA that allowed you to embed some of that context into the latent space; thus allowing you to heavily compress and summarize the current context window to free up a bunch of token/attention space…
OR
- You could go in a completely different direction and stack these on top of each other (as a LoRA can be applied to any tensor network, including one which produces downstream LoRAs) and have a dynamically sized network… – basically giving you a dynamically sized LLM that only uses as much brain power as it thinks it needs for each problem (i.e. it could use the 4-bit 7b LLM at the base, with a bunch of 3b parameter dynamic LoRA layers above each one – so that for simple problems they just invoke the base network, but when it thinks it needs more “brain power” it could spit out a special token/vector pair that singified that the next layer up needed to be involved – and it would turn on that layer to build a LoRA for the layer below.
…
Heck, you could even do both of those scenarios, separately, with a third networkt that dynamically blended between the two in some way… (much how LoRAs are statically blended together today – but this would be dynamic and the blend could change when special trained tokens are encountered).
…
There are a lot of other weird things I can think of that you could do with dynamic LoRAs (like make any LLM multi-modal…) – but first I just want to figure out how crazy this is, and what people think about it… – or maybe it’s already been done and I’m just late to the party…
…
So, what do you guys think?
- Are you saying you want a model that will spit out LORA’s? Like “Please generate me Lora that will make yourself totally amazing?” - If so, this is more in the realm of star trek food replicator. AKA it works amazingly on a TV screen. - If not then, sorry. - The closest to this would be a model that will pickup the correct LORA needed to reply. Adapters can be easily switched on the fly and so a model can be made that would call a function to select correct adapter. Maybe this is how ChatGPT works. maybe not. - No. I’m not advocating for creating a text-to-LoRA model. Though that would be a neat project, I think you’d have a monumental training task under your hands… and really… it just doesn’t seem that practical. Fine-tuning isn’t expensive enough to merit trying to train or build that netowrk anyway, so “the juice wouldn’t be worth the squeeze”. - Picking up a correct LoRA for a response is what an MoE system is (Mixture of Experts). - What I’m proposing is training a regular LLM to occasionally spit out tokens which signal another ML network to periodically run, which will make minor runtime adjustments to the current LORA to keep it “on track”. - Like a thousand tiny micro adjustments over the course of a long conversation. – Which could be used to shift the current latent space into one where the model has an “intuitive” or “latent” understanding of much of what is currently in the context – so that the actual context and attention tokens could be freed up for later use. - Basically if the network is already in the optimal LoRA the ML network would just spit out an identity tensor for the LoRA so that it never changes. - But as the LLM realizes it’s no longer in the realm of it’s current latent space, it spits out a special “think-harder” token, which signals the ML network to run. - The ML network takes the current context and pushes it into a weighted vectorized embedding that is representative of the current “state”, and spits out a tensor which makes micro adjustments to the LoRA / PEFT adapter. - That was one such application for this that I was proposing. 
 
- Using flan-t5 models - https://huggingface.co/lorahub - That’s a very interesting project which is similar in many ways to what I’m thinking of. – They’re doing something a little different than what I was thinking of, but it’s still really neat. – I’m going to check that out. Thanks for sharing it! 
 
- It’s been done b4 
- This is basically one of the main reasons to use LoRAs. Someone posted this in the machine learning Reddit about 2 months ago, but the idea is still a solid one. Train a intermediate model to determine which expert or LoRa to use, then use that LoRA for that task. It’s better than mix of experts because you get much better control over which expert or LoRa will receive the request. - Which expert you are talking about, machine learning noob here - A mixture of experts is a type of technique that is what gpt4 is rumored to be using. - From Wikipedia : - Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that typically only one or a few expert models will be run, rather than combining results from all models. - In basic terms there are networks that get really good at one type of thing, and compete to provide input when a question comes in during inference. There may be a network really good at science, or math, or literature, which has a better understanding of a field or subject than the other “experts”. So it provides the response instead of the others. 
 
 
- If I am understanding what you are saying, you have basically just reinvented context with extra steps. - Doesn’t this method free up context? 
- Can you explain more? – I thought this would make it so that context and attention could be freed up and reused for new tokens. - Like… it would also allow for a much larger context size without the quadratic memory consumption – possibly even static memory consumption. 
 
- “electrodes on the base of your skull” - could you say more about this therapy, is it some kind of neurofeedback / TENS (Transcutaneous Electrical Nerve Stimulation) / EMS (Electrical Muscle Stimulation) ? - Yeah, it was an industrial TENS unit applied to my upper neck and back to help with some pain I’ve been dealing with since I was rear ended a few months ago. - You can’t really do much but just sit there and think…, so that’s what I did. Maybe I’ll bring an audio book next time. 😅 
 
- Isn’t that just Hypernetwork? It’s been done before, eg. for stable diffusion - Neat, that’s really similar to what I was thinking of. – I know SD is transformer based, but has anyone done this with LLMs? 
 
- this group is reserching lora architectures, the hidra-b architecture use a form of merging similar to your suggestion, they go over pros and cons and what they tried and whatnot: - It looks like a neat project, and correct me if I’m wrong, but it looks like their goal is just doing MoE and blending. And not really any dynamic context sliding? 
 

