Multiple Local LLMs Pipeline on M-Series Mac

LocoMod@alien.top · 2 years ago

BAAI has the best embedding model I’ve tried so I’m excited to see what comes of this.

LocoMod@alien.top · 2 years ago

This is what my hobby project essentially does. I’m running a single chat from 3 different servers in my network all serving different LLMs that are given a role in the chat pipeline. I can send the same prompt to multiple models so they can work on it concurrently, or have them handoff each other’s responses to continue elaborating, validating, or whatever that LLMs job is. Since each server is serving an API and websocket route, all I need to do is put it behind a proxy and port forward them to the public internet. Anyone here could visit the public URL and run inference workflows in my homelab(theoretically speaking). They could also spin up an instance on their side and we can have our servers talk to each other.

Of course that’s highly insecure and just bait for bad actors. So I will scale it using overlay network that requires a key exchange and runs over VPN.

Any startup thinking they are going to profit from this idea will only burn investor money and waste their own time. This will all be free and it’s only a matter of time before the open source community cuts into their hopes and dreams.

LocoMod@alien.top · 2 years ago

I am not a Python hater, but Go is what Python should have been if it actually stuck to the Zen of Python.

You do know that what is arguably the most successful open-source project of the past decade that powers most of the modern internet is written in Go right?

https://github.com/kubernetes/kubernetes

LocoMod@alien.top · 2 years ago

We need some hero to develop an app that downloads more GPU memory like those apps back in the 90’s. /s

LocoMod@alien.top · 2 years ago

I’m getting the same output. Those are line breaks. How odd…

LocoMod@alien.top · 2 years ago

Ideally we would be better in a timeline where LLMs could do this better than classical methods but we’re not there yet. You can code a handler that cleans up html retrieval quite trivial since you’re just looking for the text in specific tags like articles, headers, paragraphs, etc. There are a ton of frameworks and examples out there on how to do this and a proper handler would execute the cleanup in a fraction of the time even the most powerful LLM ever hoped to.

LocoMod@alien.top · 2 years ago

Quantz are up:

https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF/tree/main

LocoMod@alien.top · 2 years ago

This is something I’ve noticed with large context as well. This is why the platform built around LLMs is what will be the major differentiator for the foreseeable future. I’m cooking up a workflow to insert remote LLMs as part of a chat workflow and successfully tested running inference on a fast Mistral-7B model and a large Dolphin-Yi-70B on different servers from a single chat view successfully about an hour ago. This will unlock the capability to have multiple LLMs working together to manage context by providing summaries, offloading realtime embedding/retrieval to a remote LLM, and a ton of other possibilities. I got it working on a 64GB M2 and a 128GB M3. Tonight I will insert the 4090RTX into the mix. The plan is to have the 4090 run small LLMs. Think 13B and smaller. These run and light speed on my 4090. Its job can be to provide summaries of the context by using LLMs finetuned for that purpose. The new Orca13B is promising little agent that so far follows instructions really well for these types of workflows. Then we can have all 3 servers working together on a solution. Ultimately, all of the responses would be merged into the “ideal response” and output as the “final answer”. I am not concerned with speed for my use case as I use LLMs for highly technical work. I need correctness above all even if this means waiting a while for the next step.

I’m also going to implement a mesh VPN so we can do this over WAN and scale it even more with a trusted group of peers.

The magic behind ChatGPT is the tooling and how much compute they can burn. My belief is the model is less relevant than folks think. It’s the best model no doubt, but if we were allowed to run it on the CLI as a pure prompt/response workflow between use and model with no tooling in between, my belief is it would be a lot like the best open source models…

LocoMod@alien.top · 2 years ago

You’ve basically described the entire purpose behind Retrieval Augmented Generation.

LocoMod@alien.top · 2 years ago

What’s stopping us from building a mesh of web crawlers and creating a distributed database that anyone can host and add to the total pool of indexers/servers? How long would it take to create a quality dataset by deploying bots that crawl their way “out” of the most popular and trusted sites for particular knowledge domains and just compress and dump that into a format for training into said global p2p mesh? If we got a couple of thousand nerds on Reddit to contribute compute and storage capacity to this network we might be able to build it relatively fast. Just sayin…

LocoMod@alien.top · 2 years ago

This is going to be a solid Netflix miniseries on two years. I am cheering for the team at OpenAI. There is no victory in stagnation and no honor fearing the unknown. Full steam ahead folks. And bring on GPT5 already!

LocoMod@alien.top · 2 years ago

I have. You simply parse the prompt for a url and then write a handler to retrieve the page content using whatever language or framework you use. Then you clean it up and send the content along with the prompt to the LLM and do QA over it.

LocoMod@alien.top · 2 years ago

Had the same problem last night and I promptly deleted it.

LocoMod@alien.top · 2 years ago

Multiple Local LLMs Pipeline on M-Series Mac