minus-squareShitGobbler69@alien.topBtoLocalLLaMA•Running full Falcon-180B under budget constraintlinkfedilinkEnglisharrow-up1·1 year agoFYI if all you’re using it for is benchmarking (not like chat mode) you can probably do it in way less VRAM. You can load 1 layer into VRAM, process the entire set of input tokens, remember that output, load another layer into vram, repeat. linkfedilink
FYI if all you’re using it for is benchmarking (not like chat mode) you can probably do it in way less VRAM. You can load 1 layer into VRAM, process the entire set of input tokens, remember that output, load another layer into vram, repeat.