GPU recommendations for rXg LLM
November 03, 2024
rXg includes LLM and RAG capabilities that depends on GPUs.
The recommended production deployment system architecture is to put the GPU(s) in the Fleet Manager. rXg can perform local RAG while leveraging a remote LLM. It is recommended to use WireGuard to create an SD-WAN between the rXg edges and the rXg Fleet Manager.
For very basic testing purposes, a GPU with 8 GB of VRAM can run quantized Mixtral 7b and Llama 3 8b. Examples of GPUs for this purpose - Nvidia 3070 and Nvidia 4060.
The minimum PoC would be using 24 GB of VRAM can run fp16 Llama 3 8b, 2-bit Llama 3 70b, and 3-bit Mixtral 8x7b - Nvidia 3090 and Nvidia 4090.
The Nvidia 3090 Founder's Edition is readily available for a reasonable price and it occupies three slots. Most Nvidia 4090 examples require 4 slots and are extremely heavy.
The best "bang of the buck" is to install multiple 24GB GPUs such as the Nvidia 3090. The rXg will automatically utilize multiple cards.
48 GB (2 x 24GB) of VRAM allows for the utilization of more precise quantizations of Llama 3 70b and Mixtral 8x7b.
72 GB (3 x 24GB) of VRAM allows for the utilization of Mixtral 8x22b, a model that has generates wonderful inferences.
For production environments, the Nvidia professional line GPUs are highly recommended. The Nvidia A6000 ADA has 48 GB of VRAM, active cooling, and occupies two slots. This is the highest density Nvidia card with active cooling that will work in any chassis.
The Nvidia L40 (48GB) and H100 PCI-e (96GB) are the most powerful GPUs that can be considered. These GPUs require special chassis as the heatsinks on them are passive and they work with blower fans integrated into the chassis.
The recommended production deployment system architecture is to put the GPU(s) in the Fleet Manager. rXg can perform local RAG while leveraging a remote LLM. It is recommended to use WireGuard to create an SD-WAN between the rXg edges and the rXg Fleet Manager.
For very basic testing purposes, a GPU with 8 GB of VRAM can run quantized Mixtral 7b and Llama 3 8b. Examples of GPUs for this purpose - Nvidia 3070 and Nvidia 4060.
The minimum PoC would be using 24 GB of VRAM can run fp16 Llama 3 8b, 2-bit Llama 3 70b, and 3-bit Mixtral 8x7b - Nvidia 3090 and Nvidia 4090.
The Nvidia 3090 Founder's Edition is readily available for a reasonable price and it occupies three slots. Most Nvidia 4090 examples require 4 slots and are extremely heavy.
The best "bang of the buck" is to install multiple 24GB GPUs such as the Nvidia 3090. The rXg will automatically utilize multiple cards.
48 GB (2 x 24GB) of VRAM allows for the utilization of more precise quantizations of Llama 3 70b and Mixtral 8x7b.
72 GB (3 x 24GB) of VRAM allows for the utilization of Mixtral 8x22b, a model that has generates wonderful inferences.
For production environments, the Nvidia professional line GPUs are highly recommended. The Nvidia A6000 ADA has 48 GB of VRAM, active cooling, and occupies two slots. This is the highest density Nvidia card with active cooling that will work in any chassis.
The Nvidia L40 (48GB) and H100 PCI-e (96GB) are the most powerful GPUs that can be considered. These GPUs require special chassis as the heatsinks on them are passive and they work with blower fans integrated into the chassis.