exo software - A distributed LLM solution running on a cluster of computers, smartphones, or SBCs

You’d typically need hardware with a large amount of memory and bandwidth and multiple GPUs, if you want to run the latest large language models (LLMs), such as DeepSeek R1 with 671 billion parameters. But such hardware is not affordable or even available to most people, and the Exo software works around that as a distributed LLM solution working on a cluster of computers with or without NVIDIA GPUs, smartphones, and/or single board computers like Raspberry Pi boards.

In some ways, exo works like distcc when compiling C programs over a build farm, but targets AI workloads such as LLMs instead.

Exo software distributed LLM solution — Exo cluster with Linux boxes with NVIDIA GPUs and MacBook Pro 16GB

Key features of Exo software:

Support for LLaMA (MLX and tinygrad), Mistral, LlaVA, Qwen, and Deepseek.
Dynamic Model Partitioning – The solution splits up models based on the current network topology and device resources available in order to run larger models than you would be able to on any single device. That’s what I call a “distributed LLM solution”. Several partitioning strategies are available and the default is “ring memory weighted partitioning” where each device runs a number of model layers proportional to the memory of the device.
Automatic Device Discovery / Zero manual configuration – exo will automatically discover other devices using the best method available.
ChatGPT-compatible API. It enables a one-line change in your application to run models on your own hardware using exo.
Device Equality – Exo does not use a master-worker architecture and instead connects in a peer-to-peer (P2P) fashion. So as long as a device is connected somewhere in the network, it can be used to run models.

OS support is unclear, and I don’t see any Windows support. Instead, instructions related to Linux, Mac OS, Android, and iOS are available. The only other software requirement is Python>=3.12.0. Linux systems with NVIDIA GPU also need the NVIDIA driver, CUDA toolkit, and a cuDNN library.

On the hardware front, the important is to have enough memory across all devices to fit the model in memory. For example, since running Llama 3.1 8B (FP16) requires 16GB of RAM, the following configuration would work:

2x 8GB M3 MacBook Airs or
1x 16GB NVIDIA RTX 4070 Ti Laptop
2x Raspberry Pi 400 with 4GB of RAM each (running on CPU) and 1x 8GB Mac Mini; so heterogenous architectures can work together.

If I understand correctly, it would be possible to run DeepSeek R1 (Full 671B – FP16) on a cluster of Raspberry Pi with 8GB or 16GB, as long as there’s a combined ~1.3TB of RAM, or about 170 Raspberry Pi 5 with 8GB RAM. Performance would be horrendous likely better expressed in tokens per hour, but it should work in theory,… The developers explain that “adding less capable devices will slow down individual inference latency but will increase the overall throughput of the cluster.”

Exo needs to be installed from source on all machines as follows:

git clone https://github.com/exo-explore/exo.git
cd exo
source install.sh

git clone https://github.com/exo-explore/exo.git

cd exo

source install.sh

Once done, you only need to run the following command on all machines:

exo

exo

That’s it! A ChatGPT-like WebUI will be started on http://localhost:52415. This is what it looks like on my machine (I only installed it on Ubuntu 24.04 machine).

You can download various Llama models directly from the interface. It first failed when I tried due to an “llvmlite” error, but I fixed it after installing the missing library:

source .venv/bin/activate
pip install llvmlite

1 2	source .venv/bin/activate pip install llvmlite

I installed two Llama 1B models from the web interface, but they still use a lot of memory (all of 16GB), and my laptop will typically hang when all memory is used… Even after rebooting my laptop and only running Exo and the web interface, it also hangs. Note that exos is also experimental software, so that might be why.

You’ll find the source code, extra instructions for macOS, and developer documentation (API) on GitHub.

Thanks to Onebir for the tip

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

8 Replies to “exo software – A distributed LLM solution running on a cluster of computers, smartphones, or SBCs”

And it’s even GPLv3 licensed!

OK, I have 4 Odroid-MC1-s plus one Odroid HC1 lying somewhere. That would be 40GB RAM total. Looks like I’ll have very busy weekend.

AMS says:

February 21, 2025 at 04:40

Please post back with your results. I have a gaming laptop that runs 4x8b and 13b models quite well with its 40 GB of RAM. I can fully accelerate 8b models on the Nvidia 3070 GPU, and I generally get 27 to 50 tokens per second depending on the model. This cluster solution interests me because my main server is a Ryzen 5950X with an AMD 7900GRE GPU, and I have several other mini PCs in the cluster that span Ryzen 3400U up to Ryzen 8945U. That should make for decent performance over a 5 GbE dedicated network. Most of the servers sit at 10% utilization so this may be a way to run a 34B model faster than just running it in RAM (with some GPU acceleration) on a single system.

Reply

tv boxs supporting armbian are the cheapest arm boards ever to run something like that. raspberry pi became a luxury item time ago.

David Jashi says:

February 18, 2025 at 23:07

I can imagine someone with hundreds of decommissioned STBs trying to wind this thing up. I wonder if it’s worth it.

Reply
1. fdd says:
  
  February 20, 2025 at 05:34
  
  one dollar more or less in one device is meaningless.
  one dollar more or less in one thousand devices become meaningful.
  
  Reply

“The developers explain that “adding less capable devices will slow down individual inference latency but will increase the overall throughput of the cluster.””

I think this mainly applies under “ring memory weighted partitioning” if the compute of the less capable devices is low relative to their RAM. If there’s spare RAM across the system, making it possible to scale down the RAM usage on those devices, that ought to mitigate the problem. (eg by bringing the effective compute/RAM ratio of all devices more closely into line),

—
This guy tries out exo with 5 high end Macs:
https://www.youtube.com/watch?v=Ju0ndy2kwlw

Network Chuck did a demo with 5 Mac Mini , it’s worth to watch

Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications

Upgrade pi-top [3] says:

February 18, 2025 at 19:42

And it’s even GPLv3 licensed!

David Jashi says:

February 18, 2025 at 21:53

OK, I have 4 Odroid-MC1-s plus one Odroid HC1 lying somewhere. That would be 40GB RAM total. Looks like I’ll have very busy weekend.

1. AMS says:
  
  February 21, 2025 at 04:40
  
  Please post back with your results. I have a gaming laptop that runs 4x8b and 13b models quite well with its 40 GB of RAM. I can fully accelerate 8b models on the Nvidia 3070 GPU, and I generally get 27 to 50 tokens per second depending on the model. This cluster solution interests me because my main server is a Ryzen 5950X with an AMD 7900GRE GPU, and I have several other mini PCs in the cluster that span Ryzen 3400U up to Ryzen 8945U. That should make for decent performance over a 5 GbE dedicated network. Most of the servers sit at 10% utilization so this may be a way to run a 34B model faster than just running it in RAM (with some GPU acceleration) on a single system.
  
fdd says:

February 18, 2025 at 22:44

tv boxs supporting armbian are the cheapest arm boards ever to run something like that. raspberry pi became a luxury item time ago.

1. David Jashi says:
  
  February 18, 2025 at 23:07
  
  I can imagine someone with hundreds of decommissioned STBs trying to wind this thing up. I wonder if it’s worth it.
  
  1. fdd says:
    
    February 20, 2025 at 05:34
    
    one dollar more or less in one device is meaningless.
    one dollar more or less in one thousand devices become meaningful.
    
1b1b1b1b says:

February 19, 2025 at 00:26

“The developers explain that “adding less capable devices will slow down individual inference latency but will increase the overall throughput of the cluster.””

I think this mainly applies under “ring memory weighted partitioning” if the compute of the less capable devices is low relative to their RAM. If there’s spare RAM across the system, making it possible to scale down the RAM usage on those devices, that ought to mitigate the problem. (eg by bringing the effective compute/RAM ratio of all devices more closely into line),

—
This guy tries out exo with 5 high end Macs:
https://www.youtube.com/watch?v=Ju0ndy2kwlw

Midgy says:

February 19, 2025 at 18:45

Network Chuck did a demo with 5 Mac Mini , it’s worth to watch

8 Replies to “exo software – A distributed LLM solution running on a cluster of computers, smartphones, or SBCs”

Leave a Reply Cancel reply

Leave a Reply