exo software – A distributed LLM solution running on a cluster of computers, smartphones, or SBCs

You’d typically need hardware with a large amount of memory and bandwidth and multiple GPUs, if you want to run the latest large language models (LLMs), such as DeepSeek R1 with 671 billion parameters. But such hardware is not affordable or even available to most people, and the Exo software works around that as a distributed LLM solution working on a cluster of computers with or without NVIDIA GPUs, smartphones, and/or single board computers like Raspberry Pi boards.

In some ways, exo works like distcc when compiling C programs over a build farm, but targets AI workloads such as LLMs instead.

Exo software distributed LLM solution
Exo cluster with Linux boxes with NVIDIA GPUs and MacBook Pro 16GB

Key features of Exo software:

  • Support for LLaMA (MLX and tinygrad), Mistral, LlaVA, Qwen, and Deepseek.
  • Dynamic Model Partitioning – The solution splits up models based on the current network topology and device resources available in order to run larger models than you would be able to on any single device. That’s what I call a “distributed LLM solution”. Several partitioning strategies are available and the default is “ring memory weighted partitioning” where each device runs a number of model layers proportional to the memory of the device.
  • Automatic Device Discovery / Zero manual configuration – exo will automatically discover other devices using the best method available.
  • ChatGPT-compatible API. It enables a one-line change in your application to run models on your own hardware using exo.
  • Device Equality –  Exo does not use a master-worker architecture and instead connects in a peer-to-peer (P2P) fashion. So as long as a device is connected somewhere in the network, it can be used to run models.

OS support is unclear, and I don’t see any Windows support. Instead, instructions related to Linux, Mac OS, Android, and iOS are available. The only other software requirement is Python>=3.12.0. Linux systems with NVIDIA GPU also need the NVIDIA driver, CUDA toolkit, and a cuDNN library.

On the hardware front, the important is to have enough memory across all devices to fit the model in memory. For example, since running Llama 3.1 8B (FP16) requires 16GB of RAM, the following configuration would work:

  • 2x 8GB M3 MacBook Airs or
  • 1x 16GB NVIDIA RTX 4070 Ti Laptop
  • 2x Raspberry Pi 400 with 4GB of RAM each (running on CPU) and 1x 8GB Mac Mini; so heterogenous architectures can work together.

If I understand correctly, it would be possible to run DeepSeek R1 (Full 671B – FP16) on a cluster of Raspberry Pi with 8GB or 16GB, as long as there’s a combined ~1.3TB of RAM, or about 170 Raspberry Pi 5 with 8GB RAM. Performance would be horrendous likely better expressed in tokens per hour, but it should work in theory,… The developers explain that “adding less capable devices will slow down individual inference latency but will increase the overall throughput of the cluster.”

Exo needs to be installed from source on all machines as follows:


Once done, you only need to run the following command on all machines:


exo terminal windows

That’s it! A ChatGPT-like WebUI will be started on http://localhost:52415. This is what it looks like on my machine (I only installed it on Ubuntu 24.04 machine).

tinychat dashboard

You can download various Llama models directly from the interface. It first failed when I tried due to an “llvmlite” error, but I fixed it after installing the missing library:


I installed two Llama 1B models from the web interface, but they still use a lot of memory (all of 16GB), and my laptop will typically hang when all memory is used… Even after rebooting my laptop and only running Exo and the web interface, it also hangs. Note that exos is also experimental software, so that might be why.

exo llama 3.2 1B demo

You’ll find the source code, extra instructions for macOS, and developer documentation (API) on GitHub.

Thanks to Onebir for the tip

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Radxa Orion O6 Armv9 mini-ITX motherboard

8 Replies to “exo software – A distributed LLM solution running on a cluster of computers, smartphones, or SBCs”

  1. OK, I have 4 Odroid-MC1-s plus one Odroid HC1 lying somewhere. That would be 40GB RAM total. Looks like I’ll have very busy weekend.

    1. Please post back with your results. I have a gaming laptop that runs 4x8b and 13b models quite well with its 40 GB of RAM. I can fully accelerate 8b models on the Nvidia 3070 GPU, and I generally get 27 to 50 tokens per second depending on the model. This cluster solution interests me because my main server is a Ryzen 5950X with an AMD 7900GRE GPU, and I have several other mini PCs in the cluster that span Ryzen 3400U up to Ryzen 8945U. That should make for decent performance over a 5 GbE dedicated network. Most of the servers sit at 10% utilization so this may be a way to run a 34B model faster than just running it in RAM (with some GPU acceleration) on a single system.

  2. tv boxs supporting armbian are the cheapest arm boards ever to run something like that. raspberry pi became a luxury item time ago.

    1. I can imagine someone with hundreds of decommissioned STBs trying to wind this thing up. I wonder if it’s worth it.

      1. one dollar more or less in one device is meaningless.
        one dollar more or less in one thousand devices become meaningful.

  3. “The developers explain that “adding less capable devices will slow down individual inference latency but will increase the overall throughput of the cluster.””

    I think this mainly applies under “ring memory weighted partitioning” if the compute of the less capable devices is low relative to their RAM. If there’s spare RAM across the system, making it possible to scale down the RAM usage on those devices, that ought to mitigate the problem. (eg by bringing the effective compute/RAM ratio of all devices more closely into line),


    This guy tries out exo with 5 high end Macs:
    https://www.youtube.com/watch?v=Ju0ndy2kwlw

Leave a Reply

Your email address will not be published. Required fields are marked *

Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications
Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications