Back

Introducing literegistry

10.23.2025

Gonçalo Faria

Overview

literegistry is a lightweight service discovery system for distributed ML inference. It's like Kubernetes service discovery, but without the containers, YAML files, or complexity. Built for HPC research clusters where nodes come and go (checkpoint resources, Slurm preemption, spot instances). Just add --registry redis://your-host:6379 to your vLLM/SGLang commands and get automatic failover, load balancing, and dynamic scaling. pip install literegistry and you're done.

literegistry Architecture

What research computing looks like today

If you're a researcher or student in a modern ML lab, this scenario will sound familiar:

Your lab has a Slurm cluster. You get some dedicated GPU resources—maybe a few nodes that are "yours" most of the time. But for the big jobs? You're competing for a shared pool of checkpoint resources. These GPUs can be allocated to you when they're free, but at any moment, they can be taken away for higher-priority jobs. Your training run that was humming along on 32 GPUs suddenly drops to 8. Or worse, that inference server you stood up on a borrowed node just disappears.

Academic and industry research computing has infrastructure that is constantly in flux. Jobs pop up and drop at any moment. Resources come and go. The cluster you have at 2 PM looks nothing like the cluster you have at 2 AM.

Traditional distributed computing tools weren't built for dynamic research environments. They assume a stable cluster where you request N nodes, get N nodes, and keep those N nodes until your job finishes. They treat node failures as exceptional cases, not everyday occurrences. They require manual intervention to handle dynamic resource allocation.

What about Kubernetes and other orchestration tools?

You might be thinking: "Doesn't Kubernetes solve the problem?" And you'd be right—for cloud environments. Kubernetes, Docker Swarm, and similar tools have been the standard for service discovery and orchestration in cloud computing for years. They're battle-tested, feature-rich, and widely adopted.

They were designed for containerized cloud environments, not HPC research clusters.

These tools bring enormous complexity:

Container orchestration when you just need to run Python scripts
YAML configuration files that are a full-time job to maintain
Docker/container knowledge that not every researcher has
Complete infrastructure overhaul to adopt
Complexity designed for microservices, not ML workloads

When you're a grad student who just wants to run vLLM on some GPUs and have it work when nodes come and go, containerizing your entire workflow is overkill. You don't want to learn Kubernetes, write Dockerfiles, set up a container registry, and completely change how you deploy your research code.

What if you could get the core benefits—service discovery, health checking, automatic failover—without changing how you work?

literegistry exists to solve this problem.

literegistry gives you the orchestration capabilities you need, written in Python, designed for ML workloads, and simple enough to add to your existing Slurm scripts without rewriting everything. No containers required. No YAML files. No infrastructure overhaul. Just pip install literegistry and get back to your research.

What is literegistry?

literegistry is a lightweight service registry and discovery system built specifically for distributed model inference deployments. It's designed from the ground up to handle modern research computing where resources come and go, jobs scale up and down, and failures are part of normal operation.

Think of it as a phonebook for your distributed model servers—one that automatically updates itself when servers appear or disappear, routes traffic intelligently, and handles the chaos of a shared HPC environment so you don't have to.

With first-class support for popular inference engines like vLLM and SGLang, literegistry makes it simple to:

Deploy model servers that automatically register themselves
Route requests to healthy, available servers
Scale your deployment up or down as resources change
Handle failures without manual intervention
Monitor your entire distributed system from a single dashboard/ CLI

Built for your research environment

If your lab has a mix of dedicated nodes and checkpoint/preemptible nodes (like most academic clusters do), literegistry is built for you. Submit separate Slurm jobs for your dedicated partition and your checkpoint partition. When checkpoint nodes get preempted, literegistry automatically removes them from the pool. When they come back, resubmit the job and they rejoin automatically. Your clients never need to know what's happening behind the scenes.

How literegistry works

literegistry consists of four main components that work together:

1. The Registry (Key-Value Store)

At the heart of literegistry is a distributed key-value store that tracks all your model servers. Think of it as a phonebook for your cluster—it knows which models are running, where they're located, their health status, and performance metrics.

The registry supports two backends. The FileSystem Backend is perfect for single-node setups or HPC clusters with shared filesystems (NFS). It's simple to set up with zero dependencies, but can bottleneck under high concurrency when you have many services registering and querying simultaneously.

The Redis Backend is recommended for production deployments, especially when running across multiple nodes without shared storage. It provides high-performance concurrent access and is built for distributed systems.

2. Model Server Wrappers (vLLM & SGLang)

literegistry provides wrappers for popular inference engines that handle all the registration complexity for you. When you launch a vLLM or SGLang server through literegistry, it automatically:

Registers itself with the registry on startup
Sends heartbeats to maintain its active status
Reports metrics like request counts and latency
Deregisters gracefully on shutdown

You can spin up and tear down servers dynamically, and the system adapts automatically.

3. Gateway Server

The Gateway is an HTTP reverse proxy that sits between your clients and model servers. It provides OpenAI-compatible API endpoints (/v1/completions, /v1/chat/completions, /v1/models), automatic load balancing based on server latency, smart routing based on the model parameter in requests, and health monitoring and failover.

The Gateway continuously tracks which servers are healthy and routes requests to the fastest available instance. If a server fails, it automatically retries on another replica.

4. Client Library & CLI

For programmatic access, literegistry provides RegistryClient to register servers and query available models, and RegistryHTTPClient to make requests with automatic failover and retry logic.

For monitoring, there's a CLI tool:

literegistry summary --registry redis://your-host:6379

Getting started: installation

Installation is straightforward via pip:

pip install literegistry

Running literegistry: complete workflow

Let's walk through deploying a distributed inference cluster on an HPC system.

Step 1: start the registry

First, you need a central registry. For production deployments, use Redis:

# Start Redis server (or use an existing Redis instance)
literegistry redis --port 6379

The Redis server will run on your login node or a dedicated service node. All other components will connect to this registry.

For development or shared filesystem environments, you can skip this step and use a filesystem path instead (e.g., /shared/registry).

Step 2: launch model servers

Now spin up your vLLM or SGLang servers. The beauty of literegistry is that you can use all standard vLLM/SGLang arguments—the wrapper is transparent.

Using vLLM:

literegistry vllm \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --registry redis://cpu-node:6379 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Using SGLang:

literegistry sglang \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --registry redis://cpu-node:6379 \
  --tp-size 4 \
  --mem-fraction-static 0.9

You can launch multiple instances of the same model across different nodes for load balancing, or run different models—literegistry handles all the tracking automatically.

The only thing that changed from your normal vLLM/SGLang command is adding --registry redis://cpu-node:6379. That's it. You don't rewrite your launch scripts, you don't containerize anything, you don't change your Slurm submission workflow. You just add one flag and get automatic service discovery.

Example: running multiple replicas

# On GPU node 1
literegistry vllm --model "meta-llama/Llama-3.1-8B-Instruct" --registry redis://cpu-node:6379

# On GPU node 2  
literegistry vllm --model "meta-llama/Llama-3.1-8B-Instruct" --registry redis://cpu-node:6379

# On GPU node 3
literegistry vllm --model "mistralai/Mixtral-8x7B-Instruct-v0.1" --registry redis://cpu-node:6379

All three servers automatically register and start sending heartbeats.

Step 3: start the gateway

Launch the Gateway server to handle client requests:

literegistry gateway \
  --registry redis://cpu-node:6379 \
  --host 0.0.0.0 \
  --port 8080

The Gateway immediately queries the registry and starts routing traffic to available servers.

Step 4: monitor your cluster

Use the CLI to check cluster status:

literegistry summary --registry redis://cpu-node:6379

Output:

meta-llama/Llama-3.1-8B-Instruct: 2
mistralai/Mixtral-8x7B-Instruct-v0.1: 1

You have 2 replicas of Llama running and 1 Mixtral instance.

Using the gateway API

Once your cluster is running, clients can send requests to the Gateway using the OpenAI-compatible API:

curl -X POST http://gateway-host:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 100
  }'

The Gateway automatically:

Looks up available servers for that model
Selects the server with the lowest latency
Routes the request
Returns the response
Updates metrics

If that server is down, it automatically tries the next available replica.

Writing code: server-side integration

Registering a custom server

If you're not using vLLM or SGLang, you can still register any HTTP service with literegistry:

from literegistry import RegistryClient, get_kvstore
import asyncio

async def register_my_server():
    # Connect to registry (Redis or filesystem)
    store = get_kvstore("redis://localhost:6379")
    # Or for filesystem: store = get_kvstore("/shared/registry")
    
    client = RegistryClient(store, service_type="model_path")
    
    # Register your server
    await client.register(
        port=8000,
        metadata={
            "model_path": "my-custom-model",
            "model_type": "custom-transformer"
        }
    )
    
    print("Server registered! Starting heartbeats...")
    
    # Keep server alive with heartbeats
    while True:
        await asyncio.sleep(10)  # Heartbeat every 10 seconds
        await client.heartbeat(port=8000)

asyncio.run(register_my_server())

The pattern lets you integrate any model server into the literegistry ecosystem.

Querying available models

from literegistry import RegistryClient, get_kvstore
import asyncio

async def list_models():
    store = get_kvstore("redis://localhost:6379")
    client = RegistryClient(store, service_type="model_path")
    
    # Get all available models and their servers
    models = await client.models()
    
    for model_name, servers in models.items():
        print(f"\n{model_name}:")
        for server in servers:
            print(f"  - {server['base_url']}")
            print(f"    Last heartbeat: {server.get('last_heartbeat_time')}")
            print(f"    Request stats: {server.get('request_stats', {})}")

asyncio.run(list_models())

Writing code: client-side usage

Basic HTTP client with automatic failover

The RegistryHTTPClient provides automatic failover and retry logic:

from literegistry import RegistryClient, RegistryHTTPClient, get_kvstore
import asyncio

async def make_request():
    store = get_kvstore("redis://localhost:6379")
    client = RegistryClient(store, service_type="model_path")
    
    # Create HTTP client for a specific model
    async with RegistryHTTPClient(
        client, 
        "meta-llama/Llama-3.1-8B-Instruct"
    ) as http_client:
        
        # Make request with automatic retry and rotation
        result, server_url = await http_client.request_with_rotation(
            endpoint="v1/completions",
            payload={
                "prompt": "Write a haiku about distributed systems",
                "max_tokens": 50
            },
            timeout=30,
            max_retries=3
        )
        
        print(f"Response from {server_url}:")
        print(result)

asyncio.run(make_request())

If the first server fails or times out, the client automatically tries the next available replica.

Batch processing with parallel requests

For high-throughput workloads, process multiple requests in parallel:

async def batch_inference():
    store = get_kvstore("redis://localhost:6379")
    client = RegistryClient(store, service_type="model_path")
    
    # Prepare batch of requests
    prompts = [
        {"prompt": f"Question {i}: Tell me about AI", "max_tokens": 50}
        for i in range(100)
    ]
    
    async with RegistryHTTPClient(
        client,
        "meta-llama/Llama-3.1-8B-Instruct"
    ) as http_client:
        
        # Process 100 requests with max 5 concurrent
        results = await http_client.parallel_requests(
            endpoint="v1/completions",
            payloads_list=prompts,
            max_parallel_requests=5,
            timeout=30,
            max_retries=3
        )
        
        print(f"Processed {len(results)} requests")
        for i, (result, server) in enumerate(results):
            print(f"Request {i} served by {server}")

asyncio.run(batch_inference())

The client automatically distributes load across available replicas and handles failures gracefully.

Storage backend trade-offs

Choosing between FileSystem and Redis backends depends on your deployment:

Filesystem backend

Use when:

Running on a single machine for development/testing
All nodes share a filesystem (common in HPC with NFS)
You want zero additional dependencies
You have a small, stable deployment (5-10 servers)

Limitations:

Can bottleneck with many concurrent services/clients
File locking overhead increases with scale
Not ideal for 50+ services or high query rates
Can struggle when many nodes are rapidly joining/leaving (e.g., checkpoint nodes being preempted and restarted frequently)

Example:

from literegistry import FileSystemKVStore
store = FileSystemKVStore("/shared/cluster/registry")

Redis backend (recommended for production)

Use when:

Running across multiple nodes without shared storage
Need high-concurrency access (many clients/services)
Running production workloads with 10+ services
Using checkpoint/preemptible nodes that frequently come and go
Deploying on cloud infrastructure with spot instances

Example:

from literegistry import RedisKVStore
store = RedisKVStore("redis://cpu-node:6379")

Example deployment

Here's how a typical research group might deploy literegistry on an HPC cluster with both dedicated and checkpoint resources:

Setup:

4 dedicated GPU nodes (guaranteed), each with 4x A100 GPUs
6-12 checkpoint GPU nodes (come and go based on cluster load), each with 4x A100 GPUs
Shared NFS storage for datasets
Redis running on login node for coordination
Slurm scheduler managing all allocations

Deployment:

# 1. Start Redis on login node (persistent)
literegistry redis --port 6379

# 2. Submit SLURM job for DEDICATED nodes (partition=dedicated)
# These 4 nodes are guaranteed to stay up
sbatch --partition=dedicated --nodes=4 --gres=gpu:4 --wrap="
  literegistry vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --registry redis://cpu-node:6379 \
    --tensor-parallel-size 4
"

# 3. Submit SLURM job for CHECKPOINT nodes (partition=checkpoint)  
# These can be preempted at any time - that's okay!
sbatch --partition=checkpoint --nodes=8 --gres=gpu:4 --wrap="
  literegistry vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --registry redis://cpu-node:6379 \
    --tensor-parallel-size 4
"

# 4. Start Gateway on login node (persistent)
literegistry gateway \
  --registry redis://cpu-node:6379 \
  --host 0.0.0.0 \
  --port 8080

What happens :

Your 4 dedicated nodes register immediately and stay up
The 8 checkpoint nodes register as they're allocated (might take a few minutes)
When checkpoint nodes are preempted, they disappear from the registry automatically
The gateway keeps routing traffic to whatever nodes are currently available
When checkpoint nodes become available again, you resubmit the job and they rejoin automatically

Metrics and monitoring

literegistry tracks detailed metrics for each server:

Request counts (last 5/15/60 minutes)
Latency percentiles (p50, p90, p99)
Error rates
Throughput

Access these via the client:

async def get_metrics():
    store = get_kvstore("redis://localhost:6379")
    client = RegistryClient(store, service_type="model_path")
    
    models = await client.models()
    for model_name, servers in models.items():
        for server in servers:
            stats = server.get('request_stats', {})
            print(f"{model_name} @ {server['base_url']}:")
            print(f"  Last 15min requests: {stats.get('last_15_minutes', 0)}")
            print(f"  Avg latency: {stats.get('last_15_minutes_latency', 0):.2f}ms")

Why literegistry? (and why not just use Kubernetes?)

At this point you might be thinking: "We already have Kubernetes/Consul/Nomad for service discovery."

Yes, these are powerful orchestration platforms. They absolutely can handle service discovery and failover. But consider the cost:

What Kubernetes requires:

Learning container orchestration
Dockerizing all your ML code
Setting up and maintaining a Kubernetes cluster (or paying for managed K8s)
Writing deployment YAML for each service
Understanding pods, deployments, services, ingresses
Debugging container networking issues
Complete infrastructure migration

What literegistry requires:

pip install literegistry
Add one argument to your existing vLLM/SGLang launch command: --registry redis://your-host:6379
That's it

Getting started today

literegistry is open source and ready to use:

# Install
pip install literegistry

# Start serving
literegistry redis --port 6379
literegistry vllm --model meta-llama/Llama-3.1-8B-Instruct --registry redis://localhost:6379
literegistry gateway --registry redis://localhost:6379 --port 8080

# Check status
literegistry summary --registry redis://localhost:6379

Whether you're running a single GPU workstation or a 100-node HPC cluster, literegistry provides the coordination layer you need for distributed model inference.

Conclusion

If you're tired of babysitting distributed inference deployments, manually updating configs when nodes go down, and explaining to users why their requests are failing, literegistry might be exactly what you need.

It won't make checkpoint resources stop getting preempted. But it will make dealing with that preemption automatic, transparent, and painless.

Try it out on your next project. Whether you're running 4 GPUs or 400, whether your infrastructure is rock-solid or constantly changing, literegistry can help.

Resources:

GitHub: github.com/goncalorafaria/literegistry
Documentation: docs.literegistry.org
Examples: github.com/goncalorafaria/literegistry/examples

Citation:

If you use literegistry in your research, please cite:

@software{literegistry2025,
  title={literegistry: Lightweight Service Discovery for Distributed Model Inference},
  author={Faria, Gonçalo and Smith, Noah},
  year={2025},
  url={https://github.com/goncalorafaria/literegistry}
}

Menu