Back
Introducing literegistry
Overview
literegistry is a lightweight service discovery system for distributed ML inference. It's like Kubernetes service discovery, but without the containers, YAML files, or complexity. Built for HPC research clusters where nodes come and go (checkpoint resources, Slurm preemption, spot instances). Just add --registry redis://your-host:6379 to your vLLM/SGLang commands and get automatic failover, load balancing, and dynamic scaling. pip install literegistry and you're done.

What research computing looks like today
If you're a researcher or student in a modern ML lab, this scenario will sound familiar:
Your lab has a Slurm cluster. You get some dedicated GPU resources—maybe a few nodes that are "yours" most of the time. But for the big jobs? You're competing for a shared pool of checkpoint resources. These GPUs can be allocated to you when they're free, but at any moment, they can be taken away for higher-priority jobs. Your training run that was humming along on 32 GPUs suddenly drops to 8. Or worse, that inference server you stood up on a borrowed node just disappears.
Academic and industry research computing has infrastructure that is constantly in flux. Jobs pop up and drop at any moment. Resources come and go. The cluster you have at 2 PM looks nothing like the cluster you have at 2 AM.
Traditional distributed computing tools weren't built for dynamic research environments. They assume a stable cluster where you request N nodes, get N nodes, and keep those N nodes until your job finishes. They treat node failures as exceptional cases, not everyday occurrences. They require manual intervention to handle dynamic resource allocation.
What about Kubernetes and other orchestration tools?
You might be thinking: "Doesn't Kubernetes solve the problem?" And you'd be right—for cloud environments. Kubernetes, Docker Swarm, and similar tools have been the standard for service discovery and orchestration in cloud computing for years. They're battle-tested, feature-rich, and widely adopted.
They were designed for containerized cloud environments, not HPC research clusters.
These tools bring enormous complexity:
- Container orchestration when you just need to run Python scripts
- YAML configuration files that are a full-time job to maintain
- Docker/container knowledge that not every researcher has
- Complete infrastructure overhaul to adopt
- Complexity designed for microservices, not ML workloads
When you're a grad student who just wants to run vLLM on some GPUs and have it work when nodes come and go, containerizing your entire workflow is overkill. You don't want to learn Kubernetes, write Dockerfiles, set up a container registry, and completely change how you deploy your research code.
What if you could get the core benefits—service discovery, health checking, automatic failover—without changing how you work?
literegistry exists to solve this problem.
literegistry gives you the orchestration capabilities you need, written in Python, designed for ML workloads, and simple enough to add to your existing Slurm scripts without rewriting everything. No containers required. No YAML files. No infrastructure overhaul. Just pip install literegistry and get back to your research.
What is literegistry?
literegistry is a lightweight service registry and discovery system built specifically for distributed model inference deployments. It's designed from the ground up to handle modern research computing where resources come and go, jobs scale up and down, and failures are part of normal operation.
Think of it as a phonebook for your distributed model servers—one that automatically updates itself when servers appear or disappear, routes traffic intelligently, and handles the chaos of a shared HPC environment so you don't have to.
With first-class support for popular inference engines like vLLM and SGLang, literegistry makes it simple to:
- Deploy model servers that automatically register themselves
- Route requests to healthy, available servers
- Scale your deployment up or down as resources change
- Handle failures without manual intervention
- Monitor your entire distributed system from a single dashboard/ CLI
Built for your research environment
If your lab has a mix of dedicated nodes and checkpoint/preemptible nodes (like most academic clusters do), literegistry is built for you. Submit separate Slurm jobs for your dedicated partition and your checkpoint partition. When checkpoint nodes get preempted, literegistry automatically removes them from the pool. When they come back, resubmit the job and they rejoin automatically. Your clients never need to know what's happening behind the scenes.
How literegistry works
literegistry consists of four main components that work together:
1. The Registry (Key-Value Store)
At the heart of literegistry is a distributed key-value store that tracks all your model servers. Think of it as a phonebook for your cluster—it knows which models are running, where they're located, their health status, and performance metrics.
The registry supports two backends. The FileSystem Backend is perfect for single-node setups or HPC clusters with shared filesystems (NFS). It's simple to set up with zero dependencies, but can bottleneck under high concurrency when you have many services registering and querying simultaneously.
The Redis Backend is recommended for production deployments, especially when running across multiple nodes without shared storage. It provides high-performance concurrent access and is built for distributed systems.
2. Model Server Wrappers (vLLM & SGLang)
literegistry provides wrappers for popular inference engines that handle all the registration complexity for you. When you launch a vLLM or SGLang server through literegistry, it automatically:
- Registers itself with the registry on startup
- Sends heartbeats to maintain its active status
- Reports metrics like request counts and latency
- Deregisters gracefully on shutdown
You can spin up and tear down servers dynamically, and the system adapts automatically.
3. Gateway Server
The Gateway is an HTTP reverse proxy that sits between your clients and model servers. It provides OpenAI-compatible API endpoints (/v1/completions, /v1/chat/completions, /v1/models), automatic load balancing based on server latency, smart routing based on the model parameter in requests, and health monitoring and failover.
The Gateway continuously tracks which servers are healthy and routes requests to the fastest available instance. If a server fails, it automatically retries on another replica.
4. Client Library & CLI
For programmatic access, literegistry provides RegistryClient to register servers and query available models, and RegistryHTTPClient to make requests with automatic failover and retry logic.
For monitoring, there's a CLI tool:
literegistry summary --registry redis://your-host:6379
Getting started: installation
Installation is straightforward via pip:
pip install literegistry
Running literegistry: complete workflow
Let's walk through deploying a distributed inference cluster on an HPC system.
Step 1: start the registry
First, you need a central registry. For production deployments, use Redis:
# Start Redis server (or use an existing Redis instance)
literegistry redis --port 6379
The Redis server will run on your login node or a dedicated service node. All other components will connect to this registry.
For development or shared filesystem environments, you can skip this step and use a filesystem path instead (e.g., /shared/registry).
Step 2: launch model servers
Now spin up your vLLM or SGLang servers. The beauty of literegistry is that you can use all standard vLLM/SGLang arguments—the wrapper is transparent.
Using vLLM:
literegistry vllm \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--registry redis://cpu-node:6379 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
Using SGLang:
literegistry sglang \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--registry redis://cpu-node:6379 \
--tp-size 4 \
--mem-fraction-static 0.9
You can launch multiple instances of the same model across different nodes for load balancing, or run different models—literegistry handles all the tracking automatically.
The only thing that changed from your normal vLLM/SGLang command is adding --registry redis://cpu-node:6379. That's it. You don't rewrite your launch scripts, you don't containerize anything, you don't change your Slurm submission workflow. You just add one flag and get automatic service discovery.
Example: running multiple replicas
# On GPU node 1
literegistry vllm --model "meta-llama/Llama-3.1-8B-Instruct" --registry redis://cpu-node:6379
# On GPU node 2
literegistry vllm --model "meta-llama/Llama-3.1-8B-Instruct" --registry redis://cpu-node:6379
# On GPU node 3
literegistry vllm --model "mistralai/Mixtral-8x7B-Instruct-v0.1" --registry redis://cpu-node:6379
All three servers automatically register and start sending heartbeats.
Step 3: start the gateway
Launch the Gateway server to handle client requests:
literegistry gateway \
--registry redis://cpu-node:6379 \
--host 0.0.0.0 \
--port 8080
The Gateway immediately queries the registry and starts routing traffic to available servers.
Step 4: monitor your cluster
Use the CLI to check cluster status:
literegistry summary --registry redis://cpu-node:6379
Output:
meta-llama/Llama-3.1-8B-Instruct: 2
mistralai/Mixtral-8x7B-Instruct-v0.1: 1
You have 2 replicas of Llama running and 1 Mixtral instance.
Using the gateway API
Once your cluster is running, clients can send requests to the Gateway using the OpenAI-compatible API:
curl -X POST http://gateway-host:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 100
}'
The Gateway automatically:
- Looks up available servers for that model
- Selects the server with the lowest latency
- Routes the request
- Returns the response
- Updates metrics
If that server is down, it automatically tries the next available replica.
Writing code: server-side integration
Registering a custom server
If you're not using vLLM or SGLang, you can still register any HTTP service with literegistry:
from literegistry import RegistryClient, get_kvstore
import asyncio
async def register_my_server():
# Connect to registry (Redis or filesystem)
store = get_kvstore("redis://localhost:6379")
# Or for filesystem: store = get_kvstore("/shared/registry")
client = RegistryClient(store, service_type="model_path")
# Register your server
await client.register(
port=8000,
metadata={
"model_path": "my-custom-model",
"model_type": "custom-transformer"
}
)
print("Server registered! Starting heartbeats...")
# Keep server alive with heartbeats
while True:
await asyncio.sleep(10) # Heartbeat every 10 seconds
await client.heartbeat(port=8000)
asyncio.run(register_my_server())
The pattern lets you integrate any model server into the literegistry ecosystem.
Querying available models
from literegistry import RegistryClient, get_kvstore
import asyncio
async def list_models():
store = get_kvstore("redis://localhost:6379")
client = RegistryClient(store, service_type="model_path")
# Get all available models and their servers
models = await client.models()
for model_name, servers in models.items():
print(f"\n{model_name}:")
for server in servers:
print(f" - {server['base_url']}")
print(f" Last heartbeat: {server.get('last_heartbeat_time')}")
print(f" Request stats: {server.get('request_stats', {})}")
asyncio.run(list_models())
Writing code: client-side usage
Basic HTTP client with automatic failover
The RegistryHTTPClient provides automatic failover and retry logic:
from literegistry import RegistryClient, RegistryHTTPClient, get_kvstore
import asyncio
async def make_request():
store = get_kvstore("redis://localhost:6379")
client = RegistryClient(store, service_type="model_path")
# Create HTTP client for a specific model
async with RegistryHTTPClient(
client,
"meta-llama/Llama-3.1-8B-Instruct"
) as http_client:
# Make request with automatic retry and rotation
result, server_url = await http_client.request_with_rotation(
endpoint="v1/completions",
payload={
"prompt": "Write a haiku about distributed systems",
"max_tokens": 50
},
timeout=30,
max_retries=3
)
print(f"Response from {server_url}:")
print(result)
asyncio.run(make_request())
If the first server fails or times out, the client automatically tries the next available replica.
Batch processing with parallel requests
For high-throughput workloads, process multiple requests in parallel:
async def batch_inference():
store = get_kvstore("redis://localhost:6379")
client = RegistryClient(store, service_type="model_path")
# Prepare batch of requests
prompts = [
{"prompt": f"Question {i}: Tell me about AI", "max_tokens": 50}
for i in range(100)
]
async with RegistryHTTPClient(
client,
"meta-llama/Llama-3.1-8B-Instruct"
) as http_client:
# Process 100 requests with max 5 concurrent
results = await http_client.parallel_requests(
endpoint="v1/completions",
payloads_list=prompts,
max_parallel_requests=5,
timeout=30,
max_retries=3
)
print(f"Processed {len(results)} requests")
for i, (result, server) in enumerate(results):
print(f"Request {i} served by {server}")
asyncio.run(batch_inference())
The client automatically distributes load across available replicas and handles failures gracefully.
Storage backend trade-offs
Choosing between FileSystem and Redis backends depends on your deployment:
Filesystem backend
Use when:
- Running on a single machine for development/testing
- All nodes share a filesystem (common in HPC with NFS)
- You want zero additional dependencies
- You have a small, stable deployment (5-10 servers)
Limitations:
- Can bottleneck with many concurrent services/clients
- File locking overhead increases with scale
- Not ideal for 50+ services or high query rates
- Can struggle when many nodes are rapidly joining/leaving (e.g., checkpoint nodes being preempted and restarted frequently)
Example:
from literegistry import FileSystemKVStore
store = FileSystemKVStore("/shared/cluster/registry")
Redis backend (recommended for production)
Use when:
- Running across multiple nodes without shared storage
- Need high-concurrency access (many clients/services)
- Running production workloads with 10+ services
- Using checkpoint/preemptible nodes that frequently come and go
- Deploying on cloud infrastructure with spot instances
Example:
from literegistry import RedisKVStore
store = RedisKVStore("redis://cpu-node:6379")
Example deployment
Here's how a typical research group might deploy literegistry on an HPC cluster with both dedicated and checkpoint resources:
Setup:
- 4 dedicated GPU nodes (guaranteed), each with 4x A100 GPUs
- 6-12 checkpoint GPU nodes (come and go based on cluster load), each with 4x A100 GPUs
- Shared NFS storage for datasets
- Redis running on login node for coordination
- Slurm scheduler managing all allocations
Deployment:
# 1. Start Redis on login node (persistent)
literegistry redis --port 6379
# 2. Submit SLURM job for DEDICATED nodes (partition=dedicated)
# These 4 nodes are guaranteed to stay up
sbatch --partition=dedicated --nodes=4 --gres=gpu:4 --wrap="
literegistry vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--registry redis://cpu-node:6379 \
--tensor-parallel-size 4
"
# 3. Submit SLURM job for CHECKPOINT nodes (partition=checkpoint)
# These can be preempted at any time - that's okay!
sbatch --partition=checkpoint --nodes=8 --gres=gpu:4 --wrap="
literegistry vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--registry redis://cpu-node:6379 \
--tensor-parallel-size 4
"
# 4. Start Gateway on login node (persistent)
literegistry gateway \
--registry redis://cpu-node:6379 \
--host 0.0.0.0 \
--port 8080
What happens :
- Your 4 dedicated nodes register immediately and stay up
- The 8 checkpoint nodes register as they're allocated (might take a few minutes)
- When checkpoint nodes are preempted, they disappear from the registry automatically
- The gateway keeps routing traffic to whatever nodes are currently available
- When checkpoint nodes become available again, you resubmit the job and they rejoin automatically
Metrics and monitoring
literegistry tracks detailed metrics for each server:
- Request counts (last 5/15/60 minutes)
- Latency percentiles (p50, p90, p99)
- Error rates
- Throughput
Access these via the client:
async def get_metrics():
store = get_kvstore("redis://localhost:6379")
client = RegistryClient(store, service_type="model_path")
models = await client.models()
for model_name, servers in models.items():
for server in servers:
stats = server.get('request_stats', {})
print(f"{model_name} @ {server['base_url']}:")
print(f" Last 15min requests: {stats.get('last_15_minutes', 0)}")
print(f" Avg latency: {stats.get('last_15_minutes_latency', 0):.2f}ms")
Why literegistry? (and why not just use Kubernetes?)
At this point you might be thinking: "We already have Kubernetes/Consul/Nomad for service discovery."
Yes, these are powerful orchestration platforms. They absolutely can handle service discovery and failover. But consider the cost:
What Kubernetes requires:
- Learning container orchestration
- Dockerizing all your ML code
- Setting up and maintaining a Kubernetes cluster (or paying for managed K8s)
- Writing deployment YAML for each service
- Understanding pods, deployments, services, ingresses
- Debugging container networking issues
- Complete infrastructure migration
What literegistry requires:
pip install literegistry- Add one argument to your existing vLLM/SGLang launch command:
--registry redis://your-host:6379 - That's it
Getting started today
literegistry is open source and ready to use:
# Install
pip install literegistry
# Start serving
literegistry redis --port 6379
literegistry vllm --model meta-llama/Llama-3.1-8B-Instruct --registry redis://localhost:6379
literegistry gateway --registry redis://localhost:6379 --port 8080
# Check status
literegistry summary --registry redis://localhost:6379
Whether you're running a single GPU workstation or a 100-node HPC cluster, literegistry provides the coordination layer you need for distributed model inference.
Conclusion
If you're tired of babysitting distributed inference deployments, manually updating configs when nodes go down, and explaining to users why their requests are failing, literegistry might be exactly what you need.
It won't make checkpoint resources stop getting preempted. But it will make dealing with that preemption automatic, transparent, and painless.
Try it out on your next project. Whether you're running 4 GPUs or 400, whether your infrastructure is rock-solid or constantly changing, literegistry can help.
Resources:
- GitHub: github.com/goncalorafaria/literegistry
- Documentation: docs.literegistry.org
- Examples: github.com/goncalorafaria/literegistry/examples
Citation:
If you use literegistry in your research, please cite:
@software{literegistry2025,
title={literegistry: Lightweight Service Discovery for Distributed Model Inference},
author={Faria, Gonçalo and Smith, Noah},
year={2025},
url={https://github.com/goncalorafaria/literegistry}
}