How we built a clean, lightweight container management system with support for local and remote Docker deployments, environment caching, and type-safe definitions.
The Challenge
When building Affine, we faced a significant infrastructure challenge: how do you run arbitrary RL environments securely, reproducibly, and at scale?
Traditional approaches had drawbacks:
- Kubernetes — Too complex for our needs, steep learning curve for miners
- Docker Compose — Limited orchestration capabilities, no built-in caching
- Serverless platforms — Expensive, cold start issues, limited GPU support
We needed something purpose-built for AI workloads.
Introducing Affinetes
Affinetes is our lightweight container orchestration system designed specifically for RL environment execution. It provides:
1. Clean Container Management
from affinetes import Environment, Container
env = Environment(
name="ded-v2",
image="affine/ded-v2:latest",
resources=Resources(cpu=4, memory="16Gi", gpu=1)
)
async with Container(env) as container:
result = await container.evaluate(model)The API is intentionally simple. No YAML files, no complex configuration — just Python code.
2. Local and Remote Deployment
Affinetes seamlessly switches between local Docker and remote deployments:
# Local development
client = AffinetesClient(mode="local")
# Production
client = AffinetesClient(
mode="remote",
endpoint="https://compute.affine.io"
)Same code works in both environments.
3. Environment Caching
RL environments can be expensive to initialize. Affinetes maintains a warm pool of ready environments:
client = AffinetesClient(
cache_size=10, # Keep 10 environments warm
cache_ttl=3600 # Expire after 1 hour of inactivity
)This dramatically reduces evaluation latency from minutes to seconds.
4. Type-Safe Definitions
All environment definitions are fully typed:
@dataclass
class EvaluationResult:
score: float
steps: int
metadata: dict[str, Any]
class Environment(Protocol):
async def evaluate(self, model: Model) -> EvaluationResult:
...Type errors are caught at development time, not in production.
Architecture
Container Lifecycle
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Created │────▶│ Running │────▶│ Completed │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ ▼ │
│ ┌─────────────┐ │
└───────────▶│ Cached │◀───────────┘
└─────────────┘Containers can be cached after completion for rapid reuse.
Resource Management
Affinetes implements fair scheduling across multiple concurrent evaluations:
scheduler = ResourceScheduler(
max_concurrent=4,
gpu_memory_limit="40Gi",
priority_queue=True
)Higher-priority evaluations (e.g., validator requests) are processed first.
Security Model
Running arbitrary code is inherently risky. Affinetes implements defense in depth:
- Container isolation — Each evaluation runs in its own container
- Network restrictions — Containers cannot access the internet by default
- Resource limits — CPU, memory, and GPU are strictly bounded
- Time limits — Evaluations are killed after a configurable timeout
- Read-only filesystems — Models cannot modify the evaluation environment
Performance Benchmarks
We compared Affinetes against alternatives for a typical evaluation workload:
| System | Cold Start | Warm Start | Throughput |
|---|---|---|---|
| Kubernetes | 45s | 12s | 100/hr |
| Docker Compose | 30s | 8s | 120/hr |
| Affinetes | 25s | 2s | 200/hr |
The caching system provides a 12x improvement in warm start time.
Future Plans
We're continuing to improve Affinetes:
- Multi-region deployment — Run evaluations closer to miners
- Spot instance support — Reduce costs with preemptible compute
- Custom environment SDK — Allow community-contributed environments
Open Source
Affinetes is part of the Affine open source project. Check it out at github.com/AffineFoundation/affine.
We welcome contributions!