Introduction

Deploying AI models to production is still one of the hardest parts of the ML lifecycle. Teams spend weeks configuring infrastructure, writing Dockerfiles, setting up load balancers, and debugging GPU drivers — all before serving a single inference request.

Infrarix Deploy eliminates this complexity. Push your model, define your endpoints, and Deploy handles everything: containerization, scaling, GPU allocation, health checks, and global distribution.

The Deployment Gap

Most AI teams hit the same bottlenecks:

Infrastructure complexity: Kubernetes, GPU drivers, CUDA versions, Docker images
Scaling challenges: Cold starts, auto-scaling policies, resource waste
No standardization: Every model has a different serving framework
Cost overruns: GPU instances running 24/7 even with intermittent traffic
Security gaps: Exposed endpoints, no auth, no rate limiting

How Infrarix Deploy Works

1. Push Your Model

Upload model weights, a serving configuration, and optional pre/post processing scripts. Deploy supports PyTorch, TensorFlow, ONNX, and any custom inference server.

2. Define Endpoints

Specify your endpoint configuration: hardware requirements (CPU, GPU type, memory), scaling policies (min/max replicas, scale-to-zero), and request settings.

3. Deploy & Serve

Deploy handles containerization, GPU scheduling, health checks, load balancing, and SSL termination. Your model is live in seconds with a production-ready API endpoint.

Quick Start

# Install CLI
npm install -g @infrarix/cli

# Login
infrarix login

# Deploy a model
infrarix deploy \
  --name my-llm \
  --framework pytorch \
  --gpu a100 \
  --replicas 1-10 \
  --scale-to-zero \
  ./model/

Your model is now live at:

https://my-llm.deploy.infrarix.com/v1/predict

Configuration File

# infrarix.deploy.yaml
name: my-text-classifier
framework: pytorch
runtime: python3.11

hardware:
  gpu: a100
  memory: 16Gi

scaling:
  min_replicas: 0
  max_replicas: 10
  target_concurrency: 5
  scale_to_zero_after: 300s

endpoints:
  - path: /predict
    method: POST
    timeout: 30s

auth:
  type: api_key
  rate_limit: 100/min

Key Features

Scale to Zero

When traffic drops to zero, Deploy scales down to zero replicas — you only pay for actual usage. Cold starts are optimized with pre-warmed containers.

GPU Scheduling

Deploy intelligently schedules workloads across GPU types (A100, H100, T4) based on your model requirements and budget preferences.

Blue/Green Deployments

Roll out new model versions with zero downtime. Canary deployments let you route a percentage of traffic to the new version before full rollout.

Built-in Observability

Every inference request is logged with latency, throughput, GPU utilization, and error rates. Set up alerts for SLA breaches.

Supported Frameworks

Framework	Versions	GPU Support
PyTorch	2.0+	A100, H100, T4, L4
TensorFlow	2.12+	A100, H100, T4
ONNX Runtime	1.16+	All GPUs + CPU
vLLM	0.3+	A100, H100
TGI (Text Gen Inference)	1.0+	A100, H100
Custom Docker	Any	All GPUs + CPU

Frequently Asked Questions

How fast are cold starts?

Cold starts depend on model size. Small models (<1GB) start in under 5 seconds. Large models (>10GB) use pre-warmed containers to start in under 15 seconds.

Can I use my own GPU instances?

Enterprise plans support bring-your-own-cloud (BYOC) with Deploy managing orchestration on your infrastructure.

Is there a free tier?

Yes. The free tier includes 100 GPU-hours per month on T4 instances, perfect for development and testing.

Get Started

Deploy your first model in under 5 minutes. Learn more about Infrarix Deploy or read the comparison with Replicate.