Skip to content

Nebula Integration

Nebula is a distributed GPU inference cluster management system. xtrace serves as its lightweight observability backend for historical metrics and traces.

Architecture

Nebula ComponentReports To xtrace
NodeGPU metrics (utilization, temperature, memory, KV cache) via POST /v1/metrics/batch
GatewayRequest traces via POST /v1/l/batch (trace + observation per request)
RouterReads real-time metrics via GET /api/public/metrics/query for routing decisions

Metrics Reported by Node

Each Node reports metrics on every heartbeat cycle (~5 seconds):

MetricLabelsValueDescription
gpu_utilizationnode_id, gpu_index0–100GPU utilization percentage
gpu_temperaturenode_id, gpu_indexCelsiusGPU temperature
gpu_memory_used_mbnode_id, gpu_indexMBGPU memory used
gpu_memory_total_mbnode_id, gpu_indexMBGPU memory total
kv_cache_usagemodel_uid, replica_id, node_id0.0–1.0KV cache utilization ratio
pending_requestsmodel_uid, replica_id, node_idcountPending request count
prefix_cache_hit_ratemodel_uid, replica_id, node_id0.0–1.0Prefix cache hit ratio

Example: Node Reporting

bash
NOW=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"metrics\":[
    {\"name\":\"gpu_utilization\",\"labels\":{\"node_id\":\"node-1\",\"gpu_index\":\"0\"},\"value\":85.0,\"timestamp\":\"$NOW\"},
    {\"name\":\"kv_cache_usage\",\"labels\":{\"model_uid\":\"qwen-72b\",\"replica_id\":\"0\",\"node_id\":\"node-1\"},\"value\":0.45,\"timestamp\":\"$NOW\"}
  ]}" \
  "http://127.0.0.1:8742/v1/metrics/batch"

Gateway Traces

Each Gateway request produces a trace with an observation:

  • Trace: name = request path, userId = requester, tags = ["gateway"]
  • Observation (type=GENERATION): model, input/output, promptTokens/completionTokens, latency

Router Queries

The Router queries xtrace for real-time routing signals:

bash
# Latest pending requests for a model
curl -H "Authorization: Bearer $TOKEN" \
  "http://127.0.0.1:8742/api/public/metrics/query?\
name=pending_requests&\
step=1m&agg=last&\
labels=%7B%22model_uid%22%3A%22qwen-72b%22%7D"

Freshness Check

Use meta.latest_ts to detect stale data:

if (now - meta.latest_ts) > 15s → treat as stale, use fallback
ConsumerFreshness ThresholdCache TTL
Router15 seconds30 seconds
Scheduler30 seconds30 seconds

Signal Contract

Key specifications include:

  • Metric value ranges and semantics
  • Query limits (50 series, 1000 points)
  • Error model and retry behavior
  • Acceptance criteria
SettingRouterScheduler
Max QPS≤ 10≤ 5
Connection timeout5s5s
Query timeout10s10s
Backoff on 429Use Retry-After + jitterUse Retry-After + jitter
Fallback on failureCache last result (30s TTL)Cache last result (30s TTL)

Released under the MIT License.