Documentation
Copilot
AI-powered operational assistant that orchestrates reasoning, tool execution, and evidence gathering.
How it Works
OpsOrch Copilot is not just a chatbot. It is a reasoning engine that invokes Tools exposed by the MCP server.
Reasoning Engine
The Copilot uses an iterative Plan-Act-Observe loop to solve complex operational problems.
Planning
The LLM analyzes your question ("Why is checkout slow?") and breaks it down into required data steps.
Tool Execution
It invokes read-only tools via MCP (e.g., query-metrics, query-logs) to gather evidence.
Synthesis & Iteration
It observes the results. If the data is insufficient (e.g., no logs found), it iterates with a new plan to expand the time window or check a different service.
Capabilities
- •Contextual Analysis: Retrieve and summarize incidents with full context (metrics, logs, tickets).
- •Correlation: Connect spikes in metrics to recent deployments or log error bursts.
- •Investigation: Run multi-step investigations to find root causes across systems.
- •Runbook Discovery: Proactively suggest orchestration plans related to incidents or services.
- •Answer with Evidence: Provides citations and deep links to source data in the Console.
Deep Links & Runbook Actions
Copilot responses include structured references that power Console deep links and action cards. Runbook suggestions link directly to orchestration plans so operators can launch runs without hunting.
{
"actions": [
{ "type": "orchestration_plan", "id": "db-failover", "name": "DB Failover", "reason": "Applies to the current outage." }
],
"references": {
"incidents": ["inc-404"],
"services": ["payments-api"],
"orchestrationPlans": ["db-failover"]
}
}Safety & Resilience
Read-Only by Default
The Copilot is designed to be safe. It prioritizes "Read" operations. Any "Write" operation (e.g., restarting a pod) requires explicit user confirmation via a Human-in-the-Loop flow.
Resilience Patterns
The engine handles API failures gracefully with:
- Exponential Backoff: Retries failed provider calls automatically.
- Window Expansion: Automatically widens time ranges if metrics are empty.
- Circuit Breaking: Stops calling down providers to prevent latency.
Deployment
To enable Copilot in your self-hosted instance, you must provide an LLM API key.
Supported Models
- OpenAI GPT-4o / GPT-4 Turbo
- Anthropic Claude 3.5 Sonnet
- Google Gemini 3.0 Flash
- AWS Bedrock (Claude / Titan)
Configuration
# In your opsorch-copilot env or secrets:
LLM_PROVIDER="openai" # or "anthropic", "gemini", "bedrock"
OPENAI_API_KEY="sk-..."
# For Gemini:
# LLM_PROVIDER="gemini"
# GEMINI_API_KEY="your-api-key"
# GEMINI_MODEL="gemini-3-flash-preview" # optional, this is the default
# Optional: Specialized Model Selection
LLM_MODEL_PLANNER="gpt-4o"
LLM_MODEL_FAST="gpt-3.5-turbo"Example Flow
A typical investigation flow looks like this:
CodeBlock.invoke("query_logs", {"query": "service:payment status:500"})