You Can't Watch It All Day
Your trading bot runs 24/7. You can't. You need systems that watch for you and alert you when something needs attention.
What to Monitor
System Health:
- Process status (running, crashed, restarting)
- Memory and CPU usage
- Disk space
- Network connectivity
Trading Metrics:
- P&L (daily, weekly, by strategy)
- Win rate (rolling)
- Drawdown (current, maximum)
- Position exposure
Execution Quality:
- Order success rate
- Average slippage
- Fill time
- Rejected orders
Data Quality:
- Data freshness
- Missing data points
- Anomalous values
Monitoring Architecture
The pattern: Your Trading System --> Metrics Collector --> Time Series Database --> Dashboard + Alerting
Metrics Collector: Your code emits metrics (counters, gauges, histograms).
Time Series Database: Stores metrics over time (InfluxDB, Prometheus, TimescaleDB).
Dashboard: Visualizes metrics (Grafana is the standard).
Alerting: Triggers notifications when thresholds are breached.
Key Metrics to Track
Counters (things that only go up):
- Total signals generated
- Total orders placed
- Total errors
Gauges (point-in-time values):
- Current P&L
- Current position size
- Account balance
- Open order count
Histograms (distributions):
- Order fill time
- Slippage amounts
- Signal processing latency
Building Dashboards
A good trading dashboard shows at a glance:
Top Row: Overall system health (green/yellow/red indicators)
Second Row: P&L chart (daily, cumulative), current positions, account balance
Third Row: Signal and execution metrics, error rates, latency
Fourth Row: Per-strategy breakdown
Keep it simple. If you need to scroll to find critical information, redesign.
Alerting Rules
Good alerts are actionable. Bad alerts are ignored.
Good Alert: "Position size exceeds 2x normal. Current: 0.5 BTC. Expected max: 0.25 BTC"
Bad Alert: "Error occurred" (Which error? Where? What impact?)
Alert Fatigue: Too many alerts = all alerts ignored. Be selective about what triggers notifications.
Notification Channels
Different urgency needs different channels:
SMS/Phone Call: Critical issues requiring immediate action. Position mismatch, system down, unusual P&L.
Telegram/Discord: Important but not emergency. Execution issues, high error rates, risk limit warnings.
Email: Daily summaries, reports, non-urgent information.
Dashboard: Everything else. Details available when you look, no push notification.
Building a Notification System
Simple architecture: Event --> Notification Router --> Channel-Specific Senders --> Telegram/Discord/Email/SMS
Router Logic:
- Classify event severity
- Apply throttling (don't send 100 alerts per minute)
- Route to appropriate channel(s)
Throttling is Critical: If an error occurs 1000 times per minute, you don't need 1000 notifications. Aggregate and summarize.
What Notifications to Send
Always Notify:
- System start/stop
- Trade executions (entry and exit)
- Risk limit breaches
- Position mismatches
- Significant P&L changes
Conditionally Notify:
- Signal generation (optional, can be noisy)
- Minor errors (aggregate into digest)
- Performance metrics (daily summary)
Never Notify:
- Routine operations
- Debug information
- Expected errors that self-resolve
Signal Notifications
For each trade signal, include:
- Direction (LONG/SHORT)
- Asset and exchange
- Entry price
- Stop loss level
- Position size
- Strategy/edge that generated it
Example: "LONG BTC @ $95,000 | Stop: $93,100 | Size: 0.1 BTC | Edge: DPO_PVOL_2h"
Daily Summary Reports
Send daily at a consistent time:
- P&L for the day
- Win/loss count
- Current positions
- Notable events
- System health summary
Automate these. Manual reporting means it won't happen consistently.
Monitoring Your Monitoring
Meta, but important: What happens if your monitoring system fails?
- Have a heartbeat: If you don't receive a "system healthy" message every hour, something's wrong
- Use external monitoring: A third-party service that checks if your systems are reachable
- Redundant channels: If Telegram is down, alerts should fall back to email
Takeaway
Good monitoring is invisible when things work and invaluable when they don't. Invest in dashboards that show system health at a glance and alerts that tell you exactly what's wrong and what to do about it.