Embrace the Chaos
In live trading, things will go wrong. Networks fail. Exchanges have outages. Your code has bugs. The difference between a resilient system and a disaster is how you handle errors.
Error Categories
Transient Errors: Temporary failures (network timeouts, rate limits). Response: Retry with backoff.
Persistent Errors: Won't fix themselves (invalid credentials, insufficient balance). Response: Alert and halt.
Data Errors: Unexpected or invalid data (malformed response, missing fields). Response: Log, alert, use fallback.
Logic Errors: Bugs in your code. Response: Alert immediately, implement defensive coding.
The Retry Pattern
Use exponential backoff with random jitter. If the exchange is overloaded, everyone retrying immediately makes it worse. Spreading retries over time helps recovery. Jitter prevents synchronized retry storms.
Circuit Breakers
When errors pile up, stop trying. The circuit breaker pattern: Closed (normal) --> Too many errors --> Open (failing fast) --> Timeout --> Half-Open (testing) --> Success returns to Closed, Failure returns to Open.
Use circuit breakers per exchange. If Binance is failing, don't let it affect Bybit operations.
Graceful Degradation
When components fail, degrade gracefully rather than crash completely:
- •Can't get real-time data? Fall back to polling.
- •Execution failing? Queue signals for later.
- •Database down? Log to files temporarily.
The goal: stay alive and preserve capital until you can fix the problem.
Dead Letter Queues
When a message can't be processed after retries, don't lose it. Move it to a dead letter queue for later investigation.
Idempotency
Make operations safe to retry. If you process the same signal twice, you shouldn't open two positions. Use unique signal IDs and check if already processed before executing.
Health Checks
Actively verify system health with liveness checks (is the process running?), readiness checks (can it handle traffic?), and dependency checks (are databases and exchanges reachable?).
Run health checks on schedule and alert when they fail.
Alerting Strategy
Not all errors deserve 3 AM phone calls. Categorize:
Critical (page immediately): Position mismatch, unusual P&L, system completely down.
High (alert within minutes): Execution failures, data feed issues, balance warnings.
Medium (alert within hours): Rate limit warnings, minor errors, performance degradation.
Low (daily digest): Informational events, minor warnings.
Error Logging Best Practices
Include: timestamp, error type, full stack trace, relevant context (order details, market state), unique request ID for correlation.
Format: Use structured logging (JSON) so you can query and analyze.
Recovery Procedures
Document how to recover from common failures:
- •Exchange API key expired: Generate new key, update config, restart
- •Position mismatch: Run reconciliation script, verify, adjust
- •Circuit breaker open: Check exchange status, reset if resolved
Having documented procedures prevents panic-driven mistakes.
Defensive Coding
Assume everything external is unreliable:
- •Validate all input data
- •Check for null/undefined values
- •Use timeouts on all external calls
- •Bound resource usage (memory, connections)
Takeaway
Errors aren't exceptions - they're expected. Design your system to handle them gracefully. The goal isn't to prevent all errors; it's to contain their blast radius and recover quickly.