Embrace the Chaos

In live trading, things will go wrong. Networks fail. Exchanges have outages. Your code has bugs. The difference between a resilient system and a disaster is how you handle errors.

Error Categories

Transient Errors: Temporary failures (network timeouts, rate limits). Response: Retry with backoff.

Persistent Errors: Won't fix themselves (invalid credentials, insufficient balance). Response: Alert and halt.

Data Errors: Unexpected or invalid data (malformed response, missing fields). Response: Log, alert, use fallback.

Logic Errors: Bugs in your code. Response: Alert immediately, implement defensive coding.

The Retry Pattern

Use exponential backoff with random jitter. If the exchange is overloaded, everyone retrying immediately makes it worse. Spreading retries over time helps recovery. Jitter prevents synchronized retry storms.

Circuit Breakers

When errors pile up, stop trying. The circuit breaker pattern: Closed (normal) --> Too many errors --> Open (failing fast) --> Timeout --> Half-Open (testing) --> Success returns to Closed, Failure returns to Open.

Use circuit breakers per exchange. If Binance is failing, don't let it affect Bybit operations.

Graceful Degradation

When components fail, degrade gracefully rather than crash completely:

•Can't get real-time data? Fall back to polling.
•Execution failing? Queue signals for later.
•Database down? Log to files temporarily.

The goal: stay alive and preserve capital until you can fix the problem.

Dead Letter Queues

When a message can't be processed after retries, don't lose it. Move it to a dead letter queue for later investigation.

Idempotency

Make operations safe to retry. If you process the same signal twice, you shouldn't open two positions. Use unique signal IDs and check if already processed before executing.

Health Checks

Actively verify system health with liveness checks (is the process running?), readiness checks (can it handle traffic?), and dependency checks (are databases and exchanges reachable?).

Run health checks on schedule and alert when they fail.

Alerting Strategy

Not all errors deserve 3 AM phone calls. Categorize:

Critical (page immediately): Position mismatch, unusual P&L, system completely down.

High (alert within minutes): Execution failures, data feed issues, balance warnings.

Medium (alert within hours): Rate limit warnings, minor errors, performance degradation.

Low (daily digest): Informational events, minor warnings.

Error Logging Best Practices

Include: timestamp, error type, full stack trace, relevant context (order details, market state), unique request ID for correlation.

Format: Use structured logging (JSON) so you can query and analyze.

Recovery Procedures

Document how to recover from common failures:

•Exchange API key expired: Generate new key, update config, restart
•Position mismatch: Run reconciliation script, verify, adjust
•Circuit breaker open: Check exchange status, reset if resolved

Having documented procedures prevents panic-driven mistakes.

Defensive Coding

Assume everything external is unreliable:

•Validate all input data
•Check for null/undefined values
•Use timeouts on all external calls
•Bound resource usage (memory, connections)

Takeaway

Errors aren't exceptions - they're expected. Design your system to handle them gracefully. The goal isn't to prevent all errors; it's to contain their blast radius and recover quickly.

Error Handling: When Things Go Wrong