The Scale of Our Discovery

Here's a number that might surprise you: we routinely test millions of indicator combinations.

With 50+ normalized indicators, multiple threshold levels per indicator, two directions (LONG/SHORT), multiple holding periods, and regime filters—the combinatorial space is enormous.

You can't test this manually. You can't even eyeball results. You need systematic infrastructure to generate, test, filter, and track candidates at scale.

This lesson reveals how we do it, including how we avoid the trap of finding "edges" that are pure statistical noise.

The Grid Search Architecture

Our discovery system uses grid search with smart filtering:

Layer 1: Indicator Selection Start with ~50 z-score normalized indicators. Not all combinations make sense, so we define allowed pairings based on indicator categories.

Layer 2: Threshold Grid For each indicator, test multiple threshold levels:

•Strong threshold: z > 2.0 or z < -2.0
•Medium threshold: z > 1.5 or z < -1.5
•Mild threshold: z > 1.0 or z < -1.0

Layer 3: Direction Each combination tested for both LONG and SHORT signals.

Layer 4: Holding Period Multiple target horizons: 1h, 2h, 4h, 8h, 24h.

Layer 5: Regime Filter Apply regime filters: bull-only, bear-only, high-vol-only, no filter.

The math: 50 indicators × 50 secondary indicators × 3 thresholds × 3 thresholds × 2 directions × 5 horizons × 4 regime filters = 45 million combinations.

We don't test all 45 million—smart filtering reduces this dramatically—but we still test millions.

Computational Considerations

Testing millions of combinations requires efficiency:

Vectorized Backtesting No for-loops over candles. Everything operates on numpy arrays and pandas dataframes. A single edge backtest should take milliseconds, not seconds.

Parallel Processing Multiple CPU cores testing different combination subsets simultaneously. We use Python multiprocessing to parallelize across cores.

Precomputed Data All z-scores, regime labels, and target returns computed once upfront. Discovery queries precomputed arrays rather than recalculating per test.

Result Storage Results go into a database, not files. PostgreSQL with proper indexes allows quick queries: "Show me all LONG edges with >70% win rate and >500 samples in bull regime."

Incremental Discovery Track what's been tested. When new data arrives, only test combinations on new data, then merge with existing results.

With good infrastructure, testing 1 million combinations takes hours, not days.

The Multiple Comparison Problem

Here's the danger: when you test millions of combinations, you WILL find some with amazing backtest results purely by chance.

This is the multiple comparison problem. If you flip a coin 1,000 times and look for any 20-flip stretch, you'll find runs that look like a biased coin. They're not—they're random noise.

Same with edge discovery. Test a million combinations and 1% will show 65%+ win rates by chance. That's 10,000 false edges.

How do we handle this?

Statistical Safeguards

1. Sample Size Requirements

More samples = less likely to be noise. We require minimum sample sizes:

•At least 500 events for initial consideration
•At least 1,000 events for high confidence

A 75% win rate over 50 trades is suspicious. A 65% win rate over 2,000 trades is meaningful.

2. Walk-Forward Validation

Split data temporally:

•Discovery period: First 70% of data
•Validation period: Last 30% of data

Edge must work in BOTH periods. Curve-fitted edges fail out-of-sample.

We do this multiple times with different split points. If an edge only works in one split configuration, it's probably noise.

3. Bonferroni-Style Correction

The classic statistical approach: if testing N combinations, reduce your significance threshold by N.

In practice, we don't apply strict Bonferroni (too conservative), but we do raise the bar for what counts as "interesting." Testing a million combinations means a single 65% win rate edge isn't impressive—we need 70%+.

4. Parameter Stability Testing

If an edge works at threshold 2.0 but fails at 1.9 and 2.1, it's probably overfitted to exactly 2.0.

We test parameter neighborhoods. Robust edges show good performance across a range of nearby parameters.

5. Economic Rationale Filter

Does the edge make sense? Can you explain WHY it should work?

"Extreme negative funding combined with rising OI in bull markets leads to bounces" has a story: overleveraged shorts get squeezed.

"The 17-period RSI crossing below 23.4 on BTC" has no story. It's probably noise.

We reject edges that pass statistical tests but lack rational explanation.

Separating Discovery from Validation

Critical principle: the data that finds edges cannot be the data that validates them.

Our workflow:

•Discovery Phase: Run grid search on discovery dataset
•Filter Phase: Apply statistical filters to survivors
•Validation Phase: Test filtered survivors on held-out validation data
•Robustness Phase: Test stability across parameters and time periods
•Paper Phase: Forward test on live data without real money

Only after all phases does an edge get real capital.

The validation data is sacred. It doesn't touch discovery at all. This prevents subtle data leakage where validation performance is inflated by information that leaked from discovery.

Result Tracking and Management

With millions of tested combinations, organization matters:

Result Database Schema

•Edge ID (unique identifier)
•Primary indicator + threshold
•Secondary indicator + threshold
•Direction (LONG/SHORT)
•Regime filter
•Holding period
•Discovery date
•Sample size
•Win rate
•Profit factor
•Sharpe ratio
•Validation status
•Notes

Querying Patterns

•"All edges with >65% win rate and >1000 samples in bear regime"
•"Best LONG edges using funding as primary indicator"
•"Edges that passed walk-forward validation"

Version Control When we add new indicators or change normalization, we track which edges were discovered under which data version. Old edges might need re-validation.

Duplicate Detection Similar edges often appear multiple times with minor variations. We detect and merge near-duplicates to avoid counting the same edge repeatedly.

The Discovery Pipeline in Practice

Real example of our process:

Week 1: Data Refresh New HyBlock data arrives. We update parquet files, recalculate z-scores, regenerate regime labels.

Week 2: Grid Search Launch discovery on SOL with latest data. Test 2 million combinations across all indicator pairs. Takes ~6 hours on 32 cores.

Week 3: Initial Filtering Query results: 15,000 combinations show >60% win rate. Apply sample size filter (>500): down to 3,200. Apply walk-forward: down to 890.

Week 4: Robustness Testing Test parameter stability on 890 survivors. 340 pass. Check for economic rationale—240 make sense.

Week 5: Final Validation Run 240 candidates through held-out validation period. 67 perform acceptably.

Week 6: Promotion Decisions Review 67 validated edges. Select best 10-15 for paper trading based on uncorrelation, regime diversity, and signal frequency.

From millions of combinations to dozens of tradeable edges. That's the funnel.

Common Failure Modes

Over-filtering: Requirements so strict that nothing passes. Result: No edges promoted, opportunity cost.

Under-filtering: Requirements too loose, noise gets through. Result: False edges blow up in production.

Confirmation Bias: Favoring edges that match your priors. Result: Miss edges that work for reasons you didn't expect.

Recency Bias: Over-weighting recent performance in validation. Result: Regime-specific edges mistaken for universal edges.

Complexity Creep: Adding more factors to improve backtest numbers. Result: Overfitted edges that fail forward.

Balance is key. Be rigorous but not paralyzed. Filter noise but accept that some uncertainty is unavoidable.

Key Takeaways

•Systematic discovery requires testing millions of combinations—manual testing won't work
•Grid search across indicators, thresholds, directions, horizons, and regimes
•Multiple comparison problem is real—most "good" backtests are noise
•Safeguards: large sample sizes, walk-forward validation, parameter stability, economic rationale
•Strict separation between discovery data and validation data
•Database-driven result tracking for querying and organization
•The funnel is harsh: millions of candidates → dozens of tradeable edges

Scale creates opportunity—the more you test, the more real edges you find. But scale also creates danger—the more you test, the more noise looks like signal. Rigorous statistical hygiene is what separates productive discovery from random data mining.

Testing Millions of Combinations