Intended audience: Senior, staff, and principal engineers building execution infrastructure or research/production loops around it; quant developers who own the optimizer side of an SOR; trading-system architects asking whether their adaptive routing layer is doing real work or repainting marketing copy; and engineers evaluating where, exactly, GPU and quantum acceleration should plug into a deterministic trading stack without breaking replay. Working familiarity with the JVM, electronic-market microstructure, and the layered SOR design from the prior article is assumed.

Reference implementation: Adaptive Quantum SOR implements the architecture described here, from a Java-only Phase 1 reference through Phase 7 robust policy selection. Specifications, completion reports, scenario catalog, native CUDA/CUDA-Q artifacts, JMH benchmarks, and Jupyter research notebooks live at github.com/rueishi/adaptive-quantum-sor. The full source tracks adaptive_quantum_sor_spec_v1.md and adaptive_quantum_sor_spec_phase7.md.


A parent order arrives at 14:31:07.412. Buy 800,000 shares of AAPL, VWAP-equivalent execution over the next forty minutes, risk-capped at 12% participation per venue. The execution algorithm slices it into a child intent for the current minute bucket: roughly 18,000 shares to be worked now, across the venue universe the firm is connected to. The router has to decide, in microseconds, which venues to consider, what relative weights to give them, what fill-probability assumption to use for each, what fee schedule each is currently on, what reject penalty to assess after the last three minutes of session quality, and what queue-survival prior to apply to passive children. All of this for a venue universe that, ten minutes ago, the firm’s regime detector reclassified from “normal” to “stressed.”

The router cannot run the optimization at decision time. The decision budget is a few microseconds; the optimization, done honestly, is a constrained quadratic assignment problem with a learned signal layer, a venue-correlation matrix, and a set of regime-conditioned penalty terms. The honest answer is to run the optimization continuously on a separate path: feature stats every few seconds, tactical tuning every minute, strategic venue subsets every five to thirty minutes, robust scenario evaluation at publication. The optimizer compiles its decisions into an immutable policy artifact that the execution path can read in nanoseconds.

This article is about that separation. The central claim: an adaptive SOR should never optimize on the order-time path. Optimization happens off-path on its own cadence, the result is an immutable versioned policy artifact, the execution path reads it in nanoseconds, and GPU and quantum backends earn their place only when they respect that boundary. Everything else follows: what each layer of the optimization stack does, where GPU/CUDA acceleration belongs, where quantum-style QUBO/Ising solvers belong, and how robust policy selection (choosing one policy from a candidate set under an explicit objective over a declared scenario set) replaces the silently-fragile “publish the one candidate that scored best on yesterday’s expected conditions.”

The layered policy optimization stack: L6 replay at the top down to L0 execution at the bottom, with the publication boundary as the only interface into the hot path. Native backends sit behind L3, L3.5, and L4 only. Figure 1. The layered policy optimization stack. Cadence slows upward; L0 reads exactly one immutable policy per decision. The publication boundary is the only thing the warm path gives the hot path, and the only thing the hot path reads.

Key terms 15 entries
L0–L6
The seven layers of the policy optimization stack. L0 is the hot path (per-order, microseconds, Java only); L6 is the historical/replay layer (offline, scenario catalog).
Hot path / warm path / cold path
Three latency tiers. Hot is order-time and microsecond-bounded; warm is the optimizer-cycle path (seconds to minutes); cold is audit, training, and analysis (no latency budget).
Route key
Composite key (instrumentId, regimeId, urgencyId) that indexes every per-policy routing parameter. Dense integer encoding lets the hot path read by array offset.
HotRouteBook
Primitive-array representation of the compiled policy that the L0 routing path reads. No objects, no boxing, no allocation.
SorPolicy
The immutable, versioned, audit-stamped routing policy artifact. Compiled by L1 and published atomically.
Publication boundary
The single AtomicReference<SorPolicy> that holds the active policy. The only handoff between optimizer and executor; the only thing the hot path reads from the optimizer side.
QUBO
Quadratic Unconstrained Binary Optimization. Combinatorial objective with binary variables and quadratic pair coupling. The native problem shape for quantum annealers and QAOA solvers.
Pair coefficient
The x[u]·x[v] term in a QUBO that penalizes co-selection of correlated or jointly-toxic venues. Distinguishes a real QUBO from a top-K sort.
CUDA-Q
NVIDIA's open-source hybrid quantum-classical platform. Runs quantum-style algorithms either on a real QPU or simulated on a GPU.
CVAR_K
Phase 7 robust-selection objective. Selects the candidate with the highest mean score over the worst K% of scenarios; interpolates between EXPECTED (K=100) and MIN_MAX (K = 1/n).
MIN_REGRET / MIN_MAX / EXPECTED
The other three Phase 7 objectives. EXPECTED is the mean; MIN_MAX is strict worst-case; MIN_REGRET minimizes the maximum per-scenario gap to the best alternative.
Adequacy gate
Phase 7 check that the scenario set covers required failure categories (regime, liquidity, venue-health, failure-negative). Bounds the strength of the published policy's robustness claim.
Fail-closed
Backend failure (CUDA, CUDA-Q, batch allocator) falls back to the previously approved result. The executor never sees the failure; the operator sees a counter increment.
Reference solver
Deterministic Java or CPU implementation that ships alongside every native backend. Serves as both correctness oracle for small problems and always-available fallback.
Native boundary / ABI
The contract between the JVM and a C++/CUDA backend: versioned struct layout, direct buffers across JNI, primitive status codes, Java-side fallback on any non-OK return.

Why this article exists

The integration gap

SOR is one of the few subjects in trading infrastructure where the gap between what the system actually has to do and what most write-ups describe is widest. A serious router reasons over fragmented liquidity across tens to hundreds of venues, time-bucketed fill probability distributions per (instrument, venue, regime, side, size, price-distance), per-venue fee tiers and rebate schedules, post-fill mark-out distributions at multiple horizons, queue-survival priors, latency tables, reject-rate decays, parent-level conservation invariants, cross-parent capacity sharing, regime classification, model-signal lineage, risk envelopes, and a publication gate that has to decide whether to swap a new policy in atomically, all on cadences that span microseconds to tens of minutes. The data points alone fill a small textbook. The interactions between them fill a larger one: what changes when a regime flips, what tightens when a venue’s reject rate climbs, what the right participation cap is when two parents arrive in the same minute window.

Most existing treatments of SOR pick one slice (best-execution compliance, latency engineering, market microstructure, ML signals) and leave the rest as background. The result is that engineers building real systems inherit a fragmented picture and re-derive the integration from scratch. Another article on SOR earns its place only if it integrates the slices, and an article on adaptive SOR earns its place only if it integrates them on the cadences a real production system actually runs at.

Why this matters now, not five years ago

The reason is the cost-of-optimization curve. Adaptive routing is the modern real-world SOR trend not because the math became more interesting but because the optimization that defines adaptiveness has finally acquired a credible compute substrate. By adaptiveness I mean continuously recomputing venue weights, fill priors, toxicity penalties, strategic subsets, and cross-parent allocations from streaming evidence. On CPUs alone, an honest end-to-end recomputation of a production-scale policy (thousands of route keys, dense venue universes, regime-conditioned scoring, pairwise venue penalties) is a multi-hour job. Firms that ran this on CPUs learned to cheat: graph-theoretic decomposition over thousands of nodes, dependency tracking that recomputes only the parts of the policy a change can possibly affect, dirty-flag propagation through layered caches, partial-recompute schedulers that interleave full rebuilds across hours of small updates. The technique works; these systems are in production. They are also expensive to build, fragile under non-stationary inputs (a regime shift invalidates so much of the cached graph that the bypass logic is useless exactly when adaptiveness matters most), and hard to keep correct as the venue universe and signal pipeline evolve.

GPU acceleration changes the trade. A constrained QP across all active route keys is a single batched kernel; a QUBO across the strategic subsets is a parallel evaluation; a cross-parent batch allocation is a structured optimization with hardware support. The cost of full recomputation collapses to something that fits inside a 1-to-5-minute cadence, and the brittle graph-theoretic bypass logic stops being load-bearing. The design stays simple because the hardware made the simple design cheap enough to ship.

GPU and quantum as the new substrate

There is a broader hardware trend the design rides rather than fights. GPU/CUDA compute is no longer optional for problems with the right shape. Dense linear algebra, batched independent subproblems, sparse-matrix-vector products, constrained QP, and large-N pair-coupled QUBO instances are the shape where well-designed GPU kernels deliver real one-to-two orders of magnitude throughput gains over CPU at production sizes. The published evidence is concrete and consistent: a GPU-accelerated implementation of OSQP, the production-grade quadratic-programming solver, has been measured up to two orders of magnitude (≈100×) faster than its CPU equivalent on large QP problems; GPU-parallel quadratic-assignment solvers have been measured at order-of-magnitude average speedups with up to 63× on specific QAPLIB instances; NVIDIA’s cuOpt LP barrier method reports an average speedup over a leading open-source CPU solver and over a popular commercial CPU solver on a large public test set. The same OSQP-GPU paper is explicit that GPUs are slower than CPUs on small QP instances, and the QAP work concludes that “both algorithmic choice and the shape of the input data sets are key factors” in whether the speedup materializes.

The honest claim is one-to-two orders of magnitude when the problem size, sparsity pattern, kernel design, and host-to-device transfer cost are all well matched, and that is what this design is sized for. For the layers where the optimization is naturally batched (L3 tactical across thousands of route keys, L3.5 batch allocation across concurrent parents, L4 strategic QUBO across pair-coupled venue subsets), that throughput is the difference between a cadence that fits inside the warm path and one that does not. We use the speedup where it pays and ignore it where it does not, and we make the choice visible per layer rather than hide it inside vendor copy.

The quantum side is an extension of the same trend, intentionally staged. The QUBO formulations at L4 and L3.5 are not aspirational; they are evaluated today on classical reference solvers and on quantum-simulated backends running on NVIDIA GPUs through CUDA-Q. CUDA-Q is NVIDIA’s open-source hybrid quantum-classical platform, explicit in its own documentation about “offering GPU-accelerated simulations when adequate quantum hardware isn’t available” and being “qubit-agnostic” when it is. Simulating an Ising/QUBO instance on a GPU is not the same as running it on a real QPU. It is, however, the integration that exercises the same backend boundary, the same input layout, the same fail-closed contract, and the same audit lineage that a real QPU will use.

The horizon for that real QPU is no longer fully speculative. IBM’s publicly published quantum roadmap targets the Starling fault-tolerant system by 2029, with 200 logical qubits and 100-million-gate circuit depth, and Blue Jay by 2033 with 2,000 logical qubits and one-billion-gate depth; IBM has hit its prior roadmap milestones in public and now claims “the most viable path to realize fault-tolerant quantum computing.” Whether or not those specific milestones land on the published schedule, the direction is firm, and the application class for which quantum hardware is most credibly a fit is narrow. Combinatorial optimization over pair-coupled binary variables (which is what L4 strategic subset selection is) has been among the most heavily studied “killer-app” candidates in both the NISQ and the early fault-tolerant literature for the last several years, with active research lines on QAOA, variational quantum eigensolvers, quantum walks, and quantum-inspired annealing applied directly to Markowitz-style portfolio and combinatorial selection problems. None of that body of work has demonstrated quantum advantage at industrial scale today, and this article does not claim it has. What we claim is much smaller and more defensible: the integration seams are in place, the classical reference solver is the correctness oracle for every small enough instance, the backend boundary fails closed when the exotic hardware is unavailable, and the firm can experiment with a QPU without committing to a vendor before its trust has been earned. Designing for that adaptability is what “quantum-ready” should mean: preparedness, not prediction.

Layering the policy optimization is what lets us use GPU and quantum compute without putting any of it on the order-time path. Without the layering, GPU-accelerated optimization either has to run on every order (a latency disaster and a determinism disaster) or has to be wedged into some ad-hoc warm-path loop with its own boundaries and its own failure modes. The layered model gives every cadence its own home, every backend its own boundary, every failure its own fallback, and every published policy its own audit lineage. The GPU at L3, the QUBO seam at L4, the batch allocator at L3.5, the ML pipeline at L5: each is a clean integration point with an explicit Java-side interface and an explicit reference implementation. The layering is what lets a real adaptive SOR exist on real hardware without re-implementing graph-theoretic bypass logic and without ever putting an exotic backend on the path that has to respond to a fill in microseconds.

What “adaptive” and “quantum-ready” should mean

The prior article drew a clear boundary. It described the SOR as a deterministic execution strategy: parent intent in, scored venues, allocated children, partial-fill reroute, terminal state out. The optimizer side (how venue weights, fill-probability priors, latency penalties, and toxicity scores actually get there) was treated as a cold-path input, gestured at but not designed. Most production write-ups of SOR design do exactly the same thing, and for the same reason: the execution-side discipline is hard enough on its own.

The omission is consequential. The execution layer is only as good as the policy it executes. A perfectly deterministic, allocation-free routing path that consumes a stale, miscalibrated, or fragile policy will produce perfectly deterministic, allocation-free bad fills. And the optimization stack that produces the policy is where almost all the “adaptive,” “AI-driven,” “quantum-accelerated” marketing of modern SOR vendors lives, usually attached to one box on a diagram that decision-makers never get to inspect.

There is a second reason. Adaptive and quantum are two of the most over-marketed terms in execution-strategy vendor copy. Real adaptiveness is structural: the policy actually changes with the market, the changes are evidenced by post-trade transaction-cost analysis, and the change cadence is principled. Real quantum-readiness is architectural: the system has a backend boundary that a QUBO/Ising solver or quantum-accelerated cuOpt run can plug into, with the same input shape and the same fail-closed contract as the reference solver next to it. Most systems that claim either have neither. This article tries to make the difference concrete for engineers who have to ship.

Recap from the prior article: the invariants this design assumes.

The execution side from the prior article gives us the substrate. Compressed:

  • Deterministic single-writer hot path. The order-time routing path runs on one logical writer, allocation-free, replay-stable.
  • Hot/cold path split. Cold paths allocate, fail, retry, learn. Hot paths read primitive arrays and write child orders.
  • Parent state with conservation invariants. Cumulative fill, leaves quantity, and live-children residual sum to original quantity at every observable state.
  • Risk gate is pre-trade per child, with parent-level conservation. Every child passes through the risk engine; SOR-internal invariants catch double-allocation that per-child risk cannot see.
  • Evidence over assertion. Determinism is proven by replay-divergence tests; allocation-freeness is proven by JMH -prof gc; release readiness is gated by completion reports that name every implemented and known-incomplete acceptance criterion.

The current article builds the layer above. Everything below assumes those invariants hold and never touches them.


How this article is organized

The structure follows the prior article. We start by being precise about what an adaptive SOR does that a static SOR does not. We then walk the layered optimization stack from the streaming feature layer at the bottom up to the historical replay layer at the top, with explicit attention to where GPU/CUDA work happens and where QUBO/CUDA-Q work happens. We unpack the cross-parent batch allocation problem, the one part of the stack where quadratic coupling between parents and venues breaks the per-parent-independent assumption. We cover the publication gate that turns optimizer output into an atomically-swappable immutable policy, and the Phase 7 robust selection wrapper that adds objective-and-scenario provenance to it. We close with the discipline that makes the whole thing trustworthy (deterministic replay, scenario catalogs as test surfaces, JMH allocation gates on the hot path, completion reports as evidence bundles) and with what the design explicitly does not do.

The article moves from the architectural principle through each optimization layer to the cross-layer disciplines that hold it together. Readers familiar with multi-tier optimization stacks can skim Part I, readers shipping the GPU side can land on Part III, readers focused on quantum/QUBO/CUDA-Q integration on Part IV, and readers reviewing release evidence on Part VI.

Part I, The core principle: separating optimization intelligence from execution mechanics. §1 names what makes an SOR adaptive versus static. §2 unpacks why the obvious single-tier “adaptive on every order” design is wrong. §3 introduces the compiled-policy artifact as the discipline that makes adaptiveness safe.

Part II, The layered policy optimization stack. §4 surveys the L0–L6 layers as a coherent stack. §5 covers the streaming feature layer (L2). §6 covers the ML/feature-model layer (L5). §7 covers the tactical layer (L3) and the native CUDA boundary. §8 covers the cross-parent batch allocation layer (L3.5) and its quadratic objective. §9 covers the strategic layer (L4), the QUBO/Ising formulation, and the CUDA-Q boundary. §10 covers the historical/synthetic replay layer (L6) as both validation source and policy-quality scoreboard.

Part III, GPU/CUDA integration. §11 covers the native boundary contract: ABI, direct buffers, status codes, fallback policy. §12 walks the CUDA tactical optimizer integration end to end. §13 covers profiling, metrics, and the discipline of treating native failure as a first-class observable event.

Part IV, Quantum-future readiness. §14 covers what makes the strategic layer quantum-friendly: the route-level QUBO formulation, why pair coefficients matter, and why cardinality-bounded subset selection is a natural QPU workload. §15 covers the CUDA-Q strategic bridge and the path from a classical exhaustive reference solver to a QUBO/Ising/QPU backend. §16 covers how the batch allocator’s quadratic objective gives a second quantum-friendly seam, and why cuOpt, QUBO, and a QPU all plug into the same BatchAllocationBackend boundary. §17 is honest about what quantum advantage we are and are not claiming.

Part V, Robust policy selection (Phase 7). §18 covers why expected-condition tuning silently accepts tail fragility. §19 covers candidate sets, the scenario-sweep scorecard, and the four pure-function objectives (EXPECTED, MIN_MAX, CVAR_K, MIN_REGRET). §20 covers scenario adequacy as a strength-of-claim gate. §21 covers decision provenance and the audit trail.

Part VI, Discipline and evidence. §22 covers deterministic replay across phases. §23 covers the JMH allocation gate on the L0 hot path. §24 covers scenario catalogs as test surfaces. §25 enumerates what Adaptive Quantum SOR explicitly does not do.


Part I — The core principle: separating optimization intelligence from execution mechanics

1. What “adaptive” actually means

A static SOR routes against a fixed policy. Venue weights, fill-probability priors, latency penalties, toxicity penalties, child-size limits, participation caps: all are configured at deploy time and changed only by redeploy. The router’s job is to read the policy, score eligible venues, allocate the parent, and submit children. This is not a bad SOR. For many proprietary trading workloads, where the venue universe is small and stable and the firm’s signal pipeline already encodes most of the relevant edge into the trading-strategy side, a static SOR can produce execution quality competitive with anything else in the market.

An adaptive SOR routes against a policy that changes with the market. Venue weights shift when a venue’s recent rolling fill probability drifts. Toxicity penalties rise when post-fill mark-out worsens. Strategic venue subsets are reselected when a regime detector reclassifies the market state from “normal” to “stressed.” The policy is regenerated on a cadence that is fast enough to track real microstructure change and slow enough to remain reproducible, auditable, and stable against transient noise.

Two specific things make an SOR genuinely adaptive, as opposed to “static with some knobs”:

It consumes structured feedback from realized fills. A static SOR’s penalty terms are configured. An adaptive SOR’s penalty terms are derived (at least in part) from rolling statistics on what the firm actually observed at each venue in the recent past. Fill probability is bucketed by venue, instrument, regime, side, and size, and updated from execution-report outcomes. Toxicity is measured as post-fill mark-out by venue, in basis points, across configurable horizons. Latency is measured per-venue with the same monotonic clock the execution path uses, not the wall clock. None of these statistics enter the order-time path directly; they enter the optimizer path, where they get rolled into the next policy candidate.

Its policy is reproducibly versioned. A policy that adapts has to be auditable. Which policy was in force at the moment that 800,000-share AAPL parent was routed? Which model-signal version produced its toxicity penalties? Which strategic-optimizer run produced its eligible venue subset? Which scenario-sweep score matrix justified its selection over the next-best candidate? Without these answers, “adaptive” is just a euphemism for “we change things sometimes and hope nobody asks.” With them, the firm has an audit trail strong enough to defend a routing decision in a post-mortem, a regulator interaction, or a TCA dispute with a counterparty.

The combination of structured feedback in and immutable versioned policy out is what Adaptive Quantum SOR implements. Every executed parent order’s audit record references the policy version that produced its routing decision; every policy version’s lineage stamps the strategic-optimizer run ID, the tactical-optimizer run ID, the model-signal version, the input snapshot ID, and (when robust selection is enabled) the score-matrix handle and the selection objective. The lineage is dense, machine-readable, and replay-stable.

Key Principle: Adaptiveness is a discipline, not a feature. An SOR is adaptive when its policy is recompiled from streaming evidence on a principled cadence, when every published policy carries its full optimizer lineage, and when the execution path consumes one immutable policy version per decision. Without the lineage, “adaptive” is unreviewable. Without the immutable version, “adaptive” is non-deterministic.

2. Why the obvious design is wrong

The naive design, “run the optimizer every time a parent arrives,” fails on three independent grounds, each of which is sufficient to kill it.

The first is latency. A serious tactical optimizer over a real venue universe involves a constrained quadratic program: tune weights, penalties, child-size limits, and participation caps subject to risk constraints and fee constraints. Even with cuOpt or a hand-rolled CUDA kernel, this takes milliseconds, not microseconds. By the time the optimizer returns, the displayed book has moved through several updates, the venue’s queue position the optimizer assumed has changed, and the resulting “optimal” routing decision is computed against a market that no longer exists. The order-time decision budget is microseconds; the optimization budget is minutes. Conflating them produces an order-time path that is either too slow to trade or too fast to optimize, and usually both.

The second is determinism. A deterministic SOR is one whose routing decisions are byte-identical under replay of the same input stream. This is the property that makes deterministic-replay testing work, that makes leader/follower cluster recovery sound, that makes post-mortem reproduction possible, and that makes the JMH zero-allocation discipline meaningful in the first place. An optimizer that runs on the order-time path takes inputs from cold-path state (fill-probability posteriors, model signals, regime classifications) and feeds them into the routing decision. If any of those inputs are timing-dependent (when did the model artifact get imported, when did the regime detector flip, when did the GC pause complete?), the routing decision becomes timing-dependent, which means it is not deterministic, which means it is not replayable, which means it is not auditable.

The third is failure modes. A GPU is a separate process. A quantum-style QUBO solver is, today, either a separate process or a separate vendor service. A model artifact comes from a Python training pipeline that may or may not have run successfully. Each of these is allowed to fail. An order-time path that depends on any of them must, when they fail, either route badly or refuse to route. Neither outcome is acceptable for an execution layer that has to deliver on every parent intent the trading strategy emits. The fix is to keep the failure surface entirely off the order-time path: the optimizer may fail, the GPU may be unavailable, the model artifact may be rejected, and the order-time path continues to route on the most recent approved policy without noticing.

These three pressures resolve into one design decision: the order-time path reads exactly one immutable policy artifact per decision, the optimizer side compiles new policy artifacts on a separate cadence, and the boundary between them is an atomic reference swap. Every layer of streaming stats, every GPU-backed tactical tune, every CUDA-Q strategic subset selection, every cross-parent batch allocation, every robust scenario sweep, all of these live on one side or the other of that boundary, and never cross it on the order-time path.

The naive design fails because it tries to do “intelligent” work at decision time. The honest design moves the intelligence to where it belongs (warm path, separate cadence, separate failure budget) and leaves the execution path doing the one thing it has to do at microsecond speed: read a compiled policy, score eligible venues with the precomputed weights and penalties in it, allocate, and submit.

3. Policy as a compiled, versioned artifact

The publication boundary: one atomic reference holds the immutable SorPolicy. The warm path publishes via publisher.set(newPolicy); the hot path reads via activePolicy() once per routing decision. Figure 2. The publication boundary. Everything quantum, GPU-accelerated, or experimental sits above. Everything microsecond and deterministic sits below. The interface between them is one atomic reference.

The boundary needs a representation. Adaptive Quantum SOR uses an immutable SorPolicy snapshot containing:

policyVersion              monotonic per-publication
createdAtEpochNanos        ingestion clock at compile time
effectiveFromEpochNanos    earliest decision-time validity
policyHash64               fast 64-bit fingerprint for audit
policyHashSha256           strong 32-byte fingerprint for governance
isingResultVersion         strategic-optimizer run lineage
cudaTuningVersion          tactical-optimizer run lineage
optimizerType              which backend produced this policy
policyState                ACTIVE / CANARY / SHADOW
hotRouteBook               pre-ranked execution-side route lists
fullPolicyMatrix           dense audit/replay view

The HotRouteBook is what the order-time path actually reads. It is a set of parallel primitive arrays (short[] routeVenueId, int[] weightBps, int[] latencyPenaltyNanos, int[] fillProbabilityBps, int[] toxicityPenaltyBps, long[] minChildQty, long[] maxChildQty, int[] maxParticipationBps, and so on) indexed by a flat routeKey derived from (instrumentId, regimeId, urgencyId). There are no maps, no boxed types, no String keys, no allocation, no method dispatch through generic interfaces. The executioner resolves the route key for the incoming parent, reads the offset range, walks the venue list in order, applies the precomputed scoring, and writes child orders into a caller-owned buffer. The entire path is ArchUnit-enforced as allocation-free in source and JMH-verified as allocation-free at runtime.

The FullPolicyMatrix is what the warm path and audit layers read. It carries the same data plus diagnostic context (venue eligibility flags, per-cell quality scores, optimizer trace metadata) that the order-time path never needs. It is the audit and replay view; it never reaches L0.

PolicyPublisher owns the boundary. It exposes one method to the execution path:

public final class PolicyPublisher {
    private final AtomicReference<SorPolicy> activePolicy;

    public SorPolicy activePolicy() {
        return activePolicy.get();
    }

    public void publish(SorPolicy newPolicy) {
        activePolicy.set(newPolicy);
    }
}

The atomic reference is the entire synchronization story. The executioner reads activePolicy() once per routing decision, captures the reference, and uses that snapshot for the full decision regardless of whether the publisher swaps in a newer policy mid-decision. Subsequent decisions pick up the new policy; the current decision sees a consistent snapshot. There is no lock, no copy, no version-comparison logic on the hot path. The version question, which policy made this routing decision?, is answered by recording policyVersion and policyHash64 into the per-decision RouteAuditEvent, which is written into a caller-owned audit buffer that flushes asynchronously to a cold-path writer.

This is the same architectural pattern the prior article used for the hot/cold split, scaled up one level. There, the executor side was hot and everything else was cold. Here, the executor is hot, the publisher boundary is atomic, and everything upstream of the publisher (feature aggregation, model artifact import, tactical tuning, batch allocation, strategic subset selection, robust selection) is allowed to be slow, to fail, and to take its time getting right. The price of that freedom is the discipline of compiling, validating, and publishing only an immutable, audit-stamped policy artifact. The benefit is that the entire optimizer stack, whether GPU-accelerated, quantum-future, ML-driven, or all three, can evolve without ever touching the deterministic execution path.

The publication gate runs a diff before it swaps. Not every compiled candidate justifies a new published policy version; thrash on the publication boundary is itself a cost (audit volume grows, post-trade attribution becomes harder, and downstream consumers of the policy stream see noise instead of signal). The gate computes a PolicyDiff between the candidate and the current active policy across the dense matrix and asks whether the difference is material. A representative materiality rule:

If the diff is immaterial, the gate records the candidate, declines to publish, and the active policy continues. If it is material, the candidate proceeds to the rest of the gate (lint, validation, expected-improvement, churn limits, and, when robust selection is enabled, scenario sweep evaluation). The materiality rule is configured, not hard-coded; firms that want every cycle to publish can set the thresholds to zero, firms that want aggressive thrash suppression can raise them. The choice is auditable; the diff is persisted; the rule is replayable.

Key Principle: The publication boundary is the only thing the optimizer stack can give the executor, and the only thing the executor reads from the optimizer stack. Everything quantum, GPU, learned, or experimental happens on one side of that boundary. Everything microsecond, deterministic, and risk-gated happens on the other.


Part II — The layered policy optimization stack

4. Layers as a coherent stack

Splitting “the optimizer” into one layer is a mistake. The optimization work has at least three independent cadences (seconds, minutes, tens of minutes), at least three independent computational shapes (streaming aggregation, constrained QP, constrained quadratic assignment / subset selection), and at least three independent backend requirements (CPU, GPU/CUDA, QPU/QUBO/CUDA-Q). Bundling them into one layer either makes the bundle as slow as the slowest cadence or as fragile as the most exotic backend. Splitting them lets each layer have its own cadence, its own failure budget, and its own backend.

Adaptive Quantum SOR uses seven layers, numbered the way a market-data engineer would expect: L0 at the order-time hot path, L6 at the historical replay top. The numbering matters; it gives every cross-layer reference a stable name and forbids upward references (L3 may not read from L1, because that would invert the cadence).

LayerNameCadenceComputational shapeBackend
L6Historical / synthetic simulationhourly / manualreplay over scenario catalogCPU
L5ML / feature model5–60 min / stuboffline training, online inference of signalsPython training, JVM import
L4Ising / CUDA-Q strategic optimizer5–30 minroute-level QUBO / Ising subset selectionreference CPU + CUDA-Q + QPU
L3.5Cross-parent batch allocationwarm path / event-drivenconstrained quadratic assignmentreference CPU + cuOpt + QUBO
L3CUDA / cuOpt tactical optimizer1–5 minconstrained QP over numeric policy parametersreference Java + CUDA / cuOpt
L2Streaming feature aggregation1–60 secrolling stats over fills and venue eventsCPU
L1Policy compiler / robust publisheron candidatevalidation, lint, optional robust selectionCPU
L0CPU SOR executionper orderdeterministic primitive-array routeCPU, Java-only, allocation-free

Read top to bottom for cadence (slow → fast) and read bottom to top for trust (executed → speculative). The order-time path (L0) reads exactly one immutable policy compiled and published by L1. L1 reads candidate policies from L3 / L3.5 / L4. L3 / L3.5 / L4 read from L2 (streaming stats), L5 (model signals), L6 (replay-validated quality scoreboard), and L4 ↔ L3 / L3.5 share strategic subset results. Every cross-layer interaction is a write-into-snapshot followed by an atomic snapshot reference; no layer ever holds a reference to another layer’s mutable state.

This separation lets us put GPU/CUDA at L3 and L3.5, CUDA-Q / QUBO / QPU at L4 and L3.5, and Python ML training at L5, without putting any of them on the L0 order-time path. Each backend lives behind a Java-side interface that produces an immutable result for the next layer up to consume, and each backend has an explicit failure-closed contract: when the backend is unavailable, slow, or returns an invalid result, the last approved result from that layer remains in force and execution continues.

The next sections walk each layer in order. We start at L2 (streaming features) and move up. L0 is covered in detail in the prior article and recapped briefly in §22; L1 is covered as a publication discipline in Part V (robust selection).

5. L2 — Streaming feature aggregation

L2 is the heartbeat. Every second (or every five, configurable per deployment), L2 advances a window of rolling statistics over the streams it consumes: market-book updates, execution-report outcomes, route-audit events, child-order state changes, venue-session events. Its output is a dense VenueStatsState indexed by (instrumentId, venueId, regimeId) and exposing a primitive-array surface that the optimizer layers above it can read in microseconds:

public final class VenueStatsState {
    public final int instrumentCount;
    public final int venueCount;
    public final int regimeCount;

    public final int[] latencyNanos;
    public final int[] fillProbabilityBps;
    public final int[] toxicityBps;
    public final int[] rejectRateBps;
    public final int[] feePenaltyTicks;
    public final int[] marketImpactBps;
    // ... and several more
}

The indexing is (instrumentId * venueCount + venueId) * regimeCount + regimeId. The arrays are primitive int because every statistic is expressed as a scaled value (*bps for basis-point rates, *Nanos for time, *ticks for tick-counted fees). No double, no BigDecimal, no boxing. The state is dense rather than sparse because optimizer-side scoring loops are O(eligible venues), and a dense layout with primitive offsets is faster than any sparse alternative for the venue counts the design targets (100–500 venues).

Three statistics deserve specific attention because they appear in nearly every optimizer scoring formula and because they are the place a naive implementation will silently corrupt later decisions.

Fill probability is bucketed, not averaged. A naive implementation maintains one EWMA of “did the last child fill or not” per venue. That estimator is monotonically pulled toward whichever venue gets the most flow, regardless of whether the recent flow was a representative sample. The correct estimator is bucketed by (instrument, venue, regime, side, size-bucket, price-distance-bucket), each bucket maintaining its own posterior, with smoothing across adjacent buckets where data is thin. Even bucketed, it remains a primitive-array lookup at consumption time; the sophistication is on the update side, which is L2’s job and not the optimizer’s.

Toxicity is measured against multiple horizons. Post-fill mark-out at 50ms, 500ms, and 5s tells different stories. A venue with positive mark-out at 50ms and negative at 5s is faster than the firm’s signal but slower than information flow; a venue with consistently negative mark-out at all horizons is genuinely toxic. L2 maintains all three horizons; the optimizer layers above pick which one their scoring function consumes.

Regime is a per-instrument state, not a global one. Different instruments transition regimes at different times; correlated regime shifts (a volatility burst that hits all instruments simultaneously) are observable as joint transitions but are not enforced as joint by the regime detector itself. The regimeId axis in VenueStatsState lets every higher-layer policy be regime-conditioned without forcing every regime detector to agree about the world.

The regime detector itself sits inside L2’s update path, not as a separate layer. Adaptive Quantum SOR ships a deterministic threshold-based detector as the reference (bucketed on rolling realized volatility, spread regime, and a liquidity-stability score), and exposes a RegimeState snapshot that all upper layers read. An ML-driven regime classifier (HMM, change-point detector, or learned classifier from L5’s model artifact pipeline) plugs in behind the same RegimeState interface, with the same fail-closed contract: when the learned classifier is unavailable or returns a low-confidence label, the reference detector’s regime is used and the rejection is recorded. The same discipline that protects L0 from GPU and QPU failures also protects the upper optimizer layers from a bad regime model.

L2 itself is CPU-only. Streaming aggregation does not benefit from GPU offload at the data rates we care about (tens of thousands of events per second is comfortably inside CPU memory bandwidth), and the determinism cost of a GPU round-trip would not pay for itself. The interesting design decision at L2 is not what hardware to use but what to put in the rolling window: the choices made here (which horizons, which buckets, which decay) propagate up through every optimizer layer and ultimately into every routing decision.

6. L5 — Model signals as compiled artifacts

L5 is where machine learning lives, and where it stops. Adaptive Quantum SOR keeps Python in the training and research path only. The L0 execution layer does not depend on Python, on a model server, on pandas, or on a training process. The discipline that achieves this is the same compiled-artifact pattern L1 uses for policies: Python training jobs produce a versioned, checksum-signed, schema-validated artifact, and the JVM imports the artifact through a strict validator that rejects anything malformed and preserves the previous approved signals on rejection.

The artifact contract is intentionally narrow:

model_metadata.properties     featureSchemaVersion, modelVersion,
                              predictionFile, predictionChecksumSha256,
                              validationScore, producer
predictions.csv               dense (instrument, venue, regime) → scaled bps signals
validation_metrics.csv        per-cell holdout metrics for diagnostic review

The Java-side ModelArtifactImporter validates the manifest, recomputes the SHA-256 over predictions.csv, rejects on mismatch, parses the predictions into a primitive-array ModelSignalState, validates that every value is within configured bounds (non-NaN, non-infinite, non-negative where required, bounded by configured caps where applicable), and either accepts the new signal state or returns the previous approved one with an explicit rejection reason recorded in lifecycle and audit events.

Three things are intentionally not in this contract.

There is no model registry on the JVM side. The artifact is the registry: the metadata file names its own version, the previous version is whatever was last imported successfully, and rollback means reimporting the older artifact. A real production registry is a fine future addition; in Adaptive Quantum SOR it would be a cold-path service that the importer talks to, never something the execution path or even the optimizers consume directly.

There is no online inference. The model produces signals offline (or on a periodic batch cadence), the signals are written to a versioned artifact, and the JVM imports the artifact. Online inference inside the JVM would couple the execution path’s failure budget to Python’s failure budget; we are not doing that. If a future signal genuinely needs lower latency than 5–60 minutes, the right answer is to ship a JVM-resident inference path with its own benchmark and zero-allocation discipline, not to import Python at runtime.

There is no implicit feature staleness exemption. A model trained on yesterday’s features and applied to today’s market is silently fragile. L5 records the training-data window in metadata, L1’s validator checks that the window is recent enough for the configured drift policy, and stale model artifacts are rejected at import. The previous approved signals remain in force; the optimizer cycle that consumes them is recorded with the older signal version, which is honest about what the policy was actually informed by.

The signals themselves are coarse on purpose. A typical artifact carries one venue-quality score per (instrument, venue, regime) cell, sometimes broken out into fill, toxicity, slippage, and regime sub-signals. This is enough to inform the tactical and strategic optimizers without producing a signal so dense that small training-data fluctuations cause large policy thrashing downstream. Fine-grained signals (per-size, per-time-of-day, per-counterparty-cluster) are interesting research targets but should not enter L4 / L3 inputs without explicit policy-thrashing controls at L1.

7. L3 — Tactical optimization and the CUDA boundary

L3 is the first layer where GPU acceleration becomes real. It runs every one to five minutes (one minute is a reasonable default) and tunes the continuous parameters of the policy candidate: per-(instrument, venue, regime) venue weights, latency penalties, toxicity penalties, fill-probability scores, reject penalties, queue-survival adjustments, slippage and market-impact penalties, child-size limits, and participation caps. It reads:

It writes a TacticalPolicyResult: dense primitive arrays of weights, penalties, and limits aligned with the strategic subset. The compiler at L1 turns the strategic subset and the tactical result into a MutablePolicyCandidate, which becomes a compiled SorPolicy after lint and validation.

The shape of the tactical optimization is well-suited to GPU offload. For each (instrument, regime, urgency) route key, the objective is a constrained QP-like form: minimize a weighted sum of fee-net cost, toxicity penalty, latency penalty, market-impact penalty, slippage penalty, and queue-survival penalty, subject to the strategic subset’s eligibility, the configured risk envelope, fee-tier constraints, and venue-side capacity bounds. The problem is independent per route key (which lets us batch across route keys easily), and the inner work per route key is a small matrix computation over the subset’s eligible venues. Replicated across hundreds of (instrument, regime, urgency) combinations, the total work fits a GPU naturally.

But the JVM is the system of record. The optimizer interface lives in Java, the result is consumed in Java, the failure handling is in Java, and the audit lineage is in Java. The CUDA work happens on the other side of an intentionally narrow native boundary.

7.1 The native boundary, in shape

The boundary has four hard requirements:

A stable, versioned ABI. The Java side and the C++/CUDA side agree on a fixed input layout (TacticalOptimizerNativeInput) and a fixed output layout (TacticalOptimizerNativeOutput), both expressed as direct-buffer-friendly C structs. The layout is documented in a header (tactical_optimizer_layout.h) that both sides compile against. Layout drift between versions is caught by a layout test on the native side and a CTest target that Gradle runs as part of check.

Direct buffers, not pinned objects. The Java side passes java.nio.ByteBuffer direct buffers across the boundary; the C++ side reads them as raw pointers. No JVM heap object crosses the JNI line; no GC pressure interacts with native execution; no array-marshalling cost contaminates the optimizer’s runtime profile.

Primitive status codes, not exceptions. The native function returns a TacticalOptimizerNativeStatus enum:

OK
LIBRARY_MISSING
INVALID_INPUT
GPU_UNAVAILABLE
TIMEOUT
OUTPUT_INVALID
NATIVE_FAILURE

Each is a primitive int on the wire, mapped to a Java enum on the JVM side. There is no native exception thrown across the boundary, no std::runtime_error to translate, no UTF-8 conversion of error strings on the hot warm path. A failure is a status code, recorded as a counter and a lifecycle event, and the layer above L3 reads the latest approved tactical result instead.

A fail-closed fallback. When LIBRARY_MISSING, GPU_UNAVAILABLE, TIMEOUT, OUTPUT_INVALID, or NATIVE_FAILURE is returned, the bridge does not crash the JVM, does not silently substitute random output, and does not block the publication cycle for the next cadence interval. It records the failure, exposes it through CudaOptimizerHealth and CudaOptimizerMetrics, and TacticalPolicyOptimizer falls back to the deterministic Java reference implementation (CudaTacticalOptimizerStub) that ships alongside the native backend. The reference is allowed to be slower; it is required to be correct.

The MVP CUDA scoring kernel is intentionally simple (a bounded sum-and-clamp on a few input bps values), enough to prove the ABI, the buffer layout, the status codes, the metrics surface, and the fallback path end to end. Real cuOpt-backed QP solving or a hand-rolled CUDA kernel is the natural next step once the boundary is trusted; the boundary itself does not care which backend sits behind it, because everything from the input layout to the output validation is unchanged.

7.2 Why the boundary is the discipline

The interesting claim is not that Adaptive Quantum SOR runs CUDA. The interesting claim is that it runs CUDA without making CUDA load-bearing for the order-time path. Three independent failure modes are absorbed entirely on the warm path and never reach L0:

  1. The native library is missing. Production hosts that do not ship a CUDA toolkit produce LIBRARY_MISSING. The bridge falls back to the Java reference; the cycle completes; the policy publishes; the order-time path is unaffected.

  2. The GPU is busy or unavailable. A GPU_UNAVAILABLE or TIMEOUT status produces the same fallback. The cycle records the failure for operator review; the next cycle retries; the policy in force remains the last approved one.

  3. The native output is invalid. OUTPUT_INVALID is the case we protect most aggressively. The Java validator on the result rejects anything outside configured bounds (negative weights, weights summing outside the valid range, child-size limits inconsistent with parent quantities), the rejection is recorded with the run ID and the specific failed check, and the previous approved tactical result remains in force.

In each case the L0 hot path sees no change. The order-time decision continues to read the most recent approved SorPolicy and route against it. The optimizer side absorbs the GPU’s failure surface entirely.

The corollary is operational: GPU/CUDA availability becomes an observability concern, not a correctness concern. The firm can see in real time which fraction of recent tactical cycles ran natively versus on the fallback; the firm can set alarms on the fallback rate; the firm can investigate GPU degradation without anyone needing to halt trading. This is the boundary that lets GPU integration stay honest, and the same boundary lets a future cuOpt or QPU backend slot in without ever touching L0.

8. L3.5 — Cross-parent batch allocation with quadratic coupling

Two side-by-side scenarios: on the left, both parents independently pick the same best-displayed venue and pay impact twice. On the right, a joint solver splits the parents across multiple venues for lower aggregate impact. Figure 3. Why quadratic coupling matters. Independent per-parent routing converges on the displayed-best venue. The joint solver diversifies, paying linear cost in exchange for lower quadratic impact and leakage.

Most of the SOR design space treats each parent independently. Phases 1 through 5 of Adaptive Quantum SOR do exactly this. They tune per-(instrument, regime, urgency) policy parameters and let L0 route each parent against the resulting policy without considering its sibling parents. For most workloads, this is correct: parents are largely independent, the policy already captures most of the cross-parent effects through aggregate venue-stat updates, and the engineering cost of solving a joint problem on every parent arrival is not justified.

Phase 6 adds a separate layer (L3.5) for the cases where independence fails. Three specific cases:

Same-venue self-impact. Two concurrent parent orders sized to take meaningful liquidity at the same venue will interact: parent A’s child fills push the price unfavorably for parent B’s child at the same venue. Independent per-parent routing, even with a perfect tactical policy, can route both parents to the same best venue and pay the impact twice. The joint problem can split them across the two best venues at a smaller aggregate impact cost than either parent’s independent optimum.

Shared venue capacity. Some venues impose participation caps in basis points of recent volume, or rate limits on child-order submission, or capacity bounds on resting orders. These caps apply at the firm level, not the per-parent level. Two parents that, independently routed, would each use 8% of venue V’s capacity together exceed the 12% cap. Independent routing has no way to allocate the cap fairly; a joint allocator does.

Correlated venue information leakage. Venues V and W are operated by related entities, or share a common information processor, or correlate strongly in displayed-quote movements. A parent that splits across V and W leaks more information than the same notional split across V and an uncorrelated venue. The leakage is pairwise; the cost depends on the simultaneous use of correlated venues by the same parent or by multiple parents in the same batch.

L3.5 turns these into one optimization. Let q[p,v] be the quantity from parent p allocated to venue v. The objective is:

minimize
    Σ linearCost[p,v] · q[p,v]
  + Σ sameVenuePairCost[p,q,v] · min(q[p,v], q[q,v])
  + Σ venueCorrelationPairCost[v,w] · min(q[p,v], q[q,w])

subject to
    Σ_v q[p,v] = parentQuantity[p]    for each parent p
    Σ_p q[p,v] ≤ venueCapacity[v]     for each venue v
    Σ_p q[p,v] ≤ participationCap[v]  for each venue v

The first term is what an independent per-parent optimizer would compute. The second and third are what only a joint optimizer can see: the same-venue pair term penalizes two parents using the same venue, and the cross-venue correlation pair term penalizes parents using correlated venues. Both terms are quadratic in the decision variable; the whole problem is a constrained quadratic assignment problem (QAP).

8.1 When to invoke the joint solver

A correct architectural question separates from a correct mathematical one: even granting that the joint problem is real, when should the system pay the cost of solving it? Running L3.5 on every parent arrival is wasteful (most parents are independent and small enough that the joint optimum equals the per-parent optimum) and conflates the cadence with L0’s hot path. Never running it leaves the quadratic effects on the table when they matter most.

Adaptive Quantum SOR treats L3.5 as event-driven on the warm path, with three triggers:

Material parent-order overlap. A scheduler tracks the active parent set: parents in WORKING state whose horizons overlap the current cycle window. The joint solver runs when the active set changes meaningfully: a new large parent enters, a parent terminates with non-trivial residual, or the aggregate active notional crosses a configured threshold. “Meaningful” is configured per strategy, not hard-coded; small noise in the active set does not trigger a recompute.

Material market-state change. The same regime detector that feeds the upper layers (see §5) is an L3.5 trigger source. A regime transition for any instrument in the active set re-evaluates the joint problem, because regime change can flip same-venue impact curves and cross-venue correlation estimates within a single tick range.

Configurable periodic recompute. A floor cadence (default: once per minute, suppressible to once per five minutes for low-volatility regimes) ensures the joint plan does not drift further than the cadence interval from the active state, even when neither overlap nor regime changes have triggered a recompute. This is the architectural equivalent of a watchdog timer; it bounds the staleness of the active batch plan independent of triggering heuristics.

Two operational guards complete the triggering policy. A cooldown prevents a flapping active set from triggering many joint recomputes per second; a recompute starts a cooldown window during which only the periodic recompute can override. A budget caps the absolute joint-solve rate per strategy per minute, with overflow falling through to the previous approved plan; this protects the warm-path executor from being overwhelmed by a misconfigured trigger.

The corollary: the joint plan is always available when L1 compiles a new policy candidate, but the plan that is available may be tens of seconds to a few minutes old. That is fine. The joint plan informs the next policy’s venue weights and participation caps, not the order-time decision. The cadence asymmetry is the architectural feature, not a defect.

8.2 Reference solver plus backend boundary

Quadratic assignment is NP-hard in general. The relevant question is not “is QAP hard” but “is the instance we actually solve hard at the scale we actually need?” For the parent counts and venue counts a single firm sees in a single warm-path interval (single-digit parents, single-digit venues per parent’s eligible subset, tens of venues total), a deterministic enumeration over feasible unit-quantity allocations is fast enough and exact enough.

Adaptive Quantum SOR ships DeterministicBatchVenueAllocator as that reference solver. It enumerates feasible unit-quantity allocations, evaluates the full quadratic objective on each, and returns the deterministic minimum (ties broken by lexicographic ordering). It is the reference in two senses: it is the always-available correctness oracle, and it is the comparison baseline that every other backend must match on small problems.

Larger problems (more parents, larger venue subsets, finer unit quantization) quickly exceed what enumeration can do. The BatchAllocationBackend interface is the seam where larger backends plug in:

public interface BatchAllocationBackend {
    BatchAllocationBackendResult allocate(BatchAllocationProblem problem);
}

Three backend families fit naturally here, each with its own engineering trade-offs:

cuOpt. NVIDIA’s combinatorial optimization library handles constrained MILP / QAP-shape problems efficiently on GPU. The integration is a standard JNI bridge: pack the parent-venue costs, the pair-cost matrices, the capacity constraints, and the participation bounds into direct buffers, hand off to cuOpt, validate the returned plan against the reference solver on every small enough problem, and accept it for larger problems with a configured trust margin.

QUBO / Ising. The problem maps cleanly to a QUBO formulation: binary variables x[p,v,k] indicating whether parent p allocates the k-th unit to venue v, with quadratic penalty terms encoding the pair costs and the constraints encoded via Lagrangian penalty multipliers. A QUBO/Ising solver (classical simulated-annealing, parallel-tempering, or a dedicated annealing accelerator) operates on the same Java-side problem representation through the same backend interface, with the same fail-closed contract.

A QPU or CUDA-Q hybrid solver. Once the QUBO formulation is in place, the same problem can be handed to a quantum-or-hybrid solver via CUDA-Q: a variational quantum eigensolver, a quantum annealer, or a hybrid classical-quantum optimization loop. The backend interface does not change; the validation against the reference solver does not change; the fail-closed contract does not change. What changes is which physical resource solves the QUBO instance on the other side of the boundary.

Three properties make this future-proofing real rather than aspirational:

The reference solver is always present. Any backend’s result is validated against the reference on every problem small enough to enumerate. A backend that produces invalid results (infeasible plans, objective-value worse than the reference, non-deterministic outputs on identical inputs) is rejected and its result is not approved. The latest approved BatchVenueAllocationPlan remains in force, the next cycle retries, and the warm path absorbs the failure without disturbing the executor.

The native bridge fails closed. BatchAllocatorNativeBridge is the placeholder seam for native backends. Unavailable backends produce BACKEND_UNAVAILABLE and fall through to the reference. Timeouts produce TIMEOUT and fall through. Invalid native results produce INVALID_RESULT and fall through. The fall-through path is the reference solver on small problems and the previous approved plan on larger ones.

The execution path never reads the batch allocator. L0 routes each parent against the active SorPolicy, which encodes the route-key-level policy that L3 tuned and L4 selected the subset for. The batch allocation plan is an additional warm-path input the policy compiler may use to bias venue weights or participation caps in the next cycle; it does not enter the order-time decision directly. Failure of the batch allocator at any backend, in any failure mode, never blocks L0.

The architectural lesson is the same one §7 made for L3: the value of putting GPU and quantum work behind a stable backend boundary is that you can change the backend without touching anything else. Every claim about quantum advantage, GPU speedup, or solver correctness is provable against the reference on small problems and observable as a fallback rate on larger ones. We do not claim quantum advantage; we claim a quantum-ready seam, which is a much smaller and much more defensible claim.

9. L4 — Strategic venue subset selection, QUBO/Ising, and CUDA-Q

L4 is where the design is most explicitly built for a quantum-style backend. It runs every 5–30 minutes (5 is a reasonable default) and answers, per (instrument, regime, urgency) route key, exactly one question: which subset of the configured venue universe should be even eligible for this route? The tactical layer below it tunes continuous parameters within that subset; the executor below that routes within the same subset. L4 is the structural layer.

The answer is a constrained subset-selection problem. Let x[v] ∈ {0,1} indicate whether venue v is eligible. The objective is:

minimize
    Σ linearCoefficient[v] · x[v]
  + Σ pairCoefficient[u,v] · x[u] · x[v]

subject to
    minSubsetSize ≤ Σ x[v] ≤ maxSubsetSize

where the linear coefficient is the negative of a per-venue quality score (lower energy = better venue), and the pair coefficient penalizes co-selection of toxic-or-correlated venue pairs. Lower energy is better. The cardinality bounds prevent both empty subsets and over-large subsets that defeat the structural-eligibility purpose.

This is a QUBO. The binary-variable formulation is exact, the objective is quadratic in the variables, and the constraints (cardinality) can be encoded into the objective via a quadratic cardinality penalty. The same physics-flavor problem, finding the minimum-energy spin configuration of an Ising system, is what every quantum annealer, variational solver, and QUBO accelerator on the market is designed to solve.

9.1 Why pair coefficients matter

A naive strategic layer chooses the top-K venues by quality score and stops. The result is independent-ranking subset selection: each venue’s contribution is evaluated in isolation, the K-best are taken, the cardinality constraint is automatically satisfied. No QUBO needed; a sort suffices.

This is wrong, in the same way that independent per-parent routing was wrong in §8. The substantive cases are the cases where the combination of selected venues matters. Two venues that are individually high-quality but are operated by related entities should not both be eligible for the same route, because their joint information leakage is greater than the sum of their independent costs. Two venues that have correlated outage histories (sharing infrastructure, sharing a market-data vendor, sharing a settlement counterparty) should not both be in a subset whose redundancy story depends on them being independent. Two venues whose recent reject histories are both elevated should not both be picked, because their joint contribution to expected route failure is worse than either individually would suggest.

These are pairwise effects. They cannot be expressed by a linear objective. They are what the pair coefficients are for, and they are what makes the strategic layer a real QUBO rather than a glorified top-K sort.

Adaptive Quantum SOR builds the pair coefficients from streaming feature stats. For each pair (u, v), the pair penalty is:

penalty(u,v) =
    sharedToxicityRisk(u, v)         # both high-toxicity
  + sharedRejectRisk(u, v)           # both elevated reject-rate
  + sharedMarketImpactRisk(u, v)     # both elevated impact
  + profileDistancePenalty(u, v)     # similar venue profiles = correlated risk

The first three are bilinear “both high → joint penalty”; the fourth is a similarity term that penalizes co-selection of venues with near-identical observable profiles. All four are computed from VenueStatsState at cycle time and bundled into the QUBO instance.

9.2 The CUDA-Q backend, with a CPU reference

Today, the production-grade default at L4 is a classical exhaustive solver on small problems. Adaptive Quantum SOR’s C++ reference (sor_cudaq_strategic_optimize) enumerates all 2^n venue masks for n ≤ 30, evaluates the QUBO energy on each, applies the cardinality constraint as a hard filter, and returns the minimum-energy feasible mask with deterministic tie-breaking. At n = 30 this is ~1B configurations, which is on the edge of what is feasible at the 5–30 minute cadence; at n ≤ 20 it is comfortably under a second.

Two backend integration paths sit behind the same Java-side interface:

A CUDA-Q variational backend. CUDA-Q lets a quantum or hybrid solver be invoked from the same C++ layer. The strategic problem is small enough (typical venue universes per route key are well under 30) that a variational quantum eigensolver or QAOA-style hybrid loop is a credible benchmark target rather than an aspirational claim. The reference exhaustive solver gives every CUDA-Q result a ground truth to compare against, and that comparison is what makes claims about quantum performance defensible.

A classical QUBO / Ising annealer. Simulated annealing, parallel tempering, or a dedicated annealing accelerator (D-Wave-style) operate on the same QUBO instance. The bridge passes the linear and pair coefficient arrays in the same layout; the backend returns a binary mask plus an objective energy; the Java side validates the result against the reference on small problems and accepts it on larger ones with the same fail-closed discipline as L3 and L3.5.

The native bridge surface mirrors L3’s:

SOR_STRATEGIC_OK
SOR_STRATEGIC_BACKEND_UNAVAILABLE
SOR_STRATEGIC_INVALID_INPUT
SOR_STRATEGIC_TIMEOUT
SOR_STRATEGIC_INVALID_RESULT
SOR_STRATEGIC_NATIVE_FAILURE

Every status code maps to a Java enum; every failure produces an audit record; every fallback uses the latest approved strategic result. The strategic audit lineage records, for every accepted result: optimizer run ID, optimizer type, input snapshot ID, model signal version, current policy version, route key coordinates, the linear and pair coefficients of the QUBO instance, the subset size bounds, the selected venue IDs, the result version, and the policy diff reference. Rejected attempts record the rejection reason without replacing the latest approved result.

A subtle but consequential design choice: accepted strategic results and rejected strategic attempts are stored separately. The tactical optimizer consumes the latest approved result, not merely the latest attempted result. A failed CUDA-Q run does not retract the previous accepted subset; a rejected backend output does not poison the next cycle. This separation lets a failure on the strategic side stay absorbable without cascading into the tactical layer or the policy compiler.

9.3 What L4 is, and is not, claiming

What L4 is claiming: the strategic subset-selection problem is a real QUBO with non-trivial pair structure, the backend boundary lets QUBO / Ising / CUDA-Q / QPU backends plug in without changing anything above or below, and the reference exhaustive solver gives every backend a ground truth to validate against.

What L4 is not claiming: that any specific quantum backend produces a useful speedup at production-relevant instance sizes today, that quantum annealing always finds the global optimum, or that the difference between an exhaustive classical solution and a near-optimal heuristic one is even economically significant in the routing decisions the policy informs. These are empirical questions that depend on the firm’s venue universe, its actual pair structure, and its tolerance for sub-optimal subsets. We are honest about the questions; the answers are workload-specific.

Key Principle: The right place for quantum-style optimization in a deterministic trading stack is at a layer whose cadence (5–30 minutes), output (a stable subset, not a per-order decision), and failure budget (fall back to the previous approved subset) all tolerate a non-deterministic, possibly-unavailable, possibly-slow backend. The wrong place is the order-time hot path. The design decides this; the marketing copy follows.

10. L6 — Replay and scenario catalogs as a quality scoreboard

L6 sits at the top of the stack and answers a different question than the layers below it. L2–L5 produce forward-looking artifacts: rolling stats, model signals, tactical tunes, strategic subsets, batch allocations. L6 produces backward-looking validation: did the policy this firm published last week, last month, last quarter actually execute the way it was supposed to, against the scenarios it was supposed to handle?

The mechanism is a versioned catalog of replayable scenarios. Adaptive Quantum SOR’s catalog includes 60+ scenarios across categories: baseline replay, regime transitions, liquidity disappearance, stale feeds, venue outages, toxic venues, lineage and replay determinism, failure-negative cases, live-reset modes, ML/feature dataset behavior, risk and throttle and capacity tests, multi-instrument behavior, and the Phase 6 cross-parent batch cases. Each scenario is a fully declarative YAML spec: market state, venue profiles, simulator seeds, parent-order generation rules, expected invariants, and the categories the scenario covers. Each runs through the same simulator and executor stack the live system uses, with a separate scenario-engine context that does not contaminate live state.

Two uses of L6 deserve attention.

As a regression surface. Every commit that touches the optimizer or execution stack runs the catalog under Gradle’s scenario test profile. A scenario whose summary checksum changes is either a defect or an intended behavior change; in either case it is visible, reviewable, and gated. The catalog is the deterministic-replay equivalent of an integration test surface: it catches non-determinism, allocation regressions on the hot path, optimizer instability under regime shifts, and silent corruption of policy lineage. The scenarios are kept user-readable (YAML, not generated fixtures) so they double as documentation for the failure modes the system claims to handle.

As Phase 7’s scoring substrate. When robust policy selection is enabled (next section), every candidate policy in the candidate set is evaluated against the declared scenario set through the same L6 machinery. The output is a ScoreMatrix whose rows are candidates and columns are scenarios. The matrix is the input to the robust-selection objective; the catalog gives the matrix its meaning. A robustness claim is only as strong as the scenarios it was evaluated against, which is why scenario-set adequacy is itself a gate, covered in §20.

L6 is entirely off the order-time path. It runs offline, at human or scheduled cadence, against either live state (with explicit reset modes for Jupyter-style notebook exploration) or isolated scenario contexts. It influences L0 only by validating the policy candidate that L1 will publish. The order-time path never reads from L6 directly.


Part III — GPU/CUDA integration, in detail

11. The native boundary contract

The canonical Java-to-native contract: Java side packs a direct buffer and calls the JNI bridge; the native side reads via a void pointer, runs the kernel, and returns a primitive status code; Java maps the status to an enum and either accepts the result or falls back to the reference implementation. Figure 4. One contract, three bridges. CUDA tactical, CUDA-Q strategic, and the cross-parent batch allocator all use the same shape: versioned ABI, direct buffers, status-code error surface, and fail-closed fallback to a deterministic Java reference.

Sections 7, 8, and 9 each invoked a “native boundary” without quite explaining what makes one work. This section consolidates the contract. Every native backend in Adaptive Quantum SOR (CUDA tactical, CUDA-Q strategic, batch allocator) uses the same four-property shape:

A versioned ABI. A C header (tactical_optimizer_layout.h, cudaq_strategic_optimizer.h, equivalents for batch) defines the input and output structs. Both Java and C++ compile against the same definitions. Field ordering, scalar widths, and array sizing are explicit. A layout test on the C++ side asserts struct offsets at compile time; a Gradle CTest target runs the layout test as part of check, which means any mismatched layout fails the build before any execution path runs.

Direct buffers across the boundary. The Java side allocates ByteBuffer.allocateDirect(...) for input and output payloads, configures them with the platform byte order, and passes them across JNI. The native side receives void* pointers and reads/writes the documented layout. No JVM-heap data, no array copy, no String allocation. The buffer is reused across cycles; the bridge is allocation-free per call.

Status-code-based error surface. The native function returns a primitive int status, mapped to a Java enum on return. The status codes are the same shape across all three bridges: OK, LIBRARY_MISSING or BACKEND_UNAVAILABLE, INVALID_INPUT, TIMEOUT, OUTPUT_INVALID or INVALID_RESULT, NATIVE_FAILURE. No exception ever crosses JNI; no std::error is translated; no error message string is marshalled. The bridge’s job is to map the primitive into a diagnostic Java enum, increment the appropriate counter, and return to the optimizer.

Fail-closed Java-side fallback. Every native bridge has an associated Stub or reference implementation in Java that the optimizer falls back to when the native call returns anything other than OK. The fallback is deterministic, correct, and allowed to be slower. It is what runs on hosts without the native toolkit installed (developer laptops, CI runners without GPUs, hosts where the library failed to load), and it is what runs in production when the native backend is briefly unavailable or returns invalid output. Having the fallback at all is what lets the system be GPU-accelerated without being GPU-dependent.

The discipline these four properties enforce is the same as the hot/cold split, applied to a different boundary. The boundary is permeable in one direction (Java reads from native; native never reads from Java state mid-call) and the failure budget is bounded (any failure produces a primitive status, an audit record, and a fallback). Native code is allowed to be sophisticated; the bridge is required to be simple.

There is one failure mode the in-process JNI shape does not handle: a hard crash inside the native backend (a segfault, an uncaught native exception, a GPU driver fault that takes down the host process) propagates to the JVM. The architectural answer, when the firm’s tolerance for warm-path JVM restarts becomes too tight, is to move the native backend into a separate worker process and let the bridge talk to it over shared memory or a Unix-domain socket carrying the same direct-buffer payload. The Java side does not change; the native ABI does not change; what changes is that a native crash now kills the worker process, not the trading JVM, and a process supervisor restarts the worker for the next cycle. This is the standard next step for native crash isolation and is explicitly tracked as a follow-on engineering task rather than shipped in the current release.

12. CUDA tactical, end to end

To make the abstraction concrete, the L3 CUDA tactical path runs as follows on each warm-path cycle:

The optimizer reads StrategicVenueSubsetResult from the latest approved strategic run, PolicyOptimizationInput (snapshot of L2, L5, risk, metadata), and MarketBookState. It builds a TacticalOptimizerNativeInput direct buffer per route key, populating venue counts, eligible venue IDs, current weights, current penalties, fee schedules, risk caps, child-size bounds, and the scoring weights configured for the cycle.

JniTacticalOptimizerNativeBridge.optimize(...) is invoked. The bridge calls the loaded shared library; the library does its work; the call returns. The MVP scoring kernel computes a bounded sum-and-clamp on the input bps values (this is a placeholder; a production deployment would replace it with cuOpt-backed constrained QP, a hand-rolled CUDA kernel, or an equivalent CPU implementation, behind the same ABI). The output buffer carries the updated weights, penalties, limits, and a status code.

CudaOptimizerMetrics records the wall time of the native call (taken via the cluster’s monotonic deterministic clock, not the wall clock), the return status, and per-call counters. CudaOptimizerHealth exposes the rolling failure rate, the last status, and the time since last successful call. These are operational observables, not order-time inputs.

The result is validated by TacticalPolicyOptimizer: every output weight is in [0, 10000] bps, every penalty in configured ranges, every limit consistent with the parent envelope. If validation passes, the result becomes the new TacticalPolicyResult for the cycle. If validation fails, the result is rejected, the rejection reason is recorded, the previous approved tactical result remains in force, and the next cycle retries.

TacticalPolicyResult flows into the policy compiler at L1, which combines it with the strategic subset to produce a MutablePolicyCandidate, which becomes a compiled SorPolicy, which is lint-checked, validated, and (Phase 7) optionally robust-evaluated before publication.

Three pieces of evidence make this trustworthy and reviewable:

A representative Phase2CompletionReport summarizes the implemented acceptance criteria, the failure modes covered, the metric and health surfaces, and the known limits (process-level native crash isolation is documented as out of scope; Nsight-level deep GPU profiling is out of scope for the MVP). The report is part of the release evidence; the implementation does not ship without it.

13. Profiling, observability, and the discipline of native failure

A GPU-accelerated layer that nobody observes is no better than the fallback. Three observable surfaces make the difference between “we use CUDA” and “we use CUDA, and we know how it is performing”:

Per-call runtime. CudaOptimizerMetrics.lastNativeRuntimeNanos, runCount, and maxNativeRuntimeNanos are exposed through the warm-path metrics API. A regression in native runtime, usually a symptom of GPU contention, driver issue, or kernel-side regression, is visible without changing any failure rate.

Per-call status. Every native call records its status code. The distribution of statuses over time (OK, TIMEOUT, GPU_UNAVAILABLE, etc.) is the operational dashboard for GPU health. A spike in TIMEOUT is a contention or load issue; a spike in OUTPUT_INVALID is a backend correctness bug; a spike in LIBRARY_MISSING is a deployment issue. None of them blocks trading; all of them are visible.

Fallback rate. The fraction of recent cycles that ran on the Java fallback is a single composite metric for “how GPU-dependent is this layer right now?” A fallback rate near zero says the GPU is doing the work; a rate near 100% says the system has gracefully degraded to the reference implementation; intermediate values say the system is in a partial-degradation state and an operator should look. The same metric exists for the strategic layer (CUDA-Q fallback to exhaustive reference) and the batch layer (native backend fallback to deterministic reference), giving a coherent observability surface across all three native-backed layers.

A subtler discipline: the metrics themselves do not gate execution. A high fallback rate is visible but does not halt the warm path. The warm path’s correctness comes from the fallback being correct, not from the GPU being available. The operator runbook for a high fallback rate is to investigate the GPU side without urgency; the trading side is unaffected.

This is what we mean by treating native failure as a first-class observable event. In a less disciplined architecture, GPU failure is a panic: the system either halts or starts producing garbage. In Adaptive Quantum SOR, GPU failure is a counter increment. The fallback handles correctness; the metric handles awareness; the operator handles diagnosis.


Part IV — Quantum-future readiness, with honest scope

14. Why the strategic layer is the natural quantum seam

Three properties make the L4 strategic subset-selection problem a credible target for QUBO / Ising / quantum-style backends, and they are the same three properties that make almost any other layer of the SOR stack a bad target.

The objective is a real quadratic. The pair coefficients are not optional. Without them, subset selection degenerates to a sort; with them, it is the kind of problem that physics-flavored solvers are built for. A QUBO with non-trivial pair structure is what quantum annealers, variational quantum eigensolvers, and QAOA-style hybrid loops are designed to attack. Most “quantum SOR” marketing copy applies QUBO to problems that are not QUBO-shaped at all: venue ranking is not, fee comparison is not, fill-probability estimation is not. Strategic subset selection genuinely is.

The cadence tolerates non-determinism. L4 runs every 5–30 minutes. A quantum or quantum-hybrid solver that takes 30 seconds to converge, or that returns a different (but valid) minimum on different runs because of sampling variance, is acceptable here in a way it would not be at the order-time path or even at the 1-minute tactical cadence. The strategic layer can absorb the temporal cost of a quantum-style solve; the cadence below it cannot.

The output is small and discrete. A strategic result is a venue mask plus an energy value: a few tens of bits per route key. The information density across the JNI / native / quantum boundary is tiny. Compared to the dense weight matrices L3 needs, or the per-tick venue stats L2 produces, the strategic layer’s I/O profile is trivially compatible with backends that have slow or expensive I/O.

The three properties together are why Adaptive Quantum SOR puts the QUBO seam at L4 and only at L4. Not at L0, not at L2, not at L3, not at L1. Putting it anywhere else would either break the order-time path’s determinism (because quantum sampling is non-deterministic) or fail to use the quantum solver for the thing it is good at (combinatorial subset selection with non-trivial pair interactions).

15. From the exhaustive reference to a QUBO/CUDA-Q backend, in stages

Four backends behind a shared ABI: the classical reference solver always present, a Stage 1 classical QUBO solver running today, a Stage 2 CUDA-Q hybrid simulated on an NVIDIA GPU running today, and a Stage 3 real QPU expected by 2029. Figure 5. The staged backend integration at L4. Same input layout. Same output layout. Same fail-closed contract. Only the backend changes.

The reference implementation we ship (sor_cudaq_strategic_optimize) is classical and exhaustive. For each route key, it enumerates 2^n venue masks for n ≤ 30, filters by the cardinality bounds, evaluates the QUBO energy, and returns the minimum-energy feasible mask. This is correct, deterministic, and bounded-fast for the venue counts that appear in real route keys. It is the ground truth.

The QUBO / Ising / CUDA-Q backend integration path is staged:

Stage 0: ABI parity. The QUBO-aware backend uses the same SorStrategicQuboInput / SorStrategicQuboOutput layout the exhaustive reference uses. Linear coefficients are an int[venueCount] array; pair coefficients are an int[venueCount * venueCount] matrix; cardinality bounds and the cardinality penalty are scalars. The backend reads what the reference reads and writes what the reference writes.

Stage 1: Classical QUBO solver. Simulated annealing, parallel tempering, or tabu search runs against the QUBO instance. The result is a binary mask plus an energy value. The Java side compares against the exhaustive reference on every problem with n ≤ K for some configured K (the “trust margin”) and rejects results that are worse than the reference. Problems with n > K are accepted with a configured trust policy; the audit lineage records the trust decision.

Stage 2: CUDA-Q hybrid solver. A QAOA or VQE-style hybrid loop runs the same QUBO through a quantum-classical optimizer using CUDA-Q. The result is, again, a binary mask plus an energy value. The same reference-comparison gate applies for n ≤ K. The same fail-closed contract applies for everything else: BACKEND_UNAVAILABLE falls back to Stage 1 (or directly to the reference); TIMEOUT falls back; INVALID_RESULT falls back.

Stage 3: A dedicated QPU. A quantum annealer, or a fault-tolerant gate-model machine when available, runs the QUBO through the same C++ bridge. Nothing in the Java side or the policy compiler or L0 changes. The bridge maps to a different physical resource on the other side; the input layout is the same; the output layout is the same; the validation against the reference is the same.

Three properties make this future-proof rather than aspirational:

The reference is always present, always correct, and always cheap to run on the problems where its correctness is most provable. Every backend’s result is validated against it on small enough instances; backends that fail validation are rejected and the system continues on the previous approved subset.

The bridge is the only thing that changes between stages. Java-side optimizer code, the lineage stamping, the policy compiler, the publication gate, the L0 execution path: none of these care which backend is on the other side of the JNI call. A team that wants to evaluate a new QUBO backend writes a cudaq_strategic_optimizer.cpp that conforms to the existing ABI, plugs it in via build configuration, and runs the catalog of CUDA-Q failure-and-fallback tests. There is no need to touch Java.

The trust margin is configurable, not hard-coded. The firm decides how big a problem must be before the system stops cross-validating against the exhaustive reference; that decision is recorded in policy lineage, replay-tested, and reviewable. Skeptics about quantum claims can set the margin to “always cross-validate”; aggressive deployments can set it lower. We do not force a position.

16. Why the batch allocator is a second quantum seam

The L3.5 cross-parent batch allocator has the same quadratic structure as the strategic layer, scaled up by an order of magnitude. Where L4’s variables are binary venue-selection bits, L3.5’s variables are unit-quantity allocations across parent-venue pairs. Where L4’s pair coefficients are venue-pair penalties, L3.5’s pair coefficients are same-venue self-impact and cross-venue correlated leakage. Where L4’s constraints are cardinality bounds, L3.5’s constraints are parent quantity allocation, venue capacity, and participation caps.

Both are constrained quadratic problems. Both are NP-hard in general. Both are tractable at the instance sizes a single firm sees in a single cycle. Both benefit from solver acceleration when the firm runs at scale. And, as §8.2 already covered, the same BatchAllocationBackend interface admits the same family of backends (cuOpt, QUBO/Ising, CUDA-Q hybrid) behind the same fail-closed discipline.

The point worth making here is the two-seam observation: L4 strategic and L3.5 batch are two independent places where quantum-style acceleration earns its keep, two independent places where the firm can experiment with different backends, and two independent places where the fail-closed discipline protects the executor. They do not have to use the same backend, and in practice they probably should not: the strategic problem is a pure subset-selection QUBO, the batch problem is a constrained quadratic assignment with non-binary variables; the right tools for each are likely different even within the same backend family.

17. What we are and are not claiming about quantum advantage

Marketing copy about quantum-accelerated trading systems is uniformly aggressive about claims, uniformly thin on evidence, and uniformly opaque about scope. Adaptive Quantum SOR takes the opposite position. The claims, in descending order of confidence:

We claim, with evidence: the strategic and batch layers have genuinely quadratic objectives with non-trivial pair structure, a backend boundary that QUBO / Ising / CUDA-Q / QPU implementations can plug into with the same ABI as the classical reference, a deterministic reference solver that gives every backend a correctness oracle on small problems, a fail-closed fallback policy that protects the executor on every documented native failure mode, and a benchmark and audit surface that makes any backend’s behavior reviewable and replayable.

We do not claim: that any specific quantum backend, today, produces a useful speedup over the classical reference at the problem sizes a single trading firm sees in a single cycle. That is an empirical question, it depends sharply on the firm’s venue universe and pair structure, and the design supports either answer.

We explicitly disclaim: quantum advantage as a marketing position. The system is quantum-ready in the precise sense that switching backends does not require touching any other part of the system. Whether the switch is worth making is up to the workload, the backend availability, and the firm’s willingness to operate a backend with a different failure profile than its classical equivalent. We do not pretend to have answered the question; we make the question askable.

The honest framing is the one the project’s specifications use: “CUDA, CUDA-Q, Ising, cuOpt, cuRAND, or StyleGAN-style image generators are not required for canonical market-data generation.” The native backends are integration targets, not load-bearing dependencies. The system runs end-to-end on a host with no GPU, no quantum hardware, and no Python: the JMH benchmarks, the scenario catalog, the policy compilation, and the L0 execution path all work without any of them. That is the property that makes the quantum-readiness claim defensible: the quantum side is genuinely optional, and the optionality is enforced by the build and the test catalog, not by promises.

The contrast with a meaningful slice of “quantum trading” vendor architectures is worth being explicit about. The pattern most vendors converge on, when their literature is read carefully, is quantum-or-GPU-accelerated routing: a backend that produces a routing decision per order, on the order-time path, by invoking a vendor SDK that wraps a QPU, an FPGA, or a tightly-coupled GPU kernel. The claimed value is microsecond-scale “AI-optimal” routing. The actual properties of such designs, when examined, are: (a) the order-time path’s failure budget is now coupled to the vendor’s backend availability; (b) the routing decision is non-deterministic, because the underlying solver is sampling-based, so replay is at best probabilistic; (c) the audit trail is opaque, because the decision’s provenance lives inside vendor code; and (d) the integration is irreversible, because removing the vendor means rebuilding the routing logic from scratch. These are not features. They are the failure modes the design in this article is built to avoid. The quantum and GPU work belongs on a separate cadence, behind a fail-closed boundary, producing an artifact the deterministic executor reads. Not on the order-time path, and not as a vendor-coupled black box producing per-order decisions.

Key Principle: Future-proof design is not the same as future-vendor-lock-in. The right architecture for quantum-ready trading infrastructure is one whose order-time path is correctly Java-only and CPU-only, whose warm-path optimizers have clean Java-side interfaces with deterministic reference implementations, and whose backend boundaries fail closed when the exotic backend is unavailable. Anything more aggressive (putting GPU on the order-time path, putting QPU on a 1-minute cadence, requiring CUDA-Q at startup) trades correctness and operability for marketing position. The design takes the trade in the opposite direction.


Part V — Robust policy selection (Phase 7)

18. Why expected-condition tuning silently accepts tail fragility

Through Phases 1–6, the publication path produces one candidate policy per cycle and publishes it after lint, validation, expected-improvement, cadence, and churn checks. The implicit selection objective is “is the new candidate better than the active policy under expected conditions?”, where “expected conditions” is some implicit average over the venue stats and model signals at cycle time.

This is fragile. A policy that looks excellent under expected conditions can be poor under regime shifts, liquidity fades, stale feeds, or venue outages, exactly the conditions where the cost of a bad routing decision is highest. Single-candidate publication selects on the part of the world that is easiest to measure and silently accepts the rest. A firm that publishes only on expected-improvement has no operational signal at publication time that it is taking tail risk; it discovers the tail risk only when the tail event happens.

Phase 7 of Adaptive Quantum SOR changes this by making the publication objective explicit. The system can be configured to produce a candidate set instead of one candidate, evaluate every candidate against a declared scenario set through the L6 replay machinery, and select the published candidate by a configured objective over the resulting candidate × scenario score matrix. The objective is one of four pure functions (EXPECTED, MIN_MAX, CVAR_K, MIN_REGRET); the scenario set is one of several configurable sets drawn from the catalog; the matrix is persisted as a governance artifact; the selection decision is stamped into the policy ledger.

The key word is explicit. The objective is named, the scenario set is named, the matrix is persisted, the selection is replayable. A firm that publishes under CVAR_K(k=10) over a scenario set covering regime, liquidity, venue-health, and failure-negative categories is making a configured trade between expected-condition quality and worst-decile robustness, and the configuration is auditable. A firm that publishes under EXPECTED is also making a trade, toward expected quality and away from robustness, and is recording that choice in the ledger. The point is not that any one objective is correct; the point is that the trade is no longer hidden.

19. Candidate sets, scorecards, and the four objectives

A vertical five-step flow: candidate generation, scenario sweep evaluator, score matrix of candidates against scenarios, four objective options (EXPECTED, MIN_MAX, CVAR_K, MIN_REGRET), and final publication with provenance stamped into the audit ledger. Figure 6. The Phase 7 robust selection flow. Candidates are generated explicitly, scored against every scenario in the catalog, evaluated under one of four objectives, and published with the full selection rationale stamped into the audit ledger.

The Phase 7 contract between the optimizer coordinator and the publication gate is a deterministic PolicyCandidateSet. Each candidate has:

Candidates are pairwise distinct by canonical hash; duplicates are removed; invalid candidates (those that fail lint or validation individually) are filtered out before the set is submitted to the gate.

19.1 Candidate generation: grid, then evolution

The initial candidate generation uses a deterministic configured grid over existing optimizer knobs (riskScaleBps = [7500, 10000, 12500], concentrationPenaltyScaleBps = [7500, 10000, 12500]), yielding up to nine candidates per cycle. The grid is intentionally simple: it gives the robust selector a non-degenerate candidate set with minimal engineering surface, and it lets the entire Phase 7 path be replay-tested deterministically because every candidate is a pure function of the configured grid points and the cycle’s input snapshot.

The grid is also intentionally a starting point, not a final answer. Two natural extensions sit behind the same PolicyCandidateSet contract:

Search-based generation. A directed search (local search, simulated annealing, or basin-hopping over the optimizer’s knob space) produces a small set of candidates that are more diverse than the grid samples but still deterministic given a configured seed. The search runs as a cold-path step in the optimizer coordinator and emits candidates with generationLabel describing the search path. The downstream gate consumes the set identically.

Evolutionary generation. A small evolutionary loop over candidates from the previous N cycles (recombination, mutation, selection against the score matrix from the prior robust cycle) produces a candidate set that is biased toward what scored well recently. The evolutionary state is itself an audit artifact: the lineage of every cycle’s candidates includes the parent cycles whose candidates seeded the search. The mechanism is more powerful than the grid but introduces a non-trivial state-management burden across cycles (what survives across cycles, what is purged on regime change, how the search is reset after policy rollbacks) and so is treated as a follow-on rather than a Phase 7 deliverable.

In all three generation modes, the contract is the same: a deterministic, lint-and-validation-clean, hash-distinct set of candidates submitted to the publication gate. The robust objective and the scoring scenarios do not care how the candidates were generated; they care only that the set is well-formed.

19.2 The scorecard, with a path to real TCA

The ScenarioScorecardV1 formula produces a deterministic per-(candidate, scenario) score from a ScenarioSummary:

baseScore =
    fullFillCount * 10000
  + partialFillCount * 2500
  - rejectCount * 5000
  - residualQty
  - staleEventCount * 250
  - outageEventCount * 1000

candidateAdjustment = weightedMean over eligible IVRU cells of:
    fillProbabilityBps
  + queueSurvivalBps / 2
  + venueWeightBps / 4
  + maxParticipationBps / 20
  - toxicityPenaltyBps
  - rejectPenaltyBps
  - slippagePenaltyBps / 2
  - marketImpactPenaltyBps / 2
  - latencyPenaltyNanos / 1000

score = baseScore + candidateAdjustment

The scorecard is intentionally simple, deterministic, and replayable. It is not a production TCA model; it is enough to give the candidate × scenario matrix non-trivial structure (candidate adjustments differ; scenario outcomes differ; the joint matrix is genuinely two-dimensional). A later phase can replace the scorecard with a richer TCA formula; the scorecard version is stamped into every score-matrix artifact so older matrices remain comparable to themselves.

Two notes on the formula. residualQty is subtracted directly because it represents unexecuted intent at scenario end (quantity the policy committed to but did not work off), which is unambiguously a cost regardless of how the executed portion performed. And every coefficient in the formula (the 10000 on fullFillCount, the 5000 reject penalty, the 250 staleness penalty, the 1000 outage penalty, the per-cell adjustment divisors) is a configured scalar, not a hard-coded constant: a firm whose execution priorities weight reject avoidance higher than fill completion, or whose tolerance for stale-feed routing is stricter, configures the coefficients accordingly. The scorecard version stamped into the artifact captures the configured coefficient set so matrices produced under different configurations remain self-describing.

The path from V1 to a production TCA scorecard runs through three substantive additions, each gated by the discipline that has applied to every other layer of the architecture:

Implementation-shortfall accounting. V1 counts fills; a V2 scorecard accounts for what those fills cost. The natural extension is a per-scenario implementation-shortfall score: arrival-price benchmark, realized fill VWAP, slippage in basis points, all derived from the simulator’s deterministic execution outcomes and the parent’s intent. The cost side becomes first-class; a candidate that fills 100% but pays 18 bps slippage scores worse than one that fills 95% at 6 bps. The scorecard formula stays a pure function of ScenarioSummary (extended with the shortfall fields) so determinism and replayability survive the upgrade.

Venue-specific impact curves. V1 treats venue impact symmetrically across the candidate’s eligible subset; a V3 scorecard extends ScenarioSummary with per-venue realized-impact estimates derived from the simulator’s venue-behavior model, and the candidate adjustment becomes venue-aware. The same data structure that L2 maintains for live VenueStatsState can be reused for scenario-side impact accounting; the scorecard reads from a ScenarioVenueStats flyweight per scenario per venue rather than from aggregate scenario fields.

Short-horizon P&L attribution. V1 has no notion of post-fill mark-out within a scenario; a V4 scorecard could include a configurable horizon’s mark-out per fill, computed from the simulator’s deterministic mid-price evolution. The scorecard becomes a TCA model in the strict sense: it scores candidates on what their fills cost the firm at horizons relevant to the trading strategy, not just on whether the fills happened.

Each step adds modeling sophistication. None of them adds non-determinism, runtime cost on the L0 path, or coupling to live state. The scorecard runs on the warm path inside the L1 publication gate, against L6’s replay machinery, with the same fail-closed contract as every other layer: if the scorecard upgrade is buggy or unavailable, the previous scorecard version remains in force and the lineage records which scorecard version evaluated which matrix. Versioning the scorecard keeps the evolution honest: a score from V1 and a score from V4 are not comparable, and the artifact metadata says so.

The four robust objectives are pure functions over the score matrix produced by the scorecard in §19.2:

EXPECTED selects the candidate with the highest mean score across scenarios. It is the baseline, and it is what ADEQUACY_FALLBACK reverts to when the scenario set fails adequacy gates. It captures expected-condition quality.

MIN_MAX selects the candidate with the highest minimum score across scenarios (strict worst-case). It is the most conservative objective and is explicit opt-in only (allowMinMaxObjective=true), because pure worst-case selection can be dominated by a single pathological scenario and drag the realized expected quality far below the alternative selection.

CVAR_K selects the candidate with the highest mean over the worst k percent of scenarios. With k=10, this is a conditional-value-at-risk objective over the worst-decile scenarios; it interpolates between EXPECTED (k=100) and MIN_MAX (k = 1/n). It is the default robust objective: more conservative than expected, less brittle than worst-case.

MIN_REGRET computes, for each candidate, the maximum regret (the per-scenario gap between the best candidate’s score on that scenario and this candidate’s score) and selects the candidate with the smallest maximum regret. It is the objective most aligned with “I would not want to look back and discover I picked a candidate that was meaningfully worse than the best available on any one scenario.”

All four are pure functions of ScoreMatrix. Ties break by lowest candidateId. The selection is byte-deterministic given the same matrix.

20. Scenario adequacy as a strength-of-claim gate

Robustness is relative to the scenarios it was evaluated against. A candidate that wins CVAR_K over a scenario set covering only baseline and normal-regime cases is not robust; it is well-calibrated for the easy part of the world. Phase 7 makes this visible through a scenario-adequacy gate.

The default adequacy config is:

adequacy:
  minScenarioCount: 12
  requiredCategories:
    - regime
    - liquidity
    - venue-health
    - failure-negative
  strictBlock: false

minScenarioCount ensures enough scenarios are evaluated to make the matrix’s worst-decile or worst-case statistics meaningful. requiredCategories ensures the scenario set spans the failure modes that matter operationally: regime transitions, liquidity disappearance, venue health degradation, and explicit failure-negative cases. strictBlock decides whether adequacy failure blocks publication or falls back to EXPECTED.

When adequacy passes, the configured objective is used and the result is stamped with adequacyStatus=ADEQUATE. When adequacy fails with strictBlock=false, the selection objective becomes EXPECTED (the safest baseline objective), the provenance records robustObjective=ADEQUACY_FALLBACK, the missing categories are recorded, and a warning lifecycle event is emitted. When adequacy fails with strictBlock=true, publication is blocked entirely; no active policy swap occurs; the operator sees an explicit governance gate failure.

The point of the adequacy gate is not to make every publication robust. Most cycles will be ADEQUATE; some will fall back; very few should be strictBlock-blocked. The point is that the strength of the robustness claim is bounded by the gate’s outcome. A policy published with adequacyStatus=ADEQUATE and robustObjective=CVAR_K is making a defensible robustness claim. A policy published with adequacyStatus=INADEQUATE and robustObjective=ADEQUACY_FALLBACK is not claiming robustness; it is claiming expected-condition quality, and the lineage says so. The system does not overclaim.

21. Decision provenance and the audit trail

Every Phase 7 publication path stamps provenance into PolicyChangeLedgerEntry:

robustObjective                  EXPECTED | MIN_MAX | CVAR_K | MIN_REGRET
                                 | SINGLE_CANDIDATE_FALLBACK
                                 | ADEQUACY_FALLBACK
robustObjectiveParameters        e.g. cvarKPercent=10
scenarioSetId                    e.g. default-robust-v1
scenarioSetVersion               monotonic
scenarioCount                    actual scenarios evaluated
candidateCount                   actual candidates after dedup
scoreMatrixHandle                governance handle pointing to the persisted artifact
adequacyStatus                   ADEQUATE | INADEQUATE
adequacyMissingCategories        list when adequacy failed

The score matrix itself is persisted to disk as a CSV under build/robust-selection/score-matrix/<matrixId>.csv, with columns:

matrixId,scenarioSetId,scenarioSetVersion,scorecardVersion,candidateId,
candidatePolicyHash64,candidatePolicyHashSha256Hex,
scenarioId,scenarioCategory,score

matrixId is composed from scenario-set identity, candidate IDs, candidate hashes, scenario IDs, and score content so distinct runs do not overwrite prior evidence. The artifact is the receipt: a published policy carries a ledger entry that points to a matrix that contains the full evidence behind the selection decision. A post-mortem investigation, a regulator inquiry, or a TCA dispute can replay the cycle from the lineage and reproduce the matrix exactly.

When robust selection is disabled (robustSelection.enabled=false, the repository default), the system continues to publish single candidates through the existing gate. The ledger entry records robustObjective=SINGLE_CANDIDATE_FALLBACK, which is honest about the absence of a robust claim. The disabled mode is required for Phase 1–6 regression safety and for staged rollout; flipping the flag does not break any prior phase’s behavior. Robust selection is additive, not invasive.


Part VI — Discipline and evidence

Three columns: ArchUnit rules at build time, JMH benchmarks at test time, and scenario catalog + completion reports at release time, composing into a single release gate that blocks publication when any layer regresses. Figure 7. Three-layer defense. No single check stops a regression; each layer catches what the others miss; together they are what publication has to pass.

22. Deterministic replay across the full stack

The prior article required the execution-side to be deterministic and replayable. Adaptive Quantum SOR extends the requirement across every layer of the stack. Every random-number stream is seeded; every wall-clock read uses the cluster’s monotonic deterministic clock; every map iteration uses primitive int keys; every native backend has a deterministic reference fallback; every cycle records the input snapshot ID, optimizer run ID, model signal version, and (Phase 7) score-matrix handle.

The verification surface is a scenario catalog. Each scenario specifies seed, tick range, expected behaviors, and a checksum on the resulting ScenarioSummary. A commit that changes any layer’s behavior in a way that affects routing decisions changes one or more scenario checksums; the change is either an intended behavior update (in which case the scenario expectations are updated alongside the code) or a determinism regression (in which case the build fails and the change is reviewed). The catalog runs under Gradle’s scenario test profile; CI runs it on every change; the failure is visible before merge.

Three properties make this discipline real:

The hot path (L0) is JMH-tested for zero allocation on the routing-decision path. The strict allocation evidence is gc.alloc.rate.norm ≈ 10⁻⁴ B/op on PolicyDrivenSorExecutioner.routeInto(...), the caller-owned routing API. The B/op is effectively zero after warmup; the deviation is the JMH framework’s own bookkeeping noise. Allocation regressions on this path are build breaks.

The warm path (L1–L4) is unit-tested for replay determinism. Same inputs produce same compiled policy hashes; same QUBO instances produce same strategic subset results (under the deterministic reference); same scenario seeds produce same summaries. Inter-cycle determinism is tested via PolicyOptimizerCoordinatorTest and the policy lineage tests; intra-cycle determinism is tested via the simulator and scenario test suites.

The L6 catalog is the canonical test surface for end-to-end behavior. A scenario whose summary checksum drifts is the deterministic-replay equivalent of a unit-test failure: investigated, attributed, fixed or accepted with an explicit update, never silently ignored.

23. The JMH allocation gate on the hot path

The L0 hot path’s allocation discipline is enforced by a strict JMH benchmark, run with -prof gc:

JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64 ./gradlew jmh \
  -PjmhInclude=PolicyDrivenSorJmhBenchmark.strictRouteInto \
  -PjmhWarmupIterations=3 \
  -PjmhMeasurementIterations=5 \
  -PjmhForks=1

The benchmark targets routeInto(...), which routes into caller-owned MutableRouteDecisionResult and RouteAuditEvent objects. The convenience route(...) API that returns immutable result objects is intentionally not the benchmark target; that path is for tests and notebook flows, where allocation is acceptable. The strict variant is the production-relevant path.

The reported numbers from a representative run:

PolicyDrivenSorJmhBenchmark.strictRouteInto:
  average      = ~72 ns/op
  gc.alloc.rate.norm = ~10⁻⁴ B/op
  gc.count     = ~0 counts

The interpretation: under the configured warmup and measurement, the strict routing path is sub-100ns on the benchmark hardware, with effectively zero allocation per operation. The B/op value is benchmark-framework noise, not a real allocation; the gc.count confirming that no garbage collection events occurred during the measurement window is the more important number. Scope, stated explicitly: this benchmark measures the in-process policy-driven routing function only (PolicyDrivenSorExecutioner.routeInto(...) reading the active SorPolicy and writing a ChildOrderBuffer). It excludes external venue I/O, kernel/network latency, FIX or binary encode/decode, market-data ingestion, exchange acknowledgments, and every other component of the full order lifecycle. The claim is bounded and specific: the routing-decision function itself is sub-100ns and zero-allocation. The full lifecycle is dominated by network and venue latency that this benchmark does not, and could not, measure.

The discipline this number enforces is the discipline the prior article spelled out for the execution side. ArchUnit rules in source forbid the patterns that would cause allocation; JMH proves the runtime does not allocate; the preflight gate composes both into a release-blocking check. A change that increases B/op on this benchmark is a build break, not a performance discussion.

24. Scenario catalogs as a test surface

The scenario catalog under scenarios/<category>/*.yaml is both documentation and test surface. Each scenario file is a declarative YAML that the ScenarioDefinitionLoader parses into a ScenarioSpec, which the ScenarioRunner runs against an isolated ScenarioEngineContext. The output is a ScenarioSummary with stable, deterministic fields: fill counts, reject counts, residual quantities, stale events, outage events, lifecycle annotations.

The categories cover the failure modes the design claims to handle:

The catalog is the test surface that lets the design make defensible claims about its behavior under stress. A claim like “L0 routes safely when liquidity disappears” is backed by liquidity/zero_liquidity_safe_route.yaml plus its checksumed expected summary. A claim like “the strategic layer’s pair coefficients change subset selection vs. independent ranking” is backed by optimizer-policy/quadratic_pair_override.yaml. A claim like “the batch allocator handles correlated venue leakage” is backed by optimizer-policy/batch_correlated_venue_leakage.yaml. The claim and the evidence travel together.

25. What Adaptive Quantum SOR explicitly does not do

In the spirit of the prior article’s §17, here are the boundaries the design respects:

It does not put GPU work on the order-time path. L0 is Java-only, allocation-free, JMH-gated. CUDA, CUDA-Q, cuOpt, and any other native backend live exclusively on the warm path and exclusively behind fail-closed boundaries. A production deployment without any GPU runs L0 identically to one with GPUs available.

It does not claim quantum advantage at production sizes today. The design is quantum-ready (the seams exist, the QUBO/Ising formulations are real, the CUDA-Q bridge has the same ABI as the classical reference) but the empirical performance of any specific QPU or CUDA-Q backend at the firm’s actual instance sizes is a workload question. The reference exhaustive solver is the always-available default.

It does not depend on Python at runtime. Python is used for offline training, research notebooks, and the Jupyter control plane. The JVM imports model artifacts and policies that Python produced; it never calls Python during execution or optimization cycles.

It does not provide cross-cluster strategic coordination. Each cluster runs an independent optimizer stack and an independent policy publisher. Federated strategic optimization across multiple firms or clusters is out of scope.

It does not implement real exchange connectivity. The simulator layer is the canonical source of market data and execution outcomes. Real venue plugins, real FIX gateways, real co-location latency, real settlement, and real regulatory reporting are deferred to follow-on integration work behind the existing venue and protocol axes.

It does not include FPGA acceleration. L0 is fast enough in Java; warm-path optimizers do not have the latency budget pressure where FPGA pays. If a future workload changes this, the right answer is a new task card with its own evidence.

It does not promise a single hot-path policy versioning protocol. The current discipline is atomic-reference swap on an immutable SorPolicy. Cross-cluster policy synchronization, partial-policy updates, and live A/B routing on the same instrument are explicit follow-ons.

It does not implement a production model registry, feature store, or MLOps platform. The model artifact contract is intentionally narrow (manifest, predictions, checksum, validator). Production-grade ML infrastructure is a separate engineering investment.

It does not provide native crash isolation for the bridges. A native segfault inside the CUDA, CUDA-Q, or batch backends can in principle take down the JVM. Process-level native crash isolation (separate native worker processes, IPC-based bridges) is documented as a follow-on engineering task.

The list is intentionally specific because the boundaries are part of the architecture, not embarrassments at the edge of it. A serious engineering team reading the prior article should be able to understand what the current release claims and what it does not, and a serious engineering team reading the next release’s report should be able to see precisely which of these boundaries moved.

Key Principle: Future-proof design is a discipline of explicit boundaries. The system is quantum-ready because the QUBO seams are real, the backend interfaces are stable, and the reference solvers are always present, not because any specific quantum backend has shipped to production. The same discipline applies to GPU/CUDA, to ML, and to every other “advanced” capability the design supports. Where the boundary moves, the evidence has to move with it.


Closing: what this architecture earns

The previous article showed how to make an SOR’s order-time path microsecond-fast, allocation-free, deterministic, and benchmark-gated. This article shows how to make the optimization stack above it cadence-correct, GPU-integrated, quantum-ready, and audit-stamped, without putting any of the optimization stack’s failure surface on the order-time path.

The architecture earns three things from that separation.

It earns the right to call itself adaptive: every routing decision references a policy version, every policy version references its optimizer lineage, every optimizer lineage references its input snapshot, and every input snapshot can be replayed end-to-end to reproduce the original decision. The audit trail is dense, machine-readable, and replay-stable. A regulator, a TCA dispute, or a post-mortem investigation can reconstruct any historical decision from the ledger.

It is quantum-ready in the preparedness sense the article spent Part IV defining: the strategic subset-selection problem is a genuine QUBO with non-trivial pair structure, the batch allocation problem is a genuine constrained quadratic assignment with same-venue and cross-venue coupling, the backend boundaries have stable ABIs and fail-closed fallbacks, and the reference solvers give every backend a correctness oracle on small problems. Switching a backend (to a hand-rolled CUDA kernel, to cuOpt, to a CUDA-Q hybrid solver, to a future QPU) requires changing the backend; nothing else moves.

It is future-proof in the only sense the word should mean: the boundaries that contain today’s experimental capabilities (CUDA at L3, CUDA-Q at L4, cuOpt at L3.5, ML at L5, robust selection at L1) are the same boundaries that will contain tomorrow’s. A new strategic backend is a new C++ implementation against the existing header. A new tactical backend is a new shared library against the existing ABI. A new robust objective is a new pure function over the existing ScoreMatrix. A new scenario category is a new directory in scenarios/. None of these touch L0, and none of these break replay.

The L0 path is sub-100ns per route, zero-allocation, and Java-only. Everything above it is allowed to be slow, exotic, and replaceable. That asymmetry is not a compromise; it is the design.

A staged adoption path

Five horizontal stages: Java-only reference, GPU tactical, robust selection, strategic QUBO and batch, quantum experimentation. Each adds one capability while keeping L0 byte-identical. Figure 8. The staged adoption path. Each stage adds one capability behind a boundary; none requires the previous stage to be torn down; L0 is byte-identical across all five.

For an architect evaluating whether to migrate a static or single-tier SOR onto this design, the phasing in Adaptive Quantum SOR’s spec is itself the adoption path. The same staging works for a from-scratch build:

No stage breaks the stage below it. A firm that stops at Stage 1 has an adaptive SOR. A firm that gets to Stage 5 has a quantum-experimental adaptive SOR whose order-time path is byte-identical to Stage 1’s. The discipline of stable boundaries and reference fallbacks is what makes the path traversable.

Final Principle: A trading system’s architecture is most honest when it is most explicit about which layers are allowed to fail and which are not. The order-time path is not allowed to fail. Everything else (the GPU, the QPU, the model artifact, the scenario sweep, the robust objective, the strategic subset solver) is allowed to fail, and each has a fallback. The space between “allowed to fail” and “actually fails” is where engineering rigor lives. The catalog of fallbacks, references, and evidence bundles is what makes the rigor reviewable.


Appendix A — Indicative scale and timing

Specific numbers belong in workload-specific evaluations, not in architecture documents, because they depend sharply on instrument count, venue count, regime granularity, and the hardware the firm actually runs on. That said, the design is sized for a specific shape, and being concrete about the shape is more useful than being vague.

Route-key cardinality. A representative firm-scale deployment in Adaptive Quantum SOR’s target envelope sees on the order of 50–500 instruments, 3–8 regime classes, and 2–4 urgency tiers. The route-key product is therefore in the low thousands to tens of thousands: small enough that the L3 tactical optimizer can iterate over all active route keys in a single warm-path cycle on a single GPU, and small enough that the L4 strategic optimizer can run a per-route-key QUBO solve at a 5–30 minute cadence without queueing. The dense indexing model ((instrumentId × regimeCount + regimeId) × urgencyCount + urgencyId) keeps the memory layout linear in this count.

Venue universe per route key. The target envelope is 100–500 configured venues globally with subsets of 10–30 venues active per route key after L4 strategic selection. The exhaustive QUBO reference at L4 handles up to n=30 with 2^n mask enumeration. Order-of-magnitude wall times on a modern CPU core, per route key:

These are per route key; thousands of route keys aggregate but most reuse identical or near-identical input structure and are amenable to batched evaluation across keys. n = 30 is the regime where a CUDA-Q or QPU backend has the most room to earn its place: the reference solver remains correct but expensive, the backend boundary is already in place, and the firm’s eligible-venue counts in stressed regimes are precisely where n pushes toward the upper bound.

L0 routing latency. The JMH -prof gc evidence is ~72 ns/op average on PolicyDrivenSorExecutioner.routeInto(...) with gc.alloc.rate.norm ≈ 10⁻⁴ B/op (effectively zero after warmup, dominated by JMH bookkeeping). Scope: this measures the in-process policy-driven routing function only; it excludes market-data ingestion, protocol encoding, kernel/network latency, venue I/O, exchange acknowledgments, and full order lifecycle latency. These numbers are from the reference benchmark on a representative developer host; production hosts with CPU pinning, NUMA affinity, and dedicated cores may see comparable or better numbers, but the claim should be revalidated on the target deployment hardware. The load-bearing claim is bounded and specific: the routing-decision function itself is sub-microsecond and zero-allocation, regardless of how exotic the optimizer stack above it gets.

L3 tactical cycle. The MVP CUDA kernel shipped in the repo is intentionally trivial (a bounded sum-and-clamp on a few bps values) and runs in single-digit microseconds per route key, with the bulk of the cycle dominated by Java-side input marshalling and validation. A production replacement (a constrained QP per route key, batched across route keys on the GPU) sits in the millisecond-to-tens-of-milliseconds range per cycle for the route-key counts above, which fits comfortably inside the 1–5 minute cadence and well inside the GPU-side timeout budget.

L3.5 batch allocation. The deterministic reference solver is fast for the parent counts the trigger heuristics actually produce (typically single-digit active large parents per cycle), with unit-quantity enumeration completing in single-digit milliseconds. Larger batches of tens of parents quickly exceed reference-solver feasibility and are the natural home for cuOpt or QUBO backends, with the reference acting as the small-problem correctness oracle as described in §8.

These are indicative envelopes, not contractual benchmarks. The architectural claim is that the cadences are correct for the shape; the realized timing on any specific deployment is a benchmark question with a benchmark answer.


Appendix B — Pointers into the reference implementation

For readers reviewing or extending the reference implementation, the relevant entry points are:


Appendix C — Sources for the quantitative claims in the introduction

The intro’s empirical claims about GPU speedups and the quantum hardware horizon are not handwaves; they map to specific public sources.

GPU speedup for QP-class optimization. Schubiger, Banjac, and Lygeros, “GPU acceleration of ADMM for large-scale quadratic programming” (arXiv:1912.04263; Journal of Parallel and Distributed Computing, 2020), report a CUDA-C GPU implementation of OSQP measured “up to two orders of magnitude faster than the CPU implementation” on large QP problems, with the explicit caveat that “GPUs are not suited for solving small problems for which the CPU implementation is generally much faster.”

GPU speedup for QAP-class combinatorial optimization. Novoa and Qasem, “GPU-accelerated Parallel Solutions to the Quadratic Assignment Problem” (arXiv:2307.11248), report order-of-magnitude average speedups with up to 63× on specific QAPLIB instances, and conclude that “both algorithmic choice and the shape of the input data sets are key factors in finding efficient implementations.”

GPU LP barrier solver. NVIDIA’s cuOpt LP barrier method, in internal benchmarks published on the NVIDIA Developer Blog (November 2025), reports “over 8x average speedup compared to a leading open source CPU solver and over 2x average speedup compared to a popular commercial CPU solver” on a public test set of large linear programs.

CUDA-Q. NVIDIA’s own platform documentation describes CUDA-Q as “qubit-agnostic — seamlessly integrating with all QPUs and qubit modalities and offering GPU-accelerated simulations when adequate quantum hardware isn’t available.”

Quantum hardware roadmap. IBM’s published quantum roadmap (https://www.ibm.com/roadmaps/quantum/) and the June 2025 announcement “IBM lays out clear path to fault-tolerant quantum computing” (https://www.ibm.com/quantum/blog/large-scale-ftqc) target IBM Quantum Starling — 200 logical qubits, 100 million gates — by 2029, and IBM Quantum Blue Jay — 2,000 logical qubits, 1 billion gates — by 2033. The November 2025 announcement reiterates the path to “quantum advantage by the end of 2026 and fault-tolerant quantum computing by 2029.”

Quantum portfolio and combinatorial optimization in finance. The “killer-app” candidate framing for combinatorial optimization on near-term and fault-tolerant quantum hardware in finance has an active multi-year literature; representative entries include Hodson et al. (arXiv:1911.05296) on QAOA-based portfolio rebalancing; Slate et al. (arXiv:2011.08057) on quantum-walk-based portfolio optimization; Egger et al. and follow-ons on NISQ-HHL formulations; and recent VQE / Dicke-state ansatz work (arXiv:2403.04296) for portfolio optimization. None of this work demonstrates quantum advantage at industrial scale today; collectively it identifies pair-coupled binary combinatorial optimization as one of the cleanest application targets for emerging fault-tolerant hardware.