Schema Validation Frameworks

Q: Why use Pydantic instead of plain JSON Schema for trade validation?

JSON Schema is excellent for structural and type checks but cannot express runtime coercion, cross-field business rules, or Decimal-precise financial parsing. Pydantic models encode all three in one contract, coerce and validate in a single call, and emit structured errors that route directly to the dead-letter store, which is what production energy ingestion needs.

Q: Should a negative LMP be treated as a validation error?

No. Negative locational marginal prices are legitimate outcomes of congestion and oversupply, and rejecting them silently drops real curtailment settlements. Validation should admit negative prices and bound only physically impossible values, such as a price outside a wide plausibility band.

Q: How do I stop schema drift from corrupting reconciliation?

Set the model to reject unknown fields (extra=forbid) so an added or renamed column fails loudly at ingress instead of arriving as a silent null. Pin a contract schema version per market and alert Critical on the first drift rejection, since it almost always signals an upstream contract change.

Q: Why validate financial fields in Decimal rather than float?

IEEE-754 floats cannot represent most decimal fractions exactly, so a price like 31.10 becomes 31.1000000000004 and accumulates error across thousands of summed intervals. Parsing straight from string into Decimal and quantizing to the settlement granularity keeps every financial figure exact and audit-reproducible.

A day-ahead confirmation that arrives with price serialized as the string "31.1000000000004" and a delivery_hour of 24 will not fail loudly — it will settle silently against the wrong interval, and the break surfaces a week later as a counterparty invoice dispute no one can trace. That is the failure mode this component exists to eliminate: structurally malformed or out-of-contract payloads that clear ingestion, contaminate the reconciliation ledger, and only reveal themselves as settlement disputes, margin discrepancies, or FERC recordkeeping gaps after the window has closed. Within the Trade Ingestion & Matching Workflows domain, schema validation is the gatekeeper stage: it sits between raw acquisition and deterministic matching, admits only records that satisfy an explicit typed contract, and diverts everything else to a dead-letter queue for remediation. Get it right and every downstream stage — pandas transformation, matching, the Settlement Calculation & Validation Engines that turn matched trades into invoices — inherits a clean, auditable, reproducible input.

The flowchart below shows how the validation gateway routes each incoming payload: clean records advance to reconciliation, transient connectivity failures replay with backoff, and structural defects divert to a dead-letter queue for manual remediation.

Validation Architecture and Enforcement Layers

Schema validation for energy trade data is not a single check; it is three enforcement layers stacked at the ingestion boundary, each catching a different failure class. The first is structural — is this JSON well-formed, are the required keys present, is price a number and not null? The second is type and domain — does delivery_hour fall in [1, 24], is settlement_interval one of the operator’s defined granularities, does volume_mwh parse into Decimal space without loss? The third is cross-field and regulatory — does the delivery date align with the published market calendar, does the product_type match the units, does the counterparty LEI resolve against the master registry? A record must clear all three to reach matching; a failure at any layer routes it to quarantine with a structured, machine-readable reason rather than a silent coercion.

While JSON Schema establishes a reliable baseline for structural and type checking, production-grade energy environments demand runtime type coercion, nested model composition, and cross-field dependency validation that a static schema cannot express. This is why the concrete implementation guidance for Implementing Pydantic for energy trade validation treats the typed contract as the source of truth: it enforces precise decimal precision for volumetric quantities, validates delivery-hour granularity, and cross-references contract identifiers against master data, eliminating the silent float truncation that turns 31.10 into an IEEE-754 artifact before it ever reaches the ledger. The transport and pagination concerns that deliver these payloads belong to ETRM API Integration Patterns; this component owns only what happens to a payload once it is in hand.

Validation is deliberately stateless and idempotent. The same record validated twice yields the same verdict, which is what lets an Async Batch Processing Pipelines worker replay a batch after a network partition without changing outcomes. Financial fields are checked entirely in Decimal space — a half-cent price adjustment that survives as a float rounding error is itself a settlement break.

Specification and Standards Reference

Validation rules are not a free-form engineering choice; the fields, formats, and cadences are pinned by market rules, and a validator that ignores them produces records that are structurally valid but unsettleable.

NAESB WEQ Business Practice Standards define the electronic scheduling and OASIS data formats US wholesale markets exchange, including the transaction codes and quantity/price conventions the validator must accept as canonical.
FERC recordkeeping (18 CFR Part 125) and the Electric Quarterly Report require every ingested transaction to be retained and reproducible; validation satisfies this only if each payload is hashed and audit-logged before transformation, so the retained record is provably the one that was received.
ISO/RTO settlement calendars publish the delivery dates and interval granularities a record must align to. The concrete parsing rules for those extracts live in ISO/RTO Data Format Standards, and the window arithmetic in Settlement Cycle Mapping.
REMIT / MiFID II RTS 22 attach field-level obligations to EU power and gas trades — LEIs, ISO 8601 UTC timestamps, and standardized product codes that must be normalized at the validation boundary rather than deferred to reporting time.

The practical consequence is that the ingestion boundary is also a compliance boundary. A record that clears validation but fails a downstream regulatory field check is already a reporting breach, so those checks belong at ingress, embedded in the same contract that enforces structure.

Field	Contract rule	Governing standard
`price`	`Decimal`, band `[-1000, 100000]` $/MWh	NAESB WEQ quantity/price
`delivery_hour`	integer `1..24` (or `1..25` on DST fall-back)	ISO/RTO calendar
`volume_mwh`	`Decimal`, `>= 0`, non-null	NAESB WEQ
`delivery_period_start`	ISO 8601, timezone-aware	REMIT / RTS 22
`transaction_code`	enum of NAESB codes	NAESB WEQ
`counterparty_lei`	20-char, resolves in registry	REMIT

Step-by-Step Implementation

A production validation gate assembles in five ordered steps. Each is independently testable and carries its own failure semantics.

1. Declare the typed contract

The contract is the single source of truth. Encode every field’s type, domain, and financial precision once, so structural and type validation are one call rather than a scatter of hand-rolled if checks.

from decimal import Decimal
from datetime import datetime
from pydantic import BaseModel, field_validator, ConfigDict

class TradeRecord(BaseModel):
    model_config = ConfigDict(extra="forbid")   # unknown fields = schema drift, reject

    trade_id: str
    node_id: str
    delivery_period_start: datetime
    settlement_interval: str
    delivery_hour: int
    volume_mwh: Decimal
    price: Decimal
    transaction_code: str

    @field_validator("delivery_hour")
    @classmethod
    def hour_in_range(cls, v: int) -> int:
        if not 1 <= v <= 25:                     # 25 admits the DST fall-back hour
            raise ValueError("delivery_hour_out_of_range")
        return v

2. Enforce financial precision in Decimal space

Never let money touch float. Parse price and volume from their string representations directly into Decimal, and quantize volume to the operator’s settlement granularity so a downstream groupby().sum() cannot accumulate representation error across thousands of intervals.

from decimal import Decimal, InvalidOperation, ROUND_HALF_EVEN

VOLUME_QUANTUM = Decimal("0.001")   # kWh-level precision on MWh quantities

def coerce_money(raw_price: str, raw_volume: str) -> tuple[Decimal, Decimal]:
    try:
        price = Decimal(str(raw_price))
        volume = Decimal(str(raw_volume)).quantize(VOLUME_QUANTUM, rounding=ROUND_HALF_EVEN)
    except InvalidOperation as exc:
        raise ValueError(f"non_decimal_financial_field:{exc}") from exc
    if volume < 0:
        raise ValueError("negative_volume")
    return price, volume

3. Run cross-field and regulatory checks

Structural validity is not enough — a record can be well-typed and still violate a market rule. Verify that the delivery date is an open settlement date, that the transaction_code is in the NAESB enum, and that price and product are mutually consistent.

NAESB_TX_CODES = {"BUY", "SELL", "TRANSFER", "CURTAIL"}

def validate_business_rules(record: TradeRecord, open_dates: set) -> None:
    if record.transaction_code not in NAESB_TX_CODES:
        raise ValueError(f"unknown_transaction_code:{record.transaction_code}")
    if record.delivery_period_start.date() not in open_dates:
        raise ValueError("delivery_date_outside_open_settlement_window")
    # A CURTAIL leg with positive volume is internally inconsistent.
    if record.transaction_code == "CURTAIL" and record.volume_mwh > 0:
        raise ValueError("curtailment_with_positive_volume")

4. Stamp lineage and an idempotency key

Before persistence, stamp each validated record with a content hash that doubles as its idempotency key and its audit anchor. Re-delivery of the same payload then collapses to a no-op, and the hash proves at audit time that the settled record is byte-identical to the one retained under 18 CFR Part 125.

import hashlib
import json
from datetime import datetime, timezone

def stamp_lineage(record: dict, source_system: str) -> dict:
    canonical = json.dumps(record, sort_keys=True, separators=(",", ":"), default=str)
    idempotency_key = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
    return {
        **record,
        "source_system": source_system,
        "idempotency_key": idempotency_key,
        "validated_at": datetime.now(timezone.utc).isoformat(),
    }

5. Route the verdict — advance or dead-letter

The gate has exactly two outputs: a clean record forwarded toward matching, or a structured rejection routed to the dead-letter store. Never a silently coerced third state. The rejection carries the machine-readable reason, so remediation and alerting can classify the failure without re-parsing the payload.

from pydantic import ValidationError

def validate_and_route(raw: dict, open_dates: set, source_system: str) -> dict:
    try:
        record = TradeRecord(**raw)
        record.price, record.volume_mwh = coerce_money(raw["price"], raw["volume_mwh"])
        validate_business_rules(record, open_dates)
    except (ValidationError, ValueError) as exc:
        return {"status": "dead_letter", "reason": str(exc), "payload": raw}
    clean = stamp_lineage(record.model_dump(), source_system)
    return {"status": "accepted", "record": clean}

The vectorized transformation that consumes these accepted records — interval alignment, merge_asof matching, tolerance routing — is documented in Pandas for Trade Data Processing.

Edge Cases and Failure Modes

The happy path is trivial; the settlement breaks live entirely in the edge cases below, each needing explicit handling rather than a blanket try/except.

Daylight-saving boundaries. A day-ahead award timestamped 2026-03-08T02:30:00 in US Eastern is a nonexistent wall-clock time on the spring-forward date, and 2026-11-01T01:30:00 is ambiguous — it occurs twice. Naive parsing either raises or silently folds the record into the wrong interval, unmatching it. Normalize to UTC-aware timestamps at ingestion and reject the impossible; admit the fall-back hour as delivery_hour = 25.

from zoneinfo import ZoneInfo
from datetime import datetime

def normalize_delivery_ts(local_ts: str, iso_tz: str) -> datetime:
    naive = datetime.fromisoformat(local_ts)
    aware = naive.replace(tzinfo=ZoneInfo(iso_tz))
    # A nonexistent spring-forward time round-trips to a different UTC wall-clock.
    if aware.astimezone(ZoneInfo("UTC")).astimezone(ZoneInfo(iso_tz)) != aware:
        raise ValueError("nonexistent_local_time_dst_gap")
    return aware.astimezone(ZoneInfo("UTC"))

Negative LMPs. Congestion and oversupply routinely drive locational marginal prices below zero. A validator that rejects negative prices as “invalid” silently drops legitimate curtailment settlements — the band check in step 2 deliberately admits negatives and bounds only the physically absurd. LMP itself decomposes as $LMP_n = \lambda + \mu_n + \nu_n$, the sum of energy, congestion, and loss components, any of which can push the nodal price negative.

Zero-volume intervals. A confirmation with volume_mwh = 0 is not garbage — it is frequently a legitimate curtailed or scheduled-but-undelivered interval. Preserve it and let matching decide; discarding zero-volume rows manufactures a completeness shortfall the reconciliation gate will then flag.

Schema drift. A counterparty adds a column or renames deliv_pt to delivery_point between cycles. The extra="forbid" config in step 1 catches the addition at ingress and quarantines the batch with a structured error, rather than letting a silently-null field corrupt reconciliation. Renames surface as a missing required field — an explicit, alertable failure.

Duplicate re-delivery. An SFTP poll overlaps its predecessor, or a REST retry re-sends a page. The idempotency key from step 4 makes the second copy a no-op; without it, a re-delivered confirmation double-counts and fabricates a phantom tolerance breach downstream.

Mixed unit conventions. A gas leg quoted in MMBtu and an electric leg in MWh sharing one feed must be validated against product-specific unit rules; a validator that assumes one unit silently misprices the other. Bind the unit check to transaction_code and product class, not to a global default.

Threshold and Alerting Configuration

Every parameter that governs how strict the gate is — and when a human is paged — is configuration, not code, so it can be retuned per market without a redeploy.

Parameter	Typical default	Alert tier	Escalation
Dead-letter rate	< 0.5% of batch	Warning at 1%, Critical at 5%	halt run above Critical
Schema-drift rejections	0 per cycle	Critical on first	page ingestion owner
Validation latency budget	< 5% of ingest time	Warning at 8%	profile model cache
Price band `[min, max]`	`[-1000, 100000]` $/MWh	—	widen only with market sign-off
Completeness ratio $R$	$\geq 0.995$	Critical below gate	block settlement, route exceptions
Contract schema version	pinned per market	Critical on mismatch	freeze ingest, review upgrade

The completeness gate decides whether a run is settlement-eligible at all: with $N$ expected records for an interval and $M$ that clear validation, $R = \dfrac{M}{N}$ must exceed the operator gate before the batch advances, and every shortfall routes to exceptions rather than being dropped. A sudden spike in the schema-drift counter is the highest-signal alert on this component — it almost always means an upstream contract changed silently. The discipline for tuning these bands and routing escalations without a redeploy belongs to Threshold Tuning & Alerts.

Testing and Reconciliation Verification

A validation gate that passes its own unit tests but silently coerces one field type under load is worse than no gate, so verification centers on conservation and on pinning every edge case.

Record conservation. Every payload must land in exactly one terminal state. Assert ingested == accepted + dead_lettered per batch; any drift means a worker swallowed an exception or a rejection was dropped.

def assert_no_record_loss(ingested: int, accepted: int, dead_lettered: int) -> None:
    assert ingested == accepted + dead_lettered, (
        f"record loss: {ingested} in, {accepted + dead_lettered} accounted for"
    )

Edge-case unit tests. Pin the failure modes above with explicit cases — the DST gap, the negative price, the zero-volume interval, the drifted schema, the duplicate — so a refactor cannot quietly regress them.

import pytest
from decimal import Decimal

def test_dst_gap_is_rejected():
    with pytest.raises(ValueError, match="nonexistent_local_time_dst_gap"):
        normalize_delivery_ts("2026-03-08T02:30:00", "America/New_York")

def test_negative_lmp_is_admitted():
    price, volume = coerce_money("-12.44", "5.000")
    assert price == Decimal("-12.44") and volume == Decimal("5.000")

def test_unknown_field_is_schema_drift():
    with pytest.raises(Exception):
        TradeRecord(trade_id="T1", node_id="PJM.WEST",
                    delivery_period_start="2026-03-08T05:00:00+00:00",
                    settlement_interval="0100", delivery_hour=1,
                    volume_mwh="5", price="31.10", transaction_code="BUY",
                    surprise_column="x")   # extra=forbid must reject

Shadow validation. Before promoting a contract change, run the new model in shadow against a live feed and diff its verdicts against the incumbent. A record the new schema rejects that the old one accepted is either a caught defect or a regression — the diff forces a human to decide which before the change ships. This is the same shadow-calculation discipline the settlement engines apply to pricing changes, and it is what makes a schema upgrade safe mid-cycle.

Frequently Asked Questions

Why use Pydantic instead of plain JSON Schema for trade validation?

JSON Schema is excellent for structural and type checks but cannot express runtime coercion, cross-field business rules, or Decimal-precise financial parsing. Pydantic models encode all three in one contract, coerce and validate in a single call, and emit structured errors that route directly to the dead-letter store — which is what production energy ingestion needs.

Should a negative LMP be treated as a validation error?

No. Negative locational marginal prices are legitimate outcomes of congestion and oversupply, and rejecting them silently drops real curtailment settlements. Validation should admit negative prices and bound only physically impossible values, such as a price outside a wide plausibility band.

How do I stop schema drift from corrupting reconciliation?

Set the model to reject unknown fields (extra="forbid") so an added or renamed column fails loudly at ingress instead of arriving as a silent null. Pin a contract schema version per market and alert Critical on the first drift rejection, since it almost always signals an upstream contract change.

Why validate financial fields in Decimal rather than float?

IEEE-754 floats cannot represent most decimal fractions exactly, so a price like 31.10 becomes 31.1000000000004 and accumulates error across thousands of summed intervals. Parsing straight from string into Decimal and quantizing to the settlement granularity keeps every financial figure exact and audit-reproducible.

Schema Validation Frameworks

Validation Architecture and Enforcement Layers #

Specification and Standards Reference #

Step-by-Step Implementation #

1. Declare the typed contract #

2. Enforce financial precision in Decimal space #

3. Run cross-field and regulatory checks #

4. Stamp lineage and an idempotency key #

5. Route the verdict — advance or dead-letter #

Edge Cases and Failure Modes #

Threshold and Alerting Configuration #

Testing and Reconciliation Verification #

Frequently Asked Questions #

Why use Pydantic instead of plain JSON Schema for trade validation? #

Should a negative LMP be treated as a validation error? #

How do I stop schema drift from corrupting reconciliation? #

Why validate financial fields in Decimal rather than float? #

Related #

Explore this topic

Implementing Pydantic for energy trade validation