ISO/RTO Data Format Standards

A T+30 statement fails to reconcile by a few dollars per node, and the root cause is not a pricing error at all — a parser silently read ERCOT’s positional flat file with the wrong column offset after the market operator inserted a field, so the marginal loss component landed in the congestion column and both totals still validated against a naive sum. That failure mode — schema drift corrupting a settlement component before any business logic runs — is exactly what disciplined format handling exists to prevent. Within the Core Architecture & Market Taxonomy for Energy Settlements framework, this layer is the ingestion boundary: it translates each market operator’s raw output — settlement-grade telemetry, bid and award results, locational marginal prices (LMPs), and financial settlement statements — into deterministic, typed records before a single dollar is posted. Parsing these feeds is not routine data engineering; it is a compliance and financial-accuracy control, because every downstream number inherits whatever this layer got wrong.

The diagram below shows the normalization pipeline that collapses divergent regional formats into a single canonical settlement model through schema validation and temporal alignment.

Specification and Standards Reference

Format handling begins with the authoritative specification for each market, not with the sample file a vendor happened to send. Every ISO and RTO publishes a data dictionary or schema that is versioned against a tariff effective date, and the parser must be pinned to that version rather than inferring structure at runtime. The table below maps the delivery formats, governing specifications, and component naming conventions across the major North American markets — the naming column is where most silent misalignments originate, because the same economic quantity carries a different label in each schema.

Market	Primary format	Governing spec	Energy / congestion / loss labels
PJM	Namespaced XML	Data Miner 2 dictionary, Manual 28	`total_lmp` / `congestion_price` / `marginal_loss_price`
ERCOT	Positional flat file / CSV	Nodal Protocols §4, §6	`LMP` / `LMPCongestion` / `LMPLoss` (derived)
CAISO	OASIS CSV	BPM for Settlements, NAESB WEQ-002	`MEC` / `MCC` / `MCL`
ISO-NE	CSV with file-level metadata	Market Rule 1, Manual M-28	`LMP` / `Congestion_Component` / `Loss_Component`
MISO	XML / CSV	BPM-005, BPM-004	`LMP` / `MCC` / `MLC`
NYISO	CSV	Settlement Data Exchange, Manual 12	`LBMP` / `Congestion` / `Losses`
SPP	CSV / EDI	Market Protocols §4	`LMP` / `MCC` / `MLC`

Two cross-cutting standards sit above the per-market dictionaries. NAESB WEQ-002 governs the OASIS interface through which most operators expose public market data, standardizing request templates and column headers for CAISO, MISO, and others. For bilateral and retail-adjacent flows, the ANSI X12 EDI transaction sets — 867 (meter usage) and 820 (payment remittance) — define the interchange envelope, and the same interval data may arrive both as an OASIS CSV and an EDI 867 that must agree to the interval. On the regulatory side, the data these feeds carry ultimately rolls up into the FERC Electric Quarterly Report (EQR), so the labels and units captured here must survive all the way to the filing without re-interpretation. The invariant that ties the naming table together is the LMP decomposition every market shares, even when the columns are spelled differently:

$$LMP_n = \lambda + \mu_n + \nu_n$$

where $\lambda$ is the system energy price, $\mu_n$ the congestion component at node $n$, and $\nu_n$ the marginal loss component. A canonical model stores these three quantities under one set of names regardless of the source schema, and validates that they sum to the reported nodal price before persistence. Getting the mapping right here is the prerequisite for the ISO-NE vs CAISO reporting schema differences reconciliation and for every calculation in Pricing Logic Implementation downstream.

Step-by-Step Ingestion Implementation

A production ingestion path runs the same five stages for every market, differing only in the format-specific reader at stage two. The sequence is deliberately ordered so that no record reaches the canonical model until it has been structurally validated and temporally anchored.

1. Pin the schema version to a tariff effective date

Never let the pipeline discover structure from the payload. Resolve the expected schema from a registry keyed by market and effective date, so that a mid-quarter format revision produces an explicit version mismatch rather than a silent column shift.

from dataclasses import dataclass
from datetime import date

@dataclass(frozen=True)
class SchemaBinding:
    market: str
    version: str
    effective: date
    xsd_path: str | None      # populated for XML markets
    column_order: tuple[str, ...] | None  # populated for positional flat files

SCHEMA_REGISTRY: dict[str, list[SchemaBinding]] = {
    "ERCOT": [
        SchemaBinding("ERCOT", "NP6-905-v3", date(2026, 1, 1), None,
                      ("delivery_date", "hour_ending", "settlement_point",
                       "lmp", "lmp_congestion", "lmp_loss")),
    ],
}

def resolve_schema(market: str, operating_day: date) -> SchemaBinding:
    candidates = [b for b in SCHEMA_REGISTRY[market] if b.effective <= operating_day]
    if not candidates:
        raise LookupError(f"No pinned schema for {market} on {operating_day}")
    return max(candidates, key=lambda b: b.effective)

2. Validate at the boundary before any business logic

For XML markets such as PJM, validate against the pinned XSD with lxml; for positional feeds, assert the header matches the pinned column order exactly. Structured modeling with Schema Validation Frameworks — pydantic in particular — then coerces each field into a typed record and rejects anything that does not conform. This mirrors the approach in Implementing Pydantic for energy trade validation, applied to price statements rather than trades.

from decimal import Decimal
from datetime import datetime
from pydantic import BaseModel, field_validator

class SettlementPointPrice(BaseModel):
    settlement_point: str
    hour_ending: int
    lmp: Decimal
    lmp_congestion: Decimal
    lmp_loss: Decimal

    @field_validator("lmp", "lmp_congestion", "lmp_loss", mode="before")
    @classmethod
    def _to_decimal(cls, v: str) -> Decimal:
        # Parse straight to Decimal from the raw string — never via float,
        # or sub-cent representation error enters at the boundary.
        return Decimal(v.strip())

    @field_validator("hour_ending")
    @classmethod
    def _valid_he(cls, v: int) -> int:
        if not 1 <= v <= 25:   # 25 accommodates the fall-back operating day
            raise ValueError(f"hour_ending {v} outside 1..25")
        return v

def validate_flat_row(raw: dict[str, str], binding: SchemaBinding) -> SettlementPointPrice:
    if tuple(raw.keys()) != binding.column_order:
        raise ValueError(f"Column drift: {tuple(raw.keys())} != {binding.column_order}")
    return SettlementPointPrice(**raw)

3. Anchor every timestamp to UTC

ISO and RTO reports mix UTC, local market time, and daylight saving transitions, frequently omitting explicit offset metadata. Parse the reported operating day and hour-ending into a timezone-aware instant using the standard library’s zoneinfo, then store the UTC instant alongside the original local label for the audit trail. This is the same temporal contract enforced by Settlement Cycle Mapping, which consumes these anchored intervals.

from zoneinfo import ZoneInfo
from datetime import datetime, timedelta

MARKET_TZ = {"ERCOT": ZoneInfo("US/Central"), "CAISO": ZoneInfo("US/Pacific"),
             "PJM": ZoneInfo("US/Eastern"), "ISO-NE": ZoneInfo("US/Eastern")}

def anchor_to_utc(operating_day: date, hour_ending: int, market: str) -> datetime:
    tz = MARKET_TZ[market]
    # hour_ending N covers the interval [N-1, N); anchor on the interval start.
    local_start = datetime(operating_day.year, operating_day.month,
                           operating_day.day, tzinfo=tz) + timedelta(hours=hour_ending - 1)
    return local_start.astimezone(ZoneInfo("UTC"))

4. Map into the canonical component model

With clean, typed, UTC-anchored rows, project each source-specific record onto the shared energy/congestion/loss model, translating labels via the naming table. All arithmetic stays in Decimal.

from decimal import Decimal, getcontext, ROUND_HALF_EVEN

getcontext().rounding = ROUND_HALF_EVEN

CANONICAL_MAP = {
    "ERCOT": {"energy": "lmp_energy_derived", "congestion": "lmp_congestion", "loss": "lmp_loss"},
    "CAISO": {"energy": "MEC", "congestion": "MCC", "loss": "MCL"},
}

def to_canonical(row: SettlementPointPrice, market: str) -> dict[str, Decimal]:
    # ERCOT does not publish a standalone energy component; derive it so the
    # decomposition is complete: energy = lmp - congestion - loss.
    energy = (row.lmp - row.lmp_congestion - row.lmp_loss)
    return {
        "node_id": row.settlement_point,
        "energy_price": energy.quantize(Decimal("0.01")),
        "congestion_price": row.lmp_congestion.quantize(Decimal("0.01")),
        "loss_price": row.lmp_loss.quantize(Decimal("0.01")),
        "total_lmp": row.lmp.quantize(Decimal("0.01")),
    }

5. Persist with an audit hash

Write the canonical record together with a content hash of the raw source bytes and the resolved schema version, so any later dispute can be replayed against the exact input that produced it — a requirement that flows directly from the Security & Access Boundaries applied across settlement environments.

import hashlib, json

def audit_envelope(canonical: dict, raw_bytes: bytes, binding: SchemaBinding) -> dict:
    return {
        **canonical,
        "schema_version": binding.version,
        "source_sha256": hashlib.sha256(raw_bytes).hexdigest(),
        "ingested_at": datetime.now(ZoneInfo("UTC")).isoformat(),
    }

Edge Cases and Failure Modes

The five stages above handle well-formed data; the reconciliation-breaking failures live in the edge cases, and each one needs explicit handling code rather than a defensive try/except that swallows the problem.

Negative LMPs and negative congestion. Prices routinely go negative when renewable oversupply or transmission constraints invert the economics, so a validator that rejects or floors negative values will corrupt the settlement. The component check must be sign-aware: verify the decomposition sums, not that any component is positive.

def assert_decomposition(c: dict[str, Decimal], tol: Decimal = Decimal("0.01")) -> None:
    recomposed = c["energy_price"] + c["congestion_price"] + c["loss_price"]
    if abs(recomposed - c["total_lmp"]) > tol:
        raise ValueError(
            f"Decomposition break at {c['node_id']}: "
            f"{recomposed} != {c['total_lmp']} (delta {recomposed - c['total_lmp']})")

DST boundary days. The spring-forward operating day has 23 hours and the fall-back day has 25; hour-numbered statements surface this as a missing hour-ending or a duplicated one. Because stage three anchors on timezone-aware local time, zoneinfo resolves the repeated fall-back hour to distinct UTC instants automatically — but the interval count per day must never be hard-coded to 24. Assert the expected count from the calendar, not from a constant.

def expected_intervals(operating_day: date, market: str) -> int:
    tz = MARKET_TZ[market]
    start = datetime(operating_day.year, operating_day.month, operating_day.day, tzinfo=tz)
    nxt = start + timedelta(days=1)
    hours = round((nxt.astimezone(ZoneInfo("UTC")) - start.astimezone(ZoneInfo("UTC")))
                  / timedelta(hours=1))
    return hours   # 23 on spring-forward, 25 on fall-back, else 24

Zero-volume and null-price intervals. A settlement point with no scheduled quantity may emit a zero-volume row or omit the interval entirely. Treat an omitted interval and a genuine zero as different states: the former is a gap to flag, the latter a valid record. Reconcile the received interval set against expected_intervals and route gaps to the exception queue rather than forward-filling them.

Stale telemetry. Real-time feeds occasionally republish the previous interval’s values under a new timestamp during a market operator outage. Detect this by comparing the source publish timestamp against the operating interval and rejecting records whose data age exceeds a market-specific staleness bound, escalating through the tiers defined in Threshold Tuning & Alerts.

Schema drift. The failure that opened this page. Because stage one pins the schema and stage two asserts the header, an inserted or reordered column raises a Column drift error at the boundary instead of a mis-mapped component thirty days later. When drift is detected, quarantine the batch and require a human to register a new SchemaBinding before ingestion resumes.

Threshold and Alerting Configuration

Format-handling failures are noisy in aggregate, so the boundary needs tiered thresholds rather than a single pass/fail switch. The parameters below are the ones worth externalizing into configuration so operations can tune them per market without a code change.

Parameter	Default	Tier on breach	Escalation
`decomposition_tolerance`	$0.01/MWh	Warn	Log to variance dashboard
`column_drift`	any mismatch	Critical	Quarantine batch, page on-call
`staleness_seconds` (RT)	900	Major	Switch to fallback feed
`missing_interval_pct`	> 2% of expected	Major	Open exception ticket
`negative_lmp_rate`	> 40% of nodes	Info	Annotate, do not block

The distinction that matters is between a data-quality signal and a data-integrity fault. A high negative-LMP rate is normal market behavior and should never block a run; a column drift is an integrity fault that must halt persistence. Wiring these tiers into the same escalation router used by the rest of the reconciliation stack keeps format alerts from becoming a separate, ignored channel.

from enum import Enum

class Tier(str, Enum):
    INFO = "info"; WARN = "warn"; MAJOR = "major"; CRITICAL = "critical"

def classify(event: str) -> Tier:
    return {
        "column_drift": Tier.CRITICAL,
        "stale_telemetry": Tier.MAJOR,
        "missing_intervals": Tier.MAJOR,
        "decomposition_break": Tier.WARN,
        "negative_lmp": Tier.INFO,
    }.get(event, Tier.WARN)

Testing and Reconciliation Verification

Verify the ingestion path with a shadow calculation: re-derive the canonical components from an independently sourced copy of the same interval — for example the OASIS CSV against the EDI 867, or the public Data Miner export against the settlement statement — and diff the two. Any node where the shadow and primary decompositions disagree beyond decomposition_tolerance is a parser defect, not a market event.

Unit tests should target the edge cases directly, because they are the ones that never appear in a happy-path sample:

from decimal import Decimal

def test_ercot_energy_derivation():
    row = SettlementPointPrice(settlement_point="HB_HOUSTON", hour_ending=3,
                               lmp=Decimal("28.14"), lmp_congestion=Decimal("-1.02"),
                               lmp_loss=Decimal("0.36"))
    c = to_canonical(row, "ERCOT")
    assert c["energy_price"] == Decimal("28.80")   # 28.14 - (-1.02) - 0.36
    assert_decomposition(c)                          # must not raise

def test_fall_back_day_has_25_intervals():
    assert expected_intervals(date(2026, 11, 1), "ERCOT") == 25

def test_column_drift_is_rejected():
    binding = resolve_schema("ERCOT", date(2026, 6, 1))
    bad = {"delivery_date": "2026-06-01", "settlement_point": "X"}  # reordered/short
    try:
        validate_flat_row(bad, binding); assert False, "should have raised"
    except ValueError:
        pass

def test_negative_congestion_survives_validation():
    c = {"node_id": "N1", "energy_price": Decimal("30.00"),
         "congestion_price": Decimal("-4.50"), "loss_price": Decimal("0.50"),
         "total_lmp": Decimal("26.00")}
    assert_decomposition(c)   # negative components are valid; must not raise

For a full-portfolio check, aggregate the canonical output by operating day and settlement point and reconcile the interval-weighted total against the operator’s own statement summary line. A clean run means every node’s decomposition sums within tolerance, the interval count matches the calendar-derived expectation, and no batch remains in quarantine. That result is what makes the normalized feed a trustworthy input for ETRM System Architecture and the position and settlement modules that depend on it.

Frequently Asked Questions

Why does my ERCOT LMP not include an energy component column?

ERCOT publishes the total settlement-point price, the congestion component, and the loss component, but not a standalone system energy price. Derive it as energy = lmp - congestion - loss so the three-part decomposition is complete, then validate that the parts sum back to the reported total. Storing only the raw columns leaves every downstream calculation to re-derive the energy term inconsistently.

How should I handle the duplicated hour on a fall-back day?

Do not deduplicate by local hour label — the two 1 a.m. hour-endings are genuinely different intervals. Anchor each row to a timezone-aware local timestamp and convert to UTC; zoneinfo resolves the repeat to two distinct UTC instants. Assert 25 intervals for that operating day from the calendar rather than assuming 24, and the fall-back hour reconciles correctly.

Why must format parsing use Python’s decimal module instead of float?

Binary floating point cannot represent most decimal fractions exactly, so parsing "0.36" through float introduces representation error that accumulates across millions of intervals into material ledger drift that fails audit. Parse the raw string straight into Decimal at the validation boundary and keep every component and total in Decimal with explicit ROUND_HALF_EVEN quantization.

What is the fastest way to detect that a market operator changed its file format?

Pin the expected schema — the XSD for XML markets, the exact column order for positional feeds — to a tariff effective date and assert the incoming header against it before parsing. A reordered or inserted column then raises a drift error at the boundary and quarantines the batch, instead of silently mapping a component into the wrong column and surfacing weeks later as an unexplained settlement variance.

ISO/RTO Data Format Standards

Specification and Standards Reference

Step-by-Step Ingestion Implementation

1. Pin the schema version to a tariff effective date

2. Validate at the boundary before any business logic

3. Anchor every timestamp to UTC

4. Map into the canonical component model

5. Persist with an audit hash

Edge Cases and Failure Modes

Threshold and Alerting Configuration

Testing and Reconciliation Verification

Frequently Asked Questions

Why does my ERCOT LMP not include an energy component column?

How should I handle the duplicated hour on a fall-back day?

Why must format parsing use Python’s decimal module instead of float?

What is the fastest way to detect that a market operator changed its file format?

Explore this topic

ISO-NE vs CAISO Reporting Schema Differences

Parsing CAISO OASIS XML in Python

SFTP vs REST API for ISO data retrieval

ISO/RTO Data Format Standards

Specification and Standards Reference #

Step-by-Step Ingestion Implementation #

1. Pin the schema version to a tariff effective date #

2. Validate at the boundary before any business logic #

3. Anchor every timestamp to UTC #

4. Map into the canonical component model #

5. Persist with an audit hash #

Edge Cases and Failure Modes #

Threshold and Alerting Configuration #

Testing and Reconciliation Verification #

Frequently Asked Questions #

Why does my ERCOT LMP not include an energy component column? #

How should I handle the duplicated hour on a fall-back day? #

Why must format parsing use Python’s decimal module instead of float? #

What is the fastest way to detect that a market operator changed its file format? #

Related #

Explore this topic

ISO-NE vs CAISO Reporting Schema Differences

Parsing CAISO OASIS XML in Python

SFTP vs REST API for ISO data retrieval

Specification and Standards Reference

Step-by-Step Ingestion Implementation

1. Pin the schema version to a tariff effective date

2. Validate at the boundary before any business logic

3. Anchor every timestamp to UTC

4. Map into the canonical component model

5. Persist with an audit hash

Edge Cases and Failure Modes

Threshold and Alerting Configuration

Testing and Reconciliation Verification

Frequently Asked Questions

Why does my ERCOT LMP not include an energy component column?

How should I handle the duplicated hour on a fall-back day?

Why must format parsing use Python’s decimal module instead of float?

What is the fastest way to detect that a market operator changed its file format?

Related