Parsing CSV vs XML trade feeds with pandas

Q: Should I set dtype at read time or cast afterward?

Cast afterward. Read every column as str first so a single non-numeric character in a price field cannot corrupt the whole column's dtype or coerce silently to NaN. After load, map money columns through a Decimal converter, quarantine the rows that fail to parse with a reject reason, and only then normalize timestamps to UTC. Deferred casting keeps parse failures visible and auditable instead of hiding them inside pandas inference.

Q: How do I keep rows from being dropped silently by on_bad_lines?

Never use on_bad_lines='skip', which discards malformed rows with no trace. Use 'warn' and route the warnings to a quarantine sink, then enforce the invariant that source rows equals clean rows plus quarantined rows. If the counts disagree, the parser lost a row it never accounted for, and the frame must not be released to matching until the discrepancy is explained.

A counterparty CSV that silently loses forty rows to an unquoted comma, and an ISO XML drop that returns an empty DataFrame because its namespace was never mapped, produce the same downstream symptom: a settlement statement that will not tie out and no record of what went missing. This page solves that specific failure mode — the un-audited transformation between a raw feed and a typed DataFrame — for both flat and hierarchical formats. It sits under Pandas for Trade Data Processing within the broader Trade Ingestion & Matching Workflows framework, and produces the single normalized frame that the matching layer expects: every source row either lands typed in the output or is quarantined with a logged reason, and the two counts reconcile exactly.

The flowchart below contrasts the two parse paths — the flat CSV route handling encoding and bad lines, and the hierarchical XML route resolving namespaces and flattening nested legs — converging on a single timezone-normalized, audit-ready DataFrame.

Prerequisites

Python 3.11+ with pandas>=2.1, lxml>=5.0, and chardet>=5.2. The lxml backend is required for namespace-aware read_xml; the stdlib etree parser will not resolve prefixed namespaces reliably.
Read access to the raw feed drop — an SFTP mount or object-store prefix where the ISO/RTO, counterparty, and clearinghouse files land. No transformation happens in place; the parser only reads.
A quarantine sink — a writable path or table for rows the parser rejects, so on_bad_lines events and schema violations survive as evidence rather than console noise.
The source schema baseline — the expected CSV header set and the XML element/namespace map for each feed, versioned alongside the ISO/RTO Data Format Standards your desk ingests, so schema drift trips an alert instead of a silent column shift.

CSV feed ingestion: precision over convenience

While pd.read_csv() appears straightforward, energy trade CSVs frequently violate RFC 4180. Common failure modes include UTF-8 BOM prefixes, inconsistent quoting around OTC contract descriptions, embedded newlines in narrative fields, and dynamic header rows that shift position across daily drops. Unhandled anomalies trigger silent data corruption or a catastrophic pipeline halt — both compromise settlement accuracy.

Handling encoding and malformed rows

Production parsers must declare parsing parameters explicitly and detect encoding safely. Relying on defaults risks misinterpreting exports that mix ASCII, UTF-8, and Windows-1252 across the same feed. Read everything as str first: no numeric coercion happens at read time, so a stray non-numeric character in a price column cannot corrupt the whole column’s dtype.

import logging
from pathlib import Path

import chardet
import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("trade_ingest")

def load_trade_csv(feed_path: str, chunk_rows: int = 50_000) -> pd.DataFrame:
    csv_path = Path(feed_path)
    if not csv_path.exists():
        raise FileNotFoundError(f"Trade feed missing: {feed_path}")

    # Sniff encoding from a header sample rather than the whole file
    with open(csv_path, "rb") as handle:
        encoding = chardet.detect(handle.read(10_000))["encoding"] or "utf-8"

    raw_frames: list[pd.DataFrame] = []
    for chunk in pd.read_csv(
        csv_path,
        encoding=encoding,
        sep=",",
        quotechar='"',
        skipinitialspace=True,
        on_bad_lines="warn",   # never "error" (halts) or "skip" (silent)
        dtype=str,             # defer ALL typing; read everything as text
        low_memory=False,
        chunksize=chunk_rows,
    ):
        # In production, route on_bad_lines warnings to the quarantine sink
        # so dropped rows are auditable rather than lost to the console.
        raw_frames.append(chunk)

    if not raw_frames:
        raise ValueError(f"No valid rows parsed from {feed_path}")

    trade_df = pd.concat(raw_frames, ignore_index=True)
    logger.info("Ingested %d rows from CSV feed %s", len(trade_df), csv_path.name)
    return trade_df

For files larger than RAM, do not accumulate raw_frames — process and write each chunk out incrementally, keeping only the running row counts in memory. The chunked loop is also the natural place to apply per-chunk filtering before assembly, the same batching discipline used across Async Batch Processing Pipelines.

Deferred type casting with Decimal

Energy-specific fields like settlement_price, mw_quantity, and lmp_component require deferred casting. Premature float conversion truncates decimal precision and hides the null indicators settlement teams rely on for exception handling. Cast after load, isolate the coercion failures, and use Decimal — never binary float — for anything that enters a settlement calculation.

from decimal import Decimal, InvalidOperation

def to_decimal(value: str) -> Decimal | None:
    if value is None or str(value).strip() == "":
        return None
    try:
        return Decimal(str(value).replace(",", ""))
    except InvalidOperation:
        return None  # flagged below as a quarantine candidate

def cast_trade_types(trade_df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    money_cols = ["settlement_price", "mw_quantity", "lmp_component"]
    for col in money_cols:
        trade_df[col] = trade_df[col].map(to_decimal)

    # Any row where a required money field failed to parse is a break,
    # not a value — quarantine it with its reason instead of dropping it.
    parse_failed = trade_df[money_cols].isna().any(axis=1)
    quarantined = trade_df[parse_failed].assign(reject_reason="non_numeric_money_field")
    clean_df = trade_df[~parse_failed].copy()

    # Normalize every timestamp to UTC in one pass, post-cast
    clean_df["trade_ts"] = pd.to_datetime(clean_df["trade_ts"], utc=True, errors="coerce")
    logger.info("Cast %d clean rows; quarantined %d", len(clean_df), len(quarantined))
    return clean_df, quarantined

This precision-first pass is what lets the downstream matching engine compare a counterparty price against an internal book value to the exact cent, and it hands the same typed contract to the Schema Validation Frameworks that gate the record before it reaches the ledger.

XML feed ingestion: navigating hierarchical complexity

XML submissions from clearinghouses and market operators embed nested elements, repeating transaction blocks, and strict namespace declarations. Unlike CSV, XML preserves relational context but adds parsing overhead. Legacy xml.etree implementations struggle with memory on multi-gigabyte day-ahead market files; modern pipelines use lxml via pd.read_xml() to extract structured trade legs while honoring schema validation.

Namespace resolution and XPath extraction

Market XML feeds typically declare multiple namespaces (xmlns, xmlns:ns1, and so on). Failing to map them yields an empty DataFrame with no error — the most dangerous outcome, because zero rows looks like a quiet day. Extract the namespace map from the document itself before extraction.

from lxml import etree

def parse_trade_xml(feed_path: str, record_xpath: str = ".//TradeRecord") -> pd.DataFrame:
    xml_path = Path(feed_path)
    logger.info("Parsing XML feed %s", xml_path.name)

    tree = etree.parse(str(xml_path))
    # Pull the declared namespace map; drop the default (None) prefix which
    # read_xml cannot use as a dict key.
    ns_map = {prefix: uri for prefix, uri in tree.getroot().nsmap.items() if prefix}

    try:
        xml_df = pd.read_xml(
            str(xml_path),
            xpath=record_xpath,
            namespaces=ns_map,
            parser="lxml",
            dtype=str,      # same deferred-typing discipline as the CSV path
        )
    except etree.XMLSyntaxError as exc:
        logger.error("XML schema violation in %s: %s", xml_path.name, exc)
        raise ValueError("Malformed XML feed; validate against the ISO XSD before retry") from exc

    if xml_df.empty:
        logger.warning("XPath %s matched zero records — check the namespace map", record_xpath)
    logger.info("Extracted %d trade records from XML hierarchy", len(xml_df))
    return xml_df

Flattening nested trade legs

XML feeds nest pricing components, delivery intervals, and counterparty metadata inside parent <Transaction> blocks. Flattening requires selecting the child columns you need and forward-filling the parent identifiers so trade lineage survives the denormalization. Verify that trade_id and contract_type propagate to every child row before merging with position systems.

def flatten_trade_legs(xml_df: pd.DataFrame, lineage_cols=("trade_id", "contract_type")) -> pd.DataFrame:
    flat_df = xml_df.copy()
    # Parent identifiers appear only on the first leg of each transaction;
    # forward-fill so every interval row carries its trade lineage.
    for col in lineage_cols:
        flat_df[col] = flat_df[col].ffill()

    missing_lineage = flat_df[list(lineage_cols)].isna().any(axis=1).sum()
    if missing_lineage:
        raise ValueError(f"{missing_lineage} leg rows have no parent trade_id after ffill")
    return flat_df

Flattening early keeps the record denormalized and reconciliation-ready before it flows into the matching stage — the same shape the downstream Trade Ingestion & Matching Workflows join expects.

Choosing between CSV and XML

The format decision is driven by source-system constraints, data volume, and reconciliation latency. CSV is optimal for high-frequency, flat settlement statements where row throughput dominates; XML wins for complex multi-leg derivatives and ISO market results where relational integrity and schema validation are non-negotiable.

Dimension	CSV feed	XML feed
Structure	Flat, one row per record	Hierarchical, nested legs per transaction
Typical source	Counterparty confirms, flat settlement statements	ISO/RTO market results, clearinghouse submissions
pandas entry point	`read_csv(..., chunksize=)`	`read_xml(..., xpath=, namespaces=)`
Dominant failure mode	Bad quoting / encoding → dropped or corrupt rows	Unmapped namespace → empty DataFrame
Memory profile	Streams cleanly via `chunksize`	Load-then-parse; use `iterparse` for multi-GB drops
Schema validation	Header baseline diff	XSD / namespace map validation
Relational fidelity	Lost — must be rejoined externally	Preserved in parent/child element tree

Verifying the parsed output

A parse is only trustworthy if you can prove nothing vanished. Enforce a row-count invariant across the whole ingest: the number of source records must equal the clean rows plus the quarantined rows.

$$n_{\text{source}} = n_{\text{clean}} + n_{\text{quarantined}}$$

def verify_ingest(source_rows: int, clean_df: pd.DataFrame, quarantined: pd.DataFrame) -> None:
    assert source_rows == len(clean_df) + len(quarantined), (
        f"Row conservation broken: {source_rows} source "
        f"!= {len(clean_df)} clean + {len(quarantined)} quarantined"
    )
    # Shape check: the matching layer expects a fixed column contract
    expected_cols = {"trade_id", "contract_type", "settlement_price", "mw_quantity", "trade_ts"}
    assert expected_cols.issubset(clean_df.columns), "Column contract violated"
    # Type check: money columns must be Decimal, not float
    assert clean_df["settlement_price"].map(lambda x: isinstance(x, Decimal)).all()
    logger.info("Ingest verified: %d clean rows, contract intact", len(clean_df))

Concretely, confirm three things before releasing the frame: clean_df.shape matches the expected (rows, columns) contract; a content hash of the sorted frame (pd.util.hash_pandas_object(clean_df).sum()) is stable across a replay of the same input; and a reconciliation diff against the prior cycle’s row count for the same trade date shows no unexplained drop. If any check fails, halt and route to the quarantine sink rather than passing a partial frame downstream.

Compliance note

FERC and NERC standards require traceable data lineage for every settlement adjustment, so the parser — not just the calculation engine — is a compliance surface. Whichever format you ingest, enforce four controls:

Immutable audit trails. Log parsing warnings, quarantined rows, and every type-coercion event to a centralized compliance store keyed by feed name and trade date. The row-count invariant above is the evidence that ingest was complete.
Memory-efficient chunking. Process files exceeding available RAM with chunksize (CSV) or iterparse (XML); never load a multi-gigabyte market drop into a single frame without memory profiling.
Schema drift detection. Validate the CSV header set and the XML namespace/element map against the versioned baseline on every run. An unexpected column or namespace change must raise an alert, not shift a column silently.
Timezone normalization. Energy trades span PT, CT, and ET; convert every timestamp to UTC immediately post-parse with pd.to_datetime(..., utc=True) so an interval never misaligns during netting.

For parameter tuning, consult the official pandas.read_csv documentation and the pandas.read_xml documentation. Align XML schema expectations with the ISO 20022 XML Messaging Standards to stay forward-compatible with evolving market-operator specifications.

Frequently Asked Questions

Why does my read_xml call return an empty DataFrame?

Almost always an unmapped namespace. When the document declares xmlns prefixes, an XPath like .//TradeRecord matches nothing unless you pass the corresponding namespaces= dict. Extract the map from tree.getroot().nsmap, drop the default None prefix that read_xml cannot key on, and pass what remains. Treat a zero-row result as an error to investigate, not a quiet trading day — log a warning whenever the frame comes back empty.

Should I set dtype at read time or cast afterward?

Cast afterward. Read every column as str first so a single non-numeric character in a price field cannot corrupt the whole column’s dtype or, worse, coerce silently to NaN. After load, map money columns through a Decimal converter, quarantine the rows that fail to parse with a reject reason, and only then normalize timestamps to UTC. Deferred casting keeps parse failures visible and auditable instead of hiding them inside pandas’ inference.

How do I keep rows from being dropped silently by on_bad_lines?

Never use on_bad_lines="skip" — it discards malformed rows with no trace. Use "warn" and route the warnings to a quarantine sink, then enforce the invariant n_source == n_clean + n_quarantined. If the counts disagree, the parser lost a row it never accounted for, and the frame must not be released to matching until the discrepancy is explained.

Parsing CSV vs XML trade feeds with pandas

Prerequisites #

CSV feed ingestion: precision over convenience #

Handling encoding and malformed rows #

Deferred type casting with Decimal #

XML feed ingestion: navigating hierarchical complexity #

Namespace resolution and XPath extraction #

Flattening nested trade legs #

Choosing between CSV and XML #

Verifying the parsed output #

Compliance note #

Frequently Asked Questions #

Why does my read_xml call return an empty DataFrame? #

Should I set dtype at read time or cast afterward? #

How do I keep rows from being dropped silently by on_bad_lines? #

Related #