The Pre-Ingestion Problem: Why Most AI Compliance Tools Fail Before Analysis Begins

Why most AI compliance tools fail before analysis even begins — and what it takes to solve the problem upstream.

The conversation about AI-assisted project controls compliance focuses almost entirely on the analysis layer: what algorithms detect which anomalies, how machine learning models are trained on historical data, what findings the system produces. This focus is understandable. The analysis layer is where the interesting work happens and where vendors compete.

It is also, on most production implementations, not where the systems fail.

The failure point that determines whether an AI compliance tool produces reliable findings or unreliable noise is pre-ingestion: the process of acquiring, parsing, normalizing, and validating the input data before any analysis occurs. On capital programs with real-world data environments, pre-ingestion is where most of the technical difficulty lives — and most commercial tools treat it as a solved problem that it is not.

What the Data Actually Looks Like

The idealized version of the pre-ingestion problem assumes clean, consistently structured input data: a P6 schedule in standard XER format, a cost report in a defined schema, a contract document in accessible digital form. Given these inputs, the analysis layer can proceed on known foundations.

The actual data environment on a complex capital program looks different. The schedule may exist in multiple tool formats — a P6 XER exported by the prime contractor, an MS Project file maintained by a major subcontractor, an Asta Powerproject schedule from a European subconsultant. These files do not share a common activity coding structure. WBS codes that are consistent within each file may not align across files. Resource codes may overlap in ways that create false equivalences or missed connections.

The cost data may exist in a contractor's internal cost system, a government IPMR CPR Format 1 deliverable, a separate subcontract cost tracking spreadsheet, and a procurement system with its own data structure. These sources do not share a common work package identifier. The connection between a cost variance in the CPR and the corresponding schedule activity requires a mapping that must be constructed and validated before any cross-modal analysis can occur.

The contract document may be a PDF scan of a signed contract with handwritten amendments. The change order log may be a spreadsheet maintained by the contracts manager with inconsistent date formatting. The notice log — if it exists at all — may be a folder of email threads rather than a structured record.

This is the real data environment. An AI compliance tool that has not solved the pre-ingestion problem for this environment is a tool that will produce findings based on partial, inconsistently structured, or incorrect input data.

Parsing Is Not Trivial

The parsing of P6 XER files — the most common schedule data format in federal capital program environments — is more complex than the format's tabular structure suggests. XER files encode schedule data across dozens of linked tables: activities, relationships, resources, calendars, WBS elements, risk assignments, baseline data. The relationships between these tables must be correctly traversed to reconstruct the schedule network. Missing table entries, encoding inconsistencies, and tool-specific formatting variations produce parsing results that are structurally valid but semantically incorrect — a schedule that has been read without error but does not accurately represent the schedule that was submitted.

MS Project files introduce different parsing challenges. The XML-based MSPDI format and the binary MPP format each require different parsing approaches. Relationship types and lag values that are encoded differently between P6 and MS Project must be normalized to a common representation. Calendar definitions that drive date calculations must be preserved and applied correctly.

A system that cannot parse all major schedule file formats accurately, with validation that catches parsing errors before they propagate into analysis, is a system that will produce incorrect findings on a predictable subset of real-world inputs.

The Normalization Challenge

Even correctly parsed data from multiple sources must be normalized before cross-modal analysis is possible. Normalization means establishing consistent identifiers, consistent units, consistent date formats, and consistent coding structures across data streams that were created and maintained independently.

The activity ID that identifies a work package in the P6 schedule may or may not be the same identifier used in the cost system. If it is not — and on programs where the schedule and cost systems were implemented by different teams at different times, it frequently is not — the correspondence between schedule activities and cost elements must be established through a mapping process. That mapping must account for one-to-many and many-to-one relationships: a single cost element may cover multiple schedule activities, or a single schedule activity may draw costs from multiple cost elements.

WBS normalization is a related challenge. The WBS in the schedule, the WBS in the cost system, and the WBS in the contract may all differ — evolved independently as the program executed, as scope changes were incorporated, as subcontracts were structured in ways that did not map cleanly to the original WBS. Normalizing these to a common WBS requires judgment calls about how to handle the gaps and overlaps, and those judgment calls affect the analysis results.

Version Control and Audit Trail

The pre-ingestion layer is also where the version control discipline that makes retroactive change detection possible must be established. Detecting whether baseline data has been modified between reporting periods requires that prior-period schedule versions have been ingested, parsed, and stored in a manner that preserves their exact historical state.

This sounds simple. In practice, it requires that the system receive and process each reporting period's schedule submission before that submission is superseded — that it ingests the January version of the schedule in January, not in March when someone provides a historical file that may or may not be the original. It requires that the ingested versions be stored in an immutable, versioned format that prevents later modification. And it requires that the version comparison logic correctly handles changes in activity IDs, WBS restructuring, and other legitimate structural changes that should not be flagged as retroactive modifications.

A system without this version control discipline cannot produce defensible retroactive change findings. It can compare two files and identify differences. It cannot establish that those differences represent unauthorized modifications to locked historical data rather than legitimate structural evolution.

Why This Is a Competitive Differentiator

The pre-ingestion problem is technically unglamorous. It does not produce impressive demonstrations. It does not involve novel algorithms or publishable research. It requires painstaking engineering work on edge cases, encoding inconsistencies, and data quality problems that vary by client environment.

This is precisely why it is a competitive differentiator. A tool that handles the pre-ingestion layer correctly — that parses all major formats accurately, normalizes across data streams reliably, and maintains version-controlled history — produces findings that are based on what the data actually says. A tool that does not produces findings that are based on whatever the parsing layer happened to produce, with error modes that are difficult to detect and difficult to explain.

The analysis layer gets the attention. The pre-ingestion layer determines whether that analysis is worth anything.

The Forensic Intelligence Engine's ingestion layer handles P6 XER, MS Project, and Asta Powerproject formats with full cross-table validation, normalizes across schedule and cost data streams, and maintains immutable version history that supports defensible retroactive change detection.