AI Generated Invoice Validator
Upload a PDF invoice; a vision LLM extracts its structure, a second synthesises a bespoke validator on the fly, and that code executes in a sandboxed runtime with no network access. Cached after two stable runs. Streamed live.
AI-Generated Invoice Validators, Running at the Edge
At a certain point in a senior technical leadership career, the calendar starts filling with things that were never quite in the job description. One of them is invoices. As actual verification work: cross-referencing line items against commercial agreements, checking that rates before sign-off.
The task of automating this looks simple from the outside but is fiddly up close. Introducing a new system to track and automate entire process is time consuming and not budget friendly. It would need to capture every vendor structures who invoice differently and aren’t aligned to FinOPS FOCUS spec. A rule engine that handles AWS or GCP does not handle a smaller vendor who generates a bespoke spreadsheet-exported PDF.
This project builds and validates a different approach.
What was built
Upload a PDF invoice. A vision-capable language model extracts its structure. A second model call synthesises a bespoke validation module on the fly: one that understands this vendor’s specific categories, units, currency, and billing periods. That module is then executed inside a sandboxed runtime with no network access and no visibility into the surrounding infrastructure; structural isolation, not policy. Prompting a model not to perform dangerous operations is not a security boundary. A container with no network interface and no visibility into surrounding state is. The distinction matters when the code being executed is itself generated by a model.
The entire pipeline — extraction, code generation, sandboxed execution, report generation — streams back to the user in real time. Events arrive as each stage completes.
This is a proof of concept and what it demonstrates is a concrete pattern: a machine reads an unfamiliar document, reasons about its structure, writes code to validate that structure, and executes that code in isolation.
See it working
The demo below shows the same invoice processed twice. On the left, no rules applied; the invoice passes. On the right, a single rule is specified in plain text (“unit rate was agreed to 10$”); the generated validator catches that the actual rate of $164.99 does not match, and fails with a precise field-level error. Both runs stream their results live, and the right-hand run shows the validator being reused from cache rather than regenerated.
The screenshot below shows a single run in detail; the cached validator badge confirms it was reused rather than regenerated, and the activity stream shows the field-level error surfaced from the sandboxed execution.
The caching layer from governance perspective
Vendor invoice structures are generally static as often the line items are the dynamic parts period to period. So the same vendor’s invoices can reuse the generated validator across multiple runs and when the structure does change, it can get re-generated again. A second independent run on a structurally identical invoice must produce consistent results before the validator is promoted. From the third run onward, the cached version is served directly.
This promotion gate means AI-generated code only reaches steady state after demonstrating stability. Each new vendor structure creates exactly one verification window, after which, the system runs the same code it already confirmed works.
The platform
This stack runs entirely on serverless infrastructure: an HTTPS gateway that supports WebSockets for realtime comms, pipeline orchestration via long-lived stateful compute, object storage, and a relational layer. Each generated validator runs in an isolated sandboxed container, unable to make outbound network calls or access any surrounding state for structural security.
What this was really about
This PoC is a proxy for the question: can AI actually fit in enterprise workflows that look automatable but harder with generic tooling?
Off-the-shelf AI assistants — Copilot, standalone Claude, ChatGPT — can read an invoice and summarise it. That helps, but could be slower for a consistent, auditable validation layer across a portfolio of vendors with different commercial arrangements. Instead of prompting your way into this, you can have per-vendor rule enforcement that persists across runs, compares against agreed rates, and degrades gracefully when the structure changes.
The question this project was probing is if LLM output can be a component within a machinery where generated code instead of conversational text, can be the mechanism by which AI adapts supports enterprise workflows without a human rewriting rules every time. I think the answer is yes.