Guest Column | May 18, 2026

A Multi-Agent Audit Intelligence Framework For CDMO Quality Oversight

A conversation between Gourav Pandey, Takeda, and Life Science Connect's Jon O'Connell

AGI, artificial general intelligence, ai assistant, chatbot-GettyImages-2265775884

Artificial intelligence agents working in tandem can provide powerful insights for biopharmaceutical compliance activities, but only with a tightly controlled level of autonomy, and, importantly, no ability to infiltrate GMP environments.

Gourav Pandey, a quality lead at Takeda, built a decision support system that relies on multiple function-specific agents, managed by an agentic overseer, to assist in CDMO audit preparation. The network deftly acknowledges AI's shortcomings and relies on highly traceable source material.

Pandey is to discuss his project and what he learned during the inaugural 2026 ISPE AI in Life Sciences Summit – Powered by GAMP in June. Ahead of the event, he spoke to Life Science Connect about the project.

Before we get started, a brief overview of the audit intelligence architecture you built would be helpful. Can you describe the problem statement and your overall approach using AI agents?

Pandey: Preparing for a CDMO audit after a multi-year gap means reconstructing compliance history from five or more disconnected systems: quality management systems (QMS), electronic document management systems (EDMS), supplier portals, regulatory databases, and shared mailboxes. A deviation trend visible only when you cross-reference QMS events with supplier change notifications does not surface when data stays siloed. That is how systemic issues reach an inspector before they reach you.

My approach was to build a multi-agent decision-support system that is explicitly not an automated decision-making tool. Five specialized agents, coordinated by an orchestrator, each query a defined compliance domain. The architecture uses retrieval-augmented generation (RAG) backed by a hybrid knowledge base: vector embeddings for semantic search and a graph database for relational reasoning across document relationships. Every output is cited to a specific source document, version, and retrieval path. The system is designed for private deployment to address data residency requirements, with read-only API connections to enterprise systems and role-based access controls.

What does each specialized agent actually do, and how does the orchestration work? What role does the virtual audit lead play?

Pandey: The orchestrator functions as the audit lead. It parses each query, determines which specialists to engage, sequences their execution, and synthesizes outputs into a consolidated report. Citation enforcement is implemented as a programmatic post-processing check — not a prompt instruction — so no output reaches the user without a verified source reference.

Five specialist agents each own a domain with measurable performance criteria.

The SOP agent retrieves revision histories and change rationales across controlled documents.
The internal audit agent identifies recurring findings using pattern-matching against historical audit data.
The quality systems agent cross-references CAPA trends, deviations, and minor deviation comments in batch records that often reveal systemic issues.
The web scraper agent retrieves current FDA and EMA guidance on a controlled update schedule, with content versioning to flag stale sources.
The external conference agent indexes PDA and ISPE proceedings for practical regulatory interpretations.

When agents produce conflicting outputs, the orchestrator follows a documented resolution protocol: flag the conflict, present both sources, and require human adjudication. This is transparent escalation, not silent reconciliation.

The project used synthetic data rather than a deployed system. What's the value of working at that conceptual level? What can you learn from a synthetic-data prototype that you can't from jumping straight to a production pilot?

Pandey: Synthetic data solves a fundamental GxP constraint: you cannot iterate on production quality records during development. Real batch records, audit findings, and regulatory correspondence carry confidentiality and data integrity obligations that prohibit experimental use.

But synthetic data is not a shortcut. It must be constructed with production-realistic complexity. Our synthetic dataset includes multi-product CDMO scenarios, overlapping CAPAs, ambiguous deviation classifications, and conflicting supplier notifications, modeled on realistic GxP patterns. The construction methodology is documented as part of the data qualification stage recommended by the ISPE GAMP AI Guide.

Synthetic data answers decisively: Does the architecture distinguish critical from minor deviations? Does the citation mechanism trace reliably to source? Does the orchestrator sequence agents correctly? You know the ground truth because you defined it, which is precisely why this validates design logic, not real-world edge cases.

Synthetic data cannot answer production-scale performance, legacy system integration, and user adoption dynamics. That is why it is positioned as validation feasibility, confirming the system can be validated within an existing GAMP 5 framework, not operational deployment. The architecture is designed so that transitioning to real data requires re-qualifying the knowledge base, not re-engineering the system.

RAG is central to the model. Can you describe how that works and what makes it more or less complex from a validation standpoint?

Pandey: The most important thing to understand about RAG is its failure mode. A perfectly-reasoning model paired with poor retrieval will produce confidently cited but wrong answers. That risk — authoritative-sounding outputs grounded in the wrong source documents — is the central validation challenge.

RAG works by intercepting the model's reasoning with a retrieval step. Before generating a response, the system searches a curated knowledge base and injects the most relevant document segments into the prompt as context. The model generates answers grounded in those retrieved facts rather than its training data alone.

In our implementation, we moved beyond single-method retrieval. We run best matching 25 (BM25) keyword search in parallel with dense vector embeddings and then merge results through Reciprocal Rank Fusion (RRF), which rewards documents appearing in both result sets. A cross-encoder re-ranking step then re-scores the top candidates against the specific query before the final context reaches the language model. In testing, this ensures exact regulatory citations like 21 CFR 211.68 surface ahead of semantically similar but imprecise results — a distinction that matters when an auditor needs the precise clause, not a paraphrase.

From a validation standpoint, this means you must validate the entire retrieval pipeline. Chunking strategy, embedding quality, fusion logic, and re-ranking accuracy each require defined test protocols. The ISPE GAMP AI Guide recommends data qualification as a lifecycle stage preceding traditional IQ/OQ/PQ, confirming the knowledge base is complete, current, and correctly indexed. Re-indexing when source documents change is automated through change control-triggered pipelines to prevent drift.

You describe mapping probabilistic AI components to GAMP 5 and CSV controls, which were built for deterministic systems. How does that mapping work in practice?

Pandey: Consider the SOP agent as an example. Under GAMP 5 Category 5, you start with a user requirements specification (URS): the agent must retrieve active and historical SOPs with revision context and change rationale for a defined CDMO scope. Operational qualification tests retrieval accuracy against known document sets and asks whether the agent returns the correct SOP as the top-ranked result, with the correct version and document ID.

The critical adaptation for probabilistic systems is that acceptance criteria must be functionally bounded and risk-justified through quality risk management aligned with ICH Q9. The system declines to answer below the defined confidence level, routing to mandatory human review.

In practice, we built this into a continuous validation infrastructure. A GxP-controlled benchmark, also known as a "golden evaluation dataset," with stable IDs, semantic versioning, and required metadata, including change control references, serves as the permanent benchmark. An automated evaluation script runs every golden query through the live RAG pipeline and computes four metrics — context recall, faithfulness, answer relevance, and citation coverage — each with predefined acceptance thresholds. If fewer than 80% of samples produce valid metrics, the pipeline forces a FAIL regardless of individual scores, preventing a broken system from generating a vacuous pass.

This is operationalized through a continuous integration and continuous delivery/deployment (CI/CD) gate, through which every code change touching agents, prompts, or the knowledge base automatically triggers the evaluation. A change that causes any metric to fall below threshold is blocked from deployment. This makes validation continuous rather than periodic — consistent with CSA principles emphasizing critical thinking over exhaustive scripted testing, and keeping the validation burden manageable as the system scales across all six agents.

What guardrails does your architecture include to mitigate the risk of an AI system surfacing misleading connections or missing critical observations?

Pandey: A system that confidently presents a wrong connection is more dangerous than no system at all, because it carries the appearance of thoroughness. The architecture addresses this at multiple levels.

Programmatic citation enforcement: Every claim must reference a specific source document, version, and retrieval path. If the system cannot cite a source, it returns an explicit statement of uncertainty rather than generating unsupported content.

Retrieval quality controls: The hybrid search with cross-encoder re-ranking ensures the system retrieves the precise document, not merely a semantically similar one. In testing, chunks containing exact regulatory language consistently rank above higher-scoring but off-topic results.

Cross-agent validation: When agents produce conflicting outputs, the orchestrator flags the conflict and presents both sources, rather than silently resolving disagreements.

Confidence gating: Lower-confidence outputs are routed to an escalated review queue but, critically, all outputs require human review before entering any GxP record, regardless of confidence score.

Last, and most importantly, automated regression protection: The CI/CD evaluation gate runs predefined acceptance criteria on every system change. A safeguard forces a FAIL disposition if fewer than 80% of evaluation samples produce valid metrics, preventing a degraded pipeline from passing undetected. Threshold breaches trigger investigation analogous to an out-of-specification procedure for a calibrated instrument.

Enforcing human-in-the-loop is structural. The system has no write access to any GxP system. Outputs are delivered through a controlled interface with role-based access, full audit logging, and no pathway into quality records without a documented review step. Reviewers follow a defined verification checklist to mitigate automation bias. Override decisions are logged as part of the audit trail. System outputs comply with ALCOA+ principles and access controls consistent with 21 CFR Part 11 and EU GMP Annex 11 requirements.

What milestones must an organization reach before moving from this kind of conceptual evaluation toward a validated, GMP-production deployment?

Pandey: There are five milestones, but with one prerequisite: cross-functional governance starts at day zero. Quality, CSV, IT, AI engineering, cybersecurity, and vendor management must have defined decision rights before the first milestone begins. AI in GxP is not owned by any single function.

Demonstrate feasibility with synthetic data to confirm the architecture works, citation is reliable, and the validation mapping is defensible within GAMP 5 Category 5. That is what this project accomplished.
Conduct a controlled pilot on a low-risk process such as internal audit preparation with parallel manual execution. Success criteria are predefined: retrieval accuracy, citation reliability, and concordance with manual findings. Where AI and manual results disagree, the manual result is the reference standard.
Complete formal validation per agent, including data qualification per the ISPE GAMP AI Guide and risk-based testing aligned with computer software assurance (CSA) principles. Vendor qualification for the underlying LLM is part of this milestone, including contractual controls for model update notification and impact assessment. Build the golden evaluation dataset and automated regression infrastructure so that validation evidence is generated continuously.
Establish continuous monitoring: output quality scoring against the golden dataset, human override rate tracking, input drift detection using statistical distribution monitoring, and periodic ground-truth review on a defined cadence. Threshold breaches trigger documented investigation with root cause analysis and corrective action before the system returns to validated use, analogous to managing an out-of-specification event for a calibrated instrument.
Scale and sustain. The organization is production-ready when governance is mature enough to detect and manage its imperfections in real time. The CI/CD gate, the golden dataset, the continuous metrics, and the cross-functional governance structure together create a system that gets more reliable over time, rather than silently degrading.

Editor's Notes:

The opinions expressed in this article are solely those of the author and do not necessarily represent those of any current or former employer. No proprietary or confidential information was used. All data referenced is synthetic, generated exclusively for research purposes. No employer guarantees the accuracy or reliability of the information provided herein.

This work was conducted in the author's personal capacity using synthetic, non-production data. The multi-agent system described is a research prototype; no proprietary employer data, system names, or confidential processes are referenced.

About The Expert

Gourav Pandey is an R&D quality lead at Takeda where he oversees end-to-end quality for investigational medicinal products in areas including clinical batch disposition, CDMO governance, and regulatory compliance. His work includes embedding AI into GMP processes where he advocates for transparency and validation. He received his master's degree in regulatory affairs from Northeastern University.