How We Built a Clinical AI Agent

Published on
June 30, 2025

Agent workflows have quickly emerged as one of the most compelling frontiers in AI application development—but building them for real-world production use cases is still a nuanced and evolving challenge. At Stride, we recently partnered with a healthcare client to redesign a complex patient-facing workflow using LangGraph, LangChain’s graph-based orchestration framework. In doing so, we moved from a brittle traditional rules engine to a robust, LLM-driven agent system that offers flexibility, auditability, and future extensibility—all while meeting real-world clinical and compliance constraints.

This wasn’t a demo or a lab experiment. Our system is now supporting patients in a highly sensitive use case: at-home treatment for early pregnancy loss. This required more than just clever prompting. It demanded thoughtful orchestration, rigorous error handling, human-in-the-loop safeguards, and a purpose-built evaluation framework. In this post, we’ll walk through what made this project unique, how we solved for complex operational requirements, and what we learned from deploying LangGraph in a production-grade healthcare environment.

From Static Rules to Adaptive Agents

The initial implementation of our client’s patient workflow relied on conventional software: brittle logic trees, hardcoded decision rules, and predefined conversational flows. It worked, but it lacked the flexibility needed to support personalized care or scale to new treatment domains. Any updates required manual code changes. Even minor modifications were time-consuming, fragile, and often blocked by limited engineering capacity.

We knew we needed a more flexible architecture—one that empowered domain experts, not just developers, to adapt and improve workflows. That’s where LangGraph came in. LangGraph offers a lightweight but powerful framework for orchestrating stateful, event-driven interactions with LLMs. We used it to embed Claude Sonnet at the center of our workflow and enable dynamic, contextual reasoning.

In our architecture, the LLM has full access to unsent messages and treatment state anchors, limited access to patient data (only what it needs to make decisions and create good messaging – no phone numbers, no medical history, etc.) and read-only access to message history. It can propose next steps, call tools, and update the conversation—but it cannot edit historical data or override the past. This constraint is crucial in healthcare, where auditability and trust are non-negotiable. Every action is logged and traceable. The result is an AI system that augments, rather than undermines, clinical reliability.

Real-Time Recovery and Controlled Autonomy

One of the most important design goals was graceful failure recovery. In real-world deployments, things go wrong: malformed tool calls, unrecognized inputs, ambiguous messages, silent model failures. We needed a system that didn’t just break or stall when things got messy.

To address this, we introduced a "retry node" pattern within the LangGraph structure. When a tool call fails (e.g., invalid JSON), the system automatically deletes the faulty message and loops back to the calling node, instructing the LLM to try again. This prevents hallucinated tool responses and keeps the state machine clean. Similarly, if the LLM ends its turn without a proper response, the retry node kicks in to enforce completion. These patterns make the system more robust and resilient to common model quirks.

We also needed to manage ambiguity. Not every LLM output should be trusted, even if it’s syntactically valid. That’s where our confidence threshold comes in. We tune the model to evaluate its own confidence. If it falls below a given threshold, we escalate to a human reviewer. This isn’t about catching "wrong" answers—it’s about flagging moments where complexity is high and nuance or clinical judgment is required. Importantly, we designed this to distinguish confidence from correctness: an answer can be accurate but uncertain, or confident but flawed. Our system reacts based on operational risk, not blind trust.

Evaluation Beyond Accuracy

Traditional software testing falls short in LLM workflows. We needed a way to validate the system's behavior not just at the output level, but across the entire conversation state. Our approach combined LangSmith for data handling with a custom evaluation harness written in Python.

We built multiple datasets: "happy path" flows for ideal interactions, as well as targeted edge cases and failure scenarios. These were run through LangGraph locally and evaluated using GPT-4o as a judge. We didn’t rely on exact output match—instead, we used an LLM-optimized rubric with graded tiers that allowed us to catch significant deviations. This allowed us to track behavior over time and refine system prompts, confidence thresholds, and tool logic.

Time-sensitive logic posed a special challenge. Most workflows involved date calculations or time-based triggers, and we wanted to understand whether the model was accurately managing time. To keep evals consistent, we pre-processed timestamps or instructed the model on how to normalize time references. This blend of automation and controlled fuzziness helped us catch regressions without overfitting to specific outputs.

A Modular Platform for the Future

Perhaps the most compelling outcome was how reusable the system became. Once we had a functioning LangGraph workflow for an early pregnancy loss drug regimen, we were able to spin up new treatments with almost no engineering effort. Using Cline (a leading agentic coding tool), we generated new treatment configurations, plugged them into the graph, and deployed without changing the core logic of the treatment agents.

For example, we added a treatment flow for Ozempic using a few lines of prompt-driven configuration. The agent reviewed clinical dosing guidelines, synthesized a new workflow, and integrated it into the graph. The frontend and backend remained unchanged. Only the treatment blueprint was modified—a testament to the power of abstraction and modularity in agent system design.

We also engineered for the reality of asynchronous, multi-message conversations. Patients often send multiple texts in a row. We added configurable delays and thread invalidation logic to ensure we capture the full message burst before responding. The system discards partial responses and recalculates based on the most recent state—delivering more coherent, accurate replies.

Lessons for Technical Leaders

This case study offers several takeaways for CTOs and technical leaders exploring LLM-powered agents:

  • Frameworks like LangGraph accelerate development but still require deep thinking about state, error handling, and trust.

  • Retries and human escalation aren't edge cases—they're central to production-readiness.

  • Confidence scores must be calibrated for operational utility, not just model accuracy.

  • Modularity is key: isolate treatments, tools, and prompts so new flows don’t require rewrites.

  • Eval systems need to be fuzzy but grounded: LLMs won’t always say the same thing, but the spirit and outcome must be consistent.

We built what we estimate to be a year's worth of conventional software in a few months,  not because we used AI recklessly, but because we used it intelligently and thoughtfully. This wasn’t about pushing a model to do everything. It was about designing a system where the model knew exactly what it could and couldn't do—and where humans were still firmly in the loop.

Want to learn more about Stride's agent system architecture or how we use LangGraph in production? Reach out—we love talking about this stuff.

Share

How We Built a Clinical AI Agent

Agent workflows have quickly emerged as one of the most compelling frontiers in AI application development—but building them for real-world production use cases is still a nuanced and evolving challenge. At Stride, we recently partnered with a healthcare client to redesign a complex patient-facing workflow using LangGraph, LangChain’s graph-based orchestration framework.

No items found.
How We Built a Clinical AI Agent
Down Arrow

Agent workflows have quickly emerged as one of the most compelling frontiers in AI application development—but building them for real-world production use cases is still a nuanced and evolving challenge. At Stride, we recently partnered with a healthcare client to redesign a complex patient-facing workflow using LangGraph, LangChain’s graph-based orchestration framework. In doing so, we moved from a brittle traditional rules engine to a robust, LLM-driven agent system that offers flexibility, auditability, and future extensibility—all while meeting real-world clinical and compliance constraints.

This wasn’t a demo or a lab experiment. Our system is now supporting patients in a highly sensitive use case: at-home treatment for early pregnancy loss. This required more than just clever prompting. It demanded thoughtful orchestration, rigorous error handling, human-in-the-loop safeguards, and a purpose-built evaluation framework. In this post, we’ll walk through what made this project unique, how we solved for complex operational requirements, and what we learned from deploying LangGraph in a production-grade healthcare environment.

From Static Rules to Adaptive Agents

The initial implementation of our client’s patient workflow relied on conventional software: brittle logic trees, hardcoded decision rules, and predefined conversational flows. It worked, but it lacked the flexibility needed to support personalized care or scale to new treatment domains. Any updates required manual code changes. Even minor modifications were time-consuming, fragile, and often blocked by limited engineering capacity.

We knew we needed a more flexible architecture—one that empowered domain experts, not just developers, to adapt and improve workflows. That’s where LangGraph came in. LangGraph offers a lightweight but powerful framework for orchestrating stateful, event-driven interactions with LLMs. We used it to embed Claude Sonnet at the center of our workflow and enable dynamic, contextual reasoning.

In our architecture, the LLM has full access to unsent messages and treatment state anchors, limited access to patient data (only what it needs to make decisions and create good messaging – no phone numbers, no medical history, etc.) and read-only access to message history. It can propose next steps, call tools, and update the conversation—but it cannot edit historical data or override the past. This constraint is crucial in healthcare, where auditability and trust are non-negotiable. Every action is logged and traceable. The result is an AI system that augments, rather than undermines, clinical reliability.

Real-Time Recovery and Controlled Autonomy

One of the most important design goals was graceful failure recovery. In real-world deployments, things go wrong: malformed tool calls, unrecognized inputs, ambiguous messages, silent model failures. We needed a system that didn’t just break or stall when things got messy.

To address this, we introduced a "retry node" pattern within the LangGraph structure. When a tool call fails (e.g., invalid JSON), the system automatically deletes the faulty message and loops back to the calling node, instructing the LLM to try again. This prevents hallucinated tool responses and keeps the state machine clean. Similarly, if the LLM ends its turn without a proper response, the retry node kicks in to enforce completion. These patterns make the system more robust and resilient to common model quirks.

We also needed to manage ambiguity. Not every LLM output should be trusted, even if it’s syntactically valid. That’s where our confidence threshold comes in. We tune the model to evaluate its own confidence. If it falls below a given threshold, we escalate to a human reviewer. This isn’t about catching "wrong" answers—it’s about flagging moments where complexity is high and nuance or clinical judgment is required. Importantly, we designed this to distinguish confidence from correctness: an answer can be accurate but uncertain, or confident but flawed. Our system reacts based on operational risk, not blind trust.

Evaluation Beyond Accuracy

Traditional software testing falls short in LLM workflows. We needed a way to validate the system's behavior not just at the output level, but across the entire conversation state. Our approach combined LangSmith for data handling with a custom evaluation harness written in Python.

We built multiple datasets: "happy path" flows for ideal interactions, as well as targeted edge cases and failure scenarios. These were run through LangGraph locally and evaluated using GPT-4o as a judge. We didn’t rely on exact output match—instead, we used an LLM-optimized rubric with graded tiers that allowed us to catch significant deviations. This allowed us to track behavior over time and refine system prompts, confidence thresholds, and tool logic.

Time-sensitive logic posed a special challenge. Most workflows involved date calculations or time-based triggers, and we wanted to understand whether the model was accurately managing time. To keep evals consistent, we pre-processed timestamps or instructed the model on how to normalize time references. This blend of automation and controlled fuzziness helped us catch regressions without overfitting to specific outputs.

A Modular Platform for the Future

Perhaps the most compelling outcome was how reusable the system became. Once we had a functioning LangGraph workflow for an early pregnancy loss drug regimen, we were able to spin up new treatments with almost no engineering effort. Using Cline (a leading agentic coding tool), we generated new treatment configurations, plugged them into the graph, and deployed without changing the core logic of the treatment agents.

For example, we added a treatment flow for Ozempic using a few lines of prompt-driven configuration. The agent reviewed clinical dosing guidelines, synthesized a new workflow, and integrated it into the graph. The frontend and backend remained unchanged. Only the treatment blueprint was modified—a testament to the power of abstraction and modularity in agent system design.

We also engineered for the reality of asynchronous, multi-message conversations. Patients often send multiple texts in a row. We added configurable delays and thread invalidation logic to ensure we capture the full message burst before responding. The system discards partial responses and recalculates based on the most recent state—delivering more coherent, accurate replies.

Lessons for Technical Leaders

This case study offers several takeaways for CTOs and technical leaders exploring LLM-powered agents:

  • Frameworks like LangGraph accelerate development but still require deep thinking about state, error handling, and trust.

  • Retries and human escalation aren't edge cases—they're central to production-readiness.

  • Confidence scores must be calibrated for operational utility, not just model accuracy.

  • Modularity is key: isolate treatments, tools, and prompts so new flows don’t require rewrites.

  • Eval systems need to be fuzzy but grounded: LLMs won’t always say the same thing, but the spirit and outcome must be consistent.

We built what we estimate to be a year's worth of conventional software in a few months,  not because we used AI recklessly, but because we used it intelligently and thoughtfully. This wasn’t about pushing a model to do everything. It was about designing a system where the model knew exactly what it could and couldn't do—and where humans were still firmly in the loop.

Want to learn more about Stride's agent system architecture or how we use LangGraph in production? Reach out—we love talking about this stuff.

Dan Mason

Dan Mason

Head of AI

No items found.
green diamond