Our approach to Evals at Monk

January 6, 2026

min read

Engineering

Why We Invest in Evals at Monk

Monk uses LLMs across a variety of our products, including contract extraction, payment reconciliation, and our best-in-class intelligent collections agent.

LLMs unlock a huge amount of value by efficiently handling complex, ambiguous tasks, but they also introduce new challenges.

With a wide range of possible inputs, and non-deterministic outputs, it can be difficult to understand how performance changes whenever we switch models or make adjustments to our prompts.

Therefore, we rely on evaluations to measure quality in a consistent way, giving us the confidence to iterate and ship at a fast pace.

How Evals Power Our Payment Reconciliation Engine

One example of this in action is our payment reconciliation module, which is responsible for matching incoming bank transactions to correct invoices. As with most workflows, this is a mix of deterministic guardrails and LLMs.

When a transaction can’t be deterministically matched with 100% accuracy, we use LLMs to suggest potential matches, which are then manually reviewed by a human.

Each human approval or rejection becomes a labeled data point used in our eval system.

Over time, this dataset grows, providing us with a powerful way to measure performance. When we make changes to our system, we run those changes against this real world dataset to understand the impact before shipping.

Creating a Tight Feedback Loop

Because every real-world action is a signal that can be used to improve our systems, we needed a platform that could easily capture this data and feed into our eval systems. We chose Braintrust because it makes this process straightforward. Using their LLM logging and experimentation products, we’ve built a tight feedback loop where we can ship features, capture real world outcomes, evaluate performance, and ship again.

Building a Culture Around Experimentation

Because experimentation and evaluation are necessary to improve LLM-driven systems, we believe evals can’t be an afterthought but a core part of how we build.

That means starting evals early, even when the dataset is imperfect, and adding to the dataset over time.

Our thinking here is that failures and edge cases are perfect candidates for improving the dataset.

We treat evaluation as a shared responsibility across the company, not just an engineering task. Braintrust’s UI makes this possible by enabling non-technical teammates to run experiments and contribute to datasets without needing to write code.

This culture allows us to build resilient agentic systems.

‍