Most AI agents work on day one. That is the easy part — the part everyone budgets for, the part the demo is built around. The interesting story is what happens between day one and day 90, when the accuracy number that looked clean at launch starts doing things you did not plan for.
We run a lot of agents in production for clients. The 90-day arc is consistent enough that it is worth describing — both for the teams running agents themselves and for the buyers trying to figure out whether to keep one in-house or hand it to us.
Days 1–14: the honeymoon ends
The first two weeks are mostly good. Accuracy is sitting near where it was in the pilot. Users are still learning to trust the agent and are double-checking everything, which means edge cases get caught by humans before they cause downstream problems. Throughput is up; nobody is unhappy.
What is actually happening: the agent is seeing real-world data for the first time in volume. The first edge cases that did not show up in the pilot batch are starting to arrive. Most of them get handled by the exception queue and nobody escalates. But the eval set is starting to look stale, because real volume always surfaces document shapes and inputs the pilot batch did not include.
If you do nothing in this window, you have a problem coming in week 4. If you treat it as the moment to refresh the eval set with the new edge cases, you have a smooth quarter ahead.
Days 15–45: the silent drift
Around week 3, users stop double-checking. The agent has earned enough trust that downstream consumers — your underwriters, your operators, your customers — start taking the output at face value.
This is when the silent drift starts being expensive. If a model update from your vendor changes behavior on a niche document type, or if your data distribution shifts (new state added, new product line, new format from a single big customer), the accuracy on that slice quietly degrades. Aggregate accuracy still looks fine. The slice does not.
The teams that survive this window are the ones watching accuracy by slice, not just in aggregate. The teams that do not are the ones who find out about the drift from a customer or auditor.
Days 46–90: the first model migration
Somewhere in this window, a new model ships that changes the math. It is cheaper, or smarter, or faster. The default reaction is to migrate immediately and capture the savings. The better reaction is to evaluate against your golden set first.
We have seen new models win on every aggregate benchmark but lose on the specific slice that matters most for a customer's workflow. We have also seen new models save 40% on cost with zero accuracy hit. You do not know which one this is until you run the evals. The teams that skip this step are the ones who roll forward into a 5-point accuracy regression that takes a week to notice and a month to undo.
By day 90, a well-managed agent has typically gone through at least one model migration, two or three rounds of prompt and rail tuning, and a refresh of the eval set with edge cases that did not exist at launch. Accuracy is higher than it was at day 1, not lower. Throughput is up. Exception rate is down.
What this looks like as an operating cadence
Daily: accuracy and exception-rate monitoring; alerting if either crosses a threshold.
Weekly: review of edge cases that landed in the exception queue, with corrections fed back into the eval set.
Monthly: prompt-and-rail tuning pass; eval against the latest models on the market; report to stakeholders showing the trend.
Quarterly: structural review of where the agent fits in the larger operation. Is the work it is doing still the right work? Are there adjacent workflows it should now own? Should it be retired?
This is what Managed Agents actually is. Not the model — anyone can pick a model. The cadence. That is the part that compounds, and that is the part most teams do not staff for.