Leading at the Speed of the Stack
What a receipt emailer and a local RAG system taught me about agentic AI
I know your calendar. It’s back-to-back calls, business trips, a research team waiting on architectural decisions, and fourteen things that needed your attention yesterday. I jumped into agentic AI early, but reading papers and following along wasn’t cutting it. I needed to understand what operationalization actually feels like.
If you lead a data science or AI team deploying agentic systems, you actually can’t afford not to build one yourself. Not for production. Not to ship. But to understand what happens when components break, when validation fails, when the pieces of your architecture meet reality. That understanding changes how you see the work your team is doing.
Hands-on software development was never my lane. Those who’ve been following along know where I usually operate. But agentic AI is new enough that it isn’t fully anyone’s lane yet. It helps to know what the wiring feels like, and you don’t get that from diagrams or code reviews. You get it from building something, watching it fail, and fixing it.
This post is about how I got started and what I learned that I couldn’t have learned any other way. Maybe it’ll give you permission to try.
Start Small. Embarrassingly Small.
One of my first app builds was a receipt emailer. I took a photo of a receipt, let a vision-capable model extract the vendor, date, total, and line items, and had it draft and send a summary to my inbox.
That was the entire system, nothing enterprise-worthy. But building it taught me more about agentic AI workflows than reading three papers would have.
Here’s what’s happening under the hood. To keep the terminology honest, what I built were model-driven workflows: components with defined roles, connected by tools and control logic. That is the practical sense in which I’m using “agentic” here.
Rather than relying on a single model to do everything, the system is a set of components with defined roles.
The orchestrator is the coordinator. It receives a goal, breaks it into steps, decides which agent or tool to invoke, and manages the sequencing. In simple implementations, this can be application code wrapped around an LLM with a structured prompt and a tool-use loop. In more complex systems, the orchestrator maintains state, handles retries, and decides when to escalate or abort. Think of it as the project manager who never complains about scope creep; it sequences steps and routes to the right agent, but it doesn’t make judgment calls beyond what you’ve coded it to do.
Each component is scoped to a specific task. In the receipt emailer, one step handled image interpretation. A second structured the extracted data. A third composed the email body.
Tools are external capabilities the system can call, whether that’s a file system, an API endpoint, or another service. Tools are how models reach outside themselves, but their inputs and outputs still need validation, retries, and error handling.
For the receipt emailer, the orchestrator received the image path and instructions, then invoked a vision step that used a multimodal model to parse the receipt into structured JSON. That output went to a drafting step that composed the email body. An email-sending tool then sent the message.
The pipeline ran in under ten seconds. It failed several times before it worked reliably. I learned what I needed to know through those failures. Prompt brittleness shows up fast when your input is a crumpled thermal print, and tool errors surface the moment an external dependency misbehaves, and you haven’t planned for it. Context bleed becomes obvious when you realize you’ve been passing full conversation history through every agent call and quietly inflating token usage. Beyond, of course, the cost issue, large contexts increase latency and scatter model attention. These factors frequently lead to slower inference and poorer reasoning performance. These are the failure modes that surface first in practice.
Building this made the abstractions concrete. Orchestrator, agent, tool, and inter-agent validation were no longer abstract ideas to me. In practice, this validation often requires a specific verification or reflection step. I found that a critique loop, in which the orchestrator forces self-correction when the drafting agent deviates from the vision data, is essential for grounding the system. These components became design decisions with real tradeoffs. They feel different once you watch your own pipeline fall over.
The Second Build: A 100% Local Specialist Adviser
After the receipt emailer, I wanted something more representative of what enterprise teams are actually building. I built a specialist adviser, which is a system that ingests documents, reasons over them, and provides domain-specific guidance. It ran fully local, with no calls to a commercial API.
I chose to run it locally to understand the constraints. Cloud-based versions already exist and are capable. But running everything on local infrastructure forces you to make tradeoffs that API wrappers can hide. It also gives you tighter control over data, logging, and observability across the whole stack.
The stack was deliberately constrained. Documents were chunked locally, embedded with a Sentence Transformers embedding model, and stored in a local Chroma vector store running in-process. At query time, the question was embedded with the same model, Chroma retrieved the nearest chunks by vector similarity, and those chunks were passed as context to a small language model running locally through Ollama’s local API. In practice, that part was surprisingly painless: you pull a model, it runs on your machine, and your code talks to it through a local endpoint like any other model call. The system lets the user choose from a set of small language models depending on the task and available compute. A retrieval stage and an answering stage handled those steps, coordinated by a lightweight Python orchestrator.
The key architectural decision is to keep retrieval and generation separate. The model’s job is to synthesize what the retrieval layer surfaces, keeping factual grounding anchored in retrieved context rather than leaning too heavily on parametric memory. When that boundary is respected, a small model can be sufficient for many narrow, document-grounded tasks. When it isn’t, you get confident-sounding answers with no grounding in your documents. Because even small models can sound plausible, retrieval failures can slip past casual review. The retrieval layer handles the information lookup, while the model focuses on synthesis.
The system could take a stack of policy documents, governance frameworks, or technical reports and respond to specific questions with grounded, referenced answers. Not perfect, but functional and auditable. And building it made visible a set of decisions that enterprise teams are making right now, often without fully understanding their downstream consequences.
Why I Built These
I didn’t build these to ship production systems. I have a team for that. I built them so that when we debate architecture or failure modes, I’m not speaking from theory.
When engineers debate whether to use 256- or 512-token chunks, it helps to understand why that choice matters. When they propose a two-agent architecture, it helps to see whether the orchestrator design fits the use case or if a simpler pipeline would suffice. And when something breaks in production, knowing which layer to inspect first significantly shortens the feedback loop.
The tools have gotten genuinely good. Claude Code, Cursor, and GitHub Copilot all scaffold quickly, suggest sensible structure, and significantly cut the time from idea to a running system, but they don’t think through your architecture, question your data assumptions, or tell you when your validation logic is missing. That responsibility still sits with you.
When I built the specialist adviser, the models suggested a single pipeline; I had to override that and decompose it into retrieval and reasoning agents instead. The tools gave me working code. I had to give it the right structure.
There’s also something about agentic systems that most tutorials gloss over. The real risk arises from interactions between components. An orchestrator that passes unvalidated outputs, an agent that retries silently, or a tool that returns ambiguous status codes may seem minor on their own, but in a chained system, they compound quickly. After building one yourself, you can spot those failure modes earlier and with less guesswork.
And this goes beyond the technical. Product development at this level is a chain of compounding decisions. You have to be clear about the problem you’re solving, what the user genuinely needs, and where the system should end. Beyond that point, human judgment takes over. These are questions of intent and design, not strictly engineering questions. They must be answered before anyone writes a line of code. Without these answers, you end up with systems that are technically functional but misaligned with the organization’s needs. That misalignment is expensive to fix later, and it’s almost always invisible until something goes wrong.
The window for learning doesn’t need to be wide; it just needs to exist. Start with something small, learn from where it breaks, and then build something harder.




