From Prototype to Production: Best Practices for Building Production‑Ready AI Systems
- Rahul Kumar
- Apr 15
- 11 min read
In the journey from a neat AI prototype to a production-ready system, a lot changes. We recently built a financial aggregation platform that uses LLMs (Large Language Models) to perform large-scale entity extraction, running serverless on Kubernetes with Azure Container Apps Jobs. Along the way, we picked up several best practices to make our AI system robust, efficient, and easy to maintain. This blog shares those insights in a comfortable, conversational way so that you can apply them to your own projects.
1. Structure Your Inputs (Why Markdown Helps)
One surprising trick we learned: format your input data in Markdown. LLMs love structured content. When you present information with clear headings, bullet points, tables, etc., the model can parse and interpret it more accurately than a blob of plain text. For example, using Markdown:
Headings (e.g. ## Summary) help the model recognize context or sections.
Bullet lists make it obvious what items are separate points.
Code blocks or tables can preserve formatting for data or examples.
LLM-friendly content that’s structured in Markdown offers significant advantages in accuracy and parsing. In fact, content formatted as Markdown (or similarly structured text) is easier for the model to understand, reducing misinterpretation. For instance, a bulleted list in Markdown is clearly a list of items, so the LLM won’t confuse it as a run-on sentence. In our project, we converted input documents (like financial statements) into simplified Markdown text. The result? More consistent extraction of entities, because the model wasn’t struggling to find patterns in a wall of text.
2. Enforce Structured Output with JSON Schemas
Getting a human-like answer is great until you need to parse the output in code. In a production system, you usually want the AI’s output in a structured format (like JSON) so your software can reliably use it. We found that explicitly asking the LLM to output JSON according to a schema dramatically improves reliability.
For example, if your app needs the AI to extract {"company": ..., "amount": ..., "date": ...}, you should prompt the model to only return JSON with those fields. Even better, use tools to validate that output:
Use JSON mode (if supported): Providers like OpenAI offer a JSON mode that restricts the output to valid JSON. This prevents formatting drift and eliminates the need for fragile post-processing.
Pydantic (in Python) allows you to define a data model (schema) and will parse/validate JSON against it.
Libraries like Instructor patch into the OpenAI client to directly return a Pydantic model. If the model’s first attempt isn’t valid, Instructor can even auto-retry the query behind the scenes.
Instead of hoping the LLM magically yields perfect JSON, we check and enforce it. In our pipeline, after getting the model’s response, we run it through Pydantic for validation. This catches any errors or format issues immediately. Instructor (built on Pydantic) made this easy – it essentially makes the LLM output play nicely with our data classes, turning “LLM luck” into a deterministic process. The benefit is huge: downstream components of our platform could trust the data format, avoiding a bunch of fragile string parsing code.
3. Reliable Remote Calls: Use Exponential Backoff (with Code Example)
When your application calls an LLM service (like the OpenAI API), network hiccups and rate limits are inevitable. A simple retry might work best in the case of a prototype. However, in production, you need a resilient strategy to handle errors without spamming the API. The answer is exponential backoff—a fancy way of saying “Wait a bit longer after each retry.”
Why use exponential backoff? It prevents overwhelming the API and improves success rates:
Automatic Recovery: The system will keep trying on transient errors (like a momentary network issue or a rate limit hit) without manual intervention.
Efficient Timing: Quick retries are attempted first, but if the issue persists, the wait between attempts grows, giving the service time to recover.
Random Jitter: A bit of randomness in wait times avoids a thundering herd where many retries happen at once.
Here’s a code snippet in Python using the backoff library and OpenAI SDK:
import openai, backoff
from openai.error import RateLimitError, APIError, Timeout
@backoff.on_exception(backoff.expo, (RateLimitError, APIError, Timeout), max_time=30)
def call_with_backoff(prompt):
return client.chat.completion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Usage:
response = call_with_backoff("Extract entities from the following text ...")
In this snippet, if the OpenAI API call hits a rate limit or transient error, it will retry with exponentially increasing delays. The max_time=30 means it’ll keep retrying for up to 30 seconds before giving up (you could also use max_tries to limit attempts). This approach saved our bacon whenever Azure OpenAI would transiently say “too” busy”—the job would patiently retry and usually succeed on the next go. No more crashes due to a single 429 response!
4. Bulk Extraction Tips: Batching, Tokens, and Using RAG or Big Models
Our platform had to extract entities from hundreds of documents at a time. Doing this at scale taught us a few tricks for bulk processing:
Batch your work: Instead of one massive prompt with 100 documents, send 10 prompts with 10 documents each (or similar). Batching prevents hitting token limits and can be done in parallel jobs for speed.
Stay under token limits: Every LLM has a context size limit (for example, GPT-4o had ~128k tokens per request). Always ensure each prompt + expected answer fits in the model’s limit with some buffer. In practice, we split large texts into chunks (e.g., by paragraph or section) so that each LLM call was within limits. This chunking is essential because models simply cannot ingest more than their max tokens in one go.
Use RAG for long docs: When facing very large documents or knowledge bases, Retrieval-Augmented Generation (RAG) is your friend. RAG means you index your text (using something like Elasticsearch or a vector DB) and retrieve only the most relevant snippets to feed into the LLM. That way, the prompt stays small no matter how large the total data is. It’s how you bypass context size constraints by providing only the pieces the model needs to see.
Consider large-context models: Newer models are pushing context window boundaries. For instance, Google’s Gemini 2.5 Pro model boasts a staggering 1 million tokens context window (roughly 8× larger than even GPT-4o’s 128k token context!). If you have access to such models, they can handle huge inputs in one shot. In practice, we found large context models intriguing but still somewhat experimental – they can take in all the data, but the quality of responses may vary for truly massive inputs. Use them judiciously, perhaps in combination with RAG (to keep the signal-to-noise ratio high).
In our case, running on Azure Container Apps Jobs (a serverless K8s), we spun up multiple jobs in parallel, each handling a batch of documents. This parallel batching approach kept each LLM call fast and within limits, and the overall throughput was high. Whenever we tried to shove too much into one request, either the model refused (hit limit) or the latency shot up. Small, manageable chunks with a smart retrieval strategy won the day.
5. Scaling Out: Load Balance Your LLM Endpoints (Azure API Management)
As usage grew, we needed to ensure our AI service stayed snappy and didn’t go down if one backend failed. One clever solution was to use Azure API Management (APIM) as a gateway in front of our LLM endpoints. APIM let us do two important things: load balancing and failover.
With APIM, you can register multiple backend endpoints (e.g., two different Azure OpenAI Service deployments or even one OpenAI and one Anthropic API) and set rules for how to route calls between them. We configured a round-robin distribution for load balancing, so each request would go to a different backend in turn, effectively sharing the load. This prevented any single model instance from being overwhelmed with traffic.
We also set up priority-based routing as a fallback. For example, our primary backend was a high-performance (but more costly) GPT-4o instance, and a secondary backend was a slightly cheaper model. APIM can be set so that if the primary fails (e.g., service unavailable), it automatically routes to the secondary. Combined with APIM’s circuit breaker policies, this meant if one of our LLM endpoints started erroring out, it’d be taken out of rotation for a bit while the others carried the traffic—users never noticed a thing.
The beauty of this approach is that it’s all managed at the API gateway level. Our application just calls one API endpoint (the APIM gateway), and behind the scenes, APIM handles the distribution. Microsoft’s GenAI gateway features in APIM provide token-based rate limiting, quota management, and multi-endpoint distribution, which are perfect for LLM services. In short, if you plan to scale or need high availability, don’t rely on a single AI endpoint—load balance it!
6. Observability and Cost Control: Don’t Fly Blind
Once an AI system is in production, observability (seeing what’s going on) and cost tracking becomes critical. We integrated several tools to keep us informed and costs in check:
Langfuse for LLM Observability: We instrumented our prompts and responses with Langfuse, an open-source LLM observability platform. Langfuse gave us a detailed tracing of each LLM call—we could see which prompt version was used, the model’s raw output, any errors, and token usage. It’s like having an x-ray for your AI’s decision process. One big plus is prompt versioning: Langfuse can store and compare different prompt templates, so we could A/B test tweaks and see which worked better. It also tracks key metrics like latency and cost per call out of the box. For instance, our Langfuse dashboard would show the average latency of the extraction calls and how much each call cost (in tokens or dollars). We could then optimize slow prompts or identify outlier expensive requests. Being able to version prompts and monitor cost/latency in real-time gave us the confidence to iterate quickly without losing track of performance.
Azure Cost Management: Because our system ran in Azure, we set up Azure’s cost monitoring to keep an eye on expenses from the Azure OpenAI service. We defined budgets/alerts—e.g., if this month’s AI spend goes 10% above forecast, send an alert. It’s easy to forget during development that each API call has a real price. In production, these tools ensured there were no surprise bills; we always knew roughly how much each feature was spending. If costs started creeping up, we’d catch it early and investigate (maybe a prompt got longer, or a bug caused repeat calls).
Grafana + Prometheus for Metrics: For system-level observability, we pushed metrics to Grafana. We logged things like the number of requests, processing time per job, token counts, etc., via Prometheus. Grafana’s dashboards let us visualize trends over time (e.g., tokens processed per hour, error rate, memory usage of the container jobs). We also set up alerts. For example, one useful alert was on response latency: if the 95th percentile latency > 3s for 5 minutes, alert us. This caught slowdowns quickly (sometimes due to external API slowness or too many retries). Grafana and Prometheus are great for live monitoring—they won’t give you the deep trace of each prompt like Langfuse, but they will tell you if, say, your average response time doubled after a new release or if the system is approaching a throughput bottleneck. In short, logs and metrics are your best friends in production. Don’t launch an AI service without a way to measure its behavior and costs—you’ll thank yourself later when debugging that one weird failure that only happens in production.
7. High-Level Frameworks for Productivity: LiteLLM and Friends
Tying all the above together might sound like a lot of work—and it is if you implement everything from scratch. Luckily, the AI dev ecosystem is rapidly evolving, and some frameworks bundle many best practices for you. LiteLLM was one of the very handy framework that we discovered.
LiteLLM is an open-source tool that serves as a unified proxy/client for working with LLM providers. It basically wraps around various APIs (OpenAI, Azure, Anthropic, Cohere, etc.) and provides a common interface. What’s neat is that it comes baked with many features we’ve discussed:
Multiple Provider Support & Load Balancing: You can configure multiple LLM backends in LiteLLM (even mix providers) and it will route or fallback seamlessly, similar to our APIM setup. This gives you provider flexibility—switch between OpenAI, Azure, or others without changing your code.
Automatic Retries with Backoff: If a call fails due to a rate limit or error, LiteLLM can automatically retry with an appropriate delay. You don’t have to explicitly use a backoff decorator; the library’s got your back.
Token Management (Trimming): LiteLLM can help manage token counts. For example, if you have a conversation history, it can trim older parts when you approach the model’s token limit (to avoid errors). This ensures you don’t exceed context sizes.
Async and Batch Support: Need to fire off many requests in parallel? LiteLLM provides async methods to do so, which is great for bulk processing use cases. Higher throughput without juggling threads or asyncio yourself.
Observability Integration: LiteLLM can integrate with observability tools (like sending events to Langfuse or logging frameworks). In fact, there’s an integration where the LiteLLM OpenAI wrapper works with Langfuse to automatically capture token counts, latencies, and errors for each request.
Usage Logging and Rate Limiting: It logs requests and responses, tracks token usage per request, and can even enforce your own rate limits or budget. For example, you could configure it to ensure no more than X tokens are used per day, or to round-robin across API keys. It essentially provides a higher-level API key management – great for cost control and scaling. LiteLLM’s proxy can track spend by project and even apply per-key limits (tokens per minute, etc.).
All of these features come down to saving developer time and preventing mistakes. In our financial data project, we started with basic OpenAI calls and gradually added our own retries, parsing, etc. We later found that much of this reinvention could be avoided by using a robust library like LiteLLM which already has these capabilities. It’s like using a smart SDK that handles the messy parts (errors, load balancing, logging) so you focus on your application logic. We highly recommend exploring such high-level wrappers when going to production—they encapsulate best practices and have been battle-tested by others.
Final Thoughts: Production-Ready AI Systems
Taking an AI system from a prototype to production-ready is an eye-opening experience. You realize that getting good model outputs is only the starting point—what truly matters is ensuring those outputs can be trusted, scaled effectively, and maintained over time in real-world environments. To recap, some key takeaways for building production-ready AI systems include:
Use structured inputs and outputs: Give LLMs well-formatted (e.g. Markdown) text to chew on, and demand structured JSON back. This yields clearer, more usable results.
Build in robustness: Network calls fail, APIs get rate-limited. Implement retries with exponential backoff (or use tools that do it for you) so your app is resilient.
Think at scale: Split big tasks into batches, respect token limits with chunking or RAG, and consider advanced models or techniques for handling lots of data.
Design for scale-out and HA: If one model endpoint isn’t enough or could go down, load balance across many (services like Azure APIM or libraries like LiteLLM make this easier).
Observe and optimize: You can’t improve what you don’t measure. Invest in logging, tracing, and monitoring your LLM’s performance and cost. Tools like Langfuse give deep insight into prompt efficacy and usage stats, while Grafana and cloud cost monitors keep you on top of system metrics and spending.
Leverage frameworks: Don’t waste time re-coding common solutions. Use Pydantic for validation, Instructor for structured output, or LiteLLM for an all-in-one LLM SDK that handles retries, caching, multi-provider, etc. It will accelerate your development and reduce bugs.
Building our financial AI platform was challenging, but following these practices made it robust and production-ready. We went from a prototype that worked “most of the time” to a system that works all the time (and if it encounters an issue, it knows how to recover or at least alert us!). By structuring prompts, enforcing formats, engineering for failure, and monitoring everything, you can confidently deploy AI systems that serve users reliably.
Hopefully, these insights empower you to take your own AI projects to the next level. Production doesn’t have to be scary—with the right practices and tools, you can harness the full power of AI in real-world applications. Happy building! 🚀
Sources:
The practices above are informed by a range of real-world experiences and community knowledge, including OpenAI’s guidance on rate limiting and retrie (Openai-python Requests Retry Example | Restackio)】, observations on using Markdown for LLM inpu (Boosting AI Performance: The Power of LLM-Friendly Content in Markdown | Webex Developers Blog)】, Microsoft’s documentation on scaling AI services with API (Improve LLM backend resiliency with load balancer and circuit breaker rules in Azure API Management | Microsoft Community Hub), and open-source tools documentation (Langfuse, Instructor, LiteLLM) that we referenced for implementing structured outputs and robust LLM integration (From Chaos to Order: Structured JSON with Pydantic and Instructor in LLMs) (Building Robust LLM Applications for Production grade scale using LiteLLM. | by M K Pavan Kumar | Medium).
Each technique has proven invaluable in making AI systems smart and dependable in production. Enjoy building!
댓글