
Machine Learning

In this 2025, considered by many as the year of AI agents, almost any system based on LLMs performs wonderfully in a Proof of Concept (PoC). But what happens when it reaches production?
Inconsistent answers, “hallucinations,” and silent performance degradation can turn a promising PoC into a maintenance nightmare.
When a user receives an incorrect answer, a series of challenges arise: how do you diagnose the root cause? Was it a change in the prompt, an issue with the data used as context (RAG), or a model failure itself?
In this article, the goal is to step away from the “build and launch” mindset and dive into the “continuous improvement” cycle that defines professional GenAI projects. We will demonstrate how implementing an observability framework transforms LLM systems from unpredictable black boxes into robust, maintainable, and auditable applications.
The real complexity of Generative AI (GenAI) applications doesn’t lie in the initial demo, but in their behavior once deployed in a dynamic, uncontrolled, and large-scale production environment. When a system moves from the sterile confines of a test dataset to the unpredictability of real user interactions, latent problems emerge quickly.
Traditional software systems fail in predictable ways: unhandled exceptions, logic errors, concurrency issues. LLM systems introduce a new taxonomy of failures, often more subtle and difficult to diagnose, including:
In traditional software engineering, monitoring and observability are often afterthoughts. In the GenAI world, this approach is a recipe for failure. Observability must not be an add-on, it must be a fundamental system design requirement from day one. To build robust, maintainable, and enterprise-trustworthy AI applications, a mindset shift is essential: move away from “build and launch” to a rigorous “measure, evaluate, and iterate” cycle which, when adopted properly, doesn’t have to be slower than the former.
The most common and dangerous starting point in LLM application development is to treat prompts as simple text strings embedded directly in the source code.
#
def classify_support_ticket(query: str) -> str:
prompt = f"""
You are an expert support ticket classifier.
Classify the following ticket into one of these categories: 'Technical', 'Billing', 'Sales'.
Ticket: "{query}"
Category:
"""
#... LLM call logic...This approach, while fast for prototyping, introduces massive technical debt. Hardcoded prompts are fragile; a change in business requirements (e.g., adding a new category) requires a code modification and a full deployment cycle. They are difficult to track, with no clear history of why a prompt was changed or who changed it. They create a barrier to collaboration with non-technical domain experts, such as product specialists or the prompt engineers themselves, who shouldn't need to modify the application code to refine LLM prompts.
To overcome these limitations, we must adopt the "Prompts-as-Code" philosophy. This paradigm consists of treating prompts not as strings, but as first-class software artifacts with their own development lifecycle, managed with the same rigor and tools as the application code. This approach is based on several key practices:
Versioning: Every prompt should be under version control, preferably in a Git repository. Clear conventions, such as Semantic Versioning (SemVer: MAJOR.MINOR.PATCH), should be used to communicate the nature of each change.
Structured Documentation: Each version of a prompt should be accompanied by essential metadata: the change author, the justification, the expected behavior, and input/output examples that illustrate its functionality. This documentation is invaluable for debugging and onboarding new team members.
Testing and Validation: No change to a prompt should be deployed blindly. It is essential to have a suite of automated regression tests that verify that the new version does not degrade performance in known and critical use cases. This set of tests acts as a safety net against unexpected changes in the model's behavior.
Prompt management, when done correctly, is not simply a text storage problem. It's a complete DevOps microcycle. It involves writing the prompt, validating its syntax, evaluating its performance, promoting it to an environment, making it available to the application (deploying it), and monitoring (observing its effectiveness). Working on the problem in this way aligns it with mature, widely validated, and documented software engineering practices.
A Prompt Management System (CMS) is an engineering tool that enables the implementation of the "Prompts-as-Code" philosophy at scale. Its primary function is decoupling, it separates instructional logic (the prompt) from business logic (the application code).
This decoupling is an architectural transformation that enables a "two-speed architecture":
git commit or docker buildThis agility is crucial for responding quickly to production issues. If a newly deployed prompt causes problems, reverting it is as simple as changing a label in the CMS, an operation that takes seconds, instead of a costly and time-consuming code rollback.
Furthermore, a CMS centralizes governance and collaboration:
Langfuse, an observability platform for LLMs, includes powerful prompt management functionality that serves as a CMS.
First, in the Langfuse user interface, you can create a new prompt. You assign it a unique name (e.g., customer-support-classifier), write the prompt content using the {"{{variable}}"} template syntax, and assign it tags such as development, staging, or production. These tags manage which version of the prompt is considered active in each environment.
Next, the application code is modified to dynamically retrieve the prompt from Langfuse instead of having it hardcoded. Let's see how this is implemented in practice:
# main_app.py
import os
from langfuse import get_client
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
# It's good practice to configure credentials using environment variables
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
# os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"
# Initialize the Langfuse client
# The SDK will read the credentials from the environment variables
langfuse = get_client()
# Get the version of the prompt labeled 'production' from the Langfuse CMS.
# This is key to decoupling.
try:
prompt_from_fuse = langfuse.get_prompt("customer-support-classifier", label="production")
except Exception as e:
print(f"Error getting the Langfuse prompt: {e}")
# Implement fallback logic, such as using a default prompt
# ...
# Langfuse uses the syntax {{variable}}, while LangChain uses {variable}.
# The Langfuse SDK provides a utility method for easy conversion.
langchain_prompt_str = prompt_from_fuse.get_langchain_prompt()
# Create a LangChain ChatPromptTemplate from the converted string.
chat_prompt = ChatPromptTemplate.from_template(langchain_prompt_str)
# The application continues its logic using the dynamically obtained prompt.
model = ChatOpenAI(model="gpt-4o")
chain = chat_prompt | model
# Example invocation
query = "I can't access my account, I've forgotten my password."
response = chain.invoke({"query": query})
print(f"Ticket classification: {response.content}")This sample code demonstrates the decoupling capabilities. The prompts team can now experiment with new versions of the customer-support-classifier in Langfuse. They can test a version in staging and, once validated, promote it simply by applying the production tag. The application, without any changes to its code, will automatically use the updated prompt on the next run. A critical bottleneck is eliminated, enabling unprecedented iteration and responsiveness.
In LLM-based systems, especially in complex architectures like RAG, using print() or traditional logging is fundamentally insufficient for debugging. A log can record discrete events: "User query received," "Documents retrieved," "Error generating response." However, it doesn't capture the causal relationship between these events. It doesn't answer the crucial question: Why did the generation fail? Was it because the retrieved documents were irrelevant?
This is where observability traceability becomes indispensable. A trace not only records events but connects them in a hierarchical structure that represents the entire execution flow. It shows how the output of one step becomes the input of the next, creating an auditable and debuggable record of the entire interaction. This causal chain is the key to moving from the generic complaint "the LLM made a mistake" to the precise diagnosis "the retriever failed to find document X because the user query was misinterpreted in the rewrite step."
Langfuse offers seamless and minimally invasive integration with LangChain through its CallbackHandler. This handler hooks into the execution lifecycle of a LangChain chain and automatically reports each step as part of a cohesive trace. Setup is simple. First, the handler is initialized, optionally passing metadata such as the user or session ID to enrich the traces.
# rag_pipeline_setup.py
import os
from langfuse.langchain import CallbackHandler
# Configure Langfuse credentials (usually in environment variables)
# Initialize the handler. Metadata can be passed to identify the session or user.
# This information will be visible in the Langfuse UI and is useful for filtering and analysis.
langfuse_handler = CallbackHandler(
user_id="user-456",
session_id="session-xyz-789",
# Custom metadata can also be added
metadata={"environment": "production"}
) Once initialized, the handler is passed to the chain at invocation time via the config parameter. LangChain takes care of the rest, calling the handler at each step of execution.
# rag_pipeline_invoke.py
#... (here the complete RAG chain is defined, with its retriever, prompt and model)
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
retriever = ...
prompt = ...
model = ...
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
# The magic happens here: the handler is passed in the invocation configuration.
response = rag_chain.invoke(
"What are the key evaluation metrics for RAG?",
config={"callbacks": [langfuse_handler]}
)With just a few lines of code, as in the example above, each execution of rag_chain will generate a detailed and navigable trace in the Langfuse user interface.
Opening a trace in Langfuse presents a hierarchical view that reveals the complete anatomy of the RAG system's execution.
Level 1: The Root Trace (Trace)
This is the main container representing the complete invocation of rag_chain.invoke(). In the UI, it displays crucial aggregate metrics:
user_id and session_id, which allows for grouping and filtering tracesLevel 2: Nested Spans (Spans)
Within the root trace, each component of the LangChain is represented as a nested "span." For a typical RAG chain, the key spans to analyze are:
retriever (or its specific name): This span is the first diagnostic point. Its input is the user's original query. Its output is a list of the documents or chunks retrieved from the vector database. Inspecting this span allows you to immediately answer the question: "Was the correct information retrieved?"stuff_documents_chain (or similar): This span shows how the retrieved chunks are formatted and combined to be inserted into the final prompt. It is useful for debugging formatting or context truncation issues.ChatOpenAI (or the model name): This is the generation span. Its input is the final prompt, with the context already injected. Its output is the textual response generated by the LLM. This span is where problems directly related to the model are diagnosed.The trace structure isn't just for visualization; it's a surgical diagnostic tool.
Latency Diagnostics
The timeline view (Gantt chart) in Langfuse shows the duration and sequence of each span. This allows you to visually identify which step is the bottleneck. If the total trace takes 5 seconds, the timeline might reveal that 4 of those seconds were spent on the retriever span, indicating a performance issue in the vector database, not the LLM.
Cost Diagnostics
Each LLM generation span accurately records the number of input and output tokens in its usage field. This enables granular cost attribution. In a complex agent that might make multiple calls to the LLM (for example, to rewrite the question, plan steps, and then generate the final response), the trace reveals exactly which call is the most expensive. This level of detail transforms cost optimization from an end-of-month invoice analysis task into a proactive engineering decision. If a "planning" step is found to consume 80% of the tokens, a smaller, more cost-effective model can be tested specifically for that task.
Error Diagnostics
If a component in the chain fails (for example, an external API becomes unresponsive), the corresponding span will be highlighted in red in the Langfuse UI, displaying the complete traceback of the error. This immediately isolates the failure to a specific component in the chain, saving hours of debugging.
Once we have visibility into the performance of our RAG system, the next step is to measure its quality. Here, we face a fundamental challenge: traditional machine learning metrics, such as accuracy, are largely useless for generative systems. Accuracy presupposes a single correct answer. In GenAI, there can be infinitely many valid answers to a single question, all with different styles, levels of detail, and wording. We need a more nuanced approach, a system of metrics that can assess the quality of a generated response from multiple dimensions.
The LLM research and development community has converged on a powerful evaluation framework known as the "RAG Triad." This set of three metrics, popularized by frameworks like Ragas, allows for a holistic and, more importantly, diagnostic evaluation of a RAG system.
Context Relevance
Faithfulness (Fidelity/Groundedness)
Answer Relevance
The open-source framework Ragas integrates easily into an evaluation workflow. It can take as input a dataset containing the questions, retrieved contexts, and generated responses (information that can be extracted from Langfuse traces) and automatically calculate the triad metrics.
# evaluate_rag_outputs.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_relevancy, faithfulness, answer_relevancy
# Assuming we have run our RAG system on a set of questions
# and we have extracted the results of the Langfuse traces.
data_samples = {
'question':,
'answer':,
'contexts':,
["Faithfulness refers to the idea that the answer should be grounded in the given context. An answer is considered faithful if the claims made in the answer can be inferred from the context... A low score here points to a problem in the Generator."],
# 'ground_truth' is optional for these metrics, but required for others such as 'answer_correctness'.
'ground_truth':
}
dataset = Dataset.from_dict(data_samples)
# Configure the models to use for the evaluation (these can be different from those in the app)
# os.environ = "..."
# Evaluate the dataset using the triad of metrics. Ragas will use LLMs as judges underneath.
result = evaluate(
dataset=dataset,
metrics=[
context_relevancy,
faithfulness,
answer_relevancy,
]
)
# The result is a dictionary-like object with the aggregated scores.
print(result)
# Example output:
# {'context_relevancy': 0.95, 'faithfulness': 0.88, 'answer_relevancy': 0.92}This script, inspired by Ragas's guidelines, demonstrates how qualitative assessment can be quantified and automated.
Having a system of observability and evaluation metrics is necessary, but not sufficient. To build truly robust, enterprise-level GenAI systems, we must automate the application of these quality standards. This is achieved by integrating our evaluation framework into a Continuous Integration and Continuous Deployment (CI/CD) pipeline. The goal is to create automated "quality guardians" that prevent performance-degrading changes from reaching production.
The heart of an automated evaluation system is the Golden Dataset. This isn't simply a set of test data, it's the executable specification of the system's expected behavior. It's the quality contract that defines what "working well" means for our application. A Golden Dataset for a RAG system typically consists of a curated list of tuples (question, ground_truth_answer, ground_truth_context). These tuples should represent a diverse mix of:
The idea of creating a "perfect" and static Golden Dataset all at once is a myth. In reality, it's a living entity that must evolve with the application. The creation and maintenance process is iterative:
The Golden Dataset, therefore, becomes the LLM's behavioral contract. It encodes stakeholder expectations into a machine-readable format. When a developer proposes a change (a new prompt, a different model), the CI/CD pipeline acts as an arbiter, running the system against this "contract." Evaluation metrics determine whether the contract has been fulfilled or broken, transforming subjective debates about whether a change is "better" into an objective, data-driven decision.
A CI/CD pipeline for an LLM application merges traditional software testing practices with the rigor of scientific experimentation. While traditional CI/CD tests deterministic results (e.g., 2 + 2 should always equal 4), CI/CD for LLMs tests the quality of a probabilistic result against an established baseline. Each pull request that modifies the LLM's behavior is not just a code change; it's an experiment. The pipeline automates the execution of this experiment and compares its results to a control group (the performance of the main branch on the same Golden Dataset).
Example of a typical workflow using GitHub Actions:
Trigger:
Workflow Steps:
#.github/workflows/llm_quality_gate.yml
name: LLM Quality Gate
# The workflow is executed on every pull request that modifies files in 'src/prompts/' or the file 'src/rag_pipeline.py'.
on:
pull_request:
paths:
- 'src/prompts/**'
- 'src/rag_pipeline.py'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
# This step simulates running the RAG application with the Golden Dataset.
# The script 'run_on_dataset.py' would take the dataset and produce a JSON file with the responses.
# - name: Run RAG pipeline on Golden Dataset
id: run_app
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
run: python scripts/run_on_dataset.py --dataset./data/golden_dataset.json --output./results/generated_answers.json
# This step runs the evaluation using Ragas or another tool.
# The 'evaluate_results.py' script would calculate the metrics and save them to JSON.
```````````````````````````````````` - name: Evaluate generated answers
id: run_eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/evaluate_results.py --generated./results/generated_answers.json --output./results/eval_scores.json
# The Quality Gate. Compare scores with thresholds.
- name: Quality Gate Check
run: |
# Ensure jq is installed
sudo apt-get install -y jq
FAITHFULNESS=$(jq '.faithfulness'./results/eval_scores.json)
CONTEXT_RELEVANCE=$(jq '.context_relevancy'./results/eval_scores.json)
ANSWER_RELEVANCE=$(jq '.answer_relevancy'./results/eval_scores.json)
echo "Evaluation Scores:"
echo "Faithfulness: $FAITHFULNESS"
echo "Context Relevance: $CONTEXT_RELEVANCE"
echo "Answer Relevance: $ANSWER_RELEVANCE"
# Using 'bc' for floating-point comparison in bash
if (( $(echo "$FAITHFULNESS < 0.85" | bc -l) )); then
echo "❌ Quality Gate Failed: Faithfulness score ($FAITHFULNESS) is below the 0.85 threshold."
exit 1
fi
if (( $(echo "$CONTEXT_RELEVANCE < 0.90" | bc -l) )); then
echo "❌ Quality Gate Failed: Context Relevance score ($CONTEXT_RELEVANCE) is below the 0.90 threshold."
exit 1
fi
echo "✅ All quality gates passed!"This example, inspired by CI/CD best practices for LLMs, is the most simple practical deliverable of this framework. It unites all the previous concepts of versioned prompts, metrics based evaluation, and the Golden Dataset, into an automation artifact that actively protects product quality.
We have come a long way from the opacity of the "black box" to the clarity of a "control panel." This observability framework is built on four interconnected pillars that, together, transform LLM systems development from an experimental art to a rigorous engineering discipline:
Prompts as Code: Systematically managing prompts as versioned, tested software artifacts decoupled from application code provides the foundation for stability and agility.
Anatomy of a Trace: Detailed traceability of each execution, enabled by tools like Langfuse, gives us the ability to perform surgical diagnoses, attributing latency, costs, and errors to specific components.
The RAG Evaluation Triad: Meaningful metrics such as Context Relevance, Faithfulness, and Answer Relevance allow us to quantify quality in a nuanced way and, more importantly, isolate failures in the retriever or generator.
Automated Quality Guardians: Integrating these assessments into a CI/CD pipeline, using a Golden Dataset as a quality contract, creates a safety net that prevents performance degradation and ensures continuous improvement.
Adopting this framework represents a fundamental paradigm shift. It moves teams away from "black box" engineering, where changes are made in the hope that they will work, and toward a mature software engineering discipline characterized by measurement, automation, and data-driven decisions. We no longer ask, "Is this answer good?" but rather, "What is the Faithfulness score of this answer compared to the baseline, according to our Golden Dataset?"