onJul 24, 2025
onJul 24, 2025
Your conversational AI agent might pass every standard test, yet still fail at the one thing that truly matters: completing real tasks for real users.
It may answer isolated questions like “What’s my refund status?” with perfect accuracy. But in production, things fall apart. The agent forgets context, invokes the wrong tool, or fails to coordinate with the right internal component. The result? A task that starts but never finishes, leaving users confused and unsupported.
This breakdown happens because most testing still focuses on single-turn accuracy. But in complex systems where agents manage workflows, trigger tools, and interact with subsystems, prompt-by-prompt correctness isn’t enough. What matters is whether the entire system can stay on track and deliver end-to-end outcomes.
Galtea’s testing framework is built for exactly this. Our multi-turn, scenario-based evaluations measure task success, not just response quality. We simulate realistic user journeys where your agent must:
With Galtea’s Conversation Simulator and Scenario Generator, you can uncover the breakdowns that traditional tests miss before they affect your users. Because in real-world deployments, passing a test is not enough. Completing the task is what counts.
Simple Q&A testing misses the real challenges that break task completion in multi-turn, conversational AI agents:
Challenge | Impact |
---|---|
Task focus | Can it stay focused on helping complete a multi-step task? |
Context drift | Does your AI remember what the user said three turns ago? |
Persona consistency | Does it maintain its role and tone throughout? |
Graceful handling | How does it respond when users change direction or provide unclear input? |
These failures often go undetected until production, where they directly impact user experience and business outcomes.
Galtea’s Conversation Simulator is a framework designed specifically for testing conversational AI systems in realistic dialogue scenarios. Instead of isolated Q&A pairs, it generates dynamic user messages that simulate authentic multi-turn conversations guided by structured test scenarios.
Before diving into the details, take a quick look at our demo below to see how the Conversation Simulator works in just under 2 minutes!
In this demo, you’ll learn how to:
Each simulation runs your AI through a complete conversational flow, driven by:
Dialogue Flow Testing
Ensure your AI produces coherent, logical conversations that feel natural to users.
Role Adherence
Verify your agent maintains its intended personality, expertise level, and communication style throughout extended interactions.
Task Completion
Test whether your AI can successfully guide users through complex, multi-step processes.
Robustness Testing
Discover how your AI handles unexpected user behavior, topic changes, or ambiguous requests.
Integration Ready: The simulator integrates seamlessly into CI/CD pipelines using the Galtea Python SDK, enabling continuous testing as your AI evolves.
Before diving into simulations, you’ll need to wrap your conversational AI using Galtea’s Agent interface. This simple Python class ensures your agent receives full conversation context and can be evaluated under realistic conditions.
import galtea
class MyGalteaAgent(galtea.Agent):
def __init__(self):
# Initialize your model here (e.g., load LLM or service)
self.agent = MyAgent()
def call(self, input_data: galtea.AgentInput) -> galtea.AgentResponse:
# Access the latest user message
user_message = input_data.last_user_message_str()
# Generate a response using your own logic/model
response = self.agent.generate_response(user_message)
# Return a structured response (optionally with metadata)
return galtea.AgentResponse(content=response)
Once your agent is ready, running simulations is straightforward:
# Initialize the Galtea client
galtea_client = galtea.Galtea(api_key="YOUR_API_KEY")
# Call the product and version you want to evaluate
product = galtea.products.get_by_name("your-product-name")
version = galtea.versions.get_by_name(
product_id=product.id,
version_name="your-version-name"
)
# Create a test suite
test = galtea_client.tests.create(
product_id=product.id,
name="Multi-turn Conversation Test",
type="SCENARIOS"
)
# Get your test scenarios
test_cases = galtea_client.test_cases.list(test_id=test.id)
# Create your agent
agent = MyGalteaAgent()
# Run simulations
for test_case in test_cases:
session = galtea_client.sessions.create(
version_id=version.id,
test_case_id=test_case.id
)
result = galtea_client.simulator.simulate(
session_id=session.id,
agent=agent,
max_turns=test_case.max_iterations or 10,
log_inference_results=True
)
# Analyze results
print(f"Scenario: {test_case.scenario}")
print(f"Completed {result.total_turns} turns")
print(f"Success: {result.finished}")
if result.stopping_reason:
print(f"Ended because: {result.stopping_reason}")
evaluation_task = galtea_client.evaluation_tasks.create(
session_id=session.id,
metrics=["Role Adherence", "other_metrics"], # Replace with your metrics
)
print(f"Evaluation task created: {evaluation_task.id})")
After running simulations, you’ll get detailed insights into your AI’s performance based on the metrics you choose to evaluate it on, and you will have visibility on:
Performance Insights:
Creating comprehensive test scenarios manually is time-consuming and often misses edge cases. Galtea’s Scenario Generator solves this by automatically creating diverse, product-specific test cases tailored to your use case.
Our generator creates scenarios that are both grounded in your product’s functionality and diverse enough to catch unexpected failure modes. The goal is comprehensive coverage without the manual effort of writing hundreds of test cases.
When creating a test, simply select “Generated by Galtea” instead of uploading your own CSV. Our system analyzes your product information and generates scenarios like:
User Persona | Goal | Scenario | Success Criteria |
---|---|---|---|
Sarah Chen is a 34-year-old software engineer who joined 3 months ago. She’s analytical, detail-oriented, and somewhat skeptical of AI systems. As a new employee, she prefers researching independently before contacting HR directly. | Understand parental leave policy and begin planning maternity leave starting in 6 months. | Sarah recently discovered she’s pregnant and needs comprehensive information about the company’s parental leave policies, including eligibility, duration, pay structure, and application process. She’s particularly concerned about confidentiality and doesn’t want her manager notified yet. | Sarah receives complete parental leave information AND confidentiality concerns are addressed AND she understands the application timeline. |
This scenario was generated for an HR Support Bot.
Custom Persona Focus
Steer scenario generation toward specific functionalities by providing a focus description:
"Focus on employees that need leave requests longer then 2 months."
Knowledge Base Integration [Coming Soon...]
This upcoming feature will allow you to upload your knowledge base to create even more tailored and grounded personas that reflect your actual use cases and interactions. We plan to launch this functionality in the near future, stay tuned!
Ready to move beyond single-turn testing? Book a demo with us: Galtea Demo
Multi-turn conversations are the future of AI interactions. Don’t let single-turn testing leave you blind to the failures that matter most to your users.