Skip to content

Understanding Results

After a test run completes, Mibo gives you a clear picture of how your AI system performed — from high-level quality scores down to individual test case details.

Your project dashboard shows three key numbers:

An overall percentage rating of how well your system performed across all tests. Higher is better. Track this number over time to see if things are improving or regressing after updates.

How consistently your system gives the same answer to the same input. A high determinism score means your system is predictable and reliable. A low score means users might get different answers to the same question — which can be confusing and erode trust.

How often your system makes things up or provides incorrect information. This is critical for systems that handle factual queries — a high hallucination rate means it’s inventing answers instead of admitting it doesn’t know.

The Failure Matrix is one of Mibo’s most powerful features. It shows you not just which tests failed, but where in the process things went wrong.

Every response goes through a pipeline of stages. When something fails, the matrix tells you which stage broke:

What it means: The system chose the wrong path.

For example, a billing question was sent to the technical support flow, or the agent tried to book a flight when the user asked about hotel availability.

What it means: The system called the right tool, but with the wrong inputs.

For example, it searched for the right product but in the wrong category, or it booked a reservation for the wrong date because it misunderstood the user’s input.

What it means: The system used the wrong tool entirely.

For example, it used “cancel order” instead of “check order status,” or it triggered a payment when the user was just asking about pricing.

What it means: The system had the right information but communicated it poorly.

For example, the answer was technically correct but confusing, incomplete, or in the wrong tone for your brand.

For each test case, Mibo’s AI Judge provides three scores:

Did the system stick to the facts? A high faithfulness score means the response was accurate and grounded in real information. A low score indicates it may be making things up.

  • Green (90%+): Excellent — factually solid.
  • Yellow (70-89%): Needs attention — some inaccuracies.
  • Red (below 70%): Critical — significant factual issues.

Was the response on-topic and useful? A high relevancy score means the system addressed what the user actually asked. A low score means it went off-topic or provided irrelevant information.

  • Green (90%+): Spot on.
  • Yellow (70-89%): Partially relevant.
  • Red (below 70%): Off-topic.

Was the response appropriate in style and language? This checks whether the system matches the tone you expect — professional, friendly, formal, or whatever fits your brand.

  • Green (90%+): Matches expected tone.
  • Yellow (70-89%): Slightly off.
  • Red (below 70%): Significantly mismatched.

Click on any test case to see its detailed results:

  • Pass/fail status — did it meet all the expected behaviors?
  • The input — what Mibo sent to your system.
  • The response — what your system answered.
  • Each check result — a breakdown of every rule-based and AI-powered check, showing which passed and which failed.
  • The AI Judge’s reasoning — a written explanation of why the response was scored the way it was.

Here’s a practical approach to working with your results:

  1. Check the Failure Matrix first. Look for the stage with the most failures — that’s your biggest area for improvement.
  2. Review the Hallucination Rate. If it’s high, your system likely needs better source data or stricter grounding.
  3. Compare Determinism across runs. If determinism drops after an update, your changes may have introduced instability.
  4. Dig into individual failures. Open failing test cases to see exactly what went wrong and use that to guide your fixes.
  5. Re-run after changes. After updating your system, run the same tests again to verify your fixes worked.