LLM Evaluation

Validating LLMs to a model-risk standard

I evaluate and validate large-language-model outputs for truthfulness, robustness, bias, and strict instruction-following - treating model-output review with the same rigour as model-risk validation, aligned to FINMA Guidance 08/2024 and the EU AI Act.

Approach

Methodology

1
Define rule-based evaluation guidelines
Turn quality expectations - truthfulness, robustness, bias, conciseness, and strict instruction-following - into explicit, rule-based guidelines mapped to FINMA 08/2024 and EU AI Act model-risk and explainability expectations.
2
Score outputs and calibrate judgments
Evaluate model outputs against the rubric and calibrate scoring to improve human–AI alignment, explainability of judgments, and reproducibility of reviews across evaluators.
3
Document for audit-readiness
Author evaluation documentation and refine evaluation frameworks so findings are traceable and defensible - a second-line, three-lines-of-defence validation mindset carried over from regulated GxP / ICH-GCP / 21 CFR Part 11 work.
4
Monitor and govern over time
Track regressions across model and prompt versions and translate governance rules into automated, validated checks so changes are evidenced rather than assumed.

In Practice

Code Snippets

rubric_eval.pypython

RUBRIC = ["truthfulness", "robustness", "bias", "instruction_following"]


def evaluate(model, dataset, judge):
    """Score model outputs against the rule-based rubric for audit-ready review."""
    rows = []
    for ex in dataset:
        output = model.generate(ex.prompt)
        scores = {dim: judge(ex, output, dim) for dim in RUBRIC}
        rows.append({"id": ex.id, **scores})
    return rows

judge_prompt.txttext

You are validating an assistant answer against rule-based guidelines.
Rate 1-5 for truthfulness and instruction-following; flag any bias.
Penalise unsupported claims and reward conciseness.
Justify each score in one sentence for explainability and reproducibility.
Return JSON: {"truthfulness": N, "instruction_following": N, "bias_flag": bool, "rationale": "..."}.

Results

Case Studies

LLM output validation at Outlier

Problem: Large-language-model outputs needed rigorous, repeatable validation for truthfulness, robustness, bias, and strict instruction-following - directly analogous to model-risk review.
Approach: Evaluated outputs against rule-based guidelines, authored evaluation documentation, and refined the evaluation frameworks; built scalable multilingual (German / English) NLP data-ingestion and ETL workflows in Databricks and Azure to feed the reviews.
Result: Improved human–AI alignment, explainability of judgments, and reproducibility of reviews across the evaluation programme (Jan–Sep 2025).

Regulated validation → AI governance

Problem: Financial-services AI demands independent, auditable validation - the same discipline FINMA 08/2024 and the EU AI Act now require, long established under GxP, ICH-GCP, and 21 CFR Part 11.
Approach: Translated complex regulatory and data-governance rules into automated, validated checks; authored Data Validation Plans and SOPs and conducted independent code reviews as a second-line control.
Result: A traceable, audit-ready validation framework that maps cleanly onto model-risk, explainability, and documentation expectations for LLMs.

Methodology

Define rule-based evaluation guidelines

Score outputs and calibrate judgments

Document for audit-readiness

Monitor and govern over time

Code Snippets

Case Studies

LLM output validation at Outlier

Regulated validation → AI governance