[go: up one dir, main page]

Skip to main content

What are Offline Evals

Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship. Steps to do this on Statsig -
  1. Create a Prompt. This contains the prompt for your task (e.g. Classify tickets as high, medium or low urgency based on ticket text)
  2. Upload a sample dataset - with example inputs and ideal answers (e.g. Ticket1 text, High; Ticket2 text, Low)
  3. Run your AI on that dataset to produce output. (e.g. classify each ticket in this example)
  4. Grade or score the outputs. You can do this by comparing ideal answer in the dataset with the output your AI generated.
  5. Create multiple versions of your prompts. Compare scores across versions and promote the best one to be Live.

Create/analyze an offline eval in 10 minutes

1. Create a Prompt within Statsig This captures the instruction you provide to an LLM to accomplish your task. You can now use the Statsig Node or Python Server Core SDKs to retrieve this prompt within your app and use it. You can create multiple versions of the prompt as you iterate, and choose which one is “live” (retrieved by the SDK). image 2. Create a dataset you can use to evaluate LLM completions for your prompt For the example above, this might be a list of words, along side known good translations in French. Small lists can be entered (or upload a CSV). image 3. Create a grader that will grade LLM completions for your prompt Configure a grader that compares the LLM completion text with the reference output. You can use one of the out of box string evaluators, or even configure an LLM-as-a-Judge evaluator that mimics a human’s grading rubric. image 3. Run evaluation Run an evaluation on a version of the prompt. You should see results in a few minutes that look like this. You can click into any row of the dataset to understand more about the evaluation for that row. image You can categorize your dataset, and break scores out by category. image If you have scores for multiple versions, you can compare them to see what changed between versions. image
I