AI Evals are currently in beta; Reach out in Slack to get access.
What are AI Evals?
Statsig AI Evals have a few core components to help iterate and serve your LLM apps in production.- Prompts: Prompts are a way to represent your LLM prompt (and associated LLM config like Model Provider, Model, Temperature etc). This typically represents a task you’re getting the LLM to do (e.g. “Classify this ticket to a triage queue” or “Summarize this text”). You can version prompts, choose in Statsig which version is currently live and retrieve and use this prompt in Production using the Statsig server SDKs. It is possible to use Prompts as the control pane for your LLM apps without using the rest of the Evals product suite.
- Offline Evals: Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship. It is possible to grade output even without a golden dataset (e.g. if you’re having an LLM validate English to French translation).
- Online Evals: Online evals let you grade your model output in production on real world use cases. You can run the “live” version of a prompt, but can also shadow run “candidate” versions of a prompt, without exposing users to them. Grading works directly on the model output, and has to work without a ground truth to compare against.