Handle Vertex TokenLimits error in ActiveContext (!195902) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

As part of the Semantic search: chat with your codebase (&16910), the Ai::ActiveContext::BulkProcessWorker generates embeddings for a code snippet using Gitlab::Llm::VertexAi::Embeddings::Text. Furthermore, the embeddings generation is done as a bulk request, meaning that a single embeddings generation request can have multiple inputs.

This can result in a "token limits exceeded error" in the call to the Vertex Text Embedding model.

Call stack

Ai::ActiveContext::BulkProcessWorker runs every minute and processes jobs in any ActiveContext queue. For this specific scenario, we are concerned with the Ai::ActiveContext::Queues::Code
The BulkProcessWorker processes a batch of Ai::ActiveContext::References::Code
Ai::ActiveContext::References::Code calls the ActiveContext::Preprocessors::Embeddings::apply_embeddings
ActiveContext::Preprocessors::Embeddings calls Ai::ActiveContext::Embeddings::Code::VertexText.generate_embeddings. (It determines the correct embeddings generation class to call by Ai::ActiveContext::Collections::Code::MODELS)
Ai::ActiveContext::Embeddings::Code::VertexText.generate_embeddings calls Gitlab::Llm::VertexAi::Embeddings::Text#execute

Solution

This problem has a 2-pronged solution:

Calculate the average tokens count of the code snippets, and set a batch size for processing to make sure we will not exceed token limits.
Token counts calculation is not fully accurate, so we still need to handle "token limits exceeded error" by:
1. Have Gitlab::Llm::VertexAi::Embeddings::Text raise a specific error class for "token limits exceeded"
2. In Ai::ActiveContext::Embeddings::Code::VertexText.generate_embeddings, catch the "token limits exceeded" error class and retry the call to Gitlab::Llm::VertexAi::Embeddings::Text with a smaller batch

This MR specifically addresses Step 2-2

References

see issue: Bulk code embeddings generation can exceed Vert... (#551002 - closed)

Screenshots or screen recordings

N/A

How to set up and validate locally

Option 1: Validate through the `BulkProcessWorker` for `Ai::ActiveContext::References::Code`

You can use this validation if you have a ready list of refs / Elasticsearch docs that follow the schema detailed in this migration.

# ref_ids should be a large batch, so that you're sure the _total tokens count_ exceeds Vertex AI's limits
ref_ids = [the,ids,of,the,elasticsearch,docs]

::Ai::ActiveContext::Collections::Code.track_refs!(routing: "1", hashes: ref_ids)
::Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::ActiveContext::Queues::Code", 0)

# The call to `::Ai::ActiveContext::BulkProcessWorker` should not result in any logged ERROR or WARNING.

Option 2: Validate by directly calling `Ai::ActiveContext::Embeddings::Code::VertexText.generate_embeddings`

# create a long single input then multiply by 250 (the *batch size* limit) so we have a really large token count
str = (["The quick brown fox jumps over the lazy dog"] * 50).join("\n")
contents = [str] * 250

# First, let's test on the Gitlab::Llm::VertexAi::Embeddings::Text
# This is the LLM class, and it should result in a TokenLimitExceededError
generate_embeddings = Gitlab::Llm::VertexAi::Embeddings::Text.new(
  contents,
  user: User.first,
  tracking_context: { action: 'embedding' },
  unit_primitive: ::Ai::ActiveContext::References::Code::UNIT_PRIMITIVE,
  model: 'text-embedding-005',
).execute

# Now, let's test the same input on Ai::ActiveContext::Embeddings::Code::VertexText.
# This is where the token limit exceeded error is handled, with the `contents` input
# being halved recursively until the total token counts of each contents batch is within limits.
# This should not result in an error
results = Ai::ActiveContext::Embeddings::Code::VertexText.generate_embeddings(
  contents,
  unit_primitive: ::Ai::ActiveContext::References::Code::UNIT_PRIMITIVE,
  model: 'text-embedding-005',
  user: User.first
)

# Check results length is the same as the length of the original `contents` input
results.length
=> 250

# Check that each `results` element is an array of vector embeddings:
results.all?(Array)
=> true
results.all? { |ems| ems.all?(Float) }
=> true

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #551002 (closed)

Edited Jul 01, 2025 by Pam Artiaga

Handle Vertex TokenLimits error in ActiveContext