[go: up one dir, main page]

Skip to content

Improve retry mechanism when objects are stale for pipeline cancellation

What does this MR do and why?

Related to #382065 (closed) - Closing a merge train MR doesn't cancel pipelines that contain child pipelines

This code improves the reliability of canceling CI/CD pipelines by adding better error handling and retry logic.

The main changes include:

  1. Improved retry mechanism around entire service - When canceling pipelines fails due to database conflicts (when multiple processes try to update the same record simultaneously), the system now retries the entire cancellation process up to 3 times instead of just retrying the individual pipeline again.
  • Before this MR: We retry_lock around each pipeline within the service. The retry_lock applies a transaction around each pipeline update work so that if 3 jobs fails to update it rolls back the whole pipeline, and stops executing the service. This could leave children pipeline un-canceled.

  • After this MR: We retry_lock around each job because that shortens the transaction and lets each job retry individually if there is another job update going on. If the StaleObjectError bubbles up then we retry around the entire cancellation process because that will allow use to make further progress and try to update the child pipelines. We can't use retry_lock to do the full service retry in this case because it opens a transaction around all the work. This transaction would prevent us from making progress on the retry. This would also prevent us from starting new workers to cancel children --since we can't enqeue workers in a transaction.

  1. Added validation before canceling - The system now verifies that a job is actually cancelable before attempting to cancel it, preventing state machine errors.

These changes make the pipeline cancellation process more robust by handling race conditions and conflicts that can occur when multiple users or automated systems try to update the same jobs simultaneously. The result is fewer failed cancellation attempts and more reliable cleanup of running build processes.

Logs

We see that Ci::CancelPipelineService can trigger the StaleObjectError

Screenshot_2025-08-04_at_2.01.30_PM

Edited by Allison Browne

Merge request reports

Loading