[go: up one dir, main page]

Skip to content

[Slow Job] AuthorizedProjectsWorker

AuthorizedProjectsWorker jobs are often taking 10-47+ seconds to complete which significantly exceeds our 10 second target and is contributing to Sidekiq Apdex violations. This is a big contributor to our failing error budget because the Sidekiq Apdex violations are the second-largest contributor to our error budget failures.

Investigation Query: json.class:"AuthorizedProjectsWorker" AND json.job_status: "done" AND json.duration_s > 10

Sample execution times from production logs:

  • 47.7s - User 28643938 (SAML group sync)
  • 47.2s - User 10628100 (SAML group sync)
  • 32.9s - User 19424578 (SAML group sync)
  • 18.7s - User 28644009 (SAML group sync)
  • 14.0s - User 21729670 (Group member update)
  • 10.7s - User 20218997 (SCIM user deletion)
  • 10.1s - User 14639539 (Group member removal)

So we are seeing that 4 out of 7 slowest jobs were triggered by SAML group sync operations (Groups::OmniauthCallbacksController#group_saml).

By looking at the Elasticsearch logs we can see where time was being spent:

Example 1 -> 47.2s job (User 10628100):

"json.duration_s": [47.164967],           // Total job time: 47.2s
"json.db_duration_s": [35.65518],         // Database time: 35.7s
"json.db_main_count": [586],              // 586 main database queries
"json.mem_bytes": [258875260],            // Memory: 259MB

35.7s ÷ 47.2s = 76% of total time spent in database

Example 2 -> 47.7s job (User 28643938):

"json.duration_s": [47.661427],           // Total job time: 47.7s
"json.db_duration_s": [38.7763],          // Database time: 38.8s
"json.db_main_count": [612],              // 612 main database queries  
"json.mem_bytes": [348452733],            // Memory: 348MB

38.8s ÷ 47.7s = 81% of total time spent in database

Example 3 -> 32.9s job (User 19424578):

"json.duration_s": [32.92141],            // Total job time: 32.9s  
"json.db_duration_s": [25.465672],        // Database time: 25.5s
"json.db_main_count": [585],              // 585 main database queries
"json.mem_bytes": [276107060],            // Memory: 276MB

25.5s ÷ 32.9s = 77% of total time spent in database

So what we see is that 75-80% of execution time is database operations. So we know that the bottleneck isn't in the application code it's database operations.

Following the execution flow:

  • AuthorizedProjectsWorker is a simple wrapper that calls Users::RefreshAuthorizedProjectsService
  • Users::RefreshAuthorizedProjectsService calls `AuthorizedProjectUpdate::FindRecordsDueForRefreshService
  • Inside AuthorizedProjectUpdate::FindRecordsDueForRefreshService we have
def fresh_access_levels_per_project
  fresh_authorizations.each_with_object({}) do |row, hash|
    hash[row.project_id] = row.access_level
  end
end

def fresh_authorizations
  Gitlab::ProjectAuthorizations.new(user).calculate
end

So what's slow here is calculating fresh authorizations, not comparing or updating them. Gitlab::ProjectAuthorizations#calculate

  • This method builds and executes a complex recursive SQL query that:
    • Starts with user's direct group memberships
    • Recursively finds all nested subgroups
    • Includes shared groups via GroupGroupLink relationships
    • Calculates access levels (so for example if I'm a Developer in parent group but Maintainer in child group what's my access level?)
    • Unions 4 separate queries for different authorization sources:
      • Direct project memberships
      • Personal projects
      • Projects from group memberships
      • Projects shared with my groups
Edited by Diane Russel