[Slow Job] AuthorizedProjectsWorker
AuthorizedProjectsWorker
jobs are often taking 10-47+ seconds to complete which significantly exceeds our 10 second target and is contributing to Sidekiq Apdex violations. This is a big contributor to our failing error budget because the Sidekiq Apdex violations are the second-largest contributor to our error budget failures.
Investigation Query:
json.class:"AuthorizedProjectsWorker" AND json.job_status: "done" AND json.duration_s > 10
Sample execution times from production logs:
- 47.7s - User 28643938 (SAML group sync)
- 47.2s - User 10628100 (SAML group sync)
- 32.9s - User 19424578 (SAML group sync)
- 18.7s - User 28644009 (SAML group sync)
- 14.0s - User 21729670 (Group member update)
- 10.7s - User 20218997 (SCIM user deletion)
- 10.1s - User 14639539 (Group member removal)
So we are seeing that 4 out of 7 slowest jobs were triggered by SAML group sync operations (Groups::OmniauthCallbacksController#group_saml).
By looking at the Elasticsearch logs we can see where time was being spent:
Example 1 -> 47.2s job (User 10628100):
"json.duration_s": [47.164967], // Total job time: 47.2s
"json.db_duration_s": [35.65518], // Database time: 35.7s
"json.db_main_count": [586], // 586 main database queries
"json.mem_bytes": [258875260], // Memory: 259MB
35.7s ÷ 47.2s = 76% of total time spent in database
Example 2 -> 47.7s job (User 28643938):
"json.duration_s": [47.661427], // Total job time: 47.7s
"json.db_duration_s": [38.7763], // Database time: 38.8s
"json.db_main_count": [612], // 612 main database queries
"json.mem_bytes": [348452733], // Memory: 348MB
38.8s ÷ 47.7s = 81% of total time spent in database
Example 3 -> 32.9s job (User 19424578):
"json.duration_s": [32.92141], // Total job time: 32.9s
"json.db_duration_s": [25.465672], // Database time: 25.5s
"json.db_main_count": [585], // 585 main database queries
"json.mem_bytes": [276107060], // Memory: 276MB
25.5s ÷ 32.9s = 77% of total time spent in database
So what we see is that 75-80% of execution time is database operations. So we know that the bottleneck isn't in the application code it's database operations.
Following the execution flow:
- AuthorizedProjectsWorker is a simple wrapper that calls
Users::RefreshAuthorizedProjectsService
-
Users::RefreshAuthorizedProjectsService
calls `AuthorizedProjectUpdate::FindRecordsDueForRefreshService - Inside
AuthorizedProjectUpdate::FindRecordsDueForRefreshService
we have
def fresh_access_levels_per_project
fresh_authorizations.each_with_object({}) do |row, hash|
hash[row.project_id] = row.access_level
end
end
def fresh_authorizations
Gitlab::ProjectAuthorizations.new(user).calculate
end
So what's slow here is calculating fresh authorizations, not comparing or updating them.
Gitlab::ProjectAuthorizations#calculate
- This method builds and executes a complex recursive SQL query that:
- Starts with user's direct group memberships
- Recursively finds all nested subgroups
- Includes shared groups via GroupGroupLink relationships
- Calculates access levels (so for example if I'm a Developer in parent group but Maintainer in child group what's my access level?)
- Unions 4 separate queries for different authorization sources:
- Direct project memberships
- Personal projects
- Projects from group memberships
- Projects shared with my groups