Review and document health_check queue priority
Problem
The health_check
queue currently has weight 4 (highest priority, tied with critical customer-facing operations). This queue runs every minute to monitor external service health (GitLab.com, Zuora) and manages maintenance mode.
Questions to consider:
- Should health monitoring have the same priority as customer provisioning?
- Does every-minute execution need to compete with customer-facing operations?
- Could health checks run at slightly lower priority without impacting system reliability?
- Are there scenarios where health checks could delay critical customer operations?
Current Behavior
Jobs in queue:
-
HealthCheckCron::CheckGitlabJob
- Runs every minute -
HealthCheckCron::CheckZuoraJob
- Runs every minute
What they do:
- Check if external services are reachable
- Enable maintenance mode if services are down
- Disable maintenance mode when services recover
- Pause/resume Sidekiq queues based on service health
Current priority: Weight 4 (same as gitlab
, zuora
, salesforce
, zuora_callback
)
Options to Consider
Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale)
Rationale:
- System stability is paramount
- Quick detection of outages is critical
- Maintenance mode prevents cascading failures
- Health checks are very fast (< 1 second)
Pros:
- Fastest possible outage detection
- Immediate maintenance mode activation
- No risk of delayed health checks
Cons:
- Competes with customer-facing operations
- May not be necessary to check every minute with highest priority
- Could delay critical provisioning during high load
Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale)
Rationale:
- Health checks are monitoring, not customer-facing
- 1-minute frequency provides buffer for slight delays
- Still high priority, just not highest
- Allows critical customer operations to take precedence
Pros:
- Customer operations never delayed by health checks
- Still runs frequently enough for quick outage detection
- More appropriate priority for monitoring vs. operations
Cons:
- Slightly slower outage detection (seconds, not immediate)
- Could delay maintenance mode activation
- May miss brief outages
Option 3: Separate Critical vs. Routine Health Checks
Rationale:
- Some health checks are more critical than others
- Could have different frequencies and priorities
Structure:
-
health_check_critical
(weight 8): GitLab.com, Zuora (every minute) -
health_check_routine
(weight 5): Other services (every 5 minutes)
Pros:
- Granular control over monitoring priorities
- Can optimize frequency per service
- Critical services monitored with highest priority
Cons:
- More complex configuration
- May be over-engineering for current needs
- Harder to maintain
Recommendation Needed
We need input from the team on:
- Observed behavior: Have health checks ever delayed customer operations?
- Outage scenarios: How quickly do we need to detect outages?
- Maintenance mode: How critical is immediate activation?
- Job duration: Confirm health checks are consistently fast (< 1 second)
- Frequency: Is every-minute checking necessary, or could we reduce to every 2-3 minutes?
Implementation Steps
-
Gather data:
- Review health check job duration (P50, P95, P99)
- Analyze queue depth during peak times
- Check if health checks have ever been delayed
- Review past outage detection times
-
Discuss with team:
- SRE perspective on monitoring priorities
- Engineering perspective on customer impact
- Historical incidents related to health checks
-
Make decision:
- Document rationale for chosen priority
- Consider trade-offs between monitoring and operations
- Align with overall queue priority strategy
-
Update configuration (if needed):
# config/sidekiq.yml :queues: # Option 1: Keep highest priority - [health_check, 10] # Option 2: Slightly lower - [health_check, 7] # Option 3: Split by criticality - [health_check_critical, 9] - [health_check_routine, 5]
-
Monitor after change:
- Track outage detection time
- Monitor maintenance mode activation speed
- Watch for any customer impact
-
Document decision:
- Why this priority was chosen
- Trade-offs considered
- Conditions that might warrant re-evaluation
Success Criteria
- Clear understanding of health check priority requirements
- Documented rationale for chosen priority
- No degradation in outage detection or system reliability
- Appropriate balance between monitoring and customer operations
Related
- Parent epic: &19587
- Related: #14268 (closed) (weight granularity)
- Related: #14271 (job duration analysis)