Review and document health_check queue priority

Problem

The health_check queue currently has weight 4 (highest priority, tied with critical customer-facing operations). This queue runs every minute to monitor external service health (GitLab.com, Zuora) and manages maintenance mode.

Questions to consider:

Should health monitoring have the same priority as customer provisioning?
Does every-minute execution need to compete with customer-facing operations?
Could health checks run at slightly lower priority without impacting system reliability?
Are there scenarios where health checks could delay critical customer operations?

Current Behavior

Jobs in queue:

HealthCheckCron::CheckGitlabJob - Runs every minute
HealthCheckCron::CheckZuoraJob - Runs every minute

What they do:

Check if external services are reachable
Enable maintenance mode if services are down
Disable maintenance mode when services recover
Pause/resume Sidekiq queues based on service health

Current priority: Weight 4 (same as gitlab, zuora, salesforce, zuora_callback)

Options to Consider

Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale)

Rationale:

System stability is paramount
Quick detection of outages is critical
Maintenance mode prevents cascading failures
Health checks are very fast (< 1 second)

Pros:

Fastest possible outage detection
Immediate maintenance mode activation
No risk of delayed health checks

Cons:

Competes with customer-facing operations
May not be necessary to check every minute with highest priority
Could delay critical provisioning during high load

Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale)

Rationale:

Health checks are monitoring, not customer-facing
1-minute frequency provides buffer for slight delays
Still high priority, just not highest
Allows critical customer operations to take precedence

Pros:

Customer operations never delayed by health checks
Still runs frequently enough for quick outage detection
More appropriate priority for monitoring vs. operations

Cons:

Slightly slower outage detection (seconds, not immediate)
Could delay maintenance mode activation
May miss brief outages

Option 3: Separate Critical vs. Routine Health Checks

Rationale:

Some health checks are more critical than others
Could have different frequencies and priorities

Structure:

health_check_critical (weight 8): GitLab.com, Zuora (every minute)
health_check_routine (weight 5): Other services (every 5 minutes)

Pros:

Granular control over monitoring priorities
Can optimize frequency per service
Critical services monitored with highest priority

Cons:

More complex configuration
May be over-engineering for current needs
Harder to maintain

Recommendation Needed

We need input from the team on:

Observed behavior: Have health checks ever delayed customer operations?
Outage scenarios: How quickly do we need to detect outages?
Maintenance mode: How critical is immediate activation?
Job duration: Confirm health checks are consistently fast (< 1 second)
Frequency: Is every-minute checking necessary, or could we reduce to every 2-3 minutes?

Implementation Steps

Gather data:
- Review health check job duration (P50, P95, P99)
- Analyze queue depth during peak times
- Check if health checks have ever been delayed
- Review past outage detection times
Discuss with team:
- SRE perspective on monitoring priorities
- Engineering perspective on customer impact
- Historical incidents related to health checks
Make decision:
- Document rationale for chosen priority
- Consider trade-offs between monitoring and operations
- Align with overall queue priority strategy

Update configuration (if needed):

# config/sidekiq.yml
:queues:
  # Option 1: Keep highest priority
  - [health_check, 10]
  
  # Option 2: Slightly lower
  - [health_check, 7]
  
  # Option 3: Split by criticality
  - [health_check_critical, 9]
  - [health_check_routine, 5]

Monitor after change:
- Track outage detection time
- Monitor maintenance mode activation speed
- Watch for any customer impact
Document decision:
- Why this priority was chosen
- Trade-offs considered
- Conditions that might warrant re-evaluation

Success Criteria

Clear understanding of health check priority requirements
Documented rationale for chosen priority
No degradation in outage detection or system reliability
Appropriate balance between monitoring and customer operations

Parent epic: &19587
Related: #14268 (closed) (weight granularity)
Related: #14271 (job duration analysis)

Review and document health_check queue priority

Problem

Current Behavior

Options to Consider

Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale)

Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale)

Option 3: Separate Critical vs. Routine Health Checks

Recommendation Needed

Implementation Steps

Success Criteria

Related