[go: up one dir, main page]

Skip to content

Review and document health_check queue priority

Problem

The health_check queue currently has weight 4 (highest priority, tied with critical customer-facing operations). This queue runs every minute to monitor external service health (GitLab.com, Zuora) and manages maintenance mode.

Questions to consider:

  1. Should health monitoring have the same priority as customer provisioning?
  2. Does every-minute execution need to compete with customer-facing operations?
  3. Could health checks run at slightly lower priority without impacting system reliability?
  4. Are there scenarios where health checks could delay critical customer operations?

Current Behavior

Jobs in queue:

  • HealthCheckCron::CheckGitlabJob - Runs every minute
  • HealthCheckCron::CheckZuoraJob - Runs every minute

What they do:

  • Check if external services are reachable
  • Enable maintenance mode if services are down
  • Disable maintenance mode when services recover
  • Pause/resume Sidekiq queues based on service health

Current priority: Weight 4 (same as gitlab, zuora, salesforce, zuora_callback)

Options to Consider

Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale)

Rationale:

  • System stability is paramount
  • Quick detection of outages is critical
  • Maintenance mode prevents cascading failures
  • Health checks are very fast (< 1 second)

Pros:

  • Fastest possible outage detection
  • Immediate maintenance mode activation
  • No risk of delayed health checks

Cons:

  • Competes with customer-facing operations
  • May not be necessary to check every minute with highest priority
  • Could delay critical provisioning during high load

Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale)

Rationale:

  • Health checks are monitoring, not customer-facing
  • 1-minute frequency provides buffer for slight delays
  • Still high priority, just not highest
  • Allows critical customer operations to take precedence

Pros:

  • Customer operations never delayed by health checks
  • Still runs frequently enough for quick outage detection
  • More appropriate priority for monitoring vs. operations

Cons:

  • Slightly slower outage detection (seconds, not immediate)
  • Could delay maintenance mode activation
  • May miss brief outages

Option 3: Separate Critical vs. Routine Health Checks

Rationale:

  • Some health checks are more critical than others
  • Could have different frequencies and priorities

Structure:

  • health_check_critical (weight 8): GitLab.com, Zuora (every minute)
  • health_check_routine (weight 5): Other services (every 5 minutes)

Pros:

  • Granular control over monitoring priorities
  • Can optimize frequency per service
  • Critical services monitored with highest priority

Cons:

  • More complex configuration
  • May be over-engineering for current needs
  • Harder to maintain

Recommendation Needed

We need input from the team on:

  1. Observed behavior: Have health checks ever delayed customer operations?
  2. Outage scenarios: How quickly do we need to detect outages?
  3. Maintenance mode: How critical is immediate activation?
  4. Job duration: Confirm health checks are consistently fast (< 1 second)
  5. Frequency: Is every-minute checking necessary, or could we reduce to every 2-3 minutes?

Implementation Steps

  1. Gather data:

    • Review health check job duration (P50, P95, P99)
    • Analyze queue depth during peak times
    • Check if health checks have ever been delayed
    • Review past outage detection times
  2. Discuss with team:

    • SRE perspective on monitoring priorities
    • Engineering perspective on customer impact
    • Historical incidents related to health checks
  3. Make decision:

    • Document rationale for chosen priority
    • Consider trade-offs between monitoring and operations
    • Align with overall queue priority strategy
  4. Update configuration (if needed):

    # config/sidekiq.yml
    :queues:
      # Option 1: Keep highest priority
      - [health_check, 10]
      
      # Option 2: Slightly lower
      - [health_check, 7]
      
      # Option 3: Split by criticality
      - [health_check_critical, 9]
      - [health_check_routine, 5]
  5. Monitor after change:

    • Track outage detection time
    • Monitor maintenance mode activation speed
    • Watch for any customer impact
  6. Document decision:

    • Why this priority was chosen
    • Trade-offs considered
    • Conditions that might warrant re-evaluation

Success Criteria

  • Clear understanding of health check priority requirements
  • Documented rationale for chosen priority
  • No degradation in outage detection or system reliability
  • Appropriate balance between monitoring and customer operations

Related