[go: up one dir, main page]

Skip to content

Draft: POC - Geo Troubleshooting Rake Task

What

  • This is a POC/draft of a rake task to help improve the troubleshooting experience for Geo replication issues.
  • The change introduces a module which lists known Geo errors, with a mapping to existing documented workarounds.
  • This provides an easily extendable way of recording known issues as they arise, and makes it easier to find documentation/existing issues with solutions
  • This allows customers/support staff to troubleshoot known issues more quickly, hopefully reducing support tickets
  • This provides a shorter term solution to the problem that the Geo Observability UI project is working on resolving

Why

Troubleshooting Geo issues is time consuming, and often errors are known issues.

In theory, if we record/add known issues and workarounds to this list as they arise, we can more effectively direct people to the solutions/best current workaround, and minimise the amount of duplicate RFH/customer support tickets.

Example Output

=== Geo Replication Error Troubleshooting ===
Analyzing sync and verification failures across all replicable types...

Current node: US-West Secondary (Secondary)
================================================================================

Found 127 total failures across 5 categories

▶ Uploads - Sync

  Error: The file is missing on the Geo primary site
  Count: 23 affected records
  Known Issue: Yes
  Description: File exists in database but is missing on primary site's filesystem
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-the-file-is-missing-on-the-geo-primary-site
  Workaround: Clean up orphaned records or restore missing files from backup
  Debug: Run the following to get affected record IDs:
    Geo::UploadRegistry.where(last_sync_failure: "The file is missing on the Geo primary site").limit(10).pluck(:file_id)

--------------------------------------------------------------------------------

▶ Uploads - Verification

  Error: Error during verification: The model which owns this Upload is missing. Upload ID#123, Project ID#456
  Count: 8 affected records
  Known Issue: Yes
  Description: Parent record has been deleted but the associated file record still exists
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#failed-verification-of-uploads-on-the-primary-geo-site
  Workaround: Delete orphaned upload records using the provided script
  Debug: Run the following to get affected record IDs:
    Geo::UploadRegistry.verification_failed.where(verification_failure: "Error during verification: The model which owns this Upload is missing. Upload ID#123, Project ID#456").limit(10).pluck(:file_id)

  Error: Error during verification: undefined method `underscore' for NilClass:Class
  Count: 3 affected records
  Known Issue: Yes
  Description: Orphaned upload with missing parent model
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#failed-verification-of-uploads-on-the-primary-geo-site
  Workaround: Delete orphaned upload records using the provided script
  Debug: Run the following to get affected record IDs:
    Geo::UploadRegistry.verification_failed.where(verification_failure: "Error during verification: undefined method `underscore' for NilClass:Class").limit(10).pluck(:file_id)

--------------------------------------------------------------------------------

▶ Project Repositories - Sync

  Error: Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled
  Count: 45 affected records
  Known Issue: Yes
  Description: JWT authentication failing during Git operations
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#error-error-syncing-repository-13fatal-could-not-read-username
  Workaround: Check system clock sync or upgrade to GitLab 17.1.0+, 17.0.5+, or 16.11.7+
  Debug: Run the following to get affected record IDs:
    Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled").limit(10).pluck(:project_id)

  Error: fetch remote: signal: terminated: context deadline exceeded
  Count: 12 affected records
  Known Issue: Yes
  Description: Git fetch timeout after exactly 3 hours
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#error-fetch-remote-signal-terminated-context-deadline-exceeded-at-exactly-3-hours
  Workaround: Increase gitlab_shell_git_timeout in gitlab.rb
  Debug: Run the following to get affected record IDs:
    Geo::ProjectRepositoryRegistry.where(last_sync_failure: "fetch remote: signal: terminated: context deadline exceeded").limit(10).pluck(:project_id)

  Error: Error syncing repository: fatal: 'geo' does not appear to be a git repository
  Count: 7 affected records
  Known Issue: Yes
  Description: Geo remote is missing from repository config
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#related-error-does-not-appear-to-be-a-git-repository
  Workaround: Re-sync affected repositories using the Rails console
  Debug: Run the following to get affected record IDs:
    Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Error syncing repository: fatal: 'geo' does not appear to be a git repository").limit(10).pluck(:project_id)

  Error: Connection timed out - connect(2) for "geo-secondary.internal" port 22
  Count: 15 affected records
  Known Issue: No - This may be a new issue
  Action: Check logs for more details or contact GitLab support
  Debug: Run the following to get affected record IDs:
    Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Connection timed out - connect(2) for \"geo-secondary.internal\" port 22").limit(10).pluck(:project_id)

--------------------------------------------------------------------------------

▶ CI Job Artifacts - Verification

  Error: Error during verification: File is not checksummable
  Count: 5 affected records
  Known Issue: Yes
  Description: File cannot be checksummed due to being missing or inaccessible
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-error-during-verificationerrorfile-is-not-checksummable
  Workaround: Follow instructions for cleaning up missing files on primary
  Debug: Run the following to get affected record IDs:
    Geo::JobArtifactRegistry.verification_failed.where(verification_failure: "Error during verification: File is not checksummable").limit(10).pluck(:artifact_id)

--------------------------------------------------------------------------------

▶ LFS Objects - Sync

  Error: The file is missing on the Geo primary site
  Count: 11 affected records
  Known Issue: Yes
  Description: File exists in database but is missing on primary site's filesystem
  Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-the-file-is-missing-on-the-geo-primary-site
  Workaround: Clean up orphaned records or restore missing files from backup
  Debug: Run the following to get affected record IDs:
    Geo::LfsObjectRegistry.where(last_sync_failure: "The file is missing on the Geo primary site").limit(10).pluck(:lfs_object_id)

--------------------------------------------------------------------------------

=== Summary ===
Total failures: 127
Unique error types: 9
Known errors: 8
Unknown errors: 1

ℹ Recommendations:
  • Review the documentation links above for known issues
  • Apply suggested workarounds where applicable
  • Consider opening a support ticket for unknown errors
Edited by Scott Murray

Merge request reports

Loading