Draft: POC - Geo Troubleshooting Rake Task
What
- This is a POC/draft of a rake task to help improve the troubleshooting experience for Geo replication issues.
- The change introduces a module which lists known Geo errors, with a mapping to existing documented workarounds.
- This provides an easily extendable way of recording known issues as they arise, and makes it easier to find documentation/existing issues with solutions
- This allows customers/support staff to troubleshoot known issues more quickly, hopefully reducing support tickets
- This provides a shorter term solution to the problem that the Geo Observability UI project is working on resolving
Why
Troubleshooting Geo issues is time consuming, and often errors are known issues.
In theory, if we record/add known issues and workarounds to this list as they arise, we can more effectively direct people to the solutions/best current workaround, and minimise the amount of duplicate RFH/customer support tickets.
Example Output
=== Geo Replication Error Troubleshooting ===
Analyzing sync and verification failures across all replicable types...
Current node: US-West Secondary (Secondary)
================================================================================
Found 127 total failures across 5 categories
▶ Uploads - Sync
Error: The file is missing on the Geo primary site
Count: 23 affected records
Known Issue: Yes
Description: File exists in database but is missing on primary site's filesystem
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-the-file-is-missing-on-the-geo-primary-site
Workaround: Clean up orphaned records or restore missing files from backup
Debug: Run the following to get affected record IDs:
Geo::UploadRegistry.where(last_sync_failure: "The file is missing on the Geo primary site").limit(10).pluck(:file_id)
--------------------------------------------------------------------------------
▶ Uploads - Verification
Error: Error during verification: The model which owns this Upload is missing. Upload ID#123, Project ID#456
Count: 8 affected records
Known Issue: Yes
Description: Parent record has been deleted but the associated file record still exists
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#failed-verification-of-uploads-on-the-primary-geo-site
Workaround: Delete orphaned upload records using the provided script
Debug: Run the following to get affected record IDs:
Geo::UploadRegistry.verification_failed.where(verification_failure: "Error during verification: The model which owns this Upload is missing. Upload ID#123, Project ID#456").limit(10).pluck(:file_id)
Error: Error during verification: undefined method `underscore' for NilClass:Class
Count: 3 affected records
Known Issue: Yes
Description: Orphaned upload with missing parent model
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#failed-verification-of-uploads-on-the-primary-geo-site
Workaround: Delete orphaned upload records using the provided script
Debug: Run the following to get affected record IDs:
Geo::UploadRegistry.verification_failed.where(verification_failure: "Error during verification: undefined method `underscore' for NilClass:Class").limit(10).pluck(:file_id)
--------------------------------------------------------------------------------
▶ Project Repositories - Sync
Error: Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled
Count: 45 affected records
Known Issue: Yes
Description: JWT authentication failing during Git operations
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#error-error-syncing-repository-13fatal-could-not-read-username
Workaround: Check system clock sync or upgrade to GitLab 17.1.0+, 17.0.5+, or 16.11.7+
Debug: Run the following to get affected record IDs:
Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Error syncing repository: 13:fatal: could not read Username for 'https://gitlab.example.com': terminal prompts disabled").limit(10).pluck(:project_id)
Error: fetch remote: signal: terminated: context deadline exceeded
Count: 12 affected records
Known Issue: Yes
Description: Git fetch timeout after exactly 3 hours
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#error-fetch-remote-signal-terminated-context-deadline-exceeded-at-exactly-3-hours
Workaround: Increase gitlab_shell_git_timeout in gitlab.rb
Debug: Run the following to get affected record IDs:
Geo::ProjectRepositoryRegistry.where(last_sync_failure: "fetch remote: signal: terminated: context deadline exceeded").limit(10).pluck(:project_id)
Error: Error syncing repository: fatal: 'geo' does not appear to be a git repository
Count: 7 affected records
Known Issue: Yes
Description: Geo remote is missing from repository config
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#related-error-does-not-appear-to-be-a-git-repository
Workaround: Re-sync affected repositories using the Rails console
Debug: Run the following to get affected record IDs:
Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Error syncing repository: fatal: 'geo' does not appear to be a git repository").limit(10).pluck(:project_id)
Error: Connection timed out - connect(2) for "geo-secondary.internal" port 22
Count: 15 affected records
Known Issue: No - This may be a new issue
Action: Check logs for more details or contact GitLab support
Debug: Run the following to get affected record IDs:
Geo::ProjectRepositoryRegistry.where(last_sync_failure: "Connection timed out - connect(2) for \"geo-secondary.internal\" port 22").limit(10).pluck(:project_id)
--------------------------------------------------------------------------------
▶ CI Job Artifacts - Verification
Error: Error during verification: File is not checksummable
Count: 5 affected records
Known Issue: Yes
Description: File cannot be checksummed due to being missing or inaccessible
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-error-during-verificationerrorfile-is-not-checksummable
Workaround: Follow instructions for cleaning up missing files on primary
Debug: Run the following to get affected record IDs:
Geo::JobArtifactRegistry.verification_failed.where(verification_failure: "Error during verification: File is not checksummable").limit(10).pluck(:artifact_id)
--------------------------------------------------------------------------------
▶ LFS Objects - Sync
Error: The file is missing on the Geo primary site
Count: 11 affected records
Known Issue: Yes
Description: File exists in database but is missing on primary site's filesystem
Documentation: https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting/synchronization_verification.html#message-the-file-is-missing-on-the-geo-primary-site
Workaround: Clean up orphaned records or restore missing files from backup
Debug: Run the following to get affected record IDs:
Geo::LfsObjectRegistry.where(last_sync_failure: "The file is missing on the Geo primary site").limit(10).pluck(:lfs_object_id)
--------------------------------------------------------------------------------
=== Summary ===
Total failures: 127
Unique error types: 9
Known errors: 8
Unknown errors: 1
ℹ Recommendations:
• Review the documentation links above for known issues
• Apply suggested workarounds where applicable
• Consider opening a support ticket for unknown errors
Edited by Scott Murray