From f21435236af95f1bad4db7df3290c80ff194e0b8 Mon Sep 17 00:00:00 2001 From: John Gaughan Date: Fri, 27 Jun 2025 09:29:25 -0400 Subject: [PATCH 1/4] create empty on-call training file --- .gitlab/issue_templates/On-call Training.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 .gitlab/issue_templates/On-call Training.md diff --git a/.gitlab/issue_templates/On-call Training.md b/.gitlab/issue_templates/On-call Training.md new file mode 100644 index 00000000000..e69de29bb2d -- GitLab From 841725080307a05b3fc4b21d1cf25a69668a84e1 Mon Sep 17 00:00:00 2001 From: John Gaughan <12921979-jgaughan@users.noreply.gitlab.com> Date: Tue, 1 Jul 2025 10:55:37 -0400 Subject: [PATCH 2/4] Raw Claude output --- .gitlab/issue_templates/On-call Training.md | 125 ++++++++++++++++++++ 1 file changed, 125 insertions(+) diff --git a/.gitlab/issue_templates/On-call Training.md b/.gitlab/issue_templates/On-call Training.md index e69de29bb2d..b48548d8597 100644 --- a/.gitlab/issue_templates/On-call Training.md +++ b/.gitlab/issue_templates/On-call Training.md @@ -0,0 +1,125 @@ +# Gitaly Team on-call Training for Self-Managed customer emergencies + +Welcome to your Gitaly on-call training! This training is designed to help you become familiar with troubleshooting Gitaly-related issues in self-managed GitLab environments. Rather than reading through documentation, you'll learn by exploring and discovering the tools and techniques that make troubleshooting faster and more effective. + +## Training Goals + +By the end of this training, you should be comfortable with: +- Using GitLabSOS and SOS Parser to gather and analyze system information +- Identifying and examining relevant log files for Gitaly issues +- Leveraging system monitoring tools for performance troubleshooting +- Using fast-stats to quickly identify patterns and issues in logs + +## Part 1: Understanding Your Environment + +Before diving into tools, let's understand what we're working with. + +### Questions to Explore: +1. **Environment Discovery**: When a customer reports a Gitaly issue, what are the first 3 pieces of information you need to understand about their environment? + +2. **Log Location Mapping**: Where would you expect to find Gitaly logs on a typical Linux package installation? What about Praefect logs? + +*Take a moment to think through these questions before moving to the next section.* + +## Part 2: GitLabSOS - Your Data Collection Swiss Army Knife + +GitLabSOS is a unified method of gathering information and logs from GitLab and the system it's running on. Think of it as creating a comprehensive snapshot of a customer's system at a moment in time. + +### Exploration Tasks: +3. **Tool Discovery**: Navigate to the GitLabSOS repository (https://gitlab.com/gitlab-com/support/toolbox/gitlabsos). What types of information does GitLabSOS collect? List at least 5 categories of data it gathers. + +4. **Gitaly-Specific Collection**: Looking at the GitLabSOS output, which files or data points would be most relevant when investigating a Gitaly performance issue? + +5. **Practical Application**: If a customer reports "Git pushes are extremely slow," what specific GitLabSOS outputs would you want to examine first? + +## Part 3: SOS Parser - Making Sense of the Data + +Raw GitLabSOS output can be overwhelming. SOS Parser helps you analyze and make sense of this data. + +### Exploration Questions: +6. **Parser Purpose**: Visit the SOS Parser repository (https://gitlab.com/gitlab-com/support/toolbox/sosparser). Based on what you can find, what problem does SOS Parser solve? + +7. **Analysis Workflow**: How might you use SOS Parser in conjunction with GitLabSOS when troubleshooting a complex Gitaly issue? + +## Part 4: Essential Log Files and What They Tell You + +Understanding which files to examine and what they reveal is crucial for effective troubleshooting. + +### Investigation Tasks: +8. **Log File Mapping**: For each scenario below, identify which log files you would examine first: + - Repository corruption reported by users + - Slow clone operations + - Praefect replication issues + - High CPU usage on Gitaly nodes + +9. **Beyond Gitaly Logs**: What system-level files (hint: think `ps`, `iostat`, etc.) would help you understand performance issues that might be affecting Gitaly? + +## Part 5: fast-stats - Your Log Analysis Accelerator + +fast-stats summarizes errors and resource-intensive usage statistics quickly, to help debug performance and configuration problems. + +### Practical Exploration: +10. **Top Users Discovery**: You suspect a few users are overwhelming the system with constant pushes. How would you use `fast-stats` to identify these users? What command would you run? + +11. **Error Pattern Recognition**: A customer reports intermittent failures. How would you use `fast-stats` to quickly identify error patterns in their logs? + +12. **Performance Bottleneck Identification**: Looking at `fast-stats` output, what metrics would help you identify: + - Which RPC calls are taking the longest? + - Which operations are being called most frequently? + - Whether there are specific timeouts occurring? + +### Challenge Questions: +13. **RPC Analysis**: If `fast-stats` shows that `PostReceivePack` operations are taking 30+ seconds on average, what are three potential causes you would investigate? + +14. **Pattern Recognition**: You notice `fast-stats errors` shows repeated `context deadline exceeded` errors. What does this suggest, and what would be your next troubleshooting steps? + +## Part 6: Putting It All Together + +### Scenario-Based Problem Solving: +15. **Complete Investigation Flow**: A customer reports: "Our developers can't push to repositories. Some pushes work, others timeout after 60 seconds." + + Outline your investigation approach using the tools you've learned about: + - What data would you collect first? + - Which tools would you use and in what order? + - What specific outputs would you examine? + - How would you validate your findings? + +16. **Performance Deep Dive**: You've identified that certain Gitaly RPC calls are consistently slow. Walk through how you would: + - Use GitLabSOS to capture comprehensive system state + - Apply SOS Parser to analyze the data + - Leverage fast-stats to identify patterns + - Correlate system metrics with Gitaly performance + +## Part 7: Reflection and Knowledge Check + +### Self-Assessment Questions: +17. **Tool Selection**: For each scenario, which tool would be most effective as your starting point? + - Customer reports general "slowness" + - Specific error messages appearing in application + - Need to understand system resource utilization + - Want to identify top consumers of Gitaly resources + +18. **Confidence Check**: Rate your confidence (1-10) in: + - Collecting comprehensive diagnostic data from a customer environment + - Analyzing Gitaly and Praefect logs effectively + - Using fast-stats to identify performance patterns + - Correlating system metrics with Gitaly performance issues + +### Next Steps: +19. **Knowledge Gaps**: What aspects of Gitaly troubleshooting do you feel need more exploration? + +20. **Practice Plan**: What type of scenarios would you like to practice with these tools? + +## Resources and References + +- [GitLabSOS Repository](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) +- [SOS Parser Repository](https://gitlab.com/gitlab-com/support/toolbox/sosparser) +- [fast-stats Repository](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) +- [GitLab Log System Documentation](https://docs.gitlab.com/administration/logs/) +- [Diagnostics Tools Documentation](https://docs.gitlab.com/administration/troubleshooting/diagnostics_tools/) + +## Training Completion + +Once you've worked through these questions and feel comfortable with the concepts, consider yourself ready for Gitaly on-call duties! Remember, the best way to solidify this knowledge is through hands-on practice with real scenarios. + +**Trainer Note**: This training is designed to be self-paced and exploration-based. Encourage trainees to actually visit the repositories, explore the tools, and think through scenarios rather than just reading through the questions. \ No newline at end of file -- GitLab From 9fbd6ff3fce131435d7f0173cac6a7567c3dbae3 Mon Sep 17 00:00:00 2001 From: John Gaughan <12921979-jgaughan@users.noreply.gitlab.com> Date: Tue, 1 Jul 2025 11:20:27 -0400 Subject: [PATCH 3/4] Update intro & list ideas for Setup --- .gitlab/issue_templates/On-call Training.md | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/.gitlab/issue_templates/On-call Training.md b/.gitlab/issue_templates/On-call Training.md index b48548d8597..6678d152126 100644 --- a/.gitlab/issue_templates/On-call Training.md +++ b/.gitlab/issue_templates/On-call Training.md @@ -1,14 +1,24 @@ -# Gitaly Team on-call Training for Self-Managed customer emergencies +# Gitaly Team on-call Training for customer emergencies -Welcome to your Gitaly on-call training! This training is designed to help you become familiar with troubleshooting Gitaly-related issues in self-managed GitLab environments. Rather than reading through documentation, you'll learn by exploring and discovering the tools and techniques that make troubleshooting faster and more effective. +This is a collection of items that should prepare new team members to be effective in understanding customer emergency issues and thus join the [on-call rotation](https://about.gitlab.com/handbook/engineering/infrastructure-platforms/data-access/gitaly#gitaly-oncall-rotation). -## Training Goals +While this can be started at any time, team members should complete their onboarding first, and have some experience in the codebase before completing this process. By the end of this training, you should be comfortable with: -- Using GitLabSOS and SOS Parser to gather and analyze system information +- Using [GitLabSOS](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) to gather and analyze system information - Identifying and examining relevant log files for Gitaly issues - Leveraging system monitoring tools for performance troubleshooting -- Using fast-stats to quickly identify patterns and issues in logs +- Using [fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) to quickly identify patterns and issues in logs + +**This is not a test, it's an interactive learning guide.** It's quite normal and expected to ask for help and to discuss different approaches. + +## Setup + +- They need a GitLabSOS to work with. How can we provide that? + - Have them spin up an instance and generate one + - Pros: it's live, up-to-date, and allows them to play around with configurations & generate new SOSs + - Cons: that's significantly more overhead. + - What do we do in the SM training for support engineers? ## Part 1: Understanding Your Environment -- GitLab From f97bf5b485b2eebf241628ccb5fc2d92dfd53812 Mon Sep 17 00:00:00 2001 From: John Gaughan Date: Thu, 11 Sep 2025 11:06:30 -0400 Subject: [PATCH 4/4] Replace SOS Parse w/ fast-stats, improve Setup section --- .gitlab/issue_templates/On-call Training.md | 81 +++++++++++---------- 1 file changed, 42 insertions(+), 39 deletions(-) diff --git a/.gitlab/issue_templates/On-call Training.md b/.gitlab/issue_templates/On-call Training.md index 6678d152126..af2dce9fddb 100644 --- a/.gitlab/issue_templates/On-call Training.md +++ b/.gitlab/issue_templates/On-call Training.md @@ -5,88 +5,90 @@ This is a collection of items that should prepare new team members to be effecti While this can be started at any time, team members should complete their onboarding first, and have some experience in the codebase before completing this process. By the end of this training, you should be comfortable with: + - Using [GitLabSOS](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) to gather and analyze system information -- Identifying and examining relevant log files for Gitaly issues -- Leveraging system monitoring tools for performance troubleshooting - Using [fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) to quickly identify patterns and issues in logs +- Identifying and examining relevant log files for Gitaly issues **This is not a test, it's an interactive learning guide.** It's quite normal and expected to ask for help and to discuss different approaches. ## Setup -- They need a GitLabSOS to work with. How can we provide that? - - Have them spin up an instance and generate one - - Pros: it's live, up-to-date, and allows them to play around with configurations & generate new SOSs - - Cons: that's significantly more overhead. - - What do we do in the SM training for support engineers? +- [ ] Set the title to `Gitaly Team on-call training: ` +- [ ] Set up a single-node Omnibus GitLab instance ([Sandbox Cloud Realm](https://handbook.gitlab.com/handbook/company/infrastructure-standards/realms/sandbox/)) -## Part 1: Understanding Your Environment +## Understanding Your Environment Before diving into tools, let's understand what we're working with. ### Questions to Explore: -1. **Environment Discovery**: When a customer reports a Gitaly issue, what are the first 3 pieces of information you need to understand about their environment? -2. **Log Location Mapping**: Where would you expect to find Gitaly logs on a typical Linux package installation? What about Praefect logs? +1. **Environment Discovery**: When a self managed customer reports a Gitaly issue, what are the first 3 pieces of information you need to understand about their environment? -*Take a moment to think through these questions before moving to the next section.* +2. **Log Location Mapping**: Where would you expect to find Gitaly logs on a typical Linux package installation? What about Praefect logs? -## Part 2: GitLabSOS - Your Data Collection Swiss Army Knife +## GitLabSOS GitLabSOS is a unified method of gathering information and logs from GitLab and the system it's running on. Think of it as creating a comprehensive snapshot of a customer's system at a moment in time. -### Exploration Tasks: -3. **Tool Discovery**: Navigate to the GitLabSOS repository (https://gitlab.com/gitlab-com/support/toolbox/gitlabsos). What types of information does GitLabSOS collect? List at least 5 categories of data it gathers. +### Exploration Tasks + +1. **Tool Discovery**: Go to the GitLabSOS repository (https://gitlab.com/gitlab-com/support/toolbox/gitlabsos). What types of information does GitLabSOS collect? List at least 5 categories of data it gathers. -4. **Gitaly-Specific Collection**: Looking at the GitLabSOS output, which files or data points would be most relevant when investigating a Gitaly performance issue? +2. **Gitaly-Specific Collection**: Looking at the GitLabSOS output, which files or data points would be most relevant when investigating a Gitaly performance issue? -5. **Practical Application**: If a customer reports "Git pushes are extremely slow," what specific GitLabSOS outputs would you want to examine first? +3. **Practical Application**: If a customer reports "Git pushes are extremely slow," what specific GitLabSOS outputs would you want to examine first? -## Part 3: SOS Parser - Making Sense of the Data +## fast-stats -Raw GitLabSOS output can be overwhelming. SOS Parser helps you analyze and make sense of this data. +Raw GitLabSOS output can be overwhelming. fast-stats helps you analyze and make sense of this data. ### Exploration Questions: -6. **Parser Purpose**: Visit the SOS Parser repository (https://gitlab.com/gitlab-com/support/toolbox/sosparser). Based on what you can find, what problem does SOS Parser solve? -7. **Analysis Workflow**: How might you use SOS Parser in conjunction with GitLabSOS when troubleshooting a complex Gitaly issue? +1. **Parser Purpose**: Visit the fast-stats repository (https://gitlab.com/gitlab-com/support/toolbox/fast-stats). Based on what you can find, what problem does fast-stats solve? + +2. **Analysis Workflow**: How might you use fast-stats in conjunction with GitLabSOS when troubleshooting a complex Gitaly issue? -## Part 4: Essential Log Files and What They Tell You +## Essential Log Files and What They Tell You Understanding which files to examine and what they reveal is crucial for effective troubleshooting. ### Investigation Tasks: -8. **Log File Mapping**: For each scenario below, identify which log files you would examine first: + +1. **Log File Mapping**: For each scenario below, identify which log files you would examine first: - Repository corruption reported by users - Slow clone operations - Praefect replication issues - High CPU usage on Gitaly nodes -9. **Beyond Gitaly Logs**: What system-level files (hint: think `ps`, `iostat`, etc.) would help you understand performance issues that might be affecting Gitaly? +2. **Beyond Gitaly Logs**: What system-level files (hint: think `ps`, `iostat`, etc.) would help you understand performance issues that might be affecting Gitaly? ## Part 5: fast-stats - Your Log Analysis Accelerator fast-stats summarizes errors and resource-intensive usage statistics quickly, to help debug performance and configuration problems. ### Practical Exploration: -10. **Top Users Discovery**: You suspect a few users are overwhelming the system with constant pushes. How would you use `fast-stats` to identify these users? What command would you run? -11. **Error Pattern Recognition**: A customer reports intermittent failures. How would you use `fast-stats` to quickly identify error patterns in their logs? +1. **Top Users Discovery**: You suspect a few users are overwhelming the system with constant pushes. How would you use `fast-stats` to identify these users? What command would you run? + +2. **Error Pattern Recognition**: A customer reports intermittent failures. How would you use `fast-stats` to quickly identify error patterns in their logs? -12. **Performance Bottleneck Identification**: Looking at `fast-stats` output, what metrics would help you identify: +3. **Performance Bottleneck Identification**: Looking at `fast-stats` output, what metrics would help you identify: - Which RPC calls are taking the longest? - Which operations are being called most frequently? - Whether there are specific timeouts occurring? ### Challenge Questions: -13. **RPC Analysis**: If `fast-stats` shows that `PostReceivePack` operations are taking 30+ seconds on average, what are three potential causes you would investigate? -14. **Pattern Recognition**: You notice `fast-stats errors` shows repeated `context deadline exceeded` errors. What does this suggest, and what would be your next troubleshooting steps? +1. **RPC Analysis**: If `fast-stats` shows that `PostReceivePack` operations are taking 30+ seconds on average, what are three potential causes you would investigate? + +2. **Pattern Recognition**: You notice `fast-stats errors` shows repeated `context deadline exceeded` errors. What does this suggest, and what would be your next troubleshooting steps? ## Part 6: Putting It All Together ### Scenario-Based Problem Solving: -15. **Complete Investigation Flow**: A customer reports: "Our developers can't push to repositories. Some pushes work, others timeout after 60 seconds." + +1. **Complete Investigation Flow**: A customer reports: "Our developers can't push to repositories. Some pushes work, others timeout after 60 seconds." Outline your investigation approach using the tools you've learned about: - What data would you collect first? @@ -94,36 +96,37 @@ fast-stats summarizes errors and resource-intensive usage statistics quickly, to - What specific outputs would you examine? - How would you validate your findings? -16. **Performance Deep Dive**: You've identified that certain Gitaly RPC calls are consistently slow. Walk through how you would: +2. **Performance Deep Dive**: You've identified that certain Gitaly RPC calls are consistently slow. Walk through how you would: - Use GitLabSOS to capture comprehensive system state - - Apply SOS Parser to analyze the data - - Leverage fast-stats to identify patterns + - Use fast-stats to identify patterns - Correlate system metrics with Gitaly performance ## Part 7: Reflection and Knowledge Check -### Self-Assessment Questions: -17. **Tool Selection**: For each scenario, which tool would be most effective as your starting point? +### Self-Assessment Questions + +1. **Tool Selection**: For each scenario, which tool would be most effective as your starting point? - Customer reports general "slowness" - Specific error messages appearing in application - Need to understand system resource utilization - Want to identify top consumers of Gitaly resources -18. **Confidence Check**: Rate your confidence (1-10) in: +2. **Confidence Check**: Rate your confidence (1-10) in: - Collecting comprehensive diagnostic data from a customer environment - Analyzing Gitaly and Praefect logs effectively - Using fast-stats to identify performance patterns - Correlating system metrics with Gitaly performance issues -### Next Steps: -19. **Knowledge Gaps**: What aspects of Gitaly troubleshooting do you feel need more exploration? +### Next Steps + +1. **Knowledge Gaps**: What aspects of Gitaly troubleshooting do you feel need more exploration? -20. **Practice Plan**: What type of scenarios would you like to practice with these tools? +2. **Practice Plan**: What type of scenarios would you like to practice with these tools? ## Resources and References - [GitLabSOS Repository](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) -- [SOS Parser Repository](https://gitlab.com/gitlab-com/support/toolbox/sosparser) +- [fast-stats Repository](https://gitlab.com/gitlab-com/support/toolbox/sosparser) - [fast-stats Repository](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) - [GitLab Log System Documentation](https://docs.gitlab.com/administration/logs/) - [Diagnostics Tools Documentation](https://docs.gitlab.com/administration/troubleshooting/diagnostics_tools/) -- GitLab