diff --git a/.gitlab/issue_templates/On-call Training.md b/.gitlab/issue_templates/On-call Training.md new file mode 100644 index 0000000000000000000000000000000000000000..af2dce9fddb2f017e1bac1abd544b6935864e654 --- /dev/null +++ b/.gitlab/issue_templates/On-call Training.md @@ -0,0 +1,138 @@ +# Gitaly Team on-call Training for customer emergencies + +This is a collection of items that should prepare new team members to be effective in understanding customer emergency issues and thus join the [on-call rotation](https://about.gitlab.com/handbook/engineering/infrastructure-platforms/data-access/gitaly#gitaly-oncall-rotation). + +While this can be started at any time, team members should complete their onboarding first, and have some experience in the codebase before completing this process. + +By the end of this training, you should be comfortable with: + +- Using [GitLabSOS](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) to gather and analyze system information +- Using [fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) to quickly identify patterns and issues in logs +- Identifying and examining relevant log files for Gitaly issues + +**This is not a test, it's an interactive learning guide.** It's quite normal and expected to ask for help and to discuss different approaches. + +## Setup + +- [ ] Set the title to `Gitaly Team on-call training: ` +- [ ] Set up a single-node Omnibus GitLab instance ([Sandbox Cloud Realm](https://handbook.gitlab.com/handbook/company/infrastructure-standards/realms/sandbox/)) + +## Understanding Your Environment + +Before diving into tools, let's understand what we're working with. + +### Questions to Explore: + +1. **Environment Discovery**: When a self managed customer reports a Gitaly issue, what are the first 3 pieces of information you need to understand about their environment? + +2. **Log Location Mapping**: Where would you expect to find Gitaly logs on a typical Linux package installation? What about Praefect logs? + +## GitLabSOS + +GitLabSOS is a unified method of gathering information and logs from GitLab and the system it's running on. Think of it as creating a comprehensive snapshot of a customer's system at a moment in time. + +### Exploration Tasks + +1. **Tool Discovery**: Go to the GitLabSOS repository (https://gitlab.com/gitlab-com/support/toolbox/gitlabsos). What types of information does GitLabSOS collect? List at least 5 categories of data it gathers. + +2. **Gitaly-Specific Collection**: Looking at the GitLabSOS output, which files or data points would be most relevant when investigating a Gitaly performance issue? + +3. **Practical Application**: If a customer reports "Git pushes are extremely slow," what specific GitLabSOS outputs would you want to examine first? + +## fast-stats + +Raw GitLabSOS output can be overwhelming. fast-stats helps you analyze and make sense of this data. + +### Exploration Questions: + +1. **Parser Purpose**: Visit the fast-stats repository (https://gitlab.com/gitlab-com/support/toolbox/fast-stats). Based on what you can find, what problem does fast-stats solve? + +2. **Analysis Workflow**: How might you use fast-stats in conjunction with GitLabSOS when troubleshooting a complex Gitaly issue? + +## Essential Log Files and What They Tell You + +Understanding which files to examine and what they reveal is crucial for effective troubleshooting. + +### Investigation Tasks: + +1. **Log File Mapping**: For each scenario below, identify which log files you would examine first: + - Repository corruption reported by users + - Slow clone operations + - Praefect replication issues + - High CPU usage on Gitaly nodes + +2. **Beyond Gitaly Logs**: What system-level files (hint: think `ps`, `iostat`, etc.) would help you understand performance issues that might be affecting Gitaly? + +## Part 5: fast-stats - Your Log Analysis Accelerator + +fast-stats summarizes errors and resource-intensive usage statistics quickly, to help debug performance and configuration problems. + +### Practical Exploration: + +1. **Top Users Discovery**: You suspect a few users are overwhelming the system with constant pushes. How would you use `fast-stats` to identify these users? What command would you run? + +2. **Error Pattern Recognition**: A customer reports intermittent failures. How would you use `fast-stats` to quickly identify error patterns in their logs? + +3. **Performance Bottleneck Identification**: Looking at `fast-stats` output, what metrics would help you identify: + - Which RPC calls are taking the longest? + - Which operations are being called most frequently? + - Whether there are specific timeouts occurring? + +### Challenge Questions: + +1. **RPC Analysis**: If `fast-stats` shows that `PostReceivePack` operations are taking 30+ seconds on average, what are three potential causes you would investigate? + +2. **Pattern Recognition**: You notice `fast-stats errors` shows repeated `context deadline exceeded` errors. What does this suggest, and what would be your next troubleshooting steps? + +## Part 6: Putting It All Together + +### Scenario-Based Problem Solving: + +1. **Complete Investigation Flow**: A customer reports: "Our developers can't push to repositories. Some pushes work, others timeout after 60 seconds." + + Outline your investigation approach using the tools you've learned about: + - What data would you collect first? + - Which tools would you use and in what order? + - What specific outputs would you examine? + - How would you validate your findings? + +2. **Performance Deep Dive**: You've identified that certain Gitaly RPC calls are consistently slow. Walk through how you would: + - Use GitLabSOS to capture comprehensive system state + - Use fast-stats to identify patterns + - Correlate system metrics with Gitaly performance + +## Part 7: Reflection and Knowledge Check + +### Self-Assessment Questions + +1. **Tool Selection**: For each scenario, which tool would be most effective as your starting point? + - Customer reports general "slowness" + - Specific error messages appearing in application + - Need to understand system resource utilization + - Want to identify top consumers of Gitaly resources + +2. **Confidence Check**: Rate your confidence (1-10) in: + - Collecting comprehensive diagnostic data from a customer environment + - Analyzing Gitaly and Praefect logs effectively + - Using fast-stats to identify performance patterns + - Correlating system metrics with Gitaly performance issues + +### Next Steps + +1. **Knowledge Gaps**: What aspects of Gitaly troubleshooting do you feel need more exploration? + +2. **Practice Plan**: What type of scenarios would you like to practice with these tools? + +## Resources and References + +- [GitLabSOS Repository](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos) +- [fast-stats Repository](https://gitlab.com/gitlab-com/support/toolbox/sosparser) +- [fast-stats Repository](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) +- [GitLab Log System Documentation](https://docs.gitlab.com/administration/logs/) +- [Diagnostics Tools Documentation](https://docs.gitlab.com/administration/troubleshooting/diagnostics_tools/) + +## Training Completion + +Once you've worked through these questions and feel comfortable with the concepts, consider yourself ready for Gitaly on-call duties! Remember, the best way to solidify this knowledge is through hands-on practice with real scenarios. + +**Trainer Note**: This training is designed to be self-paced and exploration-based. Encourage trainees to actually visit the repositories, explore the tools, and think through scenarios rather than just reading through the questions. \ No newline at end of file