[go: up one dir, main page]

Skip to content

Observability Improvement

Summary

Logs

When gitlab-pages receives a request that cannot be fulfilled due to the inability to complete a TLS handshake, we don't receive much information from the logs. Example:

2021/06/30 18:53:19 http: TLS handshake error from 34.73.184.90:33570: remote error: tls: bad certificate

The only useful information we have is the source IP, and the reason why gitlab-pages cannot complete the request. We have no details as to what domain was requested. This prevents us from assisting in the ability to target a potential bad actor. The only method that can be utilized currently is to perform a packet capture, hoping to see another request come in and observing the PROXY protocol data for the domain that was requested by the client.

Metrics

In that same situation, we also do not receive data on this request in our metrics. During incident gitlab-com/gl-infra/production#5050 (closed) it was noted that this error was showing up a whopping 4500 times per 10 seconds, yet our own dashboard reports no requests during the same time period. This is captured in this thread: gitlab-com/gl-infra/production#5050 (comment 615703006)

Steps to reproduce

  1. Create a pages project using TLS with an intently malformed certificate that would be rejected by the client.
  2. Observe the logs from gitlab-pages when attempting to reach the site
  3. Observe the metrics from the gitlab-pages service when attempting to reach the site

What is the expected correct behavior?

Logs

  • Log Output should be in structured JSON format - only successful requests are properly structured, the above message is not
  • Log Output should include the requested domain
  • This specific log output should be of warning status considering it's no fault of ours in most cases as this service doesn't maintain the certificates but instead is the responsibility of the end user who configures this feature

Metrics

  • Metrics should be improved to capture legitimate requests regardless of whether or not they are able to be successfully completed. Perhaps a counter called gitlab_pages_failed_tls_connect or something to that liking. We can then add this to our existing metric that captures request rate.

Output of checks

This feature request is for GitLab.com

~"devops::release" ~"group::release" Category:Pages