Tune GitLab Pages caches to increase hit-rates on gitlab.com

We currently have very low hit-rates on zip-archive caches in gitlab-pages: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(gitlab_pages_zip_cached_entries%7Benv%3D%22gprd%22%7D)%20by%20(op)&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%2Ccache%3D%22hit%22%7D%5B1h%5D))%20by%20(op%20)%20%2F%20sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%20%7D%5B1h%5D))%20by%20(op%20)&g1.tab=0&g1.stacked=0&g1.range_input=1w&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D:

There 3 different hit rates there ☝. The main one is "archive", and we have configuration options for it:

	zipCacheExpiration = flag.Duration("zip-cache-expiration", 60*time.Second, "Zip serving archive cache expiration interval")
	zipCacheCleanup    = flag.Duration("zip-cache-cleanup", 30*time.Second, "Zip serving archive cache cleanup interval")
	zipCacheRefresh    = flag.Duration("zip-cache-refresh", 30*time.Second, "Zip serving archive cache refresh interval")

These options work like this:

every cacheClenup interval we scan the cache and remove everything which was added to the cache longer than cacheExpiration interval
every time we hit the cache if archive was added to the cache longer than zipCacheRefresh ago, we update it - basically make it look like it was just added to the cache

so tradeoffs work like this:

longer zipCacheExpiration - takes more memory, but increases the hit rate as we have more data in cache.
longer zipCacheCleanup - saves CPU time, but increases memory consumption, as do garbage collection less often
longer zipCacheRefresh - saves some CPU by doing fewer operations with memory, but makes cache less "up to date", as we can evict cache entries that were accessed recently. (TBH, I don't know why we can't just make this 0, I'm not sure how big CPU impact of this will)

However, pages daemon already takes quite big amount of memory, up to 1.7 GB on gitlab.com. And 90% of this memory is used by zip archives cache. There is an issue to optimize this. So if we want to include more data in this cache, we need to do that very carefully.

I think default 30 sec for zipCacheCleanup and zipCacheRefresh is OK. But want to try increasing zipCacheExpiration from default 1 min to 2, or maybe 5-10 minutes depending on the memory impact of such change.

Success criteria:

hit-rate for archive operation on the graph above goes up
95% duration_ms percentile for files under 1 MB goes down
ttfb(from the same dashboard ☝) goes down

I'll set the weight to 4 to carefully set zipCacheExpiration on:

production to 1.5 min
production to 2 min
production to 5 min
production to 10 min

We can stop after each step, if it doesn't affect duration_ms or if memory usage goes to high. (we also need to set the same values on stg/pre every time, so there will be 8 MRs if we go all the way, but I don't want to the scary 8 weight on the issue)

References of similar MRs:

changing pages config on staging/pre: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1499 (merged)
in production - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1500 (merged)

Edited Feb 10, 2022 by Vladimir Shushlin