Frequent 500 errors on https://ci.chromium.org/p/chromium/g/chromium.clang/console |
||||||||||
Issue descriptionWe have a dashboard set up to display https://ci.chromium.org/p/chromium/g/chromium.clang/console?reload=60 However, it frequently, as in multiple times per hour, fails with "GetHistory: failed to fetch ... status code 500" See the attached screenshot.
,
Dec 8 2017
timeout of what? if of gitiles request -- not good, that'll make our console slower :( SGTM for cached data though.
,
Dec 8 2017
Gitiles timeout usually. According to this query: https://pantheon.corp.google.com/logs/viewer?project=luci-milo&resource=gae_app%2Fmodule_id%2Fdefault%2Fversion_id%2F2452-d123a4e&minLogLevel=0&expandAll=false×tamp=2017-12-08T23:01:17.900964000Z&dateRangeStart=2017-12-08T22:46:36.135Z&dateRangeEnd=2017-12-08T23:46:36.135Z&interval=PT1H&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.version_id%3D%222452-d123a4e%22%0AlogName%3D%22projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log%22%0AprotoPayload.resource:%22console%22%0AprotoPayload.status!%3D200%0AprotoPayload.resource!%3D%22%2Fstatic%2Fcommon%2Fcss%2Fconsole.css%22%0AprotoPayload.status!%3D302%0AprotoPayload.status%3D500 This has happened 8 times in the last hour (7 times where it hit a 60s timeout, 1 time when we got a 502 from gitiles). I'll see if i can create a metric from there.
,
Dec 8 2017
how about we always try to contact gitiles, but if does not respond in a minute we use the potentially stale memcached data. And if it does reply, we refresh memcache
,
Dec 8 2017
^ I think that's exactly what I'm thinking Anyways, quick and dirty metric: https://app.google.stackdriver.com/metrics-explorer?project=luci-milo&metric=logging.googleapis.com%2Fuser%2Fconsole-500
,
Dec 8 2017
ah, so it's not 500s for the most part, it's a timeout. I wish we can tell to users that it's not MILO, but Gitiles which is slow. +1 for metrics. What would it take to serve cached request immediately (if available), but also task_queue.enque(gitiles.query_with_looooong_timeout_and_update_cache(ref)) ?
,
Dec 8 2017
yeah, tandrii, that's better. serving code path always reads memcache and falls back to direct gitiles rpc only if data is not in memcache. a cron refreshes memcache periodically
,
Dec 8 2017
> cron refresh I don't think adding a minute-cron is justified here for most consoles. That's why I was specifically recommending adding a new task to task queue (with de-duplication) instead. This will scale gerrit load proportional to actual usage % deduplication, while cron will be O(consoles).
,
Dec 9 2017
sg, but note that it requires caping task frequency, i.e. if we have 10 consoles with the same repo and they are accessed 10/s, we'd create 100 tasks per sec unless we cap tasks. We could use memcache again for caping, per repo. E.g.
key = (repo, truncate(now(), minute))
err = memcache.add(key)
if err != already exists:
if err = enqueue_task(); err != nil
memcache.remove(key)
,
Dec 9 2017
exactly, that's what i meant by deduping + to ensure this also works without memcache, tq task deduplication id should be the same `key` as above.
,
Dec 13 2017
,
Dec 19 2017
I'm just going to add a 30s cache in front of gitiles log requests which specify a ref instead of a hash. I'd like to eventually do something where milo does ls-remote in the background so that it always has a fresh copy of the refs for every console, and then does sha-based log queries, but that's a larger change.
,
Dec 19 2017
> I'd like to eventually do something where milo does ls-remote in the background so that it always has a fresh copy of the refs for every console, and then does sha-based log queries, but that's a larger change. see c#8
,
Dec 19 2017
,
Dec 19 2017
ah yeah... I mean, really, the right thing to do is to have gitiles/gerrit push to pubsub...
,
Dec 19 2017
,
Dec 19 2017
Adding tracking labels.
,
Dec 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/301bd9c67b285282a631cd4d190bf0754a321490 commit 301bd9c67b285282a631cd4d190bf0754a321490 Author: Robert Iannucci <iannucci@chromium.org> Date: Tue Dec 19 20:58:37 2017 [milo] Start caching git.GetHistory responses. Also add a metric for GetHistory so we can have more information to diagnose. R=nodir@chromium.org, vadimsh@chromium.org Bug: 793494 Change-Id: Id0072e5cdc834ca2faf52fc10305822adf14d270 Reviewed-on: https://chromium-review.googlesource.com/834797 Commit-Queue: Robbie Iannucci <iannucci@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/301bd9c67b285282a631cd4d190bf0754a321490/milo/git/history.go
,
Dec 20 2017
The caching solution has been deployed to prod and we're seeing a lot of cache hits. Going to close this one as fixed.
,
Dec 20 2017
Issue 795882 has been merged into this issue.
,
Jan 31 2018
,
Feb 13 2018
,
Feb 13 2018
,
Feb 13 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by hinoka@chromium.org
, Dec 8 2017