New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 793494 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 793514



Sign in to add a comment

Frequent 500 errors on https://ci.chromium.org/p/chromium/g/chromium.clang/console

Project Member Reported by h...@chromium.org, Dec 8 2017

Issue description

We have a dashboard set up to display
https://ci.chromium.org/p/chromium/g/chromium.clang/console?reload=60

However, it frequently, as in multiple times per hour, fails with "GetHistory: failed to fetch ... status code 500"

See the attached screenshot.
 
IMG_2061.JPG
4.3 MB View Download
Cc: tandrii@chromium.org
This happens when gitiles is sad :(

Gitiles must be pretty unreliable if a user is reported to hit this many times an hour.  We should augment Milo in such a way to insulate ourselves from these type of failures.  I think a simple way at first would be to set a timeout of 2s (or something reasonable), and serve cached data if a response isn't met by then.
timeout of what?
  if of gitiles request -- not good, that'll make our console slower :(

SGTM for cached data though.

Comment 4 by no...@chromium.org, Dec 8 2017

how about we always try to contact gitiles, but if does not respond in a minute we use the potentially stale memcached data. And if it does reply, we refresh memcache
^ I think that's exactly what I'm thinking

Anyways, quick and dirty metric:
https://app.google.stackdriver.com/metrics-explorer?project=luci-milo&metric=logging.googleapis.com%2Fuser%2Fconsole-500
ah, so it's not 500s for the most part, it's a timeout.
I wish we can tell to users that it's not MILO, but Gitiles which is slow.

+1 for metrics.

What would it take to serve cached request immediately (if available), but also
task_queue.enque(gitiles.query_with_looooong_timeout_and_update_cache(ref))
?

Comment 7 by no...@chromium.org, Dec 8 2017

yeah, tandrii, that's better. serving code path always reads memcache and falls back to direct gitiles rpc only if data is not in memcache. a cron refreshes memcache periodically
> cron refresh

I don't think adding a minute-cron is justified here for most consoles.

That's why I was specifically recommending adding a new task to task queue (with de-duplication) instead. This will scale gerrit load proportional to actual usage % deduplication, while cron will be O(consoles).

Comment 9 by no...@chromium.org, Dec 9 2017

sg, but note that it requires caping task frequency, i.e. if we have 10 consoles with the same repo and they are accessed 10/s, we'd create 100 tasks per sec unless we cap tasks. We could use memcache again for caping, per repo. E.g. 

  key = (repo, truncate(now(), minute))
  err = memcache.add(key)
  if err != already exists:
     if err = enqueue_task(); err != nil 
        memcache.remove(key)        


exactly, that's what i meant by deduping + to ensure this also works without memcache, tq task deduplication id should be the same `key` as above.
Status: Available (was: Untriaged)
I'm just going to add a 30s cache in front of gitiles log requests which specify a ref instead of a hash.

I'd like to eventually do something where milo does ls-remote in the background so that it always has a fresh copy of the refs for every console, and then does sha-based log queries, but that's a larger change.

Comment 13 by no...@chromium.org, Dec 19 2017

> I'd like to eventually do something where milo does ls-remote in the background so that it always has a fresh copy of the refs for every console, and then does sha-based log queries, but that's a larger change.

see c#8

Comment 14 by no...@chromium.org, Dec 19 2017

Labels: -Pri-3 Pri-1
Owner: iannucci@chromium.org
Status: Started (was: Available)
ah yeah... I mean, really, the right thing to do is to have gitiles/gerrit push to pubsub...

Comment 16 by no...@chromium.org, Dec 19 2017

Blocking: 793514

Comment 17 by efoo@chromium.org, Dec 19 2017

Labels: LUCI-M4-S14 LUCI-M4-Migration
Adding tracking labels. 
Project Member

Comment 18 by bugdroid1@chromium.org, Dec 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/301bd9c67b285282a631cd4d190bf0754a321490

commit 301bd9c67b285282a631cd4d190bf0754a321490
Author: Robert Iannucci <iannucci@chromium.org>
Date: Tue Dec 19 20:58:37 2017

[milo] Start caching git.GetHistory responses.

Also add a metric for GetHistory so we can have more information to
diagnose.

R=nodir@chromium.org, vadimsh@chromium.org

Bug:  793494 
Change-Id: Id0072e5cdc834ca2faf52fc10305822adf14d270
Reviewed-on: https://chromium-review.googlesource.com/834797
Commit-Queue: Robbie Iannucci <iannucci@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/301bd9c67b285282a631cd4d190bf0754a321490/milo/git/history.go

Status: Fixed (was: Started)
The caching solution has been deployed to prod and we're seeing a lot of cache hits. Going to close this one as fixed.

Comment 20 by no...@chromium.org, Dec 20 2017

Issue 795882 has been merged into this issue.

Comment 21 by efoo@chromium.org, Jan 31 2018

Labels: -LUCI-M4-Migration LUCI-Migration

Comment 22 by efoo@chromium.org, Feb 13 2018

Labels: -luci-migration LUCI-Chromium-CQSets

Comment 23 by efoo@chromium.org, Feb 13 2018

Labels: -LUCI-Chromium-CQSets LUCI-Chromium

Comment 24 by efoo@chromium.org, Feb 13 2018

Labels: LUCI-Chromium-CQSets

Sign in to add a comment