Milo panics |
||||||
Issue descriptionI have a NodeJS AppEngine application that (as a temporary solution until ChromeOS moves fully to swarming and we can use BuildBucket V2) polls data from Buildbucket backwards in time and for each build queries Milo BuildInfo. Occasionally, when starting the app, I get 500s back from Milo which correspond to the given log entries: https://pantheon.corp.google.com/logs/viewer?project=luci-milo&minLogLevel=0&expandAll=false×tamp=2018-08-14T19:23:12.983000000Z&customFacets=&limitCustomFacetWidth=true&interval=PT1H&resource=gae_app%2Fmodule_id%2Fdefault&scrollTimestamp=2018-08-14T19:22:46.002846000Z&filters=path:prpc&filters=status:500&filters=useragent:the-em-drive&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&dateRangeStart=2018-08-14T18:23:12.983Z&dateRangeEnd=2018-08-14T19:23:12.983Z It seems (to a layman) like it might be a race condition with Buildbucket because it only happens when I first start the app, and only sometimes. That means it's trying to fetch BuildInfo for a very recently created build in Buildbucket. If I can successfully load the first (few?) builds, then it does not appear to error ever, even after thousands of builds.
,
Aug 14
I take back my last, these panics seem to happen at other times as well. Apparently I just had a lucky run at one point. I am seeing a LOT of 500s come back in the last few runs: Milo error in buildbucketID: 8938177490901835840 with status code: 500 Milo error in buildbucketID: 8938177490901835840 with status code: 500 Milo error in buildbucketID: 8938177489872362176 with status code: 500 Milo error in buildbucketID: 8938177489872362176 with status code: 500 Milo error in buildbucketID: 8938178052623895472 with status code: 500 Milo error in buildbucketID: 8938177782485887184 with status code: 500 Milo error in buildbucketID: 8938178052623895472 with status code: 500 Milo error in buildbucketID: 8938177782485887184 with status code: 500 Loaded chunk. Builds: 100. Stages: 2274. Loaded chunk. Builds: 200. Stages: 4955. Milo error in buildbucketID: 8938183888018869568 with status code: 500 Milo error in buildbucketID: 8938183888018869568 with status code: 500 All pending builds up to date
,
Aug 14
,
Aug 15
Look like the error here isn't being checked? I'm not a Go expert though :p https://chromium.googlesource.com/infra/luci/luci-go/+/master/milo/buildsource/swarming/buildinfo.go#89
,
Aug 15
Sorry for so many posts, it looks like it's specific builds causing this. A simple repro case is buildbucket ID 8938095596697800400: https://luci-milo.appspot.com/rpcexplorer/services/milo.BuildInfo/Get?request={%20%20%20%20%22buildbucket%22:%20{%20%20%20%20%20%20%20%20%22id%22:%20%228938095596697800400%22%20%20%20%20}}
,
Aug 15
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/2c62790ac1d94297cf4fc61127ade194c0965b7f commit 2c62790ac1d94297cf4fc61127ade194c0965b7f Author: Nodir Turakulov <nodir@google.com> Date: Wed Aug 15 21:50:03 2018 [milo] Check error returned by ReadAnnotations GetBuildInfo in swarming/buildinfo.go didn't check error returned by ReadAnnotations Bug: 874196 Change-Id: I248c25451c5b5b442055c5541a03cbd3946d626c Reviewed-on: https://chromium-review.googlesource.com/1176383 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/2c62790ac1d94297cf4fc61127ade194c0965b7f/milo/buildsource/swarming/buildinfo.go
,
Aug 15
it does not panic anymore, but now it returns 404...
,
Aug 15
ah, i wasn't signed in. Works now
,
Aug 15
and thanks for finding the bug!
,
Aug 15
Woo hoo! Thank you Nodir, much appreciated!
,
Aug 17
I am still seeing a lot of panics :( https://pantheon.corp.google.com/logs/viewer?project=luci-milo&minLogLevel=0&expandAll=false×tamp=2018-08-16T23:59:33.245000000Z&customFacets=&limitCustomFacetWidth=true&interval=PT1H&resource=gae_app%2Fmodule_id%2Fdefault&scrollTimestamp=2018-08-16T23:59:23.707466000Z&filters=path:prpc&filters=status:500&filters=useragent:the-em-drive&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&dateRangeStart=2018-08-16T22:59:33.245Z&dateRangeEnd=2018-08-16T23:59:33.245Z
,
Aug 17
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/a3757991f11030392006c198123574fe0165b29b commit a3757991f11030392006c198123574fe0165b29b Author: Nodir Turakulov <nodir@google.com> Date: Fri Aug 17 22:13:21 2018 [milo] Fix panics Also simplify code. Bug: 874196 Change-Id: Ic02fe3564f6953c7bde5b6d9583615e315c45fb5 Reviewed-on: https://chromium-review.googlesource.com/1180164 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/a3757991f11030392006c198123574fe0165b29b/milo/buildsource/swarming/build.go [modify] https://crrev.com/a3757991f11030392006c198123574fe0165b29b/milo/buildsource/swarming/expectations/build-running-logdog-no-annotation-stream.json [modify] https://crrev.com/a3757991f11030392006c198123574fe0165b29b/milo/frontend/expectations/swarming.build-build-running-logdog-no-annotation-stream.html
,
Aug 17
,
Aug 17
Just ran the GAE app again, of 5,000 BuildInfos loaded I only had 3 requests return 500s, and they don't look to be panics: https://pantheon.corp.google.com/logs/viewer?project=luci-milo&minLogLevel=0&expandAll=false×tamp=2018-08-17T23:01:49.482000000Z&customFacets=&limitCustomFacetWidth=true&interval=PT1H&resource=gae_app%2Fmodule_id%2Fdefault&scrollTimestamp=2018-08-17T23:00:04.060992000Z&filters=path:prpc&filters=status:500&filters=useragent:the-em-drive&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&dateRangeStart=2018-08-17T22:01:49.482Z&dateRangeEnd=2018-08-17T23:01:49.482Z Looks like something in Logdog maybe, but no big deal at 0.06% failure rate. I think we can call this fixed, thank you so much for the work on this Nodir! :D |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by no...@chromium.org
, Aug 14