Logdog silently fails on some perf bots |
||||||
Issue descriptionI'm perf sheriffing today. I'm following the link memory.top_10_mobile memory.top_10_mobile memory.top_10_mobile (Same thing if I just click on stdout) From https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf%20%282%29/builds/4889 Which links to https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_WebView_Perf__2_%2F4889%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile%2F0%2Flogs%2Fjson.output%2F0 Logdog is stuck on "loading streams..." Devtools shows: Resolved log stream: StreamPath https://luci-logdog.appspot.com/prpc/logdog.Logs/Get Failed to load resource: the server responded with a status of 401 () https://luci-logdog.appspot.com/prpc/logdog.Logs/Get Failed to load resource: the server responded with a status of 404 () elements.html:16857 StreamPath ": Is not found:" GrpcError "; retrying after 5000ms..." (anonymous) @ elements.html:16857 elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 () send @ elements.html:12862 send @ elements.html:13182 call @ elements.html:13422 (anonymous) @ elements.html:16169 RetryIterator.do @ elements.html:16384 Client.callOp @ elements.html:16156 Client.call @ elements.html:16149 Client.get @ elements.html:16415 (anonymous) @ elements.html:16810 (anonymous) @ elements.html:16382 elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 10000ms..." (anonymous) @ elements.html:16857 onErr @ elements.html:16380 elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 () send @ elements.html:12862 send @ elements.html:13182 call @ elements.html:13422 (anonymous) @ elements.html:16169 RetryIterator.do @ elements.html:16384 Client.callOp @ elements.html:16156 Client.call @ elements.html:16149 Client.get @ elements.html:16415 (anonymous) @ elements.html:16810 (anonymous) @ elements.html:16382 elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..." (anonymous) @ elements.html:16857 onErr @ elements.html:16380 elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 () send @ elements.html:12862 send @ elements.html:13182 call @ elements.html:13422 (anonymous) @ elements.html:16169 RetryIterator.do @ elements.html:16384 Client.callOp @ elements.html:16156 Client.call @ elements.html:16149 Client.get @ elements.html:16415 (anonymous) @ elements.html:16810 (anonymous) @ elements.html:16382 elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..." (anonymous) @ elements.html:16857 onErr @ elements.html:16380 elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 () send @ elements.html:12862 send @ elements.html:13182 call @ elements.html:13422 (anonymous) @ elements.html:16169 RetryIterator.do @ elements.html:16384 Client.callOp @ elements.html:16156 Client.call @ elements.html:16149 Client.get @ elements.html:16415 (anonymous) @ elements.html:16810 (anonymous) @ elements.html:16382 elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..." (anonymous) @ elements.html:16857 onErr @ elements.html:16380 elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 () send @ elements.html:12862 send @ elements.html:13182 call @ elements.html:13422 (anonymous) @ elements.html:16169 RetryIterator.do @ elements.html:16384 Client.callOp @ elements.html:16156 Client.call @ elements.html:16149 Client.get @ elements.html:16415 (anonymous) @ elements.html:16810 (anonymous) @ elements.html:16382 elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..." (anonymous) @ elements.html:16857 onErr @ elements.html:16380
,
Apr 13 2017
It looks like the logs aren't being uploaded on this bot. The reason can be seen in the overall BuildBot build logs:
[E2017-04-10T17:57:01.231228-07:00 3493 0 service.go:62] Failed to load private key JSON - open /creds/service_accounts/service-account-luci-logdog-publisher.json: no such file or directory
[E2017-04-10T17:57:01.231391-07:00 3493 0 pubsubOutput.go:224] Failed to send PubSub message. {"error":"rpc error: code = 3 desc = transport: open /creds/service_accounts/service-account-luci-logdog-publisher.json: no such file or directory", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
Every bot should have these credentials, so +Labs to look into why this one doesn't.
LogDog should also not allow the build to proceed if the credential file is missing, though, so there is also a LogDog bug. I'll look into why.
,
Apr 13 2017
+vhang@ from labs b/c this is really important to handle and I want to jump the triage queue :)
,
Apr 13 2017
OK something is actually going seriously wrong here. LogDog HAD the credential at some point, as is evidenced by (a) the log registration process succeeding, and (b) the initial steps having LogDog logs. Neither of those would have been possible if the credential was missing at the time, and I have confirmed that LogDog will have displayed a lot of error messages and failed very fast if that was the case. This means that the credential file disappeared sometime during the build. LogDog's auth library caches tokens, and my guess is that when it moved to refresh the token, the file was gone, hence the errors. The dominant problem here is: WHY is the credential file disappearing!? It's a Puppet-managed file. Maybe Puppet is messing with it? I think this is actually very much a labs problem. I'm going to look into caching whether or not it's appropriate to cache the JSON file once it's initially loaded. +vadimsh@ designed the authentication system, and may have had a specific reason in mind for not doing this. Otherwise, I'll implement that so that LogDog functions consistently throughout the build regardless of what system junk happens after launch.
,
Apr 13 2017
Looks like a case of bad timing? https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf%20%282%29/builds/4889 was running on build76-b1 which was removed by https://chromium-review.googlesource.com/c/474048 while the build was in flight. Puppet picked up the fact that it was no longer a buildbot slave and removed that service account mid build?
,
Apr 13 2017
OK this is hilarious, good find pschmidt@ :) Still looking into caching...
,
Apr 13 2017
,
Apr 13 2017
As expected, caching is not desirable b/c we might actually want to update the JSON.
,
Apr 17 2017
This is rare as it is, but will go away completely once we're running builds on swarming and we implement the service mutex to make puppet run only when tasks aren't. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by primiano@chromium.org
, Apr 13 2017