New issue
Advanced search Search tips

Issue 711244 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Logdog silently fails on some perf bots

Project Member Reported by primiano@chromium.org, Apr 13 2017

Issue description

I'm perf sheriffing today. I'm following the link

memory.top_10_mobile memory.top_10_mobile
memory.top_10_mobile (Same thing if I just click on stdout)

From

https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf%20%282%29/builds/4889

Which links to 

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_WebView_Perf__2_%2F4889%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile%2F0%2Flogs%2Fjson.output%2F0

Logdog is stuck on "loading streams..."

Devtools shows:
Resolved log stream: StreamPath
https://luci-logdog.appspot.com/prpc/logdog.Logs/Get Failed to load resource: the server responded with a status of 401 ()
https://luci-logdog.appspot.com/prpc/logdog.Logs/Get Failed to load resource: the server responded with a status of 404 ()
elements.html:16857 StreamPath ": Is not found:" GrpcError "; retrying after 5000ms..."
(anonymous) @ elements.html:16857
elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 ()
send @ elements.html:12862
send @ elements.html:13182
call @ elements.html:13422
(anonymous) @ elements.html:16169
RetryIterator.do @ elements.html:16384
Client.callOp @ elements.html:16156
Client.call @ elements.html:16149
Client.get @ elements.html:16415
(anonymous) @ elements.html:16810
(anonymous) @ elements.html:16382
elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 10000ms..."
(anonymous) @ elements.html:16857
onErr @ elements.html:16380
elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 ()
send @ elements.html:12862
send @ elements.html:13182
call @ elements.html:13422
(anonymous) @ elements.html:16169
RetryIterator.do @ elements.html:16384
Client.callOp @ elements.html:16156
Client.call @ elements.html:16149
Client.get @ elements.html:16415
(anonymous) @ elements.html:16810
(anonymous) @ elements.html:16382
elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..."
(anonymous) @ elements.html:16857
onErr @ elements.html:16380
elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 ()
send @ elements.html:12862
send @ elements.html:13182
call @ elements.html:13422
(anonymous) @ elements.html:16169
RetryIterator.do @ elements.html:16384
Client.callOp @ elements.html:16156
Client.call @ elements.html:16149
Client.get @ elements.html:16415
(anonymous) @ elements.html:16810
(anonymous) @ elements.html:16382
elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..."
(anonymous) @ elements.html:16857
onErr @ elements.html:16380
elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 ()
send @ elements.html:12862
send @ elements.html:13182
call @ elements.html:13422
(anonymous) @ elements.html:16169
RetryIterator.do @ elements.html:16384
Client.callOp @ elements.html:16156
Client.call @ elements.html:16149
Client.get @ elements.html:16415
(anonymous) @ elements.html:16810
(anonymous) @ elements.html:16382
elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..."
(anonymous) @ elements.html:16857
onErr @ elements.html:16380
elements.html:12862 POST https://luci-logdog.appspot.com/prpc/logdog.Logs/Get 404 ()
send @ elements.html:12862
send @ elements.html:13182
call @ elements.html:13422
(anonymous) @ elements.html:16169
RetryIterator.do @ elements.html:16384
Client.callOp @ elements.html:16156
Client.call @ elements.html:16149
Client.get @ elements.html:16415
(anonymous) @ elements.html:16810
(anonymous) @ elements.html:16382
elements.html:16857 StreamPath {project: "chrome", path: "bb/chromium.perf/Android_Nexus5X_WebView_Perf__2_/…s/steps/memory.top_10_mobile/0/logs/json.output/0"} ": Is not found:" GrpcError {name: "GrpcError", code: 5, description: "path not found↵", message: "code = 5 desc = path not found↵"} "; retrying after 15000ms..."
(anonymous) @ elements.html:16857
onErr @ elements.html:16380
 
(I am signed in)

Comment 2 by d...@chromium.org, Apr 13 2017

Components: Infra>Labs
Owner: d...@chromium.org
Status: Assigned (was: Untriaged)
It looks like the logs aren't being uploaded on this bot. The reason can be seen in the overall BuildBot build logs:

[E2017-04-10T17:57:01.231228-07:00 3493 0 service.go:62] Failed to load private key JSON - open /creds/service_accounts/service-account-luci-logdog-publisher.json: no such file or directory
[E2017-04-10T17:57:01.231391-07:00 3493 0 pubsubOutput.go:224] Failed to send PubSub message.              {"error":"rpc error: code = 3 desc = transport: open /creds/service_accounts/service-account-luci-logdog-publisher.json: no such file or directory", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}


Every bot should have these credentials, so +Labs to look into why this one doesn't.

LogDog should also not allow the build to proceed if the credential file is missing, though, so there is also a LogDog bug. I'll look into why.

Comment 3 by d...@chromium.org, Apr 13 2017

Cc: vhang@chromium.org
Labels: -Pri-2 Pri-1
+vhang@ from labs b/c this is really important to handle and I want to jump the triage queue :)

Comment 4 by d...@chromium.org, Apr 13 2017

Cc: vadimsh@chromium.org
OK something is actually going seriously wrong here. LogDog HAD the credential at some point, as is evidenced by (a) the log registration process succeeding, and (b) the initial steps having LogDog logs. Neither of those would have been possible if the credential was missing at the time, and I have confirmed that LogDog will have displayed a lot of error messages and failed very fast if that was the case.

This means that the credential file disappeared sometime during the build. LogDog's auth library caches tokens, and my guess is that when it moved to refresh the token, the file was gone, hence the errors.

The dominant problem here is: WHY is the credential file disappearing!? It's a Puppet-managed file. Maybe Puppet is messing with it? I think this is actually very much a labs problem.

I'm going to look into caching whether or not it's appropriate to cache the JSON file once it's initially loaded. +vadimsh@ designed the authentication system, and may have had a specific reason in mind for not doing this. Otherwise, I'll implement that so that LogDog functions consistently throughout the build regardless of what system junk happens after launch.

Comment 5 by pschm...@google.com, Apr 13 2017

Looks like a case of bad timing?
  https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf%20%282%29/builds/4889 was running on build76-b1 which was removed by https://chromium-review.googlesource.com/c/474048 while the build was in flight.

Puppet picked up the fact that it was no longer a buildbot slave and removed that service account mid build?


Comment 6 by d...@chromium.org, Apr 13 2017

OK this is hilarious, good find pschmidt@ :)

Still looking into caching...

Comment 7 by d...@chromium.org, Apr 13 2017

Cc: -vhang@chromium.org
Components: -Infra>Labs

Comment 8 by d...@chromium.org, Apr 13 2017

Status: WontFix (was: Assigned)
As expected, caching is not desirable b/c we might actually want to update the JSON.

Comment 9 by estaab@chromium.org, Apr 17 2017

This is rare as it is, but will go away completely once we're running builds on swarming and we implement the service mutex to make puppet run only when tasks aren't.

Sign in to add a comment