New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 859995 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner: ----
Closed: Jul 27
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

step presentation getting cut off in milo

Project Member Reported by mknyszek@google.com, Jul 3

Issue description

Since this morning we've (Fuchsia) been getting builds in Milo which look like https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-release-qemu_kvm/b8942001274139108480. Looking at the swarming task output, there seems to be a failure in publishing the final annotations proto, but the rendered proto that's dumped seems to be totally correct (https://chromium-swarm.appspot.com/task?id=3e7a137ddcbd4c10&refresh=10&show_raw=1&wide_logs=true).

Only a subset of our builders have actually been manufacturing these weird faulty protos, one of which is https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug-qemu_kvm.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jul 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/95177b6a2eaebbd1319d072b02ba83e041b404df

commit 95177b6a2eaebbd1319d072b02ba83e041b404df
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Tue Jul 03 19:14:00 2018

logdog: fix typo in butler's error reporting.

R=hinoka, iannucci

Bug:  859995 
Change-Id: I3cf8a0f92126275725217acf0e126f09b57133fc
Reviewed-on: https://chromium-review.googlesource.com/1125095
Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/95177b6a2eaebbd1319d072b02ba83e041b404df/logdog/client/butlerproto/proto.go

Project Member

Comment 2 by bugdroid1@chromium.org, Jul 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/2b38bf7165235a9e5fe966024eea29904fee8bce

commit 2b38bf7165235a9e5fe966024eea29904fee8bce
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Tue Jul 03 21:47:25 2018

logdog butler: temporary verbosity++ proto marshaling error.

Will be reverted once we find actual bug.

R=hinoka, iannucci

Bug:  859995 
Change-Id: I183a7d3b719f18f693f3100632654e8857d2b926
Reviewed-on: https://chromium-review.googlesource.com/1125262
Reviewed-by: Ryan Tseng <hinoka@chromium.org>
Commit-Queue: Andrii Shyshkalov <tandrii@chromium.org>

[modify] https://crrev.com/2b38bf7165235a9e5fe966024eea29904fee8bce/logdog/client/butlerproto/proto.go

Any update on this? We're still experiencing this issue, for most of our builds in garnet :( https://luci-milo.appspot.com/p/fuchsia/g/garnet/console
Status: Available (was: Untriaged)
I can't tell what's wrong with the presentation by looking at the build in #0. It sounds like tandrii either saw the presentation issue, or just increased the error reporting regardless. But he's out for awhile, so could you state the exact presentation issue?

I see that now your collect step on garnet-x64-debug-qemu_kvm is hanging for 18 hours, but I don't see that in the build in #0.
In the original comment I pointed to the swarming task; it seems that there's a failure to  construct a pubsub message containing the annotations proto (or something). The annotations proto is dumped as a result of the failure, but it looks sane overall, so the presentation in Milo is incorrect. That is, there isn't a step that's actually hanging for 18 hours, the build has passed successfully as it says at the top. At some point, the annotations proto emission is failing.

Newer example: https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug-qemu_kvm/b8941164804309458496
Swarming task output: https://chromium-swarm.appspot.com/task?id=3ea99fbb48689d10&refresh=10&show_raw=1&wide_logs=true

After Andrii added the logging, I don't really see what changed, but I don't really know what kind of results we're expecting.

This is causing problems for a number of our builders, and I'm not really sure how to proceed.
The version with the logging hasn't been deployed, so we still don't know why Kitchen can't serialize a proto, even though it is able to render it as text...
Also, the display issues seem to be totally on the Milo side. If I query with buildbucket v2 steps, I actually get the full thing no problem (which means buildbucket has no issues ingesting this proto).
The problem is in sending steps to Logdog. Milo fetches them from Logdog, but Buildbucket v2 don't. IIRC, Kitchen serializes steps to JSON before uploading them to Isolate (from where BB grabs them).

So it appears text and JSON serialization work, but binary one doesn't.
Err.. what I said is wrong. Steps serialize just fine. Some log chunk fails to be serialized to proto, and it probably clogs/breaks logdog butler.
i am still not clear on what's unexpected in https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug-qemu_kvm/b8941164804309458496
could you please state the obvious here (what's expected, what's actual) so there is no place for guessing?
That's very odd actually, that link was showing something different when I posted it... I think I understand why there has been so much confusion over this bug.

Let me try and find a few current examples to increase the probability that it doesn't fix itself:

https://luci-milo.appspot.com/p/fuchsia/builders/luci.fuchsia.ci/garnet-arm64-debug-qemu_kvm/b8940642753963801792

https://luci-milo.appspot.com/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug-qemu_kvm/b8940642591772987088

https://luci-milo.appspot.com/p/fuchsia/builders/luci.fuchsia.ci/garnet-arm64-release-qemu_kvm/b8940642518839583264

https://luci-milo.appspot.com/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-release-qemu_kvm/b8940642390168927376

These are the most recent builds for those builders at the time of writing. You'll notice that the "collect" step at the bottom _looks_ purple and is timing out, but when you look at the corresponding Swarming task you see the errors mentioned multiple times earlier in this ticket.

Also, the status of the overall task (success/failure) is correct, so it's really just getting cut off here.
We need to deploy new kitchen with fixed logging to proceed with identifying why protobuf serialization fails. I'll be doing this today (for unrelated reason...).
Project Member

Comment 13 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/3014d29145cad52b96dded8e1cffa7ab41ddddc0

commit 3014d29145cad52b96dded8e1cffa7ab41ddddc0
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Wed Jul 18 20:43:56 2018

Project Member

Comment 14 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/6d851ad604a64afe293b646c0b4f1f875a182f69

commit 6d851ad604a64afe293b646c0b4f1f875a182f69
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Wed Jul 18 22:06:06 2018

Next time this happens, we should have some useful logs.
For posterity, here's the diff of protobuf lib that was deployed with the new kitchen: https://chromium.googlesource.com/external/github.com/golang/protobuf.git/+log/3a3da3a4e2..9eb2c01ac278a5

It is possible it "fixed" the problem (in particular, "proto: revert UTF-8 validation for proto2" commit seems relevant, though we do not use proto2...). 

We'll see.
Status: Fixed (was: Available)
I haven't seen the issue or gotten notified about it in some time, so I'm going to go ahead and say this is OK.

I will reopen if I see this again.

Sign in to add a comment