New issue
Advanced search Search tips

Issue 761077 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocking:
issue 748145



Sign in to add a comment

Certain containers experience consistent network flakiness, always fixed by host reboot

Project Member Reported by bpastene@chromium.org, Aug 31 2017

Issue description

Like:
https://chromium-swarm.appspot.com/bot?id=build272-m1--device7
https://chromium-swarm.appspot.com/bot?id=build939-m4--device1
https://chromium-swarm.appspot.com/bot?id=build929-m4--device7

The quarantining-logic doesn't see anything amiss, but there's something clearly wrong with the bots/devices. Their failure mode is most often a timeout. They pass the tests just fine, but they run them just a bit slower than a normal healthy bot.

Maybe the device's cpus aren't fully unthrottled before a test? Filing this to track how to prevent/detect this kind of failure.
 
Blocking: 748145
Summary: Certain android swarming bots experience elevated rate of failures (was: Certain android swarming bos experience elevated rate of failures)
Re https://chromium-swarm.appspot.com/bot?id=build272-m1--device7 :
It seems that a container can get in a very strange state where it fails to upload its logdog logcat stream and receives errors continuously from pubsub until the test times out entirely. (See https://chromium-swarm.appspot.com/task?id=3851c3cb2c634810) This state lasts until the host reboots, which it does every 24 hours. It appears we're getting into this state occasionally around the fleet: (expand list of past tasks)
https://chromium-swarm.appspot.com/bot?id=build272-m1--device6
https://chromium-swarm.appspot.com/bot?id=build932-m4--device6
I think I'll have to catch a bot in this state before it reboots to debug what's going on.
Re https://chromium-swarm.appspot.com/bot?id=build939-m4--device1:
From taking video recordings of the screen while it was running tests, it looks like this device was getting randomly screen-tapped by *something*. After filing https://gutsv3.corp.google.com/#ticket/28461143, it looks like the device is back to normal. So... *shrug*
https://chromium-swarm.appspot.com/bot?id=build270-m1--device1
failed for me here:
https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/380354

Looks like it times out more often than it succeeds.
I've clicked on "Shut Down Gracefully".
Project Member

Comment 7 by bugdroid1@chromium.org, Sep 16 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/71636de5c745f9a295f64585b013e5d3b70c771c

commit 71636de5c745f9a295f64585b013e5d3b70c771c
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Sat Sep 16 00:31:34 2017

Project Member

Comment 9 by bugdroid1@chromium.org, Sep 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/b22be092cc49c5838f55f9b3de0ede807a7e8a1a

commit b22be092cc49c5838f55f9b3de0ede807a7e8a1a
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Mon Sep 18 16:40:24 2017

Revert "swarming: Roll py-adb to e3ec66"

This reverts commit 7a7e279b2aca09b1db05bd9fc57773ce5665c5b6.

Reason for revert: broken

Original change's description:
> swarming: Roll py-adb to e3ec66
> 
> TBR=maruel@chromium.org
> Bug: 761077
> Change-Id: I36d2ccc4e8009a018069321d1c5d4689c54ce2e6
> Reviewed-on: https://chromium-review.googlesource.com/671117
> Commit-Queue: Benjamin Pastene <bpastene@chromium.org>
> Reviewed-by: Benjamin Pastene <bpastene@chromium.org>

TBR=bpastene@chromium.org

Change-Id: I67904c85a12aeb0e2498dc1e98dc0afe0ddc1f12
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 761077
Reviewed-on: https://chromium-review.googlesource.com/671204
Reviewed-by: Benjamin Pastene <bpastene@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>

[modify] https://crrev.com/b22be092cc49c5838f55f9b3de0ede807a7e8a1a/appengine/third_party/python-adb/README.swarming
[modify] https://crrev.com/b22be092cc49c5838f55f9b3de0ede807a7e8a1a/appengine/third_party/python-adb/adb/contrib/adb_commands_safe.py

Project Member

Comment 10 by bugdroid1@chromium.org, Sep 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/8a1250baa1d6dddbda119c84060088838e74f5f0

commit 8a1250baa1d6dddbda119c84060088838e74f5f0
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Mon Sep 18 17:23:55 2017

Project Member

Comment 11 by bugdroid1@chromium.org, Sep 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/816d74e36d10198508dc6722217f9a039bf5da40

commit 816d74e36d10198508dc6722217f9a039bf5da40
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Mon Sep 18 19:12:52 2017

Project Member

Comment 12 by bugdroid1@chromium.org, Sep 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/b8116430303d1a9b0d3d72c70a092699c25df16a

commit b8116430303d1a9b0d3d72c70a092699c25df16a
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Mon Sep 18 20:00:22 2017

Project Member

Comment 13 by bugdroid1@chromium.org, Sep 20 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/a453020e388707ceae29d3c7547c436203b76739

commit a453020e388707ceae29d3c7547c436203b76739
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Wed Sep 20 22:57:05 2017

Random and disruptive screen taps shouldn't be a problem any more. After the change in #12, we're now detecting and quarantining any devices that get tapped. This led to detection of a particularly tap-happy device which was reported in t/28767486. It seems like the usb cables we use in the racks are the right conductivity to register screen taps.

The network problem is still unsolved, however. About once a day, one or more containers on a particular bot will start having network troubles. This manifests as frequent timeouts when either fetching files from isolate or uploading task logs to logdog. It persists across container restarts but doesn't persist across host reboots, which indicates that it's something on the host itself. Maybe something goes awry when the docker engine initializes.
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/0073f69b56ad33b22500ab4c2432060565b84ec8

commit 0073f69b56ad33b22500ab4c2432060565b84ec8
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Fri Sep 22 17:44:08 2017

I've been looking around for the network failures mentioned in c#14, but I haven't seen any in over a week. It's nice that it's not slowing down our tests any longer, but it worries me that it came and went and I still have no clue what caused it. Oh well, as long as it never shows up again...

Found a bot with some wonky font settings, which were screwing up its tests. Potential fix in https://chrome-internal-review.googlesource.com/472794

Also found a bot that really likes to fail tests with window-focus related errors. Will try to debug:
https://chromium-swarm.appspot.com/bot?id=build274-m1--device4
Summary: Certain containers experience consistent network flakiness, always fixed by host reboot (was: Certain android swarming bots experience elevated rate of failures)
Found another occurrence of the network problem:
https://chromium-swarm.appspot.com/task?id=3a4100f5ff856f10
https://chromium-swarm.appspot.com/task?id=3a418ca748c82410
https://chromium-swarm.appspot.com/task?id=3a404b98aafed610
(all on the same bot, all within 24 hours)
Issue 792438 has been merged into this issue.

Sign in to add a comment