Certain containers experience consistent network flakiness, always fixed by host reboot |
||
Issue descriptionLike: https://chromium-swarm.appspot.com/bot?id=build272-m1--device7 https://chromium-swarm.appspot.com/bot?id=build939-m4--device1 https://chromium-swarm.appspot.com/bot?id=build929-m4--device7 The quarantining-logic doesn't see anything amiss, but there's something clearly wrong with the bots/devices. Their failure mode is most often a timeout. They pass the tests just fine, but they run them just a bit slower than a normal healthy bot. Maybe the device's cpus aren't fully unthrottled before a test? Filing this to track how to prevent/detect this kind of failure.
,
Sep 1 2017
Re https://chromium-swarm.appspot.com/bot?id=build272-m1--device7 : It seems that a container can get in a very strange state where it fails to upload its logdog logcat stream and receives errors continuously from pubsub until the test times out entirely. (See https://chromium-swarm.appspot.com/task?id=3851c3cb2c634810) This state lasts until the host reboots, which it does every 24 hours. It appears we're getting into this state occasionally around the fleet: (expand list of past tasks) https://chromium-swarm.appspot.com/bot?id=build272-m1--device6 https://chromium-swarm.appspot.com/bot?id=build932-m4--device6 I think I'll have to catch a bot in this state before it reboots to debug what's going on.
,
Sep 1 2017
Re https://chromium-swarm.appspot.com/bot?id=build939-m4--device1: From taking video recordings of the screen while it was running tests, it looks like this device was getting randomly screen-tapped by *something*. After filing https://gutsv3.corp.google.com/#ticket/28461143, it looks like the device is back to normal. So... *shrug*
,
Sep 5 2017
https://chromium-swarm.appspot.com/bot?id=build289-m4--device1&sort_stats=total%3Adesc It flaked for me here: https://build.chromium.org/p/tryserver.chromium.android/builders/android_optional_gpu_tests_rel/builds/9836 Looking at recent history, it times out to install Chrome twice: INFO:devil.utils.timeout_retry:Still working on Install(008d7edd5140b46a, /b/swarming/w/ir/out/Release/apks/ChromePublic.apk, retries=3, timeout=120) https://chromium-swarm.appspot.com/task?id=386c4c05fec59110&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=386c0ed9d1590910&refresh=10&show_raw=1 Before that it had some problem with logcats: https://chromium-swarm.appspot.com/task?id=3865fe3312bc6b10&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=38651c999c5e7c10&refresh=10&show_raw=1
,
Sep 7 2017
https://chromium-swarm.appspot.com/bot?id=build270-m1--device1 failed for me here: https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/380354 Looks like it times out more often than it succeeds. I've clicked on "Shut Down Gracefully".
,
Sep 7 2017
https://chromium-swarm.appspot.com/bot?id=build348-m4--device5 failed here: https://build.chromium.org/p/tryserver.chromium.angle/builders/android_angle_deqp_rel_ng/builds/250 Lot of BOT_DIED tasks lately. Turning off.
,
Sep 16 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/71636de5c745f9a295f64585b013e5d3b70c771c commit 71636de5c745f9a295f64585b013e5d3b70c771c Author: Benjamin Pastene <bpastene@chromium.org> Date: Sat Sep 16 00:31:34 2017
,
Sep 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/7a7e279b2aca09b1db05bd9fc57773ce5665c5b6 commit 7a7e279b2aca09b1db05bd9fc57773ce5665c5b6 Author: Benjamin Pastene <bpastene@chromium.org> Date: Mon Sep 18 16:33:12 2017 swarming: Roll py-adb to e3ec66 TBR=maruel@chromium.org Bug: 761077 Change-Id: I36d2ccc4e8009a018069321d1c5d4689c54ce2e6 Reviewed-on: https://chromium-review.googlesource.com/671117 Commit-Queue: Benjamin Pastene <bpastene@chromium.org> Reviewed-by: Benjamin Pastene <bpastene@chromium.org> [modify] https://crrev.com/7a7e279b2aca09b1db05bd9fc57773ce5665c5b6/appengine/third_party/python-adb/README.swarming [modify] https://crrev.com/7a7e279b2aca09b1db05bd9fc57773ce5665c5b6/appengine/third_party/python-adb/adb/contrib/adb_commands_safe.py
,
Sep 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/b22be092cc49c5838f55f9b3de0ede807a7e8a1a commit b22be092cc49c5838f55f9b3de0ede807a7e8a1a Author: Benjamin Pastene <bpastene@chromium.org> Date: Mon Sep 18 16:40:24 2017 Revert "swarming: Roll py-adb to e3ec66" This reverts commit 7a7e279b2aca09b1db05bd9fc57773ce5665c5b6. Reason for revert: broken Original change's description: > swarming: Roll py-adb to e3ec66 > > TBR=maruel@chromium.org > Bug: 761077 > Change-Id: I36d2ccc4e8009a018069321d1c5d4689c54ce2e6 > Reviewed-on: https://chromium-review.googlesource.com/671117 > Commit-Queue: Benjamin Pastene <bpastene@chromium.org> > Reviewed-by: Benjamin Pastene <bpastene@chromium.org> TBR=bpastene@chromium.org Change-Id: I67904c85a12aeb0e2498dc1e98dc0afe0ddc1f12 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 761077 Reviewed-on: https://chromium-review.googlesource.com/671204 Reviewed-by: Benjamin Pastene <bpastene@chromium.org> Commit-Queue: Benjamin Pastene <bpastene@chromium.org> [modify] https://crrev.com/b22be092cc49c5838f55f9b3de0ede807a7e8a1a/appengine/third_party/python-adb/README.swarming [modify] https://crrev.com/b22be092cc49c5838f55f9b3de0ede807a7e8a1a/appengine/third_party/python-adb/adb/contrib/adb_commands_safe.py
,
Sep 18 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/8a1250baa1d6dddbda119c84060088838e74f5f0 commit 8a1250baa1d6dddbda119c84060088838e74f5f0 Author: Benjamin Pastene <bpastene@chromium.org> Date: Mon Sep 18 17:23:55 2017
,
Sep 18 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/816d74e36d10198508dc6722217f9a039bf5da40 commit 816d74e36d10198508dc6722217f9a039bf5da40 Author: Benjamin Pastene <bpastene@chromium.org> Date: Mon Sep 18 19:12:52 2017
,
Sep 18 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/b8116430303d1a9b0d3d72c70a092699c25df16a commit b8116430303d1a9b0d3d72c70a092699c25df16a Author: Benjamin Pastene <bpastene@chromium.org> Date: Mon Sep 18 20:00:22 2017
,
Sep 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/a453020e388707ceae29d3c7547c436203b76739 commit a453020e388707ceae29d3c7547c436203b76739 Author: Benjamin Pastene <bpastene@chromium.org> Date: Wed Sep 20 22:57:05 2017
,
Sep 22 2017
Random and disruptive screen taps shouldn't be a problem any more. After the change in #12, we're now detecting and quarantining any devices that get tapped. This led to detection of a particularly tap-happy device which was reported in t/28767486. It seems like the usb cables we use in the racks are the right conductivity to register screen taps. The network problem is still unsolved, however. About once a day, one or more containers on a particular bot will start having network troubles. This manifests as frequent timeouts when either fetching files from isolate or uploading task logs to logdog. It persists across container restarts but doesn't persist across host reboots, which indicates that it's something on the host itself. Maybe something goes awry when the docker engine initializes.
,
Sep 22 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/0073f69b56ad33b22500ab4c2432060565b84ec8 commit 0073f69b56ad33b22500ab4c2432060565b84ec8 Author: Benjamin Pastene <bpastene@chromium.org> Date: Fri Sep 22 17:44:08 2017
,
Oct 6 2017
I've been looking around for the network failures mentioned in c#14, but I haven't seen any in over a week. It's nice that it's not slowing down our tests any longer, but it worries me that it came and went and I still have no clue what caused it. Oh well, as long as it never shows up again... Found a bot with some wonky font settings, which were screwing up its tests. Potential fix in https://chrome-internal-review.googlesource.com/472794 Also found a bot that really likes to fail tests with window-focus related errors. Will try to debug: https://chromium-swarm.appspot.com/bot?id=build274-m1--device4
,
Dec 6 2017
Found another occurrence of the network problem: https://chromium-swarm.appspot.com/task?id=3a4100f5ff856f10 https://chromium-swarm.appspot.com/task?id=3a418ca748c82410 https://chromium-swarm.appspot.com/task?id=3a404b98aafed610 (all on the same bot, all within 24 hours)
,
Dec 6 2017
Issue 792438 has been merged into this issue. |
||
►
Sign in to add a comment |
||
Comment 1 by bpastene@chromium.org
, Aug 31 2017Summary: Certain android swarming bots experience elevated rate of failures (was: Certain android swarming bos experience elevated rate of failures)