New issue
Advanced search Search tips

Issue 753046 link

Starred by 0 users

Issue metadata

Status: Duplicate
Merged: issue 748145
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

Android swarming tasks occasionally run on slaves without live devices

Project Member Reported by joedow@chromium.org, Aug 7 2017

Issue description

I've seen a few intermittent failures on the Android buildbots where all tests appear to have been run (and passed / skipped as appropriate) but the run fails.  In each case the failure seemed to be related to an exit code returned from collect_cmd.  Here is an example from this run:
https://build.chromium.org/p/chromium.linux/builders/Android%20Tests/builds/44513

Last N lines of stdio for the failing webview_instrumentation_test_apk tests:
I  511.628s Main  All tests completed.
C  511.629s Main  ********************************************************************************
C  511.629s Main  Summary
C  511.629s Main  ********************************************************************************
C  511.629s Main  [==========] 123 tests ran.
C  511.629s Main  [  PASSED  ] 121 tests.
C  511.629s Main  [  SKIPPED ] Skipped 2 tests, listed below:
C  511.629s Main  [  SKIPPED ] org.chromium.android_webview.test.AcceptLanguageTest#testAcceptLanguagesWithenUS
C  511.629s Main  [  SKIPPED ] org.chromium.android_webview.test.AcceptLanguageTest#testAcceptLanguagesWithenUS with --webview-sandboxed-renderer
C  511.629s Main  ********************************************************************************
I  511.630s TimeoutThread-1-for-individual_device_tear_down(06b9c48513c86b6d)  [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 06b9c48513c86b6d shell '( rm -f /data/local/tmp/android-webview-command-line );echo %$?'
I  511.731s TimeoutThread-1-for-individual_device_tear_down(06b9c48513c86b6d)  [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 06b9c48513c86b6d shell '( test -e /data/local/tmp/android-webview-command-line );echo %$?'
I  511.794s individual_device_tear_down(06b9c48513c86b6d)  Flags now set on the device: []
I  511.795s TimeoutThread-1-for-individual_device_tear_down(06b9c48513c86b6d)  [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 06b9c48513c86b6d shell '( am clear-debug-app );echo %$?'
I  512.427s TimeoutThread-1-for-individual_device_tear_down(06b9c48513c86b6d)  [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 06b9c48513c86b6d shell '( rm -f /data/local/tmp/chrome_timeout_scale );echo %$?'
I  512.495s tear_down_device(06b9c48513c86b6d)  Wrote device cache: /b/swarming/w/ir/out/Release/device_cache_06b9c48513c86b6d.json
I  512.760s Main  Opening text logdog stream, unified_logcats
C  512.812s Main  Logcat: https://luci-logdog.appspot.com/v/?s=chromium%2Fandroid%2Fswarming%2Flogcats%2F37d6a9e2ff918511%2F%2B%2Funified_logcats
[I2017-08-07T18:06:28.091800Z 18786 0 main.go:349] Terminating.                                {"returnCode":0}
+------------------------------------------------------------------------------------+
| End of shard 0  Pending: 0.4s  Duration: 514.2s  Bot: build85-b4--device3  Exit: 0 |
+------------------------------------------------------------------------------------+
Total duration: 4016.5s
WARNING:root:collect_cmd had non-zero return code: 87
step returned non-zero exit code: 87
 

Comment 1 by joedow@google.com, Aug 7 2017

Here is another example from this run:
https://uberchromegw.corp.google.com/i/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/44574

Excerpt from stdio for unit_tests run:
I  788.613s Main  All tests completed.
C  788.619s Main  ********************************************************************************
C  788.619s Main  Summary
C  788.619s Main  ********************************************************************************
C  788.622s Main  [==========] 3033 tests ran.
C  788.622s Main  [  PASSED  ] 3033 tests.
C  788.622s Main  ********************************************************************************
I  788.700s individual_device_tear_down(06a1f1b4f0e687d1)  [host]> /b/swarming/w/ir/out/Debug/host_forwarder --adb=/b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb --serial-id=06a1f1b4f0e687d1 --unmap 10203
I  788.761s tear_down_device(06a1f1b4f0e687d1)  Wrote device cache: /b/swarming/w/ir/out/Debug/device_cache_06a1f1b4f0e687d1.json
I  788.836s Main  Opening text logdog stream, unified_logcats
C  788.844s Main  Logcat: https://luci-logdog.appspot.com/v/?s=chromium%2Fandroid%2Fswarming%2Flogcats%2F37d6b4aa7268bf11%2F%2B%2Funified_logcats
[W2017-08-07T18:23:38.863266Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"500ms", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W2017-08-07T18:23:45.738974Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"16s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W2017-08-07T18:24:09.363503Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"1s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W2017-08-07T18:24:31.739218Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"30s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
[W2017-08-07T18:24:40.363745Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"2s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
ERROR:root:Forwarding signal 15 to test process
[W2017-08-07T18:24:58.999061Z 2705 0 butler.go:240] Butler Context was cancelled. Initiating shutdown. {"error":"context canceled"}
[W2017-08-07T18:25:12.364187Z 2705 0 pubsubOutput.go:217] TRANSIENT error publishing messages; retrying... {"error":"context deadline exceeded", "delay":"4s", "pubsub":"pubsub(projects/luci-logdog/topics/logs)"}
+--------------------------------------------------------------------------------------+
| End of shard 0  Pending: 0.6s  Duration: 930.0s  Bot: build934-m4--device3  Exit: -9 |
+--------------------------------------------------------------------------------------+
Total duration: 930.0s
WARNING:root:collect_cmd had non-zero return code: 247
step returned non-zero exit code: 247
Components: -Tests>Flaky Infra>Platform>Swarming
Components: -Infra>Platform>Swarming Infra>Client>Android
Cc: jbudorick@chromium.org
+John specifically for triage
Cc: bpastene@chromium.org
Summary: Android swarming tasks occasionally run on slaves without live devices (was: Intermittent Android test failures reported, possibly related to collect_cmd exit code)
The examples you linked are two separate issues, neither of which is related to the exit code.

#0 has one of the shards (https://chromium-swarm.appspot.com/task?id=37d6a9ece09a5010) failing because it got assigned a bot without a live device. +bpastene and I were talking about this last week -- something about the device dying in on_before_task? Repurposing this bug for that.

#1 is symptomatic of a problem I noticed looking at the bots this morning -- runtimes for unit_tests on Android Tests (dbg) have been rising conspicuously since July 27. I'm not yet sure why. Filed as https://bugs.chromium.org/p/chromium/issues/detail?id=753059
Owner: bpastene@chromium.org
Status: Assigned (was: Untriaged)
I'll take this for the no devices problem.
Project Member

Comment 7 by bugdroid1@chromium.org, Aug 8 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/d2aefb6d321d8f24eba47bbaf185214d1f164c63

commit d2aefb6d321d8f24eba47bbaf185214d1f164c63
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Tue Aug 08 18:43:55 2017

Project Member

Comment 8 by bugdroid1@chromium.org, Aug 8 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/182ff74d29ba7f082d8311fc74821ee9faa2a080

commit 182ff74d29ba7f082d8311fc74821ee9faa2a080
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Tue Aug 08 21:34:46 2017

Do you have an update on this issue. We appear to still be running into "all devices are blacklisted" errors:

https://uberchromegw.corp.google.com/i/chromium.linux/builders/Android%20Tests/builds/44770
Labels: -Pri-3 Pri-2
I've got another change coming down that should fix that particular problem. It looks like we're rebooting the device but not waiting for it to fully complete the boot-up process before scheduling a new test. Hence why we can see the device and can run some simple commands, but can't do anything more complex like touch the sd card or talk with the package manager.

Raising pri since this looks to be responsible for quite a bit of flake on https://build.chromium.org/p/chromium.linux/builders/Android%20Tests
Project Member

Comment 11 by bugdroid1@chromium.org, Aug 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/f57737b20a79c7b9672f08de0f58c98c647c0cb6

commit f57737b20a79c7b9672f08de0f58c98c647c0cb6
Author: Benjamin Pastene <bpastene@chromium.org>
Date: Tue Aug 15 19:00:21 2017

swarming: Roll python-adb to 55aea2...

Also quarantine a device if IsFullyBooted is false.

Bug:  753046 
Change-Id: I81c2d6d40a99d48f4f47c311f2df2c315891c155
Reviewed-on: https://chromium-review.googlesource.com/614622
Reviewed-by: Kevin Lubick <kjlubick@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Benjamin Pastene <bpastene@chromium.org>

[modify] https://crrev.com/f57737b20a79c7b9672f08de0f58c98c647c0cb6/appengine/swarming/swarming_bot/api/os_utilities.py
[modify] https://crrev.com/f57737b20a79c7b9672f08de0f58c98c647c0cb6/appengine/third_party/python-adb/README.swarming
[modify] https://crrev.com/f57737b20a79c7b9672f08de0f58c98c647c0cb6/appengine/third_party/python-adb/adb/contrib/high.py

Comment 12 by awdf@chromium.org, Aug 16 2017

Just had the same issue (?) when trying to submit a change to the CQ:

https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/363734 

chrome_public_test_apk step is red, but all tests appear to have passed.

Yeah, same issue. That test ran on a bot that should have been quarantined:
https://chromium-swarm.appspot.com/bot?id=build65-b4--device2

There was a string of test failures that lasted a few hours when the device was unsuitable for tasks. Our bot-health checks should have caught that the device was wonkers and quarantined it. It was eventually quarantined but not soon enough: http://shortn/_sCDaHUj0sP

I need to look into why we didn't catch that sooner.
Mergedinto: 748145
Status: Duplicate (was: Assigned)
Duping this into bug 748145 since it's the same root cause.

Sign in to add a comment