cheets autotests: adb timed out and aborted
Reported by
jrbarnette@chromium.org,
Mar 19 2018
|
||||||||
Issue description
A recent CQ run failed. This is the failed slave:
https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/2447
The bvt-arc suite for the builder timed out and aborted.
Logs for the suite are here:
http://cautotest-prod/afe/#tab_id=view_job&object_id=184240176
The suite failed because this test job ran over the time
limit:
http://cautotest-prod/afe/#tab_id=view_job&object_id=184240530
That job was for the cheets_AndroidToChromeIntents test. Looking
at debug/client.0.DEBUG for the job, you see these lines are the
last reported:
03/16 22:30:59.300 DEBUG| utils:0214| Running 'adb get-state'
03/16 22:30:59.301 DEBUG| global_hooks:0056| 'adb get-state'
03/16 22:30:59.315 DEBUG| arc:0073| adb get-state: device
03/16 22:30:59.316 DEBUG| utils:0214| Running 'adb shell 'am start -a "android.intent.action.SET_WALLPAPER"''
03/16 22:30:59.317 DEBUG| global_hooks:0056| 'adb shell \'am start -a "android.intent.action.SET_WALLPAPER"\''
After those lines, the test sat and did nothing for some 1h25m, until the
system aborted the job at 23:54:33.
,
Mar 26 2018
,
Mar 27 2018
nya@, are you the owner of this test? Can you please check why did it fail? We might need to fix it if it's flaky.
,
Mar 29 2018
I think this is a general flakiness of ARC, not specific to this test. The problem is the the timeout is too long -- this test usually takes 1-3 minutes, so it should have been aborted after 10 minutes. I'll tune the timeout.
,
Mar 29 2018
I gathered data about run times of passed tests: https://goto.google.com/zbiov So we just need to tune control files to set timeouts. However I'm not sure if I can work on this soon as I have other priority tasks. If anyone is interested please feel free to take.
,
Mar 29 2018
,
Mar 29 2018
"a general flakiness of ARC" - I don't know what is that. If test hung there must be a reason. May it be that we need to collect more logs? I don't really understand how tuning timeouts would have fixed this test. IIUC the test will still be flaky, but will just fail earlier. Is this right?
,
Mar 30 2018
Hmm, makes sense. Let's keep this bug to track the hang issue. I'll file another bug for setting timeouts. --- I searched for logs of cheets_AndroidToChromeIntents in last 28 days, and found 3 instances of similar timeouts out of 8036 runs. So the probability is 0.04%, pretty rare. --- The point where the test hang was running "am start" with "adb shell". Here are relevant logs: https://storage.cloud.google.com/chromeos-autotest-results/184240530-chromeos-test/chromeos6-row1-rack12-host5/cheets_AndroidToChromeIntents/debug/cheets_AndroidToChromeIntents.DEBUG https://storage.cloud.google.com/chromeos-autotest-results/184240530-chromeos-test/chromeos6-row1-rack12-host5/crashinfo.chromeos6-row1-rack12-host5/var/log/logcat logcat says the intent was successfully delivered, so the problem is whether: - "am" command did not return - "adb" command did not return --- I also found a similar failure in cheets_DownloadsFilesystem. http://cautotest-prod/afe/#tab_id=view_job&object_id=182054796 In this case the test hang at "adb pull": 03/08 19:51:24.162 DEBUG| global_hooks:0056| 'adb pull /storage/emulated/0/Download/kittens.jpg /tmp/tmpYw69wB' Assuming they are the same problem, I guess the problem is that adb command hangs. --- Possible reasons by which adb hangs will be: - bugs in adb command - bugs in adbd - bugs in sslh (adb connection go through sslh) - network connection failure We need more data to investigate this. I'll try reproducing this issue in local VM over this weekend...
,
Mar 30 2018
> So the probability is 0.04%, pretty rare.
.04% _is_ pretty rare, but the individual test failure rate isn't quite
the right measure.
The bvt-arc suite (which contains the problem test) runs on 8 different
models of hardware in every CQ run. So, the chance that a CQ run will
be affected is
(1 - (1 - .04%)^8) ~ .3%
That may still sound "pretty rare", but we manage at least 65 CQ runs
every week. So a .04% test failure rate means that, in any given week,
our chance of seeing a failure is at least
(1 - (1 - .04%)^(8*65)) ~ 19%
That's _not_ "pretty rare", and that's before we account for what might
happen if more than one test had a .04% failure rate.
TTBOMK, in practice we're not seeing a failure rate anywhere near that
high in the CQ. If we were, I'd recommend yanking the test from the CQ
until it were made more reliable.
Key conclusions (just to be clear):
* In the CQ, the actual failure rate for this test doesn't seem high
enough to require urgent action. _Yet_.
* The CQ's reliability requirements are stringent enough that for a
single test, even a .04% false failure rate is unacceptably high.
,
Mar 30 2018
+@lhchavez for any inputs on adb hang
,
Mar 30 2018
#c8 is a good summary. We should also add Chrome crashes and some ARC crashes to that list. #c7 also touches on a good point: we urgently need better logs to better understand what happened. most of the time we just see "adb timed out" and it could mean a lot of different things.
,
May 11 2018
Issue 841996 has been merged into this issue.
,
Nov 9
Constable here. Can we close this now?
,
Nov 16
adb connectivity issue still happens sometimes, but I did not have time to investigate this.
,
Jan 7
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by uekawa@google.com
, Mar 20 2018