Orphan autotest job interfers with Tast test |
||
Issue descriptionarc.* tests failures in this run were interesting: https://stainless.corp.google.com/browse/chromeos-autotest-results/226611031-chromeos-test/ Error messages were: 2018/08/12 10:55:17 [10:55:16.522] Command: 'adb' 'wait-for-device' 2018/08/12 10:55:17 [10:55:16.522] Uncaptured output: error: protocol fault (no status) 2018/08/12 10:55:17 [10:55:16.523] Error at downloads.go:47: Failed to start ARC: failed connecting to ADB: exit status 1 2018/08/12 10:56:39 [10:56:39.382] Command: 'adb' 'wait-for-device' 2018/08/12 10:56:39 [10:56:39.382] Uncaptured output: error: more than one device and emulator 2018/08/12 10:56:39 [10:56:39.384] Error at intent_forward.go:52: Failed to start ARC: failed connecting to ADB: exit status 1 Actually, ps.txt says that there was another autotest job running in parallel (!): root 26218 0.0 0.1 18604 7044 ? S 09:47 0:00 /usr/bin/python /usr/local/autotest/bin/autotestd /tmp/autoserv-5JTjli -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26219 0.0 0.4 31316 19172 ? S 09:47 0:00 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26226 0.0 0.3 31316 15128 ? S 09:47 0:00 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26227 0.0 0.3 31316 15128 ? S 09:47 0:00 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26253 0.1 1.2 85216 49308 ? S 09:47 0:06 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26283 0.0 0.8 75968 33300 ? S 09:47 0:00 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26284 0.0 0.8 75968 33300 ? S 09:47 0:00 \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv root 26285 0.0 0.0 8496 744 ? S 09:47 0:00 \_ evemu-device /usr/local/autotest/cros/input_playback/keyboard.prop root 29104 0.0 0.0 0 0 ? Z 09:48 0:00 \_ [android-sh] <defunct> root 7198 0.0 0.0 0 0 ? Zs 10:54 0:00 \_ [adb] <defunct> root 8372 0.0 0.0 10696 1336 ? Ss 10:55 0:00 \_ adb connect localhost:22 Since it was running long (>1 hour), I guess it's a timed out job.
,
Aug 15
Here's corresponding GE report: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=226610626 According to the timeline, cheets_BackupTest timed out before tast on the same DUT: https://stainless.corp.google.com/browse/chromeos-autotest-results/226611048-chromeos-test/ > Are those the process start times? It's interesting that there's an android-sh from 09:48 and adb processes from 10:54 and 10:55, all under the same autotest process. Yes, that's the start time. I guess autotest was still retrying something with adb at that time. > Was PID 8372 (the non-defunct "adb connect localhost:22") the reason why the Tast test's adb command failed? If existing processes will cause problems, maybe Tast's ARC code should kill any existing adb processes first. Tast's ARC code kills ADB local server first. However I believe the autotest job tried to issue adb connect command in parallel, which makes Tast's adb command to fail.
,
Aug 15
Odd. I'm surprised if Autotest doesn't already contain logic to try to avoid running two tests (e.g. cheets_BackupTest and tast.py in this case) simultaneously on the same DUT.
,
Sep 10
I don't think there's enough here to chase down unless it's common or reproducible. Maybe we could use cgroups in the future to better prevent orphans
,
Sep 10
adb needs to be inside of a container. Where you trying to run tast + adb without a container?
,
Sep 11
Do you mean adbd in Android? IIUC adb local server runs outside of containers.
,
Sep 11
I misread this. I though adb from the shard (=server tests). I see those two were client tests. I think what should be done here is to add some android checks (like adb, but possibly running container) to the autotest reset verifiers. https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row6-rack11-host15/927755-reset/ That will force a reboot/repair when we have runaway state. |
||
►
Sign in to add a comment |
||
Comment 1 by derat@chromium.org
, Aug 15Labels: OS-Chrome