New issue
Advanced search Search tips

Issue 874333 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Orphan autotest job interfers with Tast test

Project Member Reported by nya@chromium.org, Aug 15

Issue description

arc.* tests failures in this run were interesting:
https://stainless.corp.google.com/browse/chromeos-autotest-results/226611031-chromeos-test/

Error messages were:

2018/08/12 10:55:17 [10:55:16.522] Command: 'adb' 'wait-for-device'
2018/08/12 10:55:17 [10:55:16.522] Uncaptured output:
error: protocol fault (no status)
2018/08/12 10:55:17 [10:55:16.523] Error at downloads.go:47: Failed to start ARC: failed connecting to ADB: exit status 1

2018/08/12 10:56:39 [10:56:39.382] Command: 'adb' 'wait-for-device'
2018/08/12 10:56:39 [10:56:39.382] Uncaptured output:
error: more than one device and emulator
2018/08/12 10:56:39 [10:56:39.384] Error at intent_forward.go:52: Failed to start ARC: failed connecting to ADB: exit status 1


Actually, ps.txt says that there was another autotest job running in parallel (!):

root     26218  0.0  0.1  18604  7044 ?        S    09:47   0:00 /usr/bin/python /usr/local/autotest/bin/autotestd /tmp/autoserv-5JTjli -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26219  0.0  0.4  31316 19172 ?        S    09:47   0:00  \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26226  0.0  0.3  31316 15128 ?        S    09:47   0:00      \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26227  0.0  0.3  31316 15128 ?        S    09:47   0:00      \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26253  0.1  1.2  85216 49308 ?        S    09:47   0:06      \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26283  0.0  0.8  75968 33300 ?        S    09:47   0:00          \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26284  0.0  0.8  75968 33300 ?        S    09:47   0:00          \_ /usr/bin/python -u /usr/local/autotest/bin/autotest -H autoserv --verbose --hostname=chromeos2-row6-rack11-host15 --user=chromeos-test /usr/local/autotest/control.autoserv
root     26285  0.0  0.0   8496   744 ?        S    09:47   0:00          \_ evemu-device /usr/local/autotest/cros/input_playback/keyboard.prop
root     29104  0.0  0.0      0     0 ?        Z    09:48   0:00          \_ [android-sh] <defunct>
root      7198  0.0  0.0      0     0 ?        Zs   10:54   0:00          \_ [adb] <defunct>
root      8372  0.0  0.0  10696  1336 ?        Ss   10:55   0:00          \_ adb connect localhost:22

Since it was running long (>1 hour), I guess it's a timed out job.

 
Cc: akes...@chromium.org ayatane@chromium.org pprabhu@chromium.org ihf@chromium.org
Labels: OS-Chrome
Huh, that's strange. The defunct processes make me wonder if 26253 was hanging. :-/

Are those the process start times? It's interesting that there's an android-sh from 09:48 and adb processes from 10:54 and 10:55, all under the same autotest process.

Was PID 8372 (the non-defunct "adb connect localhost:22") the reason why the Tast test's adb command failed? If existing processes will cause problems, maybe Tast's ARC code should kill any existing adb processes first.
Here's corresponding GE report:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=226610626

According to the timeline, cheets_BackupTest timed out before tast on the same DUT:
https://stainless.corp.google.com/browse/chromeos-autotest-results/226611048-chromeos-test/


> Are those the process start times? It's interesting that there's an android-sh from 09:48 and adb processes from 10:54 and 10:55, all under the same autotest process.

Yes, that's the start time. I guess autotest was still retrying something with adb at that time.


> Was PID 8372 (the non-defunct "adb connect localhost:22") the reason why the Tast test's adb command failed? If existing processes will cause problems, maybe Tast's ARC code should kill any existing adb processes first.

Tast's ARC code kills ADB local server first. However I believe the autotest job tried to issue adb connect command in parallel, which makes Tast's adb command to fail.

Odd. I'm surprised if Autotest doesn't already contain logic to try to avoid running two tests (e.g. cheets_BackupTest and tast.py in this case) simultaneously on the same DUT.
Status: Available (was: Untriaged)
I don't think there's enough here to chase down unless it's common or reproducible.  Maybe we could use cgroups in the future to better prevent orphans
adb needs to be inside of a container. Where you trying to run tast + adb without a container?
Do you mean adbd in Android? IIUC adb local server runs outside of containers.
I misread this. I though adb from the shard (=server tests). I see those two were client tests.

I think what should be done here is to add some android checks (like adb, but possibly running container) to the autotest reset verifiers.
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row6-rack11-host15/927755-reset/

That will force a reboot/repair when we have runaway state.

Sign in to add a comment