New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 862408 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

[USB_Detect - Peach-pit] platform_ExternalUsbPeripherals.detect.reboot_login test fails with 'Client job was aborted'

Project Member Reported by pgangishetty@chromium.org, Jul 10

Issue description

DUT: Peach-pit
Host: chromeos15-row13a-rack1-host11	
Failure reason: client job was aborted

Sometimes other platform_ExternalUsbPeripherals.* tests fails with no failure reason.

https://stainless.corp.google.com/search?view=list&first_date=2018-06-27&last_date=2018-07-10&suite=usb_detect&test=platform_ExternalUsbPeripherals*&build=%5ER69*&board=%5Epeach_pit%24&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true  

Failures screenshot: https://screenshot.googleplex.com/CsieW21aPBe

From debug logs:
==========================================
07/10 08:03:49.631 DEBUG|        server_job:1370| Client state file /usr/local/autotest/results/215794987-chromeos-test/platform_ExternalUsbPeripherals.detect.reboot_login/control.autoserv.state not found
07/10 08:03:49.666 DEBUG|          base_job:0399| Persistent state client.* deleted
07/10 08:03:49.677 DEBUG|          autotest:1122| Autotest job finishes.
07/10 08:03:49.677 DEBUG|              test:0410| Test failed due to client job was aborted. Exception log follows the after_iteration_hooks.
07/10 08:03:49.677 DEBUG|              test:0415| Starting after_iteration_hooks for platform_ExternalUsbPeripherals.detect.reboot_login
07/10 08:03:49.678 DEBUG|              test:0420| after_iteration_hooks completed
07/10 08:03:49.678 WARNI|              test:0637| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 631, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/platform_ExternalUsbPeripherals/platform_ExternalUsbPeripherals.py", line 353, in run_once
    self.action_login()
  File "/usr/local/autotest/server/site_tests/platform_ExternalUsbPeripherals/platform_ExternalUsbPeripherals.py", line 64, in action_login
    exit_without_logout=True)
  File "/usr/local/autotest/server/autotest.py", line 638, in run_test
    *args, **dargs)
  File "/usr/local/autotest/server/autotest.py", line 626, in run_timed_test
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 479, in run
    client_disconnect_timeout, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 562, in _do_run
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 1054, in execute_control
    logger, client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 999, in execute_section
    raise err
AutotestRunError: client job was aborted

==========================================


 
Owner: pgangishetty@chromium.org
Reason: File "/usr/local/autotest/server/autotest.py", line 626, in run_timed_test client_disconnect_timeout=client_disconnect_timeout)

The test fails consistently now, but first was a flake, and then started more frequent issue - https://stainless.corp.google.com/search?view=matrix&row=build&col=test&first_date=2018-05-27&last_date=2018-07-10&suite=usb_detect&test=platform_ExternalUsbPeripherals.detect.reboot_login&build=%5ER69*&board=%5Epeach_pit%24&exclude_cts=false&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true


So it seems the LOGIN step is broken, as the device does not even get to execute login, but instead log shows there is no space left on device:

05/31 17:25:45.974 DEBUG|platform_ExternalU:0087| Succeeded in :0 sec
05/31 17:25:45.974 INFO |platform_ExternalU:0335| STEP 1.2. LOGIN
05/31 17:25:45.990 DEBUG|          ssh_host:0301| Running (ssh) 'true' from '_install|wait_up|is_up|ssh_ping|run|run_very_slowly'
05/31 17:25:46.346 DEBUG|      abstract_ssh:0670| Host chromeos1-row1-rack4-host5 is now up
05/31 17:25:46.346 INFO |          autotest:0334| Installing autotest on chromeos1-row1-rack4-host5
...
05/31 17:26:06.198 DEBUG|          ssh_host:0301| Running (ssh) 'nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-ldbLKB -H autoserv --verbose --hostname=chromeos1-row1-rack4-host5 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly'
05/31 17:26:06.643 DEBUG|          ssh_host:0301| Running (ssh) '/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-ldbLKB 0 0' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly'
05/31 17:26:07.437 DEBUG|          autotest:1281| Traceback (most recent call last):
05/31 17:26:07.437 INFO |          autotest:1340| Traceback (most recent call last):
05/31 17:26:07.438 DEBUG|          autotest:1281|   File "/usr/local/autotest/bin/autotestd_monitor", line 12, in <module>
05/31 17:26:07.438 INFO |          autotest:1340|   File "/usr/local/autotest/bin/autotestd_monitor", line 12, in <module>
05/31 17:26:07.438 DEBUG|          autotest:1281|     print >> stderr, 'Entered autotestd_monitor.'
05/31 17:26:07.438 INFO |          autotest:1340|     print >> stderr, 'Entered autotestd_monitor.'
05/31 17:26:07.461 DEBUG|          autotest:1281| IOError: [Errno 28] No space left on device
05/31 17:26:07.461 INFO |          autotest:1340| IOError: [Errno 28] No space left on device
05/31 17:26:07.472 DEBUG|          autotest:0956| Result exit status is 1.


Also I see memd (debugd for older results) crashes when test fails - like https://storage.cloud.google.com/chromeos-autotest-results/215794987-chromeos-test/chromeos15-row13a-rack1-host11/crashinfo.chromeos15-row13a-rack1-host11/memd.20180710.080213.2606.dmp.txt

Logs size is also huge


AI: re-image device. Try the command
$ cros flash chromeos1-row1-rack4-host5 xbuddy://remote/peach_pit/latest-dev --clober-stateful

If it does not work, reimage with USB stick


Actually the hostname is chromeos15-row13a-rack1-host11. plug this one in the command above.

I am puzzled no other tests are failing like that. And device space looks Ok now:

$ ssh root@chromeos15-row13a-rack1-host11
localhost ~ # df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/root                1.2G  939M  283M  77% /
devtmpfs                1000M     0 1000M   0% /dev
tmp                     1001M  136K 1001M   1% /tmp
run                     1001M  432K 1000M   1% /run
shmfs                   1001M  2.3M  998M   1% /dev/shm
/dev/mmcblk0p1            11G  1.3G  8.4G  14% /mnt/stateful_partition
/dev/mmcblk0p8            12M   28K   12M   1% /usr/share/oem
/dev/mapper/encstateful  3.1G   30M  3.0G   1% /mnt/stateful_partition/encrypted
media                   1001M     0 1001M   0% /media
none                    1001M     0 1001M   0% /sys/fs/cgroup
imageloader             1001M     0 1001M   0% /run/imageloader



Still, re-imaging should help.
If not, we have to get another peach_pit DUT.
Re-imaged DUT with command from #1.  Will wait for the tests to run and update the status later.  
Still the test is failing and the latest file system info is about the same as above.  

localhost ~ # df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/root                1.2G  962M  259M  79% /
devtmpfs                1000M     0 1000M   0% /dev
tmp                     1001M  120K 1001M   1% /tmp
run                     1001M  428K 1000M   1% /run
shmfs                   1001M  4.3M  996M   1% /dev/shm
/dev/mmcblk0p1            11G  1.4G  8.4G  14% /mnt/stateful_partition
/dev/mmcblk0p8            12M   28K   12M   1% /usr/share/oem
/dev/mapper/encstateful  3.0G   47M  3.0G   2% /mnt/stateful_partition/encrypted
media                   1001M     0 1001M   0% /media
none                    1001M     0 1001M   0% /sys/fs/cgroup
imageloader             1001M     0 1001M   0% /run/imageloader

@kalin, send request for another peach_pit DUT as mentioned?
Not yet.

Lets figure out how ONLY this test is affected. 
If needed - lets remove this test from the suite.
Ohh, I see this is the ONLY test that REBOOTs as first step. All the rest tests first step is LOGIN.

Yes, lets replace the DUT.
Recovered DUT with USB stick on friday 07/13, but forgot to unlock the device.  Will update status once we have results.  
Test is still failing.  Screenshot: https://screenshot.googleplex.com/ZGMUKknZogQ
Rebooted Servo and ran test twice locally on the same host and the test is passing now.  Will update autotest results later.  
hah, that's a new 'good-to-know-about-servo'. Hopefully no further action will be needed.
No luck, test is still failing.  
Screenshot: https://screenshot.googleplex.com/NSgxUqLKJY7


Will try replacing the Servo SD card.  
This issue is still going on(though test passing sometimes).

Did you replace the SD card?
Not yet.  Will do it this afternoon.
Replaced SD card on Servo.  Now servo is not pingable at all.  Tried changing network cables, rebooted few times etc. 
Test still failing, filed report @ go/acs-device-failure 
Ok, then lets replace the DUT. Be sure we do not send the old device before we put the new in the lab cell. Thanks.
Status: Assigned (was: Untriaged)
This bug has an owner, thus, it's been triaged. Changing status to "assigned".

Sign in to add a comment