New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 678009 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

veyron_minnie: dut provision failed and then blamed on video_ChromeHWDecodeUsed.vp8.arc timeout

Project Member Reported by semenzato@chromium.org, Jan 3 2017

Issue description

I can't find much info, other than the test timed out.  Assigning to infra for the moment because of the similarity with  issue 663926 , but it could be test flake.

https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/1278

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=94176377

Xixuan, can you tell (or guess) if this is the same as  bug 663926 ?
 

Comment 1 by nxia@chromium.org, Jan 3 2017

Host chromeos4-row9-rack11-host22 failed to provision, not sure if it affects the video_ChromeHWDecodeUsed.vp8.arc test. 
Waiting xixuan@ for more input.
Looking at the DUT for that job, provision failed, then repair suceeded, then the job failed without any logs.

provision: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546048-provision
repair: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546170-repair/
job (no logs): https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/94176679-chromeos-test/chromeos4-row9-rack11-host22/


So really, this is a case of provision failure resulting in us blaming the job, and the failure looking like a timeout.
KI: issue 632886

Comment 4 by nxia@chromium.org, Jan 3 2017

Cc: ayatane@chromium.org
+ ayatane@ to confirm 'blaming on the wrong job' is the case here.
But what is the actual provision error?  Both the first and second provision attempts produce unhelpful error messages.

01/03 08:44:49.802 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?build_name=veyron_minnie-paladin/R57-9147.0.0-rc3&force_update=True&pid=805&host_name=chromeos4-row9-rack11-host22&full_update=False"''
01/03 08:44:50.986 DEBUG|        dev_server:1723| Current CrOS auto-update status: perform stateful update
01/03 08:45:01.034 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?build_name=veyron_minnie-paladin/R57-9147.0.0-rc3&force_update=True&pid=805&host_name=chromeos4-row9-rack11-host22&full_update=False"''
01/03 08:45:02.231 DEBUG|        dev_server:1785| Failed to trigger auto-update process on devserver
01/03 08:45:02.232 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/handler_cleanup?pid=805&host_name=chromeos4-row9-rack11-host22"''
01/03 08:45:03.448 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/collect_cros_au_log?pid=805&host_name=chromeos4-row9-rack11-host22"''
01/03 08:45:06.766 DEBUG|        dev_server:1617| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row9-rack11-host22/546048-provision/20170301083048/autoupdate_logs/CrOS_update_chromeos4-row9-rack11-host22_805.log
01/03 08:45:06.774 DEBUG|        dev_server:1884| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 223, in TriggerAU
    self._StatefulUpdate(chromeos_AU)
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 136, in _StatefulUpdate
    cros_updater.UpdateStateful()
  File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 512, in UpdateStateful
    raise StatefulUpdateError(error_msg % e)
StatefulUpdateError: Failed to perform stateful partition update: RunCommandError("cwd=None, extra env={'LC_MESSAGES': 'C'}", <chromite.lib.cros_build_lib.CommandResult object at 0x7fe7253ad5d0>, None)

Comment 6 by nxia@chromium.org, Jan 7 2017

Cc: dgarr...@chromium.org
Summary: veyron_minnie: dut provision failed and then blamed on video_ChromeHWDecodeUsed.vp8.arc timeout (was: video_ChromeHWDecodeUsed.vp8.arc timeout on veyron_minnie)
Also saw 'Servo verification failed' as below.

xixuan@ would know more about the real provision failure reason. wait for xixuan@'s input.

ayatane@, can you comment on #4 ?


01/03 08:31:42.543 ERROR|        servo_host:0860| Servo verification failed for chromeos4-row9-rack11-host22-servo
Traceback (most recent call last):
  File "/usr/local/autotest/server/hosts/servo_host.py", line 856, in create_servo_host
    newhost.verify()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 610, in verify
    self._repair_strategy.verify(self, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 685, in verify
    self._verify_root._verify_host(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 326, in _verify_host
    self._verify_dependencies(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 199, in _verify_dependencies
    self._verify_list(host, self._dependency_list, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 188, in _verify_list
    raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: Servo initialize timed out.
"Blaming the job" sounds like the case here.
I feel it's different from  bug 663926 . In that bug, the job has its logs and an incomplete parse.log, which at least means the job is executed and finished.

In this bug, by checking the autoupdate_logs.tgz of this failed provision job: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546048-provision

the log shows:

2017/01/03 08:44:01.125 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpn4zwJ8/testing_rsa root@chromeos4-row9-rack11-host22 -- sh /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful_update /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful.tgz '--stateful_change=clean'
Warning: Permanently added 'chromeos4-row9-rack11-host22,100.115.200.155' (RSA) to the list of known hosts.
Reading local payload /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful.tgz
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/battor_chrome.json: Wrote only 7168 of 10240 bytes
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/lthi_cats.json.gz: Cannot write: No space left on device
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/flow_big.json: Cannot write: No space left on device
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/counter_tracks.html: Cannot write: No space left on device
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/quads.json: Cannot write: No space left on device
tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/big_trace.json: Cannot write: No space left on device
...

Seems this DUT has sth wrong. Need to investigate 2 problems:

1. Normally, will a DUT's space be small enough so that after being transferred stateful.tgz and update.gz, there's no space for it? Seems this error doesn't happen widely, the answer should be no. I'm not sure this DUT's lack of space is due to the test or due to it's a bad DUT.

2. If a DUT is full for some reason, how to make it work again? Will repair job free some space for a DUT?


Status: Archived (was: Untriaged)
This bug has not been touched in over a year.  It is probably no longer relevant.

Sign in to add a comment