veyron_minnie: dut provision failed and then blamed on video_ChromeHWDecodeUsed.vp8.arc timeout |
||||
Issue descriptionI can't find much info, other than the test timed out. Assigning to infra for the moment because of the similarity with issue 663926 , but it could be test flake. https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/1278 http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=94176377 Xixuan, can you tell (or guess) if this is the same as bug 663926 ?
,
Jan 3 2017
Looking at the DUT for that job, provision failed, then repair suceeded, then the job failed without any logs. provision: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546048-provision repair: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546170-repair/ job (no logs): https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/94176679-chromeos-test/chromeos4-row9-rack11-host22/ So really, this is a case of provision failure resulting in us blaming the job, and the failure looking like a timeout.
,
Jan 3 2017
KI: issue 632886
,
Jan 3 2017
+ ayatane@ to confirm 'blaming on the wrong job' is the case here.
,
Jan 3 2017
But what is the actual provision error? Both the first and second provision attempts produce unhelpful error messages.
01/03 08:44:49.802 DEBUG| base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?build_name=veyron_minnie-paladin/R57-9147.0.0-rc3&force_update=True&pid=805&host_name=chromeos4-row9-rack11-host22&full_update=False"''
01/03 08:44:50.986 DEBUG| dev_server:1723| Current CrOS auto-update status: perform stateful update
01/03 08:45:01.034 DEBUG| base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?build_name=veyron_minnie-paladin/R57-9147.0.0-rc3&force_update=True&pid=805&host_name=chromeos4-row9-rack11-host22&full_update=False"''
01/03 08:45:02.231 DEBUG| dev_server:1785| Failed to trigger auto-update process on devserver
01/03 08:45:02.232 DEBUG| base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/handler_cleanup?pid=805&host_name=chromeos4-row9-rack11-host22"''
01/03 08:45:03.448 DEBUG| base_utils:0185| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/collect_cros_au_log?pid=805&host_name=chromeos4-row9-rack11-host22"''
01/03 08:45:06.766 DEBUG| dev_server:1617| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row9-rack11-host22/546048-provision/20170301083048/autoupdate_logs/CrOS_update_chromeos4-row9-rack11-host22_805.log
01/03 08:45:06.774 DEBUG| dev_server:1884| Exception raised on auto_update attempt #1:
Traceback (most recent call last):
File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 223, in TriggerAU
self._StatefulUpdate(chromeos_AU)
File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 136, in _StatefulUpdate
cros_updater.UpdateStateful()
File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 512, in UpdateStateful
raise StatefulUpdateError(error_msg % e)
StatefulUpdateError: Failed to perform stateful partition update: RunCommandError("cwd=None, extra env={'LC_MESSAGES': 'C'}", <chromite.lib.cros_build_lib.CommandResult object at 0x7fe7253ad5d0>, None)
,
Jan 7 2017
Also saw 'Servo verification failed' as below.
xixuan@ would know more about the real provision failure reason. wait for xixuan@'s input.
ayatane@, can you comment on #4 ?
01/03 08:31:42.543 ERROR| servo_host:0860| Servo verification failed for chromeos4-row9-rack11-host22-servo
Traceback (most recent call last):
File "/usr/local/autotest/server/hosts/servo_host.py", line 856, in create_servo_host
newhost.verify()
File "/usr/local/autotest/server/hosts/servo_host.py", line 610, in verify
self._repair_strategy.verify(self, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 685, in verify
self._verify_root._verify_host(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 326, in _verify_host
self._verify_dependencies(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 199, in _verify_dependencies
self._verify_list(host, self._dependency_list, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 188, in _verify_list
raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: Servo initialize timed out.
,
Jan 7 2017
"Blaming the job" sounds like the case here.
,
Jan 7 2017
I feel it's different from bug 663926 . In that bug, the job has its logs and an incomplete parse.log, which at least means the job is executed and finished. In this bug, by checking the autoupdate_logs.tgz of this failed provision job: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row9-rack11-host22/546048-provision the log shows: 2017/01/03 08:44:01.125 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpn4zwJ8/testing_rsa root@chromeos4-row9-rack11-host22 -- sh /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful_update /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful.tgz '--stateful_change=clean' Warning: Permanently added 'chromeos4-row9-rack11-host22,100.115.200.155' (RSA) to the list of known hosts. Reading local payload /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.bu9kUsfdGB/stateful.tgz tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/battor_chrome.json: Wrote only 7168 of 10240 bytes tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/lthi_cats.json.gz: Cannot write: No space left on device tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/flow_big.json: Cannot write: No space left on device tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/counter_tracks.html: Cannot write: No space left on device tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/quads.json: Cannot write: No space left on device tar: dev_image_new/telemetry/src/third_party/catapult/tracing/test_data/big_trace.json: Cannot write: No space left on device ... Seems this DUT has sth wrong. Need to investigate 2 problems: 1. Normally, will a DUT's space be small enough so that after being transferred stateful.tgz and update.gz, there's no space for it? Seems this error doesn't happen widely, the answer should be no. I'm not sure this DUT's lack of space is due to the test or due to it's a bad DUT. 2. If a DUT is full for some reason, how to make it work again? Will repair job free some space for a DUT?
,
Mar 10 2018
This bug has not been touched in over a year. It is probably no longer relevant. |
||||
►
Sign in to add a comment |
||||
Comment 1 by nxia@chromium.org
, Jan 3 2017