Nearly all canary failed: paygen and AUtest fail to install device image. |
|||
Issue descriptionStarting from 8919.0, there are a lot of failures in paygen and autoupdate autotest: e.g. https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/501 [Auto-Bug]: autoupdate_EndToEndTest.paygen_au_canary_full: retry_count: 1, ABORT: Failed to install device image using payload at http://100.115.219.132:42576/update on chromeos4-row8-rack4-host4. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError, 12 reports Auto bug filed at https://bugs.chromium.org/p/chromium/issues/detail?id=640978 Log: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/81830916-chromeos-test/chromeos4-row8-rack4-host4/ File "/usr/local/autotest/server/server_job.py", line 843, in group_func test.runtest(self, url, tag, args, dargs) File "/usr/local/autotest/server/test.py", line 289, in runtest *logging_args) File "/usr/local/autotest/client/common_lib/test.py", line 888, in runtest mytest._exec(args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1815, in run_once test_platform.prep_device_for_update(test_conf['source_release']) File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1142, in prep_device_for_update self._staged_urls.source_stateful_url) File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 965, in _install_source_version stateful_url, True) File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 940, in _update_via_test_payloads perform_update(payload_url, False) File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 926, in perform_update updater.update_image() File "/usr/local/autotest/client/common_lib/cros/autoupdater.py", line 241, in update_image raise to_raise RootFSUpdateError: Failed to install device image using payload at http://100.115.219.132:42576/update on chromeos4-row8-rack4-host4. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError 8917.0: green. 8918.0: there were a lot of failures in paygen and hwtest steps. dshi@ has pointed that it was related to lab load issue, and after increasing the capacity, suite job should be ok after that. 8919.0: there are a lot of failures in paygen, some with AU tests too. Not sure if this is related to lab load issue. The blamelist between 8917 (green) and 8918: https://crosland.corp.google.com/log/8917.0.0..8918.0.0 Some changes in chromite, some in autotest, but not related to autoupdate. between 8918.0 and 8919.0: https://crosland.corp.google.com/log/8918.0.0..8919.0.0 Some changes in chromeos-admin for lab load issue, maybe they introduce this issue ? I can't see other suspicious CLs between 8917.0 and 8919.0. Hi Dan, could you please check if this is related to the changes introduced in 8919 in chromeos-admin ? Set to p0 since it fails across nearly all the boards. Thanks!
,
Oct 21 2016
,
Oct 21 2016
There is also something I haven't seen before, for instance in monroe-release build 1122 there are many of these: 82145678 monroe-release/R55-8872.19.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_full_5500.100.0 started on: 2016-10-21 01:55:51 status Completed 236889 Reset started on: 2016-10-21 01:55:35 status PASS 236886 Repair started on: 2016-10-21 01:55:03 status PASS 236885 Reset started on: 2016-10-21 01:54:52 status FAIL 82145352 monroe-release/R55-8872.19.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_8872.15.0 started on: 2016-10-21 01:41:12 status Completed 236839 Reset started on: 2016-10-21 01:40:55 status PASS 82144792 monroe-release/R56-8919.0.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_5500.100.0 started on: 2016-10-21 01:23:30 status Failed 236812 Reset started on: 2016-10-1 01:23:12 status PASS 82131405 monroe-release/R56-8919.0.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_8919.0.0 started on: 2016-10-21 01:04:49 status Completed 236752 Reset started on: 2016-10-21 01:04:30 status PASS Why is the Reset step failing?
,
Oct 21 2016
Re #3 which job is that paygen test? From the name, I only find http://cautotest/afe/#tab_id=view_job&object_id=81813778 which passed. job id 82145678 doesn't exist somehow.
,
Oct 21 2016
> Why is the Reset step failing? Run `dut-status -f` for the DUT in question, in the time period in question. It'll have a pointer to the reset job logs that will answer the question.
,
Oct 21 2016
I'm looking at failures in the vicinity of reset failures
like above, and I see this error message:
10/21 04:21:06.427 WARNI| cros_host:1327| cros-version label "cros-version:auron_paine-release/R55-8872.19.0" does not match release version 8743.69.1. Removing the label.
The error came after running an AU test that updated to 8743.69.1,
so the presence of the label suggests provisioning isn't installing
new images. That would be bad.
,
Oct 21 2016
For this provision failure: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/81843396-chromeos-test/chromeos4-row9-rack8-host1/debug/ The dut was provisioned with the right image: 10/21 04:06:09.274 DEBUG| ssh_host:0180| Running (ssh) 'cat "/etc/lsb-release"' ... 10/21 04:06:09.499 DEBUG| base_utils:0280| [stdout] CHROMEOS_RELEASE_DESCRIPTION=8919.0.0-rc2 (Continuous Builder - Builder: N/A) daisy_skate 10/21 04:06:09.499 DEBUG| base_utils:0280| [stdout] CHROMEOS_RELEASE_NAME=Chromium OS But failed later in version label check: 10/21 04:06:09.921 ERROR| control:0071| The host has wrong cros-version label. 10/21 04:06:09.503 WARNI| cros_host:1327| cros-version label "cros-version:daisy_skate-release/R54-8743.65.0" does not match release version 8919.0.0-rc2. Removing the label. It almost looks like the provision job failed to update host's version label. Maybe an RPC issue.
,
Oct 21 2016
I looked at history for one DUT. The short summary appears
to be that AU tests install a new build, but never change
the cros-version: label. We're seeing AU tests for one
version interspersed with tests for some other version.
The end result is a sequence like this:
* Provision and test version X.
* Run an AU test that installs version Y.
* Reset discovers that the cros-version: label says "version X"
but the actual version is Y. That deletes the bad label, and
triggers repair, which installs a repair image.
* The scheduler re-provisions version X for more testing.
NOTE NOTE NOTE
It's not clear whether this sequence is contributing to the original
problem, or if it's unrelated. DO NOT ASSUME THIS IS RELATED.
DO NOT ASSUME IT'S NOT RELATED.
,
Oct 21 2016
About #4, sorry I didn't add the job link. And now I cannot find that log any more! I thought it was from the first autotest failure in the Paygen stdio from build 1122, in debug/autotest.DEBUG. The link for that is http://cautotest/tko/retrieve_logs.cgi?job=/results/81812892-chromeos-test/, but it doesn't match the snippet I posted :P Will be more careful next time.
,
Oct 22 2016
the provision issue went away, likely some load flake in lab. |
|||
►
Sign in to add a comment |
|||
Comment 1 by cychiang@chromium.org
, Oct 21 2016