New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 658214 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment

Nearly all canary failed: paygen and AUtest fail to install device image.

Project Member Reported by cychiang@chromium.org, Oct 21 2016

Issue description

Starting from 8919.0, there are a lot of failures in paygen and autoupdate autotest:

e.g. https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/501


[Auto-Bug]: autoupdate_EndToEndTest.paygen_au_canary_full: retry_count: 1, ABORT: Failed to install device image using payload at http://100.115.219.132:42576/update on chromeos4-row8-rack4-host4. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError, 12 reports

Auto bug filed at https://bugs.chromium.org/p/chromium/issues/detail?id=640978

Log: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/81830916-chromeos-test/chromeos4-row8-rack4-host4/



    File "/usr/local/autotest/server/server_job.py", line 843, in group_func
      test.runtest(self, url, tag, args, dargs)
    File "/usr/local/autotest/server/test.py", line 289, in runtest
      *logging_args)
    File "/usr/local/autotest/client/common_lib/test.py", line 888, in runtest
      mytest._exec(args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
      _call_test_function(self.execute, *p_args, **p_dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1815, in run_once
      test_platform.prep_device_for_update(test_conf['source_release'])
    File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 1142, in prep_device_for_update
      self._staged_urls.source_stateful_url)
    File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 965, in _install_source_version
      stateful_url, True)
    File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 940, in _update_via_test_payloads
      perform_update(payload_url, False)
    File "/usr/local/autotest/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py", line 926, in perform_update
      updater.update_image()
    File "/usr/local/autotest/client/common_lib/cros/autoupdater.py", line 241, in update_image
      raise to_raise
  RootFSUpdateError: Failed to install device image using payload at http://100.115.219.132:42576/update on chromeos4-row8-rack4-host4. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError


8917.0: green.
8918.0: there were a lot of failures in paygen and hwtest steps.
dshi@ has pointed that it was related to lab load issue, and after increasing the capacity, suite job should be ok after that.

8919.0: there are a lot of failures in paygen, some with AU tests too. Not sure if this is related to lab load issue.

The blamelist between 8917 (green) and 8918:
https://crosland.corp.google.com/log/8917.0.0..8918.0.0

Some changes in chromite, some in autotest, but not related to autoupdate.

between 8918.0 and 8919.0:
https://crosland.corp.google.com/log/8918.0.0..8919.0.0

Some changes in chromeos-admin for lab load issue, maybe they introduce this issue ?

I can't see other suspicious CLs between 8917.0 and 8919.0.


Hi Dan, could you please check if this is related to the changes introduced in 8919 in chromeos-admin ?
Set to p0 since it fails across nearly all the boards.
Thanks!
 
I am looking at update_engine.log in   autoupdate_EndToEndTest.paygen_au_canary_full.tgz downloaded from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/81830916-chromeos-test/chromeos4-row8-rack4-host4/

There were two update_engine logs in
autoupdate_EndToEndTest.paygen_au_canary_full/sysinfo/var/log/update_engine
The first one update_engine.20161021-021540 shows success:

[1021/022728:INFO:main.cc(128)] Chrome OS Update Engine terminating with exit code 0

The second one update_engine.20161021-022735 shows failure:


[1021/022800:INFO:omaha_request_action.cc(654)] Posting an Omaha request to http://100.115.219.132:42576/update
[1021/022800:INFO:omaha_request_action.cc(655)] Request: <?xml version="1.0" encoding="UTF-8"?>
<request protocol="3.0" version="ChromeOSUpdateEngine-0.1.0.0" updaterversion="ChromeOSUpdateEngine-0.1.0.0" installsource="ondemandupdate" ismachine="1">
    <os version="Indy" platform="Chrome OS" sp="ForcedUpdate_x86_64"></os>
    <app appid="{0AA3F9E1-A088-9252-50B8-5A1345D09AEB}" version="ForcedUpdate" track="stable-channel" lang="en-US" board="auron_paine" hardware_class="PAINE D25-A2A-C6A-A6P" delta_okay="true" fw_version="" ec_version="" installdate="3577" >
        <updatecheck targetversionprefix=""></updatecheck>
        <event eventtype="54" eventresult="1" previousversion="0.0.0.0"></event>
    </app>
</request>

[1021/022800:INFO:libcurl_http_fetcher.cc(94)] Starting/Resuming transfer
[1021/022800:INFO:libcurl_http_fetcher.cc(106)] Using proxy: no
[1021/022800:INFO:libcurl_http_fetcher.cc(227)] Not setting http(s) curl options because we are running a dev/test image
[1021/022820:INFO:update_attempter.cc(1197)] Already updated boot flags. Skipping.
[1021/023105:ERROR:libcurl_http_fetcher.cc(378)] Unable to get http response code.

^^^^^^^^^^^^^^^^^ An error here.



[1021/023105:INFO:libcurl_http_fetcher.cc(397)] No HTTP response, retry 1
[1021/023115:INFO:libcurl_http_fetcher.cc(94)] Starting/Resuming transfer
[1021/023115:INFO:libcurl_http_fetcher.cc(106)] Using proxy: no
[1021/023115:INFO:libcurl_http_fetcher.cc(227)] Not setting http(s) curl options because we are running a dev/test image
[1021/023416:INFO:libcurl_http_fetcher.cc(375)] HTTP response code: 200
[1021/023416:INFO:libcurl_http_fetcher.cc(451)] Transfer completed (200), 264 bytes downloaded
[1021/023416:INFO:omaha_request_action.cc(914)] Omaha request response: <?xml version="1.0" encoding="UTF-8"?>
  <response protocol="3.0">
    <daystart elapsed_seconds="9264"/>
    <app appid="{87efface-864d-49a5-9bb3-4b050a7c227a}" status="ok">
      <ping status="ok"/>
      <updatecheck status="noupdate"/>
    </app>
  </response>
[1021/023416:INFO:omaha_request_action.cc(787)] No update.
[1021/023416:INFO:metrics.cc(142)] Sending 1 for metric UpdateEngine.Check.Result (enum)
[1021/023416:INFO:prefs.cc(41)] metrics-check-last-reporting-time not present in /var/lib/update_engine/prefs
[1021/023416:INFO:action_processor.cc(116)] ActionProcessor: finished OmahaRequestAction with code ErrorCode::kSuccess

^^^^^^^^^^^^^^^^^ A success here.


[1021/023416:INFO:action_processor.cc(143)] ActionProcessor: starting OmahaResponseHandlerAction
[1021/023416:INFO:omaha_response_handler_action.cc(56)] There are no updates. Aborting.

^^^^^^^^^^^^^^^^^ Aborting here.



[1021/023416:INFO:action_processor.cc(116)] ActionProcessor: finished OmahaResponseHandlerAction with code ErrorCode::kError
[1021/023416:INFO:action_processor.cc(121)] ActionProcessor: Aborting processing due to failure.
[1021/023416:INFO:update_attempter.cc(909)] Processing Done.
[1021/023416:INFO:update_attempter.cc(983)] No update.
[1021/023416:INFO:update_manager-inl.h(52)] ChromeOSPolicy::UpdateCheckAllowed: START
[1021/023416:WARNING:evaluation_context-inl.h(43)] Error reading Variable update_disabled: "No value set for update_disabled"
[1021/023416:WARNING:evaluation_context-inl.h(43)] Error reading Variable target_version_prefix: "No value set for target_version_prefix"
[1021/023416:WARNING:evaluation_context-inl.h(43)] Error reading Variable release_channel_delegated: "No value set for release_channel_delegated"
[1021/023416:INFO:chromeos_policy.cc(291)] Unofficial build, blocking periodic update checks.
[1021/023416:INFO:update_manager-inl.h(74)] ChromeOSPolicy::UpdateCheckAllowed: END

Not usre if these logs are useful.
81830916-chromeos-test%2Fchromeos4-row8-rack4-host4%2Fautoupdate_EndToEndTest.paygen_au_canary_full.tgz
1.8 MB Download
Components: Infra>Client>ChromeOS
There is also something I haven't seen before, for instance in monroe-release build 1122 there are many of these:

82145678 monroe-release/R55-8872.19.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_full_5500.100.0 started on: 2016-10-21 01:55:51 status Completed
  236889 Reset started on: 2016-10-21 01:55:35 status PASS
  236886 Repair started on: 2016-10-21 01:55:03 status PASS
  236885 Reset started on: 2016-10-21 01:54:52 status FAIL
  82145352 monroe-release/R55-8872.19.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_8872.15.0 started on: 2016-10-21 01:41:12 status Completed
  236839 Reset started on: 2016-10-21 01:40:55 status PASS
  82144792 monroe-release/R56-8919.0.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_5500.100.0 started on: 2016-10-21 01:23:30 status Failed
  236812 Reset started on: 2016-10-1 01:23:12 status PASS
  82131405 monroe-release/R56-8919.0.0/paygen_au_dev/autoupdate_EndToEndTest_paygen_au_dev_full_8919.0.0 started on: 2016-10-21 01:04:49 status Completed
  236752 Reset started on: 2016-10-21 01:04:30 status PASS

Why is the Reset step failing?


Comment 4 by dshi@chromium.org, Oct 21 2016

Re #3

which job is that paygen test?
From the name, I only find
http://cautotest/afe/#tab_id=view_job&object_id=81813778
which passed.

job id 82145678 doesn't exist somehow.
> Why is the Reset step failing?

Run `dut-status -f` for the DUT in question, in the time period
in question.  It'll have a pointer to the reset job logs that
will answer the question.

I'm looking at failures in the vicinity of reset failures
like above, and I see this error message:
    10/21 04:21:06.427 WARNI|         cros_host:1327| cros-version label "cros-version:auron_paine-release/R55-8872.19.0" does not match release version 8743.69.1. Removing the label.

The error came after running an AU test that updated to 8743.69.1,
so the presence of the label suggests provisioning isn't installing
new images.  That would be bad.

Comment 7 by dshi@chromium.org, Oct 21 2016

For this provision failure:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/81843396-chromeos-test/chromeos4-row9-rack8-host1/debug/

The dut was provisioned with the right image:

10/21 04:06:09.274 DEBUG|          ssh_host:0180| Running (ssh) 'cat "/etc/lsb-release"'
...
10/21 04:06:09.499 DEBUG|        base_utils:0280| [stdout] CHROMEOS_RELEASE_DESCRIPTION=8919.0.0-rc2 (Continuous Builder - Builder: N/A) daisy_skate
10/21 04:06:09.499 DEBUG|        base_utils:0280| [stdout] CHROMEOS_RELEASE_NAME=Chromium OS

But failed later in version label check:
10/21 04:06:09.921 ERROR|           control:0071| The host has wrong cros-version label.

10/21 04:06:09.503 WARNI|         cros_host:1327| cros-version label "cros-version:daisy_skate-release/R54-8743.65.0" does not match release version 8919.0.0-rc2. Removing the label.

It almost looks like the provision job failed to update host's version label. Maybe an RPC issue.
I looked at history for one DUT.  The short summary appears
to be that AU tests install a new build, but never change
the cros-version: label.  We're seeing AU tests for one
version interspersed with tests for some other version.

The end result is a sequence like this:
  * Provision and test version X.
  * Run an AU test that installs version Y.
  * Reset discovers that the cros-version: label says "version X"
    but the actual version is Y.  That deletes the bad label, and
    triggers repair, which installs a repair image.
  * The scheduler re-provisions version X for more testing.

NOTE NOTE NOTE
It's not clear whether this sequence is contributing to the original
problem, or if it's unrelated.  DO NOT ASSUME THIS IS RELATED.
DO NOT ASSUME IT'S NOT RELATED.

About #4, sorry I didn't add the job link.  And now I cannot find that log any more!  I thought it was from the first autotest failure in the Paygen stdio from build 1122, in debug/autotest.DEBUG.  The link for that is http://cautotest/tko/retrieve_logs.cgi?job=/results/81812892-chromeos-test/, but it doesn't match the snippet I posted :P  Will be more careful next time.


Comment 10 by dshi@chromium.org, Oct 22 2016

Status: WontFix (was: Assigned)
the provision issue went away, likely some load flake in lab.

Sign in to add a comment