New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618727 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

provision succeed despite SSH errors, but still marked as failed

Project Member Reported by gwendal@chromium.org, Jun 9 2016

Issue description


Looking at https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/2249

The build is marked as failed: several errors are reported, related to SSH timeout. One type is:

[Test-Logs]: provision: FAIL: Failed to install device image using payload at http://100.107.160.7:8082/update/veyron_mighty-paladin/R53-8431.0.0-rc2 on chromeos4-row6-rack11-host9. 
Update failed. Returned update_engine error code: ERROR_CODE=37, ERROR_MESSAGE=ErrorCode::kOmahaErrorInHTTPResponse. Reported error: AutoservRunError

However, looking into the update engine log file (enclosed), it looks like a transient error.

In https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/66232641-chromeos-test/chromeos4-row6-rack11-host9/debug/
We fail when the update engine reports:

[0609/061007:INFO:payload_state.cc(247)] Updating payload state for error code: 37 (ErrorCode::kOmahaErrorInHTTPResponse)

But we retry later and succeed.

Should the test give a change to the updater to let it retry?
 
66232641-chromeos-test%2Fchromeos4-row6-rack11-host9%2Fsysinfo%2Fupdate_engine%2Fupdate_engine.20160609-060342.txt
14.0 KB View Download
Cc: xixuan@chromium.org
Looking at the debug log and talking with Xixuan, the retry mechanism for autoupdate is not kicking in.

The warning "Autoupdate did not complete." only appears once. 

In the log: the updater is launched at 06:03:45.983.
At 06:10:08.227 it gives up.
We collect the log up to 06:11:05.263. and check the devserver there.

We would not retry once only if the devserver is deemed healthy.
But:

1. we are checking the health for the devserver from the autotest, not the DUT. If there is a network congestion between the devserver and the DUTs but the path between autotest worker and the devserver is clear, we would not retry.

2. The devserver had a full minute to recover, so in case of transient network error, we won't retry either.

Should we relax the test and retry updater.run_update() unconditionally in machine_install?

Cc: d...@chromium.org
Components: -Infra Infra>Client>ChromeOS
Cc: levarum@chromium.org
Is this being fixed? Looks like it happened again at:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-paladin/builds/1258
Labels: -Pri-2 Pri-1
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
Possibly happened again: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-paladin/builds/1290
Status: WontFix (was: Assigned)
This won't happen with new provision code flow, so close it for now.

Sign in to add a comment