New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 759133 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: ----
Type: ----



Sign in to add a comment

ninja-release:1437 failed: ssh timed out: connect to host

Project Member Reported by briannorris@chromium.org, Aug 25 2017

Issue description

ninja-release:1437 failed

Builders failed on: 
- ninja-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/ninja-release/1437



ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out


It looks like a couple of DUTs couldn't be reached (neither ping nor SSH). If I'm reading this right, they also were unreachable for the previous test runs (on different release branch). It's not clear to me that we even attempted repair properly though; viceroy tells me [1] one of the DUTs repaired, but the other didn't even try before aborting the canary with "infrastructure issues".

Am I reading this wrong, or is this strange behavior?

[1] https://viceroy.corp.google.com/chromeos/suite_details?job_id=137190644
 
The other DUT was repaired:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row3-rack9-host2/1020202-repair/20172508041837

viceroy won't show you repairs that fall outside the time that HWTest for that builder was running. The lab takes care of repairing DUTs that fail critical tasks, but that's out-of-band of the builders' request to run the test.
The history of that DUT shows that it was doing just fine before %62-9877 was installed on it, and then started flaking.

I'd say the build is bad: http://chromeos-server56.hot.corp.google.com/afe/#tab_id=view_host&object_id=1561
Actually #2 is wrong -- the provision that tried to get R62-9877 on the DUT died before ever installing the new build. So the DUT was unreachable with the old image, where it had already run a bunch of jobs.

The only job that has succeeded since 2:00 AM this morning: http://chromeos-server56.hot.corp.google.com/afe/#tab_id=view_job&object_id=137204771

is an autoupdate job, so it didn't need provision.
Of course this job itself runs autoupdate, which is similar to the provision flow.

Funky.

Cc: jrbarnette@chromium.org
What happens later is also expected.

The one provision failed because we couldn't SSH into the DUT at the time (can happen as a side effect of network congestion)

The next autoupdate test did not need provision, and was run properly. 

We see a bunch of SSH timeouts there as well, but autoupdate succeeded.
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/137204771-chromeos-test/chromeos4-row3-rack9-host2/autoupdate_logs/

This bolster the theory of network congestion.

What I can't explain is why the immediately following reset job failed claiming that the last provision failed: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row3-rack9-host2/1020287-reset/20172508050255/debug/

The autoupdate test had succeeded in stateful update, so that local "dirty" file should have been cleared
Status: WontFix (was: Assigned)
In any case, the canary failure itself is truly a network flake from my analysis.

Sign in to add a comment