ninja-release:1437 failed: ssh timed out: connect to host |
|||
Issue descriptionninja-release:1437 failed Builders failed on: - ninja-release: https://luci-milo.appspot.com/buildbot/chromeos/ninja-release/1437 ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out It looks like a couple of DUTs couldn't be reached (neither ping nor SSH). If I'm reading this right, they also were unreachable for the previous test runs (on different release branch). It's not clear to me that we even attempted repair properly though; viceroy tells me [1] one of the DUTs repaired, but the other didn't even try before aborting the canary with "infrastructure issues". Am I reading this wrong, or is this strange behavior? [1] https://viceroy.corp.google.com/chromeos/suite_details?job_id=137190644
,
Aug 25 2017
The history of that DUT shows that it was doing just fine before %62-9877 was installed on it, and then started flaking. I'd say the build is bad: http://chromeos-server56.hot.corp.google.com/afe/#tab_id=view_host&object_id=1561
,
Aug 25 2017
Actually #2 is wrong -- the provision that tried to get R62-9877 on the DUT died before ever installing the new build. So the DUT was unreachable with the old image, where it had already run a bunch of jobs. The only job that has succeeded since 2:00 AM this morning: http://chromeos-server56.hot.corp.google.com/afe/#tab_id=view_job&object_id=137204771 is an autoupdate job, so it didn't need provision. Of course this job itself runs autoupdate, which is similar to the provision flow. Funky.
,
Aug 25 2017
What happens later is also expected. The one provision failed because we couldn't SSH into the DUT at the time (can happen as a side effect of network congestion) The next autoupdate test did not need provision, and was run properly. We see a bunch of SSH timeouts there as well, but autoupdate succeeded. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/137204771-chromeos-test/chromeos4-row3-rack9-host2/autoupdate_logs/ This bolster the theory of network congestion. What I can't explain is why the immediately following reset job failed claiming that the last provision failed: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row3-rack9-host2/1020287-reset/20172508050255/debug/ The autoupdate test had succeeded in stateful update, so that local "dirty" file should have been cleared
,
Aug 25 2017
In any case, the canary failure itself is truly a network flake from my analysis. |
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Aug 25 2017