Bad devserver push causing provision failures |
||||||||||
Issue descriptionSee, for example, https://luci-milo.appspot.com/buildbot/chromeos/stumpy-paladin/29009 (from master-paladin https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15423) The error seems to be a servo initialization timeout? 07/17 11:17:37.718 ERROR| repair:0332| Failed: servod service is taking calls Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify host.connect_servo() File "/usr/local/autotest/server/hosts/servo_host.py", line 138, in connect_servo 'Servo initialize timed out.') AutoservVerifyError: Servo initialize timed out.
,
Jul 17 2017
,
Jul 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/bff69708eeca1bcbae16c7c5bc80765bfbd28e16 commit bff69708eeca1bcbae16c7c5bc80765bfbd28e16 Author: Richard Barnette <jrbarnette@chromium.org> Date: Mon Jul 17 21:41:07 2017 Make the default `payload_filename` be None. This is a speculative fix to address ongoing failures in the lab. It appears that the standard provisioning flow is passing a "payload_filename" parameter that's an empty string instead of None. That's causing the AU EndToEnd test flow to trigger, causing failures. TEST=None BUG= chromium:744959 Change-Id: I142aa0c24b1a2305c51fae0a78762e0c92546792 Reviewed-on: https://chromium-review.googlesource.com/575429 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/bff69708eeca1bcbae16c7c5bc80765bfbd28e16/cros_update.py
,
Jul 17 2017
oops, beat me, https://chromium-review.googlesource.com/c/575431/, I just leave it there since any fix is ok.
,
Jul 17 2017
The fix change has been pushed to the devservers. A consequence of the bug was that any DUT without a reliable servo failed repair: Provision failure requires a successful AU or servo/USB reinstall, and AU would fail in this case. After the devserver update, failed devices have been coming back online (meaning, they've been successfully running AU) which suggests that the problem is now fixed. I'm holding this open until we see a green CQ run, but I think that the outage is over.
,
Jul 17 2017
The fix also keeps au_endtoend happy. woot
,
Jul 18 2017
The latest CQ run is green. I'm closing this. https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15425
,
Jul 18 2017
Can I open the tree now that this is fixed?
,
Jul 18 2017
15426 failed. It's still provision failure. But the error was different from #0. https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15426 https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/2314 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/128886661-chromeos-test/ 07/17 19:47:26.090 DEBUG| dev_server:1873| Current CrOS auto-update status: transfer rootfs update package 07/17 19:47:36.139 DEBUG| utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?full_update=False&force_update=True&pid=22819&build_name=winky-paladin/R61-9754.0.0-rc1&host_name=chromeos4-row3-rack12-host11&clobber_stateful=True"'' 07/17 19:47:37.818 DEBUG| dev_server:1938| Failed to trigger auto-update process on devserver 07/17 19:47:37.819 DEBUG| utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/collect_cros_au_log?pid=22819&host_name=chromeos4-row3-rack12-host11"'' 07/17 19:47:39.417 DEBUG| dev_server:1753| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row3-rack12-host11/654818-provision/20171707194011/autoupdate_logs/CrOS_update_chromeos4-row3-rack12-host11_22819.log 07/17 19:47:39.424 DEBUG| utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/handler_cleanup?pid=22819&host_name=chromeos4-row3-rack12-host11"'' 07/17 19:47:40.963 DEBUG| utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/kill_au_proc?pid=22819&host_name=chromeos4-row3-rack12-host11"'' 07/17 19:47:42.563 DEBUG| dev_server:2149| Exception raised on auto_update attempt #1: Traceback (most recent call last): File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 259, in TriggerAU self._RootfsUpdate(chromeos_AU) File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 166, in _RootfsUpdate cros_updater.TransferRootfsUpdate() File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 1138, in TransferRootfsUpdate delay_sec=DELAY_SEC_FOR_RETRY) File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 271, in RetryException return _run() File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 176, in _Wrapper ret = func(*args, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 270, in _run return functor(*args, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 545, in TransferRootfsUpdate self.device.RunCommand(['mv', current_path, expected_path]) File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 1098, in RunCommand return super(ChromiumOSDevice, self).RunCommand(cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 942, in RunCommand return self.BaseRunCommand(cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 889, in BaseRunCommand return self.GetAgent().RemoteSh(cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 337, in RemoteSh return cros_build_lib.RunCommand(ssh_cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/cros_build_lib.py", line 645, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts. mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'} 07/17 19:47:42.565 DEBUG| dev_server:2155| Please see error details in log /usr/local/autotest/results/hosts/chromeos4-row3-rack12-host11/654818-provision/20171707194011/autoupdate_logs/CrOS_update_chromeos4-row3-rack12-host11_22819.log
,
Jul 18 2017
I don't see errors in CrOS_update_chromeos4-row3-rack12-host11_22819.log.
2017/07/17 19:47:35.440 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz
2017/07/17 19:47:35.772 DEBUG| retry_util:0203| <class 'chromite.lib.cros_build_lib.RunCommandError'>(return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file
cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'})
2017/07/17 19:47:35.773 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- rm -rf /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs
2017/07/17 19:47:36.848 DEBUG| cros_update:0266| Error happens in CrOS auto-update: RunCommandError("cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'}", <chromite.lib.cros_build_lib.CommandResult object at 0x7f710dbc9810>, None)
,
Jul 18 2017
Some paladins passed and some failed in 15426. The error is mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file No idea how to debug further.
,
Jul 18 2017
The devserver used during that winky run is 100.115.219.135 (chromeos-test@chromeos2-devservertest) It does not have the fix that Richard chumped earlier. I just ssh'd into it and checked. The devserver has a different naming scheme *-devservertest so I think it is used for the CQ or used for testing devserver pushes before doing a full devserver push or something like that. Either way he must have forgot to do a devserver push to it.
,
Jul 18 2017
OK. I've changed the tree status to throttled. I don't know how to do a devserver push. This needs to wait for Richard.
,
Jul 18 2017
> The devserver has a different naming scheme *-devservertest so I think it is used for the CQ or used for testing devserver pushes before doing a full devserver push or something like that. It's not the naming scheme. The push is supposed to use the same source code list of servers as is used for selecting servers for provisioning. Moreover, the list uses IP addresses, not host names. <sigh> So, I can't explain why the server didn't update. My best guess would be that the list used on the server driving the update isn't the same as the list currently in prod... I don't know how that happened.
,
Jul 18 2017
Is this something easy to fix?
,
Jul 18 2017
> Is this something easy to fix? In terms of eliminating the current symptoms, I expect so. In terms of changing code and documentation so that it never happens again, not so much.
,
Jul 18 2017
When I pushed yesterday, I observed that some devservers failed to
update. I did a spot-check and concluded it was merely because some
entries in the devserver list were obsolete. However, there seems
to have been more to it, because the devserver in question now was
in the list, and it's also (obviously) very much in service.
The list of failures:
"description": "Failed to push 100.107.225.251",
"description": "Failed to push 100.107.225.252",
"description": "Failed to push 100.107.227.251",
"description": "Failed to push 100.107.227.252",
"description": "Failed to push 100.115.219.133",
"description": "Failed to push 100.115.219.135",
"description": "Failed to push 100.115.245.197",
"description": "Failed to push 172.22.39.161",
"description": "Failed to push 172.22.39.162",
"description": "Failed to push 172.22.39.163",
"description": "Failed to push 172.22.39.164",
I'm re-running the push, to see if it will behave better this
time.
I smell a post-mortem in my future.
,
Jul 18 2017
> > The devserver has a different naming scheme *-devservertest [ ... ]
Oh. For the record, the failing devserver has a standard name:
$ ssh 100.115.219.135 hostname
chromeos4-devserver7
,
Jul 18 2017
My logs from the push that failed don't say much about the cause for
failure. I think this is the trouble:
2017-07-17 14:48:31,247 INFO| Starting task DevserverPushOneTask (100.115.219.135) at 1500328111.25
2017-07-17 14:48:31,308 INFO| Unable to handle key of type ssh-ed25519
2017-07-17 14:48:31,311 ERRO| Exception: Error reading SSH protocol banner
I'm able to ssh in to the host now without trouble. But, I think
the client above was 'fabric'. That would suggest that there's a
server/client version problem, except that the same client (in the
form of the push script) passed on that same host earlier in the
day.
I'm waiting to see what happens the third time around.
,
Jul 18 2017
The problem child has been updated. However, some servers failed update
again:
"description": "Failed to push 100.107.225.251",
"description": "Failed to push 100.107.225.252",
"description": "Failed to push 100.107.227.251",
"description": "Failed to push 100.107.227.252",
"description": "Failed to push 172.22.39.161",
"description": "Failed to push 172.22.39.162",
"description": "Failed to push 172.22.39.163",
"description": "Failed to push 172.22.39.164",
I've confirmed that all eight of these are offline; they
don't answer 'ping'.
So, at this point all of the devservers _that matter_ should be
up-to-date.
I'm going to downgrade priority here on the theory that the
outage is over. I'm holding this open until I can prove my
assertion more satisfactorily.
,
Jul 18 2017
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15432 That run is green. Declaring victory. Again.
,
Jul 18 2017
Holding open so that we remember to file Chase-Pending bugs on mitigations to this outage.
,
Jul 18 2017
Pasting for posterity http://shortn/_8JRvAb2fDl we spiked in provision failure rate, possibly in a clean enough way to have made this an alertable signal.
,
Jul 19 2017
Re #23, my WIP omens dashboard would have told us about this. See http://shortn/_QyeMwrGrCs Also FYI, the latest devserver push failed on the same servers jrbarnette@ mentioned.
,
Jul 31 2017
Really fixed, and follow-up actions have been assigned.
,
Aug 1 2017
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by xixuan@chromium.org
, Jul 17 2017