New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 744959 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Bad devserver push causing provision failures

Project Member Reported by pho...@chromium.org, Jul 17 2017

Issue description

See, for example, https://luci-milo.appspot.com/buildbot/chromeos/stumpy-paladin/29009 (from master-paladin https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15423)

The error seems to be a servo initialization timeout?

07/17 11:17:37.718 ERROR|            repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
    host.connect_servo()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 138, in connect_servo
    'Servo initialize timed out.')
AutoservVerifyError: Servo initialize timed out.


 

Comment 1 by xixuan@chromium.org, Jul 17 2017

Cc: xixuan@chromium.org dhadd...@chromium.org
It's a bad change in chromite, is_au_endtoendtest in lib/auto_updater.py shouldn't be true at this case.

Comment 2 by xixuan@chromium.org, Jul 17 2017

Owner: xixuan@chromium.org
Status: Started (was: Untriaged)
Project Member

Comment 3 by bugdroid1@chromium.org, Jul 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/bff69708eeca1bcbae16c7c5bc80765bfbd28e16

commit bff69708eeca1bcbae16c7c5bc80765bfbd28e16
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Mon Jul 17 21:41:07 2017

Make the default `payload_filename` be None.

This is a speculative fix to address ongoing failures in the lab.
It appears that the standard provisioning flow is passing a
"payload_filename" parameter that's an empty string instead of None.
That's causing the AU EndToEnd test flow to trigger, causing failures.

TEST=None
BUG= chromium:744959 

Change-Id: I142aa0c24b1a2305c51fae0a78762e0c92546792
Reviewed-on: https://chromium-review.googlesource.com/575429
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/bff69708eeca1bcbae16c7c5bc80765bfbd28e16/cros_update.py

Comment 4 by xixuan@chromium.org, Jul 17 2017

Owner: jrbarnette@chromium.org
oops, beat me, https://chromium-review.googlesource.com/c/575431/, I just leave it there since any fix is ok. 
The fix change has been pushed to the devservers.

A consequence of the bug was that any DUT without a reliable
servo failed repair:  Provision failure requires a successful
AU or servo/USB reinstall, and AU would fail in this case.

After the devserver update, failed devices have been coming
back online (meaning, they've been successfully running AU)
which suggests that the problem is now fixed.

I'm holding this open until we see a green CQ run, but I think
that the outage is over.

The fix also keeps au_endtoend happy. woot 
Status: Fixed (was: Started)
The latest CQ run is green. I'm closing this.
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15425
Can I open the tree now that this is fixed?
Status: Assigned (was: Fixed)
15426 failed. It's still provision failure. But the error was different from #0.

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15426
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/2314
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/128886661-chromeos-test/

07/17 19:47:26.090 DEBUG|        dev_server:1873| Current CrOS auto-update status: transfer rootfs update package
07/17 19:47:36.139 DEBUG|             utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/get_au_status?full_update=False&force_update=True&pid=22819&build_name=winky-paladin/R61-9754.0.0-rc1&host_name=chromeos4-row3-rack12-host11&clobber_stateful=True"''
07/17 19:47:37.818 DEBUG|        dev_server:1938| Failed to trigger auto-update process on devserver
07/17 19:47:37.819 DEBUG|             utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/collect_cros_au_log?pid=22819&host_name=chromeos4-row3-rack12-host11"''
07/17 19:47:39.417 DEBUG|        dev_server:1753| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row3-rack12-host11/654818-provision/20171707194011/autoupdate_logs/CrOS_update_chromeos4-row3-rack12-host11_22819.log
07/17 19:47:39.424 DEBUG|             utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/handler_cleanup?pid=22819&host_name=chromeos4-row3-rack12-host11"''
07/17 19:47:40.963 DEBUG|             utils:0212| Running 'ssh 100.115.219.135 'curl "http://100.115.219.135:8082/kill_au_proc?pid=22819&host_name=chromeos4-row3-rack12-host11"''
07/17 19:47:42.563 DEBUG|        dev_server:2149| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 259, in TriggerAU
    self._RootfsUpdate(chromeos_AU)
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 166, in _RootfsUpdate
    cros_updater.TransferRootfsUpdate()
  File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 1138, in TransferRootfsUpdate
    delay_sec=DELAY_SEC_FOR_RETRY)
  File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 271, in RetryException
    return _run()
  File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 176, in _Wrapper
    ret = func(*args, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/retry_util.py", line 270, in _run
    return functor(*args, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 545, in TransferRootfsUpdate
    self.device.RunCommand(['mv', current_path, expected_path])
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 1098, in RunCommand
    return super(ChromiumOSDevice, self).RunCommand(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 942, in RunCommand
    return self.BaseRunCommand(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 889, in BaseRunCommand
    return self.GetAgent().RemoteSh(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 337, in RemoteSh
    return cros_build_lib.RunCommand(ssh_cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/cros_build_lib.py", line 645, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file

cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'}
07/17 19:47:42.565 DEBUG|        dev_server:2155| Please see error details in log /usr/local/autotest/results/hosts/chromeos4-row3-rack12-host11/654818-provision/20171707194011/autoupdate_logs/CrOS_update_chromeos4-row3-rack12-host11_22819.log
I don't see errors in CrOS_update_chromeos4-row3-rack12-host11_22819.log.

2017/07/17 19:47:35.440 DEBUG|    cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz
2017/07/17 19:47:35.772 DEBUG|        retry_util:0203| <class 'chromite.lib.cros_build_lib.RunCommandError'>(return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- mv /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row3-rack12-host11,100.115.194.194' (ED25519) to the list of known hosts.
mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file

cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'})
2017/07/17 19:47:35.773 DEBUG|    cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp6P4Ire/testing_rsa root@chromeos4-row3-rack12-host11 -- rm -rf /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs
2017/07/17 19:47:36.848 DEBUG|       cros_update:0266| Error happens in CrOS auto-update: RunCommandError("cmd=['ssh', '-p', '22', '-oConnectionAttempts=4', '-oUserKnownHostsFile=/dev/null', '-oProtocol=2', '-oConnectTimeout=30', '-oServerAliveCountMax=3', '-oStrictHostKeyChecking=no', '-oServerAliveInterval=10', '-oNumberOfPasswordPrompts=0', '-oIdentitiesOnly=yes', '-i', '/tmp/ssh-tmp6P4Ire/testing_rsa', 'root@chromeos4-row3-rack12-host11', '--', 'mv', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz', '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz'], extra env={'LC_MESSAGES': 'C'}", <chromite.lib.cros_build_lib.CommandResult object at 0x7f710dbc9810>, None)
Some paladins passed and some failed in 15426. The error is 
mv: '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' and '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.FYAsPYGTqs/static/pregenerated/update.gz' are the same file

No idea how to debug further.
The devserver used during that winky run is 100.115.219.135 (chromeos-test@chromeos2-devservertest)

It does not have the fix that Richard chumped earlier.
I just ssh'd into it and checked.

The devserver has a different naming scheme *-devservertest so I think it is used for the CQ or used for testing devserver pushes before doing a full devserver push or something like that. 

Either way he must have forgot to do a devserver push to it. 
OK. I've changed the tree status to throttled.

I don't know how to do a devserver push. This needs to wait for Richard.
> The devserver has a different naming scheme *-devservertest so I think it is used for the CQ or used for testing devserver pushes before doing a full devserver push or something like that.

It's not the naming scheme.  The push is supposed to use the same source code
list of servers as is used for selecting servers for provisioning.  Moreover,
the list uses IP addresses, not host names. <sigh>

So, I can't explain why the server didn't update.  My best guess would
be that the list used on the server driving the update isn't the same
as the list currently in prod...  I don't know how that happened.

Is this something easy to fix?
> Is this something easy to fix?

In terms of eliminating the current symptoms, I expect so.

In terms of changing code and documentation so that it never
happens again, not so much.
When I pushed yesterday, I observed that some devservers failed to
update.  I did a spot-check and concluded it was merely because some
entries in the devserver list were obsolete.  However, there seems
to have been more to it, because the devserver in question now was
in the list, and it's also (obviously) very much in service.

The list of failures:
      "description": "Failed to push 100.107.225.251",
      "description": "Failed to push 100.107.225.252",
      "description": "Failed to push 100.107.227.251",
      "description": "Failed to push 100.107.227.252",
      "description": "Failed to push 100.115.219.133",
      "description": "Failed to push 100.115.219.135",
      "description": "Failed to push 100.115.245.197",
      "description": "Failed to push 172.22.39.161",
      "description": "Failed to push 172.22.39.162",
      "description": "Failed to push 172.22.39.163",
      "description": "Failed to push 172.22.39.164",

I'm re-running the push, to see if it will behave better this
time.

I smell a post-mortem in my future.

> > The devserver has a different naming scheme *-devservertest [ ... ]

Oh.  For the record, the failing devserver has a standard name:
    $ ssh 100.115.219.135 hostname
    chromeos4-devserver7

My logs from the push that failed don't say much about the cause for
failure.  I think this is the trouble:
    2017-07-17 14:48:31,247 INFO| Starting task DevserverPushOneTask (100.115.219.135) at 1500328111.25
    2017-07-17 14:48:31,308 INFO| Unable to handle key of type ssh-ed25519
    2017-07-17 14:48:31,311 ERRO| Exception: Error reading SSH protocol banner

I'm able to ssh in to the host now without trouble.  But, I think
the client above was 'fabric'.  That would suggest that there's a
server/client version problem, except that the same client (in the
form of the push script) passed on that same host earlier in the
day.

I'm waiting to see what happens the third time around.

Labels: -Pri-0 Pri-1
The problem child has been updated.  However, some servers failed update
again:
      "description": "Failed to push 100.107.225.251",
      "description": "Failed to push 100.107.225.252",
      "description": "Failed to push 100.107.227.251",
      "description": "Failed to push 100.107.227.252",
      "description": "Failed to push 172.22.39.161",
      "description": "Failed to push 172.22.39.162",
      "description": "Failed to push 172.22.39.163",
      "description": "Failed to push 172.22.39.164",

I've confirmed that all eight of these are offline; they
don't answer 'ping'.

So, at this point all of the devservers _that matter_ should be
up-to-date.

I'm going to downgrade priority here on the theory that the
outage is over.  I'm holding this open until I can prove my
assertion more satisfactorily.

Status: Fixed (was: Assigned)
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15432

That run is green.  Declaring victory.  Again.

Status: Available (was: Fixed)
Holding open so that we remember to file Chase-Pending bugs on mitigations to this outage.
Pasting for posterity http://shortn/_8JRvAb2fDl we spiked in provision failure rate, possibly in a clean enough way to have made this an alertable signal.
Re #23, my WIP omens dashboard would have told us about this. See http://shortn/_QyeMwrGrCs

Also FYI, the latest devserver push failed on the same servers jrbarnette@ mentioned.
Status: Fixed (was: Available)
Really fixed, and follow-up actions have been assigned.

Status: Verified (was: Fixed)

Sign in to add a comment