Unexplained provisioning failures on the CQ on multiple boards
Reported by
jrbarnette@chromium.org,
Jun 19 2017
|
||||||||
Issue description
This CQ run failed to provision on three separate
boards:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15076
These are the boards that failed:
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/1993
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/16096
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1995
Looking at autoupdate_logs from the failures, the basic symptom was
that after copying the payload to the DUT and starting the devserver on
the DUT, update_engine failed to install the payload. The exact cause
isn't obvious from the logs.
,
Jun 19 2017
But the update_engine has not been uprev'ed yet. How could it cause the provisioning to fail?
,
Jun 19 2017
> But the update_engine has not been uprev'ed yet. How could it
> cause the provisioning to fail?
This was a CQ failure. Here's the sequence of events:
* Master paladin CQ run 15076 builds CL:538063, and installs it
on all CQ test devices.
* CQ run 15076 fails due to a separate, unrelated CL.
* Master paladin CQ run 15077 builds an image without the
update_engine change.
* Multiple devices running the previous build fail in
update_engine, causing the next CQ run to fail as well.
The 15076 CQ build looks like it was bad. We don't know exactly
why. However, given that the failure was in update_engine, one
of the things we have to consider is the update_engine CL.
,
Jun 19 2017
Assigned to current deputy to keep an eye. We currently expect this to have gone poof!
,
Jun 20 2017
The same basic symptoms struck this CQ run:
https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15106
The update_engine CL was included in that run, and in the run
before it. Proving that the update_engine CL is still a viable
suspect requires further digging in the logs, but there's a
good chance that's what we'll find.
The CL submitter has marked the CL "-1 Verified", so we should
be safe, assuming that's really the CL at fault.
,
Jun 20 2017
I've dug through the logs, and confirmed that the build from
master-paladin run 15105 was installed on multiple boards,
and that the CQ run that followed it failed just like in the
original description. The good news is that it looks like that
_this_ time, the lab recovered completely.
To recap the theory: The builds from CQ runs 15075 and 15105
were unable to update properly for the CQ runs that followed them.
Moreover the failure was specific enough both times to suspect
that the root cause was a bad CL (not infrastructure). The single
CL in common between the two runs was this:
https://chromium-review.googlesource.com/#/c/538063/
That CL is a plausible suspect on its own, since the failure
involved update_engine, and that CL changes update_engine.
ahassani@ - Assuming you agree with the diagnosis, you're free
to close this bug. The bug isn't in the Chrome OS tree, so the
only action required is to fix the bugs, test(!), and then
re-submit to the CQ.
,
Jun 20 2017
This is good to know. From this bug discussion I'm assuming the provisioning of a device with faulty update engine (which included that CL) is failing, not the actual AUtest after the image has been installed. What is the backup plan for provisioning when the update_engine fails? like cros flash or something? Because this can happen anytime. Last night, the palading builds starting to fail due to ssh timeout and I thought this was due to lab shutdown. Anyway, can you point me logs that shows more specific details about this problem? Assuming that CL is the cause, what is the best way to test it so I can be able to reproduce the error and potentially fix it ASAP?
,
Jun 20 2017
> [ ... ] From this bug discussion I'm assuming the provisioning of a device
> with faulty update engine (which included that CL) is failing, not the
> actual AUtest after the image has been installed. What is the backup plan
> for provisioning when the update_engine fails? like cros flash or
> something? Because this can happen anytime.
The failure occurred in provisioning, that is, trying to install a new
build on a device that already had the suspect update_engine change.
AU testing is mostly not involved. Note, however, that the test
"provision_AutoUpdate.double" did time out on at least two devices
that ran it. That test merely re-installs the current build.
Regarding recovery: There's a standardized automated repair flow
designed to help recover from these kinds of bugs. In the worst cases,
the system can use Servo to re-install from USB. However, for the
cases I looked at, the actual repair that was used was to powerwash
the failed device, and then use AU to install a known good software
image. So, whatever was wrong with update_engine, it wasn't as simple
as "it would never, ever work."
For initial debug, you can look at logs starting from either of the
CQ failures that followed the run with the suspect CL:
https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15106
https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15076
Multiple individual paladins showed failures:
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/1993
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/16096
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1995
https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/1481
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/16121
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/2943
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/2018
You can click on individual failure links that start with
"[Test-Logs]: provision:".
Here's an example from the "peach_pit-paladin/16121" run:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124171384-chromeos-test/chromeos6-row2-rack11-host19/
In that results folder, you can find important logs here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124171384-chromeos-test/chromeos6-row2-rack11-host19/crashinfo.chromeos6-row2-rack11-host19/var/log/
,
Jun 20 2017
,
Jun 20 2017
update_engine.20170619-193436 ends with the following: [0619/193542:ERROR:utils.cc(88)] AddDBusError(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.NetworkProxyService was not provided by any .service files [0619/193542:WARNING:chrome_browser_proxy_resolver.cc(121)] Failed to resolve proxy: The name org.chromium.NetworkProxyService was not provided by any .service files [0619/193542:INFO:libcurl_http_fetcher.cc(146)] Starting/Resuming transfer [0619/193542:INFO:libcurl_http_fetcher.cc(165)] Using proxy: no [0619/193542:INFO:libcurl_http_fetcher.cc(286)] Not setting http(s) curl options because we are running a dev/test image [0619/193621:INFO:delta_performer.cc(201)] Completed 65/822 operations (7%), 50620366/985585915 bytes downloaded (5%), overall progress 5% [0619/193651:INFO:delta_performer.cc(201)] Completed 99/822 operations (12%), 79259598/985585915 bytes downloaded (8%), overall progress 10% [0619/193705:INFO:libcurl_http_fetcher.cc(433)] HTTP response code: 206 [0619/193705:INFO:libcurl_http_fetcher.cc(490)] Transfer interrupted after downloading 93530335 of 985585915 bytes. 892055580 bytes remaining after 1 attempt(s) [0619/193705:INFO:libcurl_http_fetcher.cc(502)] Restarting transfer to download the remaining bytes [0619/193725:INFO:libcurl_http_fetcher.cc(146)] Starting/Resuming transfer [0619/193725:INFO:libcurl_http_fetcher.cc(165)] Using proxy: no [0619/193725:INFO:libcurl_http_fetcher.cc(286)] Not setting http(s) curl options because we are running a dev/test image [0619/193725:ERROR:libcurl_http_fetcher.cc(436)] Unable to get http response code. [0619/193725:INFO:libcurl_http_fetcher.cc(490)] Transfer interrupted after downloading 93530335 of 985585915 bytes. 892055580 bytes remaining after 2 attempt(s) [0619/193725:INFO:libcurl_http_fetcher.cc(502)] Restarting transfer to download the remaining bytes ---- I suspect that the D-Bus error is unrelated to the later errors. The error suggests that the org.chromium.NetworkProxyService D-Bus service name, usually owned by Chrome, wasn't owned at the time that update_engine called one of its methods. This is weird and may indicate a problem like Chrome not running for some reason, but given that we fell back to not using a proxy and managed to download ~89 MB of the 940 MB payload, I don't think it caused the failure. I don't know much about update_engine, but the two "Transfer interrupted" messages suggest to me that there was some sort of networking issue and/or a problem with the devserver (or whatever's serving the update).
,
Jun 20 2017
@jrbarnette: These failures look to happen around the same time that the test labs were closed yesterday and all errors indicate a timeout. Can you clarify if that is not the case?
,
Jun 21 2017
> [ ... ] These failures look to happen around the same time
> that the test labs were closed yesterday and all errors
> indicate a timeout. Can you clarify if that is not the case?
Timeout as such is unlikely. In any event, both times that
this happened, the lab was healthy. The problem is almost
certainly caused by software on the device, not external
conditions.
Regarding the update_engine messages, it'll help to understand
how provisioning is slightly different from regular updates.
Provisioning works similar to `cros flash`:
* The full payload is copied into the stateful partition on
the device.
* A copy of the devserver is copied onto the device.
* The devserver is started on the device, configured to serve
the locally stored payload.
* update_engine_client is invoked with a command to update
using the locally running devserver.
I'm not sure what happens to chrome during the update, but it's
entirely possible that the provisioning process stops chrome.
Regarding a bit more about the symptoms of this bug: One of
the observed problems is that on nyan_kitty, something causes the
stateful file system to go into read-only mode. See bugs 734769
and 734764 for more details on what's known about that symptom.
,
Jun 21 2017
Regarding the overall flow of the provision described in c#12,
the logs can be found here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124171384-chromeos-test/chromeos6-row2-rack11-host19/autoupdate_logs/
There were two failed attempts; the first attempt is in the
file named CrOS_update_chromeos6-row2-rack11-host19_13577.log.
The most important part of the log seems to be this:
2017/06/19 19:15:33.065 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpcwxBm5/testing_rsa root@chromeos6-row2-rack11-host19 -- update_engine_client -check_for_update '-omaha_url=http://127.0.0.1:46467/update/pregenerated'
Warning: Permanently added 'chromeos6-row2-rack11-host19,100.115.129.115' (RSA) to the list of known hosts.
[0619/191532:INFO:update_engine_client.cc(473)] Initiating update check and install.
2017/06/19 19:15:33.307 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpcwxBm5/testing_rsa root@chromeos6-row2-rack11-host19 -- update_engine_client --status
2017/06/19 19:15:33.571 DEBUG| cros_build_lib:0633| (stdout):
LAST_CHECKED_TIME=1497924932
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_CHECKING_FOR_UPDATE
NEW_VERSION=0.0.0.0
NEW_SIZE=0
2017/06/19 19:15:33.572 DEBUG| cros_build_lib:0635| (stderr):
Warning: Permanently added 'chromeos6-row2-rack11-host19,100.115.129.115' (RSA) to the list of known hosts.
[0619/191533:INFO:update_engine_client.cc(493)] Querying Update Engine status...
2017/06/19 19:15:33.572 INFO | auto_updater:0645| Waiting for update...status: UPDATE_STATUS_CHECKING_FOR_UPDATE at progress 0.000000
2017/06/19 19:15:43.583 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpcwxBm5/testing_rsa root@chromeos6-row2-rack11-host19 -- update_engine_client --status
2017/06/19 19:15:43.835 DEBUG| cros_build_lib:0633| (stdout):
LAST_CHECKED_TIME=1497924932
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_CHECKING_FOR_UPDATE
NEW_VERSION=0.0.0.0
NEW_SIZE=0
2017/06/19 19:15:43.835 DEBUG| cros_build_lib:0635| (stderr):
Warning: Permanently added 'chromeos6-row2-rack11-host19,100.115.129.115' (RSA) to the list of known hosts.
[0619/191543:INFO:update_engine_client.cc(493)] Querying Update Engine status...
2017/06/19 19:15:43.835 INFO | auto_updater:0645| Waiting for update...status: UPDATE_STATUS_CHECKING_FOR_UPDATE at progress 0.000000
2017/06/19 19:15:53.847 DEBUG| cros_build_lib:0584| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpcwxBm5/testing_rsa root@chromeos6-row2-rack11-host19 -- update_engine_client --status
2017/06/19 19:16:19.117 DEBUG| cros_build_lib:0635| (stderr):
Warning: Permanently added 'chromeos6-row2-rack11-host19,100.115.129.115' (RSA) to the list of known hosts.
[0619/191553:INFO:update_engine_client.cc(493)] Querying Update Engine status...
[0619/191618:ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.UpdateEngineInterface.GetStatus: object_path= /org/chromium/UpdateEngine: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[0619/191618:ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.NoReply, Message=Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[0619/191618:ERROR:update_engine_client.cc(495)] Failed to query status
,
Jun 21 2017
Ok. I tried to anything I could to reproduce this error. But it just doesn't give up! :/ Looking at all the logs possible the only explanation that I can come up with is that the local dev server in the DUT somehow failed/stopped to serve the update engine. This is the reason update_engine is getting http 206 code (partial content). Unfortunately there is no log for that local devserver to see what happens. As far as I can tell, the update engine did not fail, it just failed to download the content from the local dev server. I tried both trybot builds and the paladin build that failed (R61-9666.0.0-rc4) with both test lab devservers and a locally run devserver (as described in provision_AutoUpdate.control), but I cannot reproduce the error. I tried installing multiple times and in both partitions to make sure the presumably faulty update engine is used in provisioning but I could not make it to fail. I was not able to run provision_AutoUpdate.double because it is not supposed to be run using test_that. Does anyone know what is the correct way to run it? Is there anyway I can replicate more precisely what happens during the provisioning in the test labs? If not and if Infra guys are ok, I would like to send this CL one more time to see if it fails again. I ran out of options to reproduce it. Is there any suggestions or other methods to solve this problem? Regarding chrome it seems like provisioning does 'stop ui' which I guess shuts down the chrome too! But that is not the cause of our problem.
,
Jun 22 2017
> [ ... ] If not and if Infra guys are ok, I would like to send
> this CL one more time to see if it fails again. [ ... ]
Ugh. No, the costs of seeing this fail in the CQ are too high,
and there's plenty of evidence that we should expect it to fail.
We need to find a way to reproduce it more cheaply.
> [ ... ] the only explanation that I can come up with is that the
> local dev server in the DUT somehow failed/stopped to serve the
> update engine. This is the reason update_engine is getting http
> 206 code (partial content). [ ... ]
But the devserver code hadn't changed, and the update_engine code had.
The devserver isn't a good candidate for the cause of the problem,
there's no other CL that was present for both CQ failures, and the
problem hasn't recurred under any other circumstances. This has to
have been caused by the change to update_engine:
https://www.brainyquote.com/quotes/quotes/a/arthurcona134512.html
,
Jun 22 2017
Ok, Thanks jrbarnette@ for not giving up to my request :P. I think I might have found the problem but have not proven it yet! So it seems everything goes back to this evil CL: https://chromium-review.googlesource.com/c/502369/ :-) which increased the provisioning time. The problem as I can see it now is 'update_engine_client --status' fails to get the status from update_engine with this error: [0622/123317:INFO:update_engine_client.cc(493)] Querying Update Engine status... [0622/123342:ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.UpdateEngineInterface.GetStatus: object_path= /org/chromium/UpdateEngine: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [0622/123342:ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.NoReply, Message=Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [0622/123342:ERROR:update_engine_client.cc(495)] Failed to query status This log is in the dev server log that are not actually included in the paladin hwtest provisioning logs (for some reason? intended? bug?). When this failure happens, the dev server assumes the update has failed and starts to revert the update and kills the local dev server, etc. This causes the update to halt. In practice the devserver tries to do everything again by rebooting the device, etc. which will work eventually because the update_engine checkpoints how much of update it has done. But I'm assuming when test labs were off, it failed to restart the provisioning again. But that's the problem for another day. So why that error happened. senj@ and me speculate that since the update_engine is single threaded and it is taking longer for writes to finish, the dbus call from update_engine_client to update_engine times out. This error does not happen all the time but when it happens everything gets messed up. I'm still about to prove this right or wrong with building trybots and trying them. But in case this is the problem we should either don't do the synchronous writes when provisioning (cc'd @grundler) or I have to figure out a way to increase the dbus timeout (which doesn't seem to be trivial at the first look).
,
Jun 22 2017
But again, the synchronous writes may not be the problem since they are merged now and working fine. Something else might cause the dbus failures too.
,
Jun 23 2017
I think we found the problem. I sent out this CL for review: https://chromium-review.googlesource.com/c/547015/ I will do more testing and will try to merge it monday since I don't want to break anything over the weekend.
,
Jun 23 2017
I wish I would have been included in this bug a bit sooner given the "O_DSYNC" change was part of the discussion:
https://chromium-review.googlesource.com/502369
1) I'm adding gwendal since I had the impression that DISCARD should be really fast: the device only needs to mark the blocks as "unused" and drop the LBA mapping for that range. It doesn't need to actually touch flash or start erasing.
The bug is DiscardPartitionTail() is using BLKSECDISCARD and not BLKDISCARD.
148 const vector<blkioctl_request> blkioctl_requests = {
149 {BLKSECDISCARD, "BLKSECDISCARD"},
150 {BLKDISCARD, "BLKDISCARD"},
151 #ifdef BLKZEROOUT
152 {BLKZEROOUT, "BLKZEROOUT"},
153 #endif
154 };
Swap lines 149 and 150 and this will go away.
2) I'd like to use DISCARD on the entire partition before we start writing to it . Could repurpose DiscardPartitionTail() to "DiscardPartition" but it seems like the update protocol includes a DISCARD op:
726 case InstallOperation::ZERO:
727 case InstallOperation::DISCARD:
728 op_result = PerformZeroOrDiscardOperation(op);
729 break;
But I can't see where that is actually used in the payload generator. I'll find Amin to get a better understanding of this.
,
Jun 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/8f08cfc84ed0e2dd3652fc03ac989056ebacbebf commit 8f08cfc84ed0e2dd3652fc03ac989056ebacbebf Author: Amin Hassani <ahassani@google.com> Date: Mon Jun 26 22:28:35 2017 Fix the long time ioctl problem. On some boards ioctl() takes a long time (several seconds) to perform discarding blocks for a few hundred MB (which might be expected). This causes the UE (which is single threaded) to block for ioctl for several seconds and prevents the message loop from responding to dbus requests. 'update_engine_client --status' relies on dbus signals to get the status of the update engine. This can cause 'cros flash' or provisionning of devices to fail due error returned from ssh invokation of update_engine_client. This CL fixes the issue by doing BLKDISCARD before BLKSECDISCARD. Discarding the end of partition were originally added for b/28744609. BUG= chromium:734731 TEST=cros flash; test_that provision_AutoUpdate; Change-Id: I1f9b572f5155284b3aa399285630daee560573a0 Reviewed-on: https://chromium-review.googlesource.com/547015 Commit-Ready: Amin Hassani <ahassani@chromium.org> Tested-by: Amin Hassani <ahassani@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Alex Deymo <deymo@google.com> [modify] https://crrev.com/8f08cfc84ed0e2dd3652fc03ac989056ebacbebf/payload_consumer/delta_performer.cc
,
Jun 29 2017
Is this bug fixed?
,
Jun 29 2017
,
Jun 29 2017
The code that caused the failures was never committed to the tree. So in that sense, "fixed". At some point, the code in question will have to be resubmitted (with fixes), so in that sense, "not fixed", but we should only hold bugs open for stuff that's in the tree. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by jrbarnette@chromium.org
, Jun 19 2017Components: Infra>Client>ChromeOS
It should be noted that because provisioning failed before applying the update, if there was a Chrome OS bug causing the problem, that bug was in the CQ run *before* the failure. That means this CQ run may have a bad CL: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15075 That run included one change to update_engine: https://chromium-review.googlesource.com/#/c/538063/ That CL should be viewed with some suspicion.