update_engine crashes at boot
Reported by
jrbarnette@chromium.org,
Jun 23 2018
|
|||||
Issue descriptionI've just discovered a labstation running in the test lab for which 'update_engine' isn't running. System state indicated that the server started at boot time, and then shut down. Here's the tail end of the update_engine log file: [0613/202506:INFO:common_service.cc(95)] Attempt update: app_version="" omaha_url="http://100.115.219.139:8082/update/guado_labstation-release/R69-10738.0.0" flags=0x0 interactive=yes RestrictDownload=no [0613/202506:INFO:update_attempter.cc(778)] Refusing to do an interactive update with an update already in progress [0613/202506:FATAL:utils.cc(27)] Check failed: error. Error object must be specified /usr/lib64/libbase-core-395517.so(_ZN4base5debug10StackTraceC1Ev+0x13) [0x7f305764aaf3] The full log file is attached. This is a guado_labstation build, so it's possible that the problem doesn't affect production Chrome hardware. However, I think it's prudent to assess update_engine as guilty until proven innocent in this case. In any event, this isn't the only time something like this has been seen. Bug 845620 describes a symptom that matches this failure.
,
Jun 23 2018
,
Jun 23 2018
I think the error you see is fixed in: https://chromium-review.googlesource.com/c/aosp/platform/system/update_engine/+/1082979 and were cherry-picked all the way back to stable. It was a while ago. If the version you are updating from is one that doesn't have this patch (like olther M67 stable), then you will have this problem if you request for an update check two times. Please, let me know if that is not the case here.
,
Jun 23 2018
The failure is in this build:
CHROMEOS_RELEASE_BUILDER_PATH=guado_labstation-release/R68-10658.0.0
It sounds like maybe that build would be expected to have the bug...
,
Jun 23 2018
OK. I started update engine on the problem labstation. The service
stayed up, and after a little while, I was able to see this:
$ ssh chromeos6-row4-rack13-labstation2 update_engine_client -status
[0622/174233:INFO:update_engine_client.cc(508)] Querying Update Engine status...
LAST_CHECKED_TIME=1529714525
PROGRESS=0.290565
CURRENT_OP=UPDATE_STATUS_DOWNLOADING
NEW_VERSION=999999.0.0
NEW_SIZE=503693885
I'm feeling like maybe that would be expected, and that it points the way
to our workaround (namely, just restart update-engine if it's down).
,
Jun 25 2018
It only fails if you check for update (not check for update status) while the update is running. I guess R68-10658.0.0 should be one with that problem.
,
Jun 26 2018
I've filed a ticket (as bug 856740 ) to update the lab to the latest, noting the workaround above. If we see the problem go away after the update, we can declare victory.
,
Jun 26 2018
I think another aspect of it to look into is in which code we are sending a check for update when another update is in progress. That normally should not happen I believe. I know that normally should not cause any issues, but I rather not to start calling 'start update_engine' on all lab devices for everything because that can eventually hide a potential problem with the UE to not be present at any given time. We are just restarting without knowing why it wasn't running. At the very least we should log the 'start update-engine' success (meaning it was not running) as a warning log to look into if we want to go that approach.
,
Jun 26 2018
> I think another aspect of it to look into is in which code we are > sending a check for update when another update is in progress. That > normally should not happen I believe. Normally, that shouldn't happen. In the case of labstations, independent processes can ask the labstation to perform the check at arbitrary times. Changing that behavior is hard, so I'd rather have an update_engine that won't fail when it happens. > [ ... ] but I rather not to start calling 'start update_engine' on all > lab devices [ ... ] I'd rather not do that too. We need at least to do it manually on affected systems in order to get past the current problem. I'll update the ticket to reflect that we don't want a (permanent) code change.
,
Jul 16
Issue 853284 has been merged into this issue.
,
Jul 18
,
Jul 18
OK. I've gone and checked status for all the various labstations; the "fails at boot" symptom seems gone. I'm satisfied we know the root cause of this, and most (though alas, not quite all) labstations are updated to a safe version. So, let's declare victory. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Jun 23 2018Status: Assigned (was: Untriaged)