New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 855792 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 856740



Sign in to add a comment

update_engine crashes at boot

Reported by jrbarnette@chromium.org, Jun 23 2018

Issue description

I've just discovered a labstation running in the test lab for which
'update_engine' isn't running.  System state indicated that the
server started at boot time, and then shut down.

Here's the tail end of the update_engine log file:

[0613/202506:INFO:common_service.cc(95)] Attempt update: app_version="" omaha_url="http://100.115.219.139:8082/update/guado_labstation-release/R69-10738.0.0" flags=0x0 interactive=yes RestrictDownload=no 
[0613/202506:INFO:update_attempter.cc(778)] Refusing to do an interactive update with an update already in progress
[0613/202506:FATAL:utils.cc(27)] Check failed: error. Error object must be specified
/usr/lib64/libbase-core-395517.so(_ZN4base5debug10StackTraceC1Ev+0x13) [0x7f305764aaf3]

The full log file is attached.

This is a guado_labstation build, so it's possible that the problem doesn't
affect production Chrome hardware.  However, I think it's prudent to assess
update_engine as guilty until proven innocent in this case.

In any event, this isn't the only time something like this has been seen.
Bug 845620 describes a symptom that matches this failure.
 
update_engine.20180613-202421
7.4 KB Download
Owner: ahass...@chromium.org
Status: Assigned (was: Untriaged)
ahassani@ - can you provide an initial assessment of what might have
happened here?  Is there an obvious explanation?

Cc: gu...@chromium.org
I think the error you see is fixed in:
https://chromium-review.googlesource.com/c/aosp/platform/system/update_engine/+/1082979

and were cherry-picked all the way back to stable. It was a while ago. If the version you are updating from is one that doesn't have this patch (like olther M67 stable), then you will have this problem if you request for an update check two times.

Please, let me know if that is not the case here.
The failure is in this build:
    CHROMEOS_RELEASE_BUILDER_PATH=guado_labstation-release/R68-10658.0.0

It sounds like maybe that build would be expected to have the bug...

OK.  I started update engine on the problem labstation.  The service
stayed up, and after a little while, I was able to see this:
    $ ssh chromeos6-row4-rack13-labstation2 update_engine_client -status
    [0622/174233:INFO:update_engine_client.cc(508)] Querying Update Engine status...
    LAST_CHECKED_TIME=1529714525
    PROGRESS=0.290565
    CURRENT_OP=UPDATE_STATUS_DOWNLOADING
    NEW_VERSION=999999.0.0
    NEW_SIZE=503693885

I'm feeling like maybe that would be expected, and that it points the way
to our workaround (namely, just restart update-engine if it's down).

It only fails if you check for update (not check for update status) while the update is running. I guess R68-10658.0.0 should be one with that problem.


Blockedon: 856740
I've filed a ticket (as  bug 856740 ) to update the lab to the latest,
noting the workaround above.  If we see the problem go away after the
update, we can declare victory.

I think another aspect of it to look into is in which code we are sending a check for update when another update is in progress. That normally should not happen I believe. I know that normally should not cause any issues, but I rather not to start calling 'start update_engine' on all lab devices for everything because that can eventually hide a potential problem with the UE to not be present at any given time. We are just restarting without knowing why it wasn't running. 
At the very least we should log the 'start update-engine' success (meaning it was not running) as a warning log to look into if we want to go that approach.


> I think another aspect of it to look into is in which code we are
> sending a check for update when another update is in progress. That
> normally should not happen I believe.

Normally, that shouldn't happen.  In the case of labstations, independent
processes can ask the labstation to perform the check at arbitrary times.
Changing that behavior is hard, so I'd rather have an update_engine that
won't fail when it happens.

> [ ... ] but I rather not to start calling 'start update_engine' on all
> lab devices [ ... ]

I'd rather not do that too.  We need at least to do it manually on affected
systems in order to get past the current problem.

I'll update the ticket to reflect that we don't want a (permanent) code
change.

Issue 853284 has been merged into this issue.
Cc: englab-sys-cros@google.com
Status: Fixed (was: Assigned)
OK.  I've gone and checked status for all the various labstations;
the "fails at boot" symptom seems gone.  I'm satisfied we know the
root cause of this, and most (though alas, not quite all) labstations
are updated to a safe version.

So, let's declare victory.

Sign in to add a comment