update_engine slowness |
||||||
Issue descriptionChrome Version: 60.0.3329.0/10359.0.0 OS: Chrome What steps will reproduce the problem? (1) Pass through OOBE while connected to ethernet What is the expected result? "Checking for updates" should be fast. What happens instead? "Checking for updates" takes ~20 seconds. Full logs attached.
,
Feb 1 2018
cat ~/tmp/debug-logs_20180201-125534 | tar tzvf - It seems to be tar.gz
,
Feb 1 2018
No idea why the suffix got lost. Sorry.
,
Feb 1 2018
Chrome waited for 22 seconds for update_engine to respond. $ egrep 'EULA_ACCEPTED|UPDATE|setting timezone' chrome/chrome_20180201-035351 [1545:1545:0201/035420.422121:VERBOSE1:wizard_controller.cc(1225)] Wizard screen exit code: EULA_ACCEPTED [1545:1545:0201/035423.551684:VERBOSE1:wizard_controller.cc(1584)] Resolve TimeZone: setting timezone to 'Europe/Berlin' [1545:1545:0201/125442.213673:VERBOSE1:wizard_controller.cc(1225)] Wizard screen exit code: UPDATE_NOUPDATE $ egrep 'Forced update|Processing Done' update_engine/update_engine.20180201-035400 [0201/035420:INFO:update_attempter.cc(768)] Forced update check requested. [0201/035420:INFO:chromeos_policy.cc(280)] Forced update signaled (interactive), allowing update check. [0201/035442:INFO:update_attempter.cc(873)] Processing Done. $ Assigning to update_engine owners.
,
Feb 1 2018
,
Feb 1 2018
So I look at the log, the 22 second delay is coming from these two things: - 10 seconds from running set-good-firmware. Remember that update_engine is single threaded, and we run this script synchronously. I'm not exactly sure the security consequences of continuing with the update while the firmware is not set as good is, but that is how it works at this point. The fact is that the update check is just coincided with running this script. - ~10 seconds from setting vpd and sending some UMA metrics. Setting vpd should not take more than 2-3 seconds, and same true for sending UMA metrics (even less since it goes through chrome). Probably there would be some delay on replying back to chrome to set 'no update'. But I don't think this is a bug, this WAI, but we may be able to move stuff around to reduce the delay. Let me know if other owners have any opinion.
,
Feb 1 2018
Ketaki, could you dig into this some more and work on how we can drive improvements?
,
Feb 1 2018
Why would it take 2-3 seconds to send UMA metrics? Also, that can 100% be made parallel.
,
Feb 2 2018
Re #8: I have not fully tested it figure out where exactly those delays for the second ~10 seconds coming from. It was based on my observation of the code. I don't think we want to change parallelism in update engine internals at this point. UE should be very lightweight and should've consume too much resources. If we wanted to spawn a new thread for each UMA update, that is kind of a not good. I think we have a message loop that gets checked out once in a while to reply to requests, but there are times that UE's thread is busy and cannot respond. We should probably prioritize doing operations better, but I don't think parallelizing them for this instance makes much sense. Although this is just my opinion :P
,
Feb 2 2018
Sure - whatever gets this down to a more sane level. 22 seconds is a bit excessive.
,
Feb 2 2018
Why do we need to block the update check on set-good-firmware?
,
Feb 2 2018
Re #11: You can't apply an update to the other slot if the current one is not mark as good. Otherwise you will be writing to the good slot. The set-good-firmware step could be delayed until after the update check (omaha request/response) but before start applying the payload... if that helps.
,
Feb 2 2018
> The set-good-firmware step could be delayed until after the update check (omaha request/response) but before start applying the payload... if that helps. Based on comment #6 that would save us 10 seconds -- which would help a lot.
,
Feb 22 2018
I just stumbled over UMA [1] which shows that the update check takes over 3 minutes in the 95% percentile! [1] https://uma.googleplex.com/p/chrome/histograms/?endDate=latest&dayCount=7&histograms=OOBE.StepCompletionTime.Update&fixupData=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
,
Feb 22 2018
Does those times include force updates too (which actually does the update)?
,
Feb 23 2018
I've been wondering about that as well. Forced updates might cause the bump around the 64s-79s bucket. I doubt that the other bump around 10s-12s is caused by forced updates because that would just be too fast. And the overflow bin is anyone's guess -- I'm trying to extend the range of that histogram right now.
,
Feb 23 2018
Thus my conclusion in comment #14 is probably not substantiated, however it seems fair to say that the update check takes 20 seconds in the 90% percentile.
,
Feb 23 2018
Then, this is consistent (20 seconds) with the intent of the origin of this bug. I have this in my TODO list, but there are few higher priority things going on. So this might take a while to get fixed. But feel free to send a patch if you feel like it :)
,
Feb 23 2018
Comment #17 is still wrong -- I need to subtract the second bump of (assumed) forced updates. Third attempt: 19 seconds in the 95th percentile.
,
Mar 30 2018
,
Jul 25
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/d3f4bea600867b4ee85074869d723ba070db0ae7 commit d3f4bea600867b4ee85074869d723ba070db0ae7 Author: Amin Hassani <ahassani@chromium.org> Date: Wed Jul 25 07:15:01 2018 update_engine: Pass Action ownership to ActionProcessor Currently, an object that uses an ActionProcessor for processing one or more actions has to own the Actions. This is problematic, because if we want to create an action on the fly and use an ActionProcessor to perform it, we have to own the Action until it is finished. Furthermore, if someone forget to own the action, there will be memory leaks because ActionProcessor does not delete the Action. This patch passes the ownership of the Actions to the ActionProcessor through unique pointers. If an object wants to have access to the Action, it can get it when ActionComplete() is called. BUG= chromium:807976 TEST=unittests TEST=cros flash TEST=precq Change-Id: I28f7e9fd3425f17cc51b4db4a4abc130a7d6ef8f Reviewed-on: https://chromium-review.googlesource.com/1065113 Commit-Ready: Amin Hassani <ahassani@chromium.org> Tested-by: Amin Hassani <ahassani@chromium.org> Reviewed-by: Xiaochu Liu <xiaochu@chromium.org> [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/action_processor.h [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/payload_consumer/filesystem_verifier_action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/omaha_response_handler_action.h [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/action_processor.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/payload_consumer/download_action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/update_attempter_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/omaha_request_action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/omaha_response_handler_action.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/update_attempter.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/action_processor_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/mock_http_fetcher.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/mock_action_processor.h [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/omaha_response_handler_action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/payload_consumer/postinstall_runner_action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/omaha_request_action.h [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/update_attempter.h [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/action_unittest.cc [modify] https://crrev.com/d3f4bea600867b4ee85074869d723ba070db0ae7/common/mock_http_fetcher.h
,
Jul 26
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/0882a5198b58173044e8b4b28eb4d30b1bcfe230 commit 0882a5198b58173044e8b4b28eb4d30b1bcfe230 Author: Amin Hassani <ahassani@chromium.org> Date: Thu Jul 26 02:45:32 2018 update_engine: Pospone setgoodkernel after update check Currently, if an update is forced, we first run setgoodkernel and then continue with the update check. However, this delays the update check screen in the OOBE. This patch adds a new Action, UpdateBootFlagsAction which is configured to run AFTER an update check and response. However, now if the update check failed due for example the response was invalid, then setgoodkernel will not happen. But we already have a mechanism in SystemState where setgoodkernel is always called (currently) 45 seconds after the boot. So there would no worries. BUG= chromium:807976 TEST=unittests TEST=powerwashed the device, and the check for update passed faster on OOBE. Change-Id: I5845d7a4c1393aec568b6279dfbb7d3dfa31cdd8 Reviewed-on: https://chromium-review.googlesource.com/1011244 Commit-Ready: Amin Hassani <ahassani@chromium.org> Tested-by: Amin Hassani <ahassani@chromium.org> Reviewed-by: Sen Jiang <senj@chromium.org> [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/real_system_state.h [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/real_system_state.cc [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_attempter_unittest.cc [add] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_boot_flags_action.h [add] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_boot_flags_action.cc [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/Android.mk [add] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_boot_flags_action_unittest.cc [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_engine.gyp [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_attempter.h [modify] https://crrev.com/0882a5198b58173044e8b4b28eb4d30b1bcfe230/update_attempter.cc
,
Jul 31
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/667cf7b5b3d02cb38b786760c6c1764dfc758b85 commit 667cf7b5b3d02cb38b786760c6c1764dfc758b85 Author: Amin Hassani <ahassani@chromium.org> Date: Tue Jul 31 23:37:15 2018 update_engine: Reflect ActionProcessor and UpdateBootFlags changes in AOSP We changed the ActionProcessor interface and how we update boot flags in CL:1065113 and CL:1011244. This patch reflects those changes in update_attempter_android.* BUG= chromium:807976 TEST=unittests TEST=build and run in android Change-Id: Ib06b6c1560e58771e66288208a525130a5811360 Reviewed-on: https://chromium-review.googlesource.com/1150737 Commit-Ready: Amin Hassani <ahassani@chromium.org> Tested-by: Amin Hassani <ahassani@chromium.org> Reviewed-by: Sen Jiang <senj@chromium.org> [modify] https://crrev.com/667cf7b5b3d02cb38b786760c6c1764dfc758b85/update_attempter_android.h [modify] https://crrev.com/667cf7b5b3d02cb38b786760c6c1764dfc758b85/common/action_processor_unittest.cc [modify] https://crrev.com/667cf7b5b3d02cb38b786760c6c1764dfc758b85/Android.mk [modify] https://crrev.com/667cf7b5b3d02cb38b786760c6c1764dfc758b85/update_attempter_android.cc
,
Sep 14
I just checked it again, and it seems to just take only 2-3 seconds for an update check, so the previous trick of postponing update boot flags have worked :) |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by ahass...@chromium.org
, Feb 1 2018