Kevin: First time observing error while installing linux after powerwash or chromeos installation |
|||||||||||||||||
Issue descriptionChrome Version: 69.0.3486.0 Chrome OS Version: 10866.1.0 Chrome OS Platform: Kevin Network info: Wifi Please specify Cr-* of the system to which this bug/feature applies (add the label below). Steps To Reproduce: (1) Install fresh ChromeOS or powerwash device. (2) Login device. (3) Turn on Linux. (4) Observe behavior. Expected Result: -- Able to install without any issue Actual Result: -- Observing error as "Error installing Linux". -- By retry it is successfully installing How frequently does this problem reproduce? (Always, sometimes, hard to reproduce?) Almost every time. Feedback report:85548052165 This can be duplicated if the root cause as crbug.com/863190 .
,
Jul 17
From the logs it looks like we couldn't start vm_concierge: [1614:1614:0713/155659.477992:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.VmConcierge.StartVm: object_path= /org/chromium/VmConcierge: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
,
Jul 17
[1614:1614:0713/155800.329931:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.debugd.StartVmConcierge: object_path= /org/chromium/debugd: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
,
Jul 17
,
Jul 17
Not sure why debugd didn't reply in time for #3 but concierge is running just fine. Lxd took too long to start: 2018-07-13T22:56:29.747149+00:00 ERR vm_concierge[7371]: Failed to launch lxd: Deadline Exceeded 2018-07-13T22:56:29.747346+00:00 ERR vm_concierge[7371]: lxd waitready failed I wonder if we just need to make the deadlines longer. Everything seems to be much slower on kevin. For example, concierge receives the StartVM here: 2018-07-13T22:55:35.399950+00:00 INFO vm_concierge[7371]: Received StartVm request It takes almost 3 seconds for the VM to boot: 2018-07-13T22:55:38.153566+00:00 INFO VMBOOT(3)[7384]: server_builder.cc:247] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2147483647, CQ timeout (msec): 1000#015 And another 1.5 seconds to mount the btrfs disk: 2018-07-13T22:55:39.686709+00:00 INFO VMBOOT(3)[7384]: [ 3.531081] BTRFS info (device vdb): creating UUID tree#015 Then we start lxd: 2018-07-13T22:55:39.732681+00:00 INFO vm_concierge[7371]: Launching lxd inside VM 3 And lxd waitready times out after 50 seconds: 2018-07-13T22:56:29.747149+00:00 ERR vm_concierge[7371]: Failed to launch lxd: Deadline Exceeded 2018-07-13T22:56:29.747346+00:00 ERR vm_concierge[7371]: lxd waitready failed For the second successful attempt, the StartVM request comes in here: 2018-07-13T22:56:34.475634+00:00 INFO vm_concierge[7371]: Received StartVm request And the VM finishes all of its setup after almost a minute: 2018-07-13T22:57:26.518628+00:00 INFO vm_concierge[7371]: Started VM with pid 28 Again, I'm not sure why the reply isn't making its way back to chrome but that seems like a separate problem. I think the fix for this particular issue is to just make the lxd waitready timeout longer.
,
Jul 17
sgtm, we should work on that VM boot time on ARM for sure. But the LXD step might be hard to speed up as it's very IO heavy and some systems aren't great at IO.
,
Jul 17
Actually, it looks like the chrome code that calls concierge uses dbus::ObjectProxy::TIMEOUT_USE_DEFAULT (https://cs.chromium.org/chromium/src/chromeos/dbus/concierge_client.cc?q=StartVmRequest&sq=package:chromium&g=0&l=115) and the default dbus timeout is 25 seconds so I'm not surprised that the call timed out in #2.
,
Jul 17
yeah, we already made the timeouts longer on kevin for concierge but the UI timeout of 25 seconds is not sufficient. The other timeouts are much longer -- like 45 or 60 seconds.
,
Jul 17
considering the "thundering herd" prob we have on Chromebooks when you first log in, i would say go with "insanely" high timeouts in general (e.g. 5 minutes). if the system eventually works, then the timeouts wouldn't have saved us just broke us, and if the system is buggy and hits the timeout, the system is already buggy and the timeout is simply noticing that. i.e. the timeout imo should be considered an extreme upper bound to keep broken systems from hanging forever, not as a latency check.
,
Jul 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/c89eb42de241254609cf0b26c316a198c7d99ffc commit c89eb42de241254609cf0b26c316a198c7d99ffc Author: Chirantan Ekbote <chirantan@chromium.org> Date: Wed Jul 18 08:40:01 2018 vm_tools: Increase various timeouts On devices with slow I/O some operations can end up taking longer than our current timeouts allow. A timeout should represent a worst-case scenario rather than a tight bound on RPC latency. Increase timeouts across the board to deal with this. BUG= chromium:863630 TEST=start a new VM, container, do some operations, shut down the VM, and see that none of the operations fail due to a time out Change-Id: I25eaf598548489cbca15dc3b9137895599b61344 Signed-off-by: Chirantan Ekbote <chirantan@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1141338 Reviewed-by: Stephen Barber <smbarber@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/concierge/virtual_machine.cc [modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/concierge/service.cc [modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/maitred/init.cc
,
Jul 18
,
Jul 18
This issue also observed in Dru device on 10888.0.0, 69.0.3494.0.
,
Jul 31
Observed this issue on 10895.10.0, 69.0.3497.21 on kevin device while uninstall and install. Feedback report: 85577977135
,
Jul 31
The fix for crbug.com/863630 which increases the UI timeout, needs to be backported to 69 still
,
Jul 31
Opening bug for merging into M69.
,
Jul 31
oops sorry I put the wrong bug into #14 I meant the fix for crbug.com/863190 still needs to be merged
,
Aug 2
Adding merge request label for M69, to approve merge.
,
Aug 3
This landed before branch on the platform side. The chrome side has now been merged to 69 as well.
,
Aug 6
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 9
The CL in this bug landed before branch, the CL in the associated bug has been merged into M69. Updating the Merge-Approved label to Merged. If this is not the correct status, please update accordingly.
,
Aug 30
Observed this issue on 10895.40.0, 69.0.3497.73 on kevin and bob devices.
,
Aug 30
So are the symptoms still that things fail the first time but work the second time? Can you provide logs / link to a feedback report?
,
Aug 30
Feedback report id: 85630150200 This issue is observed after auto update.
,
Aug 30
This is a different issue than the timeouts. `lxd waitready` is returning a non-zero exit status: 2018-08-30T14:52:44.569465-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg="LXD 3.0.0 is starting in normal mode" path=/mnt/stateful/lxd t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569481-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg="Kernel uid/gid map:" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569487-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg=" - u 0 0 4294967295" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569492-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg=" - g 0 0 4294967295" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569496-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg="Configured LXD uid/gid map:" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569502-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg=" - u 0 1000000 1000000000" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569508-07:00 INFO VM(3)[5925]: lxd[171]: lvl=info msg=" - g 0 1000000 1000000000" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569513-07:00 WARNING VM(3)[5925]: lxd[171]: lvl=warn msg="AppArmor support has been disabled because of lack of kernel support" t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569517-07:00 WARNING VM(3)[5925]: lxd[171]: lvl=warn msg="Couldn't find the CGroup network class controller, network limits will be ignored." t=2018-08-30T21:52:41+0000#012 2018-08-30T14:52:44.569522-07:00 WARNING VM(3)[5925]: lxd[171]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-30T21:52:41+0000#012 2018-08-30T14:53:30.898443-07:00 ERR vm_concierge[5927]: Failed to start Termina: lxd waitready returned non-zero smbarber@: do we need to merge https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1194603 back to 69 as well?
,
Aug 30
Just talked to smbarber and dgreid in person. We think this is likely still the RSA key generation step on first start. For 69 we're going to make the timeout 300 seconds and long term we will try to get rid of the key generation step since we never actually use the key.
,
Aug 30
Requesting merge for https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1194603 to the 69 branch.
,
Aug 30
,
Aug 30
This bug requires manual review: We are only 4 days from stable. Please contact the milestone owner if you have questions. Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 31
,
Aug 31
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/94a2cb76a67617b4e6e41a2723279cc8303f979b commit 94a2cb76a67617b4e6e41a2723279cc8303f979b Author: Stephen Barber <smbarber@chromium.org> Date: Fri Aug 31 21:09:38 2018 vm_tools: maitred: increase lxd waitready timeout Increase the lxd waitready timeout to 2 minutes. lxd's first run is slow due to RSA key generation, so we should set this value to be conservative for slower devices. BUG= chromium:863630 TEST=vm.StartTerminaVM no longer times out on lxd waitready for strago/gru Change-Id: Ic2d2becbba96b08ba7b646ec66ee633f4a8b73ba Reviewed-on: https://chromium-review.googlesource.com/1194603 Commit-Ready: Stephen Barber <smbarber@chromium.org> Tested-by: Stephen Barber <smbarber@chromium.org> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> (cherry picked from commit 391a80e192f8a393d24697cd39e7f3b3222236ff) Reviewed-on: https://chromium-review.googlesource.com/1198064 Reviewed-by: Stephen Barber <smbarber@chromium.org> Commit-Queue: Chirantan Ekbote <chirantan@chromium.org> Tested-by: Chirantan Ekbote <chirantan@chromium.org> [modify] https://crrev.com/94a2cb76a67617b4e6e41a2723279cc8303f979b/vm_tools/maitred/service_impl.cc
,
Aug 31
,
Sep 3
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 8
Verified on 69.0.3497.120/10895.78.0 and 11137.0.0, 71.0.3567.0 |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by tbuck...@chromium.org
, Jul 16Status: Assigned (was: Untriaged)