New issue
Advanced search Search tips

Issue 863630 link

Starred by 4 users

Issue metadata

Status: Verified
Owner:
Closed: Aug 31
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Kevin: First time observing error while installing linux after powerwash or chromeos installation

Project Member Reported by avkodipelli@chromium.org, Jul 13

Issue description

Chrome Version: 69.0.3486.0
Chrome OS Version: 10866.1.0
Chrome OS Platform: Kevin
Network info: Wifi

Please specify Cr-* of the system to which this bug/feature applies (add
the label below).

Steps To Reproduce:
(1) Install fresh ChromeOS or powerwash device.
(2) Login device.
(3) Turn on Linux.
(4) Observe behavior.

Expected Result:
-- Able to install without any issue

Actual Result:
-- Observing error as "Error installing Linux".
-- By retry it is successfully installing

How frequently does this problem reproduce? (Always, sometimes, hard to
reproduce?)
Almost every time. 

Feedback report:85548052165

This can be duplicated if the root cause as  crbug.com/863190 . 
 
Owner: nverne@chromium.org
Status: Assigned (was: Untriaged)
@nverne, anything useful in the logs?
https://listnr.corp.google.com/report/85548052165
From the logs it looks like we couldn't start vm_concierge:

[1614:1614:0713/155659.477992:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.VmConcierge.StartVm: object_path= /org/chromium/VmConcierge: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[1614:1614:0713/155800.329931:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.debugd.StartVmConcierge: object_path= /org/chromium/debugd: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Owner: chirantan@chromium.org
Not sure why debugd didn't reply in time for #3 but concierge is running just fine.  Lxd took too long to start: 

2018-07-13T22:56:29.747149+00:00 ERR vm_concierge[7371]: Failed to launch lxd: Deadline Exceeded
2018-07-13T22:56:29.747346+00:00 ERR vm_concierge[7371]: lxd waitready failed


I wonder if we just need to make the deadlines longer.  Everything seems to be much slower on kevin.  For example, concierge receives the StartVM here:

2018-07-13T22:55:35.399950+00:00 INFO vm_concierge[7371]: Received StartVm request


It takes almost 3 seconds for the VM to boot:

2018-07-13T22:55:38.153566+00:00 INFO VMBOOT(3)[7384]:  server_builder.cc:247]      Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2147483647, CQ timeout (msec): 1000#015


And another 1.5 seconds to mount the btrfs disk:

2018-07-13T22:55:39.686709+00:00 INFO VMBOOT(3)[7384]: [    3.531081] BTRFS info (device vdb): creating UUID tree#015


Then we start lxd:

2018-07-13T22:55:39.732681+00:00 INFO vm_concierge[7371]: Launching lxd inside VM 3


And lxd waitready times out after 50 seconds:

2018-07-13T22:56:29.747149+00:00 ERR vm_concierge[7371]: Failed to launch lxd: Deadline Exceeded
2018-07-13T22:56:29.747346+00:00 ERR vm_concierge[7371]: lxd waitready failed


For the second successful attempt, the StartVM request comes in here:

2018-07-13T22:56:34.475634+00:00 INFO vm_concierge[7371]: Received StartVm request


And the VM finishes all of its setup after almost a minute:

2018-07-13T22:57:26.518628+00:00 INFO vm_concierge[7371]: Started VM with pid 28



Again, I'm not sure why the reply isn't making its way back to chrome but that seems like a separate problem.  I think the fix for this particular issue is to just make the lxd waitready timeout longer.
sgtm, we should work on that VM boot time on ARM for sure. But the LXD step might be hard to speed up as it's very IO heavy and some systems aren't great at IO.
Actually, it looks like the chrome code that calls concierge uses dbus::ObjectProxy::TIMEOUT_USE_DEFAULT (https://cs.chromium.org/chromium/src/chromeos/dbus/concierge_client.cc?q=StartVmRequest&sq=package:chromium&g=0&l=115) and the default dbus timeout is 25 seconds so I'm not surprised that the call timed out in #2.
yeah, we already made the timeouts longer on kevin for concierge but the UI timeout of 25 seconds is not sufficient.  The other timeouts are much longer -- like 45 or 60 seconds.
considering the "thundering herd" prob we have on Chromebooks when you first log in, i would say go with "insanely" high timeouts in general (e.g. 5 minutes).  if the system eventually works, then the timeouts wouldn't have saved us just broke us, and if the system is buggy and hits the timeout, the system is already buggy and the timeout is simply noticing that.

i.e. the timeout imo should be considered an extreme upper bound to keep broken systems from hanging forever, not as a latency check.
Project Member

Comment 10 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/c89eb42de241254609cf0b26c316a198c7d99ffc

commit c89eb42de241254609cf0b26c316a198c7d99ffc
Author: Chirantan Ekbote <chirantan@chromium.org>
Date: Wed Jul 18 08:40:01 2018

vm_tools: Increase various timeouts

On devices with slow I/O some operations can end up taking longer than
our current timeouts allow.  A timeout should represent a worst-case
scenario rather than a tight bound on RPC latency.

Increase timeouts across the board to deal with this.

BUG= chromium:863630 
TEST=start a new VM, container, do some operations, shut down the VM,
     and see that none of the operations fail due to a time out

Change-Id: I25eaf598548489cbca15dc3b9137895599b61344
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1141338
Reviewed-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/concierge/virtual_machine.cc
[modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/concierge/service.cc
[modify] https://crrev.com/c89eb42de241254609cf0b26c316a198c7d99ffc/vm_tools/maitred/init.cc

Status: Fixed (was: Assigned)
This issue also observed in Dru device on 10888.0.0, 69.0.3494.0.
Observed this issue on 10895.10.0, 69.0.3497.21 on kevin device while uninstall and install.
Feedback report: 85577977135
The fix for  crbug.com/863630  which increases the UI timeout, needs to be backported to 69 still
Status: Assigned (was: Fixed)
Opening bug for merging into M69.
oops sorry I put the wrong bug into #14 

I meant the fix for  crbug.com/863190  still needs to be merged
Labels: Merge-Approved-69
Adding merge request label for M69, to approve merge.
Status: Fixed (was: Assigned)
This landed before branch on the platform side. The chrome side has now been merged to 69 as well.
Project Member

Comment 19 by sheriffbot@chromium.org, Aug 6

Cc: cindyb@chromium.org
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-69 merge-merged-3497
The CL in this bug landed before branch, the CL in the associated bug has been merged into M69. Updating the Merge-Approved label to Merged. If this is not the correct status, please update accordingly. 

Status: Assigned (was: Fixed)
Observed this issue on 10895.40.0, 69.0.3497.73 on kevin and bob devices.
So are the symptoms still that things fail the first time but work the second time?

Can you provide logs / link to a feedback report?
Feedback report id: 85630150200

This issue is observed after auto update.
Cc: smbar...@chromium.org
This is a different issue than the timeouts.  `lxd waitready` is returning a non-zero exit status:

2018-08-30T14:52:44.569465-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg="LXD 3.0.0 is starting in normal mode" path=/mnt/stateful/lxd t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569481-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg="Kernel uid/gid map:" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569487-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg=" - u 0 0 4294967295" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569492-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg=" - g 0 0 4294967295" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569496-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg="Configured LXD uid/gid map:" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569502-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg=" - u 0 1000000 1000000000" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569508-07:00 INFO VM(3)[5925]:  lxd[171]: lvl=info msg=" - g 0 1000000 1000000000" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569513-07:00 WARNING VM(3)[5925]:  lxd[171]: lvl=warn msg="AppArmor support has been disabled because of lack of kernel support" t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569517-07:00 WARNING VM(3)[5925]:  lxd[171]: lvl=warn msg="Couldn't find the CGroup network class controller, network limits will be ignored." t=2018-08-30T21:52:41+0000#012
2018-08-30T14:52:44.569522-07:00 WARNING VM(3)[5925]:  lxd[171]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-30T21:52:41+0000#012
2018-08-30T14:53:30.898443-07:00 ERR vm_concierge[5927]: Failed to start Termina: lxd waitready returned non-zero


smbarber@: do we need to merge https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1194603 back to 69 as well?
Just talked to smbarber and dgreid in person.  We think this is likely still the RSA key generation step on first start.  For 69 we're going to make the timeout 300 seconds and long term we will try to get rid of the key generation step since we never actually use the key.


Labels: -merge-merged-3497 Merge-Request-68
Requesting merge for https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1194603 to the 69 branch. 
Labels: -Merge-Request-68 Merge-Request-69
Project Member

Comment 28 by sheriffbot@chromium.org, Aug 30

Labels: -Merge-Request-69 Merge-Review-69 Hotlist-Merge-Review
This bug requires manual review: We are only 4 days from stable.
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-69 Merge-Approved-69
Project Member

Comment 30 by bugdroid1@chromium.org, Aug 31

Labels: merge-merged-release-R69-10895.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/94a2cb76a67617b4e6e41a2723279cc8303f979b

commit 94a2cb76a67617b4e6e41a2723279cc8303f979b
Author: Stephen Barber <smbarber@chromium.org>
Date: Fri Aug 31 21:09:38 2018

vm_tools: maitred: increase lxd waitready timeout

Increase the lxd waitready timeout to 2 minutes. lxd's first run
is slow due to RSA key generation, so we should set this value to
be conservative for slower devices.

BUG= chromium:863630 
TEST=vm.StartTerminaVM no longer times out on lxd waitready for strago/gru

Change-Id: Ic2d2becbba96b08ba7b646ec66ee633f4a8b73ba
Reviewed-on: https://chromium-review.googlesource.com/1194603
Commit-Ready: Stephen Barber <smbarber@chromium.org>
Tested-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit 391a80e192f8a393d24697cd39e7f3b3222236ff)
Reviewed-on: https://chromium-review.googlesource.com/1198064
Reviewed-by: Stephen Barber <smbarber@chromium.org>
Commit-Queue: Chirantan Ekbote <chirantan@chromium.org>
Tested-by: Chirantan Ekbote <chirantan@chromium.org>

[modify] https://crrev.com/94a2cb76a67617b4e6e41a2723279cc8303f979b/vm_tools/maitred/service_impl.cc

Status: Fixed (was: Assigned)
Project Member

Comment 32 by sheriffbot@chromium.org, Sep 3

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Verified (was: Fixed)
Verified on 69.0.3497.120/10895.78.0 and 11137.0.0, 71.0.3567.0 

Sign in to add a comment