New issue
Advanced search Search tips

Issue 884244 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Sep 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

SetUpUser RPC fails with "failed to run useradd: EOF" sometimes

Project Member Reported by jkardatzke@chromium.org, Sep 14

Issue description

In the tast-tests I've seen a few results where there's a failure with the following log lines. This has been observed on both kevin, eve, grunt, wizpig (so it's not a slower board problem or an architecture specific issue).

Specific instances were observed on R71-11065.0.0, 11064, 11061, 11060, 11056 (and likely earlier). 

2018-09-14T08:36:00.832872-07:00 ERR vm_cicerone[26468]: Failed to set up user: failed to run useradd: EOF
2018-09-14T08:36:03.327256-07:00 INFO VM(4)[26435]:  lxd[172]: action=start created=2018-09-14T15:35:10+0000 ephemeral=false lvl=info msg="Starting container" name=penguin stateful=false t=2018-09-14T15:35:59+0000 used=1970-01-01T00:00:00+0000#012
2018-09-14T08:36:03.327266-07:00 INFO VM(4)[26435]:  lxd[172]: action=start created=2018-09-14T15:35:10+0000 ephemeral=false lvl=info msg="Started container" name=penguin stateful=false t=2018-09-14T15:36:00+0000 used=1970-01-01T00:00:00+0000#012
2018-09-14T08:36:03.327269-07:00 INFO VM(4)[26435]:  tremplin[202]: 2018/09/14 15:36:00 Received SetUpUser RPC: penguin (username testuser)#012
2018-09-14T08:36:03.327272-07:00 ERR VM(4)[26435]:  lxd[172]: lvl=eror msg="Failed to retrieve PID of executing child process: EOF" t=2018-09-14T15:36:01+0000#012
 
Labels: -Pri-2 M-71 Pri-1
Components: OS>Systems>Containers
Status: Started (was: Assigned)
The cause is the child process in lxc_attach calling shutdown() on the socketpair that the intermediate process needs to use. There's a race between the shutdown() in the intermediate process and the shutdown() in the child (target) process.

Basic repro case: x=0; while lxc exec penguin -- id -n 1000 -u; do x=$(( $x+1 )); echo $x; done

This fails usually in < 300 iterations on my system, and has exceeded 20k iterations by removing the shutdown() calls in attach_child_main(): https://github.com/lxc/lxc/blob/lxc-3.0.2/src/lxc/attach.c#L933

Attaching an strace of the failure.
bad_strace.txt
2.3 KB View Download
Project Member

Comment 4 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/4f6515fbe53897a78e6e0e2d70166215b360d1d7

commit 4f6515fbe53897a78e6e0e2d70166215b360d1d7
Author: Stephen Barber <smbarber@chromium.org>
Date: Wed Sep 19 04:12:07 2018

app-emulation/lxc: add shutdown fix for lxc-attach

Merged upstream in https://github.com/lxc/lxc/pull/2619

BUG= chromium:884244 
TEST=run repro case in bug; no failure after 10k+ iterations

Change-Id: Ia93db140e10ba840c50826fbfbfff77969113530
Reviewed-on: https://chromium-review.googlesource.com/1231854
Commit-Ready: Stephen Barber <smbarber@chromium.org>
Tested-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>

[modify] https://crrev.com/4f6515fbe53897a78e6e0e2d70166215b360d1d7/app-emulation/lxc/lxc-3.0.1.ebuild
[rename] https://crrev.com/4f6515fbe53897a78e6e0e2d70166215b360d1d7/app-emulation/lxc/lxc-3.0.1-r2.ebuild
[add] https://crrev.com/4f6515fbe53897a78e6e0e2d70166215b360d1d7/app-emulation/lxc/files/lxc-3.0.1-attach-shutdown.patch

Status: Fixed (was: Started)
Status: Verified (was: Fixed)

Sign in to add a comment