New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Fix moblab to work with the version of lxc provided by portage-stable

Project Member Reported by xixuan@chromium.org, Nov 28 Back to list

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/7989

11/27 21:27:52.766 DEBUG|          base_job:0357| Persistent state global_properties.fast now set to False
11/27 21:27:52.766 DEBUG|          base_job:0357| Persistent state global_properties.max_result_size_KB now set to 20000
11/27 21:27:52.785 DEBUG|          autotemp:0116| Clean was not called for /tmp/_autotmp_aeCTpYssh-master
11/27 21:27:52.809 INFO |    connectionpool:0207| Starting new HTTP connection (1): metadata.google.internal
11/27 21:27:53.064 INFO |            config:0024| Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
11/27 21:27:53.065 ERROR|            config:0244| ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
11/27 21:27:53.066 NOTIC|      cros_logging:0038| ts_mon was set up.
11/27 21:27:53.066 DEBUG|          autoserv:0264| Trying to start servod.
11/27 21:27:53.166 WARNI|          autoserv:0272| Starting servod is aborted. The dut's servo_host attribute is not set to localhost.
11/27 21:27:53.166 DEBUG|             utils:0212| Running 'sudo test -e "/mnt/moblab/containers/base_05/container_id.p"'
11/27 21:27:53.180 DEBUG|             utils:0212| Running 'sudo lxc-ls --active'
11/27 21:27:53.197 DEBUG|             utils:0212| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"'
11/27 21:27:53.212 DEBUG|             utils:0212| Running 'cp /usr/local/autotest/results/drone_tmp/attach.7 /usr/local/autotest/results/2-moblab/192.168.231.101/attach.7'
11/27 21:27:53.220 DEBUG|             utils:0212| Running 'sudo test -e "/mnt/moblab/containers/test_2_1511846872_15065"'
11/27 21:27:53.242 DEBUG|             utils:0212| Running 'sudo -n virt-what'
11/27 21:27:53.259 WARNI|             utils:2300| Package virt-what is not installed, default to assume it is not a virtual machine.
11/27 21:27:53.260 DEBUG|             utils:0212| Running 'sudo lxc-clone --lxcpath /mnt/moblab/containers --newpath /mnt/moblab/containers --orig base_05 --new test_2_1511846872_15065  '
11/27 21:27:53.276 DEBUG| container_factory:0102| Creating snapshot clone failed. Attempting without snapshot...
11/27 21:27:53.278 DEBUG|             utils:0212| Running 'sudo lxc-ls --active'
11/27 21:27:53.306 DEBUG|             utils:0212| Running 'sudo test -e "/mnt/moblab/containers/base_05/rootfs"'
11/27 21:27:53.326 DEBUG|             utils:0212| Running 'sudo test -e "/mnt/moblab/containers/base_05/container_id.p"'
11/27 21:27:53.342 INFO |        server_job:0218| FAIL  ----    ----    timestamp=1511846873    localtime=Nov 27 21:27:53       Failed to setup container for test: Command <sudo lxc-clone --lxcpath /mnt/moblab/containers --newpath /mnt/moblab/containers --orig base_05 --new test_2_1511846872_15065  > failed, rc=1, Command returned non-zero exit status
  * Command:
      sudo lxc-clone --lxcpath /mnt/moblab/containers --newpath
      /mnt/moblab/containers --orig base_05 --new test_2_1511846872_15065
  Exit status: 1
  Duration: 0.00917220115662

  stderr:
  sudo: lxc-clone: command not found. Check logs in ssp_logs folder for more details.
11/27 21:27:53.343 DEBUG|             utils:0212| Running 'sudo -n chown -R 246 "/usr/local/autotest/results/2-moblab/192.168.231.101"'
11/27 21:27:53.353 DEBUG|             utils:0212| Running 'sudo -n chgrp -R 246 "/usr/local/autotest/results/2-moblab/192.168.231.101"'
11/27 21:27:53.362 ERROR|         traceback:0013| Traceback (most recent call last):
11/27 21:27:53.362 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 507, in run_autoserv
11/27 21:27:53.363 ERROR|         traceback:0013|     machines)
11/27 21:27:53.363 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 168, in _run_with_ssp
11/27 21:27:53.363 ERROR|         traceback:0013|     dut_name=dut_name)
11/27 21:27:53.363 ERROR|         traceback:0013|   File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
11/27 21:27:53.364 ERROR|         traceback:0013|     return fn(*args, **kwargs)
11/27 21:27:53.364 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/cleanup_if_fail.py", line 40, in func_cleanup_if_fail
11/27 21:27:53.364 ERROR|         traceback:0013|     return func(*args, **kwargs)
11/27 21:27:53.364 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/container_bucket.py", line 153, in setup_test
11/27 21:27:53.364 ERROR|         traceback:0013|     self.container_path)
11/27 21:27:53.365 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/container_factory.py", line 67, in create_container
11/27 21:27:53.365 ERROR|         traceback:0013|     lxc_path=lxc_path)
11/27 21:27:53.365 ERROR|         traceback:0013|   File "/usr/lib64/python2.7/site-packages/chromite/lib/metrics.py", line 483, in wrapper
11/27 21:27:53.366 ERROR|         traceback:0013|     return fn(*args, **kwargs)
11/27 21:27:53.366 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/container_factory.py", line 100, in _create_from_base
11/27 21:27:53.366 ERROR|         traceback:0013|     cleanup=self._force_cleanup)
11/27 21:27:53.366 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/container.py", line 223, in clone
11/27 21:27:53.367 ERROR|         traceback:0013|     new_container = cls(new_path, new_name, {}, src, snapshot)
11/27 21:27:53.367 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/container.py", line 135, in __init__
11/27 21:27:53.367 ERROR|         traceback:0013|     self.name, snapshot)
11/27 21:27:53.367 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc/utils.py", line 88, in 
11/27 21:27:53.368 ERROR|         traceback:0013|     utils.run(cmd)
11/27 21:27:53.368 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/utils.py", line 738, in run
11/27 21:27:53.369 ERROR|         traceback:0013|     "Command returned non-zero exit status")
11/27 21:27:53.369 ERROR|         traceback:0013| CmdError: Command <sudo lxc-clone --lxcpath /mnt/moblab/containers --newpath /mnt/moblab/containers --orig base_05 --new test_2_1511846872_15065  > failed, rc=1, Command returned non-zero exit status
11/27 21:27:53.369 ERROR|         traceback:0013| * Command:
11/27 21:27:53.370 ERROR|         traceback:0013|     sudo lxc-clone --lxcpath /mnt/moblab/containers --newpath
11/27 21:27:53.370 ERROR|         traceback:0013|     /mnt/moblab/containers --orig base_05 --new test_2_1511846872_15065
11/27 21:27:53.370 ERROR|         traceback:0013| Exit status: 1
11/27 21:27:53.370 ERROR|         traceback:0013| Duration: 0.00917220115662
11/27 21:27:53.371 ERROR|         traceback:0013|
11/27 21:27:53.371 ERROR|         traceback:0013| stderr:
11/27 21:27:53.371 ERROR|         traceback:0013| sudo: lxc-clone: command not found
11/27 21:27:53.378 ERROR|          autoserv:0759| Uncaught SystemExit with code 1
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 755, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 562, in run_autoserv
    sys.exit(exit_code)
SystemExit: 1
11/27 21:27:53.434 DEBUG|   logging_manager:0627| Logging subprocess finished
11/27 21:27:53.434 DEBUG|   logging_manager:0627| Logging subprocess finishedclone


Suspecting there's a bad CL.

 
Cc: dshi@chromium.org
Can't find related CL except for this one: https://chromium-review.googlesource.com/c/chromiumos/overlays/portage-stable/+/784271

@dshi could you verify it's because of bad CL or guado_moblab flake?
Cc: haddowk@chromium.org
Could be, lxc-clone is an old script, replaced by lxc-copy in lxd. The lxc upgrade might remove that command completely. Lab is still on lxc 2, we need to do some test to see if lxc-copy works on lab server as well.

For moblab, it's possible we can replace lxc-clone with lxc-copy if autotest finds it's running in moblab.

+haddowk
Owner: chirantan@chromium.org
Assign to CL's owner.
Where do I find the logs from comment #1?  I uploaded a CL to replace lxc-clone with lxc-copy: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/794876

The guado_moblab-paladin-tryjob with that CL failed: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/4559

But I can't find any logs that mention anything about lxc-clone or lxc-copy like in comment #1.  The best I've been able to find is:

Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 631, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 837, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x   -o Protocol=2 -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -l root -p 22 chromeos2-row2-rack8-host11 "export LIBC_FATAL_STDERR_=1; if
    type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::run_once|run_as_moblab|run] -> ssh_run(su - moblab -c
    '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
    --build=cyan-release/R62-9901.66.0 --suite_name=dummy_server --retry=True
    --max_retries=1')\";fi; su - moblab -c
    '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
    --build=cyan-release/R62-9901.66.0 --suite_name=dummy_server --retry=True
    --max_retries=1'"
Exit status: 1
Duration: 489.806571007


Which looks to me like the ssh command failed but doesn't say anything about why the underlying call to run_suite.py failed.  What's the magic location for the log from comment #1?
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/7989
=> [Test-Logs]: moblab_RunSuite: FAIL: Unhandled AutoservRunError: command execution error
=> https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/159006017-chromeos-test/chromeos2-row1-rack8-host1/
=> download moblab_RunSuite.tgz, extract it
=> moblab_RunSuite/sysinfo/reboot_current/mnt/moblab/results/4-moblab/192.168.231.101/ssp_logs/debug/autoserv.DEBUG
Project Member

Comment 7 by bugdroid1@chromium.org, Dec 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d

commit 7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d
Author: Chirantan Ekbote <chirantan@chromium.org>
Date: Sat Dec 02 06:45:28 2017

project-moblab: Copy app-emulation/lxc and mask newer versions

Copy app-emulation/lxc from portage-stable into the project-moblab
directory and mask newer versions in the moblab overlay because they
break moblab.  This allows us to update the version of lxc in
portage-stable.

BUG=chromium:789062
TEST='cros tryjob --hwtest guado_moblab-paladin-tryjob'

Change-Id: I7cbf4dc445db9e7e3b38b11615b1d2bd8292094f
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/804814
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/profiles/base/package.mask
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/files/lxc.initd.2
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/files/lxc.initd.3
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/metadata.xml
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/lxc-1.0.7.ebuild
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/files/lxc_at.service
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/profiles/base/eapi
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/files/lxc-1.0.6-bash-completion.patch
[add] https://crrev.com/7aa8d5e7c02525f0544e9ed35e444f34ba0f2c9d/project-moblab/app-emulation/lxc/Manifest

Cc: chirantan@chromium.org
Owner: ----
Status: Available
Summary: Fix moblab to work with the version of lxc provided by portage-stable (was: guado_moblab-paladin failed due to "lxc-clone: command not found")
I've landed a temporary workaround to pin the version used by moblab to 1.0.7.

Changing this bug to be about fixing moblab to work with the new version.
Cc: ihf@chromium.org jkop@chromium.org
Labels: -Pri-2 M-67 Pri-1
Owner: snanda@chromium.org
Status: Assigned
Sameer, could you please find somebody to fix Chirantan's technical debt? This is breaking moblab and causing problems running server tests via lxc down the road.
How exactly is this my technical debt?
I believe you landed this TODO
https://chromium-review.googlesource.com/#/c/chromiumos/overlays/board-overlays/+/804814/3/project-moblab/profiles/base/package.mask

# Mask newer versions of lxc because they break moblab.
# TODO(crbug.com/789062): Fix moblab to work with newer versions of lxc
# and drop the old version in this overlay.

This TODO has *your* name on it, even if you avoided typing it.
crbug.com/789062 is quite a novel way to spell "chirantan".
As someone with experience being an ass: Stop being an ass. Both of you.

A quick fix for a bug you were owner of, which still needs to be fixed eventually, is technical debt someone in the moblab team needs to deal with.
I'm not arguing that it's not technical debt.  My point is that this is something that has always existed:

* moblab depends on lxd
* newer versions of lxd depend on criu
* criu's build system is set up in a way that makes cross-compiling fail
* upgrading lxd to a newer version requires fixing criu

Even if I hadn't landed a workaround to unblock a separate project, _moblab would still have this exact problem_.  The only difference is that I wouldn't be involved in the discussion in any way.

I'm only objecting to the idea that it's somehow my fault that we're in this situation.  I'm more than happy to help with getting criu fixed so that we drop this workaround.


As for being an ass, I 100% agree that I was being an ass but I _really_ don't like people randomly CC'ing my manager and throwing me under the bus for stuff that is only tangentially related to me.


My sincere apology to Chirantan for misreading the situation! I was following revision history and scanned this issue not carefully enough.
That said lxc is kernel code and owned by the kernel team. I don't see the infra team resolving problems with it.

lxc is required to run moblab, a board which is mission critical. It was agreed that moblab is used by CrOS partners to qualify builds. If moblab lxc cannot keep up with the lxc in the ChromeOS lab then moblab will sooner or later diverge and fail and partners will not deal with it anymore. This is why I feel so strongly about it.

Chirantan, I am very grateful that you were able to see past my abrasiveness and offered your help with upreving lxc. I do understand though if Sameer should look for somebody else.
Cc: stephenlin@chromium.org snanda@chromium.org
Owner: chirantan@chromium.org
Thanks all for bringing this discussion back on rails. Much appreciated.

Chirantan will be chatting with Keith from the moblab team to figure out the next steps. Keith is out today so earliest this will happen is Monday.

Assigning to Chirantan for now.
Owner: haddowk@chromium.org
Status: Started

Sign in to add a comment