New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 829871 link

Starred by 4 users

Issue metadata

Status: Duplicate
Merged: issue 855664
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

moblab-generic-vm-paladin failed (setting up lxc?)

Project Member Reported by pprabhu@chromium.org, Apr 6 2018

Issue description

There is a mystery paladin failures overnight: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1268


Looking at the scheduler logs in the moblab vm, I see:

04/06 03:19:01.298 DEBUG|             utils:0214| Running 'which lxc-start'
04/06 03:19:01.340 DEBUG|             utils:0282| [stdout] /usr/sbin/lxc-start
04/06 03:19:01.341 DEBUG|             utils:0214| Running 'sudo -n ls "/usr/local/autotest/containers/base_05"'
04/06 03:19:01.376 ERROR|             utils:0282| [stderr] ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory
04/06 03:19:01.378 ERROR|            drones:0188| Drone localhost does not support server-side packaging.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/drones.py", line 182, in support_ssp
    self._host.run('sudo -n ls "%s"' %  base_container)
  File "/usr/local/autotest/client/bin/local_host.py", line 57, in run
    stderr_tee=stderr_tee, stdin=stdin, args=args)
  File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run
    "Command returned non-zero exit status")
AutotestHostRunCmdError: sudo -n ls "/usr/local/autotest/containers/base_05" (Command returned non-zero exit status)
* Command: 
    sudo -n ls "/usr/local/autotest/containers/base_05"
Exit status: 2
Duration: 0.0236210823059

stderr:
ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory
04/06 03:19:01.382 ERROR|        monitor_db:0201| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
    initialize()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 269, in initialize
    _drone_manager.initialize(RESULTS_DIR, drone_list, results_host)
  File "/usr/local/autotest/scheduler/drone_manager.py", line 212, in initialize
    self.refresh_drone_configs()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 326, in refresh_drone_configs
    drone.support_ssp)
  File "/usr/local/autotest/scheduler/drones.py", line 182, in support_ssp
    self._host.run('sudo -n ls "%s"' %  base_container)
  File "/usr/local/autotest/client/bin/local_host.py", line 57, in run
    stderr_tee=stderr_tee, stdin=stdin, args=args)
  File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run
    "Command returned non-zero exit status")
AutotestHostRunCmdError: sudo -n ls "/usr/local/autotest/containers/base_05" (Command returned non-zero exit status)
* Command: 
    sudo -n ls "/usr/local/autotest/containers/base_05"
Exit status: 2
Duration: 0.0236210823059

stderr:
ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory
04/06 03:19:01.424 INFO |     ts_mon_config:0207| Waiting for ts_mon flushing process to finish...
04/06 03:20:01.006 INFO |     ts_mon_config:0213| Finished waiting for ts_mon process.



-----------

But moblab-vm doesn't support SSP yet. It can't run the lxc container. See issue 793016

I'm guessing something changed in the lxc container setup that the scheduler started unconditionally starting lxc pool in a way that it fails if the pool fails to startup.

Assigning to akeshet@ because this smells like a tree closer to me.
 

I added some logic to the scheduler a long time ago to exit if the base container is not downloaded.  That is part of bootup not when your run a lxc container

I will check - but I suspect that the USB device not mounting is the root cause again.

I need to keep the monitor db logic the way it was - it was to fix an issue with partners where slow connections resulted in the monitor_db starting in the "not lxc" only mode - we never want that on a real moblab.
Issue 829886 has been merged into this issue.
I found the CL - if you want to drop drones being SSP enabled on VM there is a setting to change in moblab_config.ini

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/719241
However again - that change to raise errors went in last Oct - so whatever is happening now is new.
Labels: Chase-Pending
Chase pending justification: This was top-of-list of CL killer last week. Chase for at least figuring out what was up / how often this is happening + initial mitigation.
Labels: -Chase-Pending Chase
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
-> pprabhu to root cause
Status: Started (was: Assigned)
Last CQ run as well: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1303
The real failure is in moblab-base-container-init:

DEBUG:root:Running 'sudo -n true'
DEBUG:root:Running 'sudo lxc-ls --active'
DEBUG:root:Running 'sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp'
WARNING:root:<class 'autotest_lib.client.common_lib.error.CmdError'>(Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=6, Command returned non-zero exit status
* Command: 
    sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-
    containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp
Exit status: 6
Duration: 0.0614428520203
)
WARNING:root:Retrying in 10.070890 seconds...
DEBUG:root:Running 'sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp'
WARNING:root:run process timeout (180) fired on: sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp
DEBUG:root:Running 'sudo test -e "/usr/local/autotest/containers/base_05"'
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/lxc.py", line 68, in <module>
    main()
  File "/usr/local/autotest/site_utils/lxc.py", line 62, in main
    image.setup(name=options.name, force_delete=options.force_delete)
  File "/usr/local/autotest/site_utils/lxc/base_image.py", line 93, in setup
    self._download_and_install_base_container()
  File "/usr/local/autotest/site_utils/lxc/base_image.py", line 158, in _download_and_install_base_container
    lxc.download_extract(container_url, tar_path, self.container_path)
  File "/usr/local/autotest/site_utils/lxc/lxc.py", line 128, in download_extract
    _download_via_curl(url, tmp_file.name)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 218, in func_retry
    remaining_time)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 123, in timeout
    default_result = func(*args, **kwargs)
  File "/usr/local/autotest/site_utils/lxc/lxc.py", line 142, in _download_via_curl
    stderr_tee=common_utils.TEE_TO_LOGS, timeout=3*60)
  File "/usr/local/autotest/client/common_lib/utils.py", line 740, in run
    timeout)[0]
  File "/usr/local/autotest/client/common_lib/utils.py", line 820, in join_bg_jobs
    "Command(s) did not complete within %d seconds" % timeout)
autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=-9, Command(s) did not complete within 180 seconds
* Command: 
    sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-
    containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp
Exit status: -9
Duration: 186.543534994
I uploaded https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1003275 to make the scheduler not start in this condition -- so that this failure occurs closer to the source.

But that doesn't actually address why the vmtest is failing to download the ssp container so often. Nor does it tell us how often this is happening.

There is also the point that the vmtest doesn't (yet) need the container in the first place.
I checked how often this has happened on the paladin on R67


$ gsutil ls gs://chromeos-image-archive/moblab-generic-vm-paladin/R67*/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/moblab_RunSuite/sysinfo/var/log/bootup/moblab-base-container-init.log | tee base_container_logs
$ for i in $(cat base_container_logs); do gsutil cp $i moblab-base-container-init.log.${i:54:17}; done
$ grep "did not complete within 180 seconds" *
moblab-base-container-init.log.R67-10550.0.0-rc4:autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_R1GrDF> failed, rc=-9, Command(s) did not complete within 180 seconds
moblab-base-container-init.log.R67-10560.0.0-rc4:autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=-9, Command(s) did not complete within 180 seconds


Turns out there are exactly two occurrences, both already recorded on this bug.
In light of the frequency of this occurrence, and the fact that there is already a 180 second timeout on that single curl download, my reccommendation on this bug is wait and watch. The failure should be more obvious (TM) once my CL to block scheduler on base container setup failure lands.
Project Member

Comment 12 by bugdroid1@chromium.org, Apr 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/6b23fbead51968cad6c48f138384ed6c796c33af

commit 6b23fbead51968cad6c48f138384ed6c796c33af
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Apr 13 22:47:50 2018

project-moblab: Only start scheduler if base container setup succeeds.

BUG= chromium:829871 
TEST=moblab-generic-vm-pre-cq

Change-Id: I3d85a341a46b54b1c69a4f8cb99ca289a53429df
Reviewed-on: https://chromium-review.googlesource.com/1003275
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/6b23fbead51968cad6c48f138384ed6c796c33af/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-scheduler-init.conf

Labels: -Chase
Upon investigation, not frequent enough for Chase bucket.
Labels: -Pri-1 Pri-2
Owner: ----
Status: Available (was: Started)
Owner: pprabhu@chromium.org
Cc: ihf@chromium.org pprabhu@chromium.org dgarr...@chromium.org jclinton@chromium.org mortonm@chromium.org
 Issue 855664  has been merged into this issue.
Labels: -Pri-2 Pri-1
Owner: jkop@chromium.org
Status: Assigned (was: Available)
This killed a few CQ runs today.

I didn't get too far root causing last I looked at this.
Labels: Hotlist-Deputy
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test

Comment 21 by jkop@chromium.org, Jun 25 2018

Cc: -jkop@chromium.org gu...@chromium.org
Owner: xixuan@chromium.org
Mergedinto: 855664
Status: Duplicate (was: Assigned)

Sign in to add a comment