Issue metadata
Sign in to add a comment
|
moblab-generic-vm-paladin failed (setting up lxc?) |
||||||||||||||||||||||||
Issue descriptionThere is a mystery paladin failures overnight: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1268 Looking at the scheduler logs in the moblab vm, I see: 04/06 03:19:01.298 DEBUG| utils:0214| Running 'which lxc-start' 04/06 03:19:01.340 DEBUG| utils:0282| [stdout] /usr/sbin/lxc-start 04/06 03:19:01.341 DEBUG| utils:0214| Running 'sudo -n ls "/usr/local/autotest/containers/base_05"' 04/06 03:19:01.376 ERROR| utils:0282| [stderr] ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory 04/06 03:19:01.378 ERROR| drones:0188| Drone localhost does not support server-side packaging. Traceback (most recent call last): File "/usr/local/autotest/scheduler/drones.py", line 182, in support_ssp self._host.run('sudo -n ls "%s"' % base_container) File "/usr/local/autotest/client/bin/local_host.py", line 57, in run stderr_tee=stderr_tee, stdin=stdin, args=args) File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run "Command returned non-zero exit status") AutotestHostRunCmdError: sudo -n ls "/usr/local/autotest/containers/base_05" (Command returned non-zero exit status) * Command: sudo -n ls "/usr/local/autotest/containers/base_05" Exit status: 2 Duration: 0.0236210823059 stderr: ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory 04/06 03:19:01.382 ERROR| monitor_db:0201| Uncaught exception, terminating monitor_db. Traceback (most recent call last): File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling initialize() File "/usr/local/autotest/scheduler/monitor_db.py", line 269, in initialize _drone_manager.initialize(RESULTS_DIR, drone_list, results_host) File "/usr/local/autotest/scheduler/drone_manager.py", line 212, in initialize self.refresh_drone_configs() File "/usr/local/autotest/scheduler/drone_manager.py", line 326, in refresh_drone_configs drone.support_ssp) File "/usr/local/autotest/scheduler/drones.py", line 182, in support_ssp self._host.run('sudo -n ls "%s"' % base_container) File "/usr/local/autotest/client/bin/local_host.py", line 57, in run stderr_tee=stderr_tee, stdin=stdin, args=args) File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run "Command returned non-zero exit status") AutotestHostRunCmdError: sudo -n ls "/usr/local/autotest/containers/base_05" (Command returned non-zero exit status) * Command: sudo -n ls "/usr/local/autotest/containers/base_05" Exit status: 2 Duration: 0.0236210823059 stderr: ls: cannot access '/usr/local/autotest/containers/base_05': No such file or directory 04/06 03:19:01.424 INFO | ts_mon_config:0207| Waiting for ts_mon flushing process to finish... 04/06 03:20:01.006 INFO | ts_mon_config:0213| Finished waiting for ts_mon process. ----------- But moblab-vm doesn't support SSP yet. It can't run the lxc container. See issue 793016 I'm guessing something changed in the lxc container setup that the scheduler started unconditionally starting lxc pool in a way that it fails if the pool fails to startup. Assigning to akeshet@ because this smells like a tree closer to me.
,
Apr 6 2018
Issue 829886 has been merged into this issue.
,
Apr 6 2018
I found the CL - if you want to drop drones being SSP enabled on VM there is a setting to change in moblab_config.ini https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/719241
,
Apr 6 2018
However again - that change to raise errors went in last Oct - so whatever is happening now is new.
,
Apr 9 2018
Chase pending justification: This was top-of-list of CL killer last week. Chase for at least figuring out what was up / how often this is happening + initial mitigation.
,
Apr 9 2018
-> pprabhu to root cause
,
Apr 9 2018
Last CQ run as well: https://luci-milo.appspot.com/buildbot/chromiumos/moblab-generic-vm-paladin/1303
,
Apr 9 2018
The real failure is in moblab-base-container-init: DEBUG:root:Running 'sudo -n true' DEBUG:root:Running 'sudo lxc-ls --active' DEBUG:root:Running 'sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp' WARNING:root:<class 'autotest_lib.client.common_lib.error.CmdError'>(Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=6, Command returned non-zero exit status * Command: sudo curl -s http://storage.googleapis.com/abci-ssp/autotest- containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp Exit status: 6 Duration: 0.0614428520203 ) WARNING:root:Retrying in 10.070890 seconds... DEBUG:root:Running 'sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp' WARNING:root:run process timeout (180) fired on: sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp DEBUG:root:Running 'sudo test -e "/usr/local/autotest/containers/base_05"' Traceback (most recent call last): File "/usr/local/autotest/site_utils/lxc.py", line 68, in <module> main() File "/usr/local/autotest/site_utils/lxc.py", line 62, in main image.setup(name=options.name, force_delete=options.force_delete) File "/usr/local/autotest/site_utils/lxc/base_image.py", line 93, in setup self._download_and_install_base_container() File "/usr/local/autotest/site_utils/lxc/base_image.py", line 158, in _download_and_install_base_container lxc.download_extract(container_url, tar_path, self.container_path) File "/usr/local/autotest/site_utils/lxc/lxc.py", line 128, in download_extract _download_via_curl(url, tmp_file.name) File "/usr/local/autotest/client/common_lib/cros/retry.py", line 218, in func_retry remaining_time) File "/usr/local/autotest/client/common_lib/cros/retry.py", line 123, in timeout default_result = func(*args, **kwargs) File "/usr/local/autotest/site_utils/lxc/lxc.py", line 142, in _download_via_curl stderr_tee=common_utils.TEE_TO_LOGS, timeout=3*60) File "/usr/local/autotest/client/common_lib/utils.py", line 740, in run timeout)[0] File "/usr/local/autotest/client/common_lib/utils.py", line 820, in join_bg_jobs "Command(s) did not complete within %d seconds" % timeout) autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=-9, Command(s) did not complete within 180 seconds * Command: sudo curl -s http://storage.googleapis.com/abci-ssp/autotest- containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp Exit status: -9 Duration: 186.543534994
,
Apr 10 2018
I uploaded https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1003275 to make the scheduler not start in this condition -- so that this failure occurs closer to the source. But that doesn't actually address why the vmtest is failing to download the ssp container so often. Nor does it tell us how often this is happening. There is also the point that the vmtest doesn't (yet) need the container in the first place.
,
Apr 10 2018
I checked how often this has happened on the paladin on R67
$ gsutil ls gs://chromeos-image-archive/moblab-generic-vm-paladin/R67*/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/moblab_RunSuite/sysinfo/var/log/bootup/moblab-base-container-init.log | tee base_container_logs
$ for i in $(cat base_container_logs); do gsutil cp $i moblab-base-container-init.log.${i:54:17}; done
$ grep "did not complete within 180 seconds" *
moblab-base-container-init.log.R67-10550.0.0-rc4:autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_R1GrDF> failed, rc=-9, Command(s) did not complete within 180 seconds
moblab-base-container-init.log.R67-10560.0.0-rc4:autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s http://storage.googleapis.com/abci-ssp/autotest-containers/base_05.tar.xz -o /tmp/base_05.tar.xz_z2oCsp> failed, rc=-9, Command(s) did not complete within 180 seconds
Turns out there are exactly two occurrences, both already recorded on this bug.
,
Apr 10 2018
In light of the frequency of this occurrence, and the fact that there is already a 180 second timeout on that single curl download, my reccommendation on this bug is wait and watch. The failure should be more obvious (TM) once my CL to block scheduler on base container setup failure lands.
,
Apr 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/6b23fbead51968cad6c48f138384ed6c796c33af commit 6b23fbead51968cad6c48f138384ed6c796c33af Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Apr 13 22:47:50 2018 project-moblab: Only start scheduler if base container setup succeeds. BUG= chromium:829871 TEST=moblab-generic-vm-pre-cq Change-Id: I3d85a341a46b54b1c69a4f8cb99ca289a53429df Reviewed-on: https://chromium-review.googlesource.com/1003275 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> [modify] https://crrev.com/6b23fbead51968cad6c48f138384ed6c796c33af/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-scheduler-init.conf
,
Apr 16 2018
Upon investigation, not frequent enough for Chase bucket.
,
Apr 16 2018
,
Apr 16 2018
,
Apr 25 2018
,
Jun 23 2018
Issue 855664 has been merged into this issue.
,
Jun 23 2018
This killed a few CQ runs today. I didn't get too far root causing last I looked at this.
,
Jun 23 2018
,
Jun 23 2018
,
Jun 25 2018
,
Jun 25 2018
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by haddowk@chromium.org
, Apr 6 2018