New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 920855 link

Starred by 6 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

moblab-generic-vm-paladin: Upstart service moblab-scheduler-init not in running state.

Project Member Reported by drinkcat@chromium.org, Jan 11

Issue description

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8924679387556216816

https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924679387556216816/+/steps/MoblabVMTest/0/stdout

  -----------------------------------------------------------------------------------------------
  /tmp/cbuildbotpejVqZ/results/results-1-moblab_DummyServerNoSspSuite                 [  FAILED  ]
  /tmp/cbuildbotpejVqZ/results/results-1-moblab_DummyServerNoSspSuite                   ERROR: Unhandled UpstartServiceNotRunning: Upstart service moblab-scheduler-init not in running state.
  /tmp/cbuildbotpejVqZ/results/results-1-moblab_DummyServerNoSspSuite/moblab_RunSuite [  FAILED  ]
  /tmp/cbuildbotpejVqZ/results/results-1-moblab_DummyServerNoSspSuite/moblab_RunSuite   ERROR: Unhandled UpstartServiceNotRunning: Upstart service moblab-scheduler-init not in running state.
  -----------------------------------------------------------------------------------------------
 
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Results are here:

https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/moblab-generic-vm-paladin/R73-11562.0.0-rc1/moblab_vm_test_results

I don't see anything interesting in the logs ,-( I don't even see where that moblab-scheduler-init service should be started from...

There are some other errors in mobmonitor.log, not sure if any of them matter:
2019-01-10 22:34:checkfile.manager:ERROR    Failed to execute health check ServoExists: Command '['sudo', 'lsusb']' returned non-zero exit status 1
Traceback (most recent call last):
  File "/etc/moblab/mobmonitor/checkfile/manager.py", line 130, in DetermineHealthcheckStatus
    result = healthcheck.Check()
  File "/etc/moblab/mobmonitor/checkfiles/moblab/servo_check.py", line 26, in Check
    usbs = osutils.sudo_run_command(cmd).strip()
  File "/etc/moblab/mobmonitor/util/osutils.py", line 64, in sudo_run_command
    shell=shell)
  File "/etc/moblab/mobmonitor/util/osutils.py", line 46, in run_command
    raise RunCommandError(e.returncode, e.cmd)
RunCommandError: Command '['sudo', 'lsusb']' returned non-zero exit status 1
2019-01-10 22:34:moblab.heartbeat_check:INFO     Start to check heartbeat
2019-01-10 22:34:moblab.heartbeat_check:INFO     Try to import autotest.
2019-01-10 22:34:moblab.heartbeat_check:WARNING  Autotest is not ready.
2019-01-10 22:34:checkfile.manager:ERROR    Failed to execute health check Heartbeat: too many values to unpack
Traceback (most recent call last):
  File "/etc/moblab/mobmonitor/checkfile/manager.py", line 139, in DetermineHealthcheckStatus
    description, actions = healthcheck.Diagnose(result)
ValueError: too many values to unpack

Labels: -Pri-2 Pri-1
Cc: mattmallett@chromium.org
Owner: haddowk@chromium.org
+haddowk to triage.
#6, precq error is different: crbug.com/921324
The original issue is caused by lack of access to cloud storage that moblab needs during its boot up

  File "/usr/local/autotest/client/common_lib/utils.py", line 834, in join_bg_jobs
    "Command(s) did not complete within %d seconds" % timeout)
autotest_lib.client.common_lib.error.CmdTimeoutError: Command <sudo curl -s https://storage.googleapis.com/abci-ssp/autotest-containers/moblab_base_07.tar.xz -o /tmp/moblab_base_07.tar.xz_ZVClmb> failed, rc=-9, Command(s) did not complete within 180 seconds
* Command: 
    sudo curl -s https://storage.googleapis.com/abci-ssp/autotest-
    containers/moblab_base_07.tar.xz -o /tmp/moblab_base_07.tar.xz_ZVClmb

So this is an infra issue.
Owner: ----

Comment 10 by evanhernandez@google.com, Jan 16 (6 days ago)

Cc: semenzato@chromium.org evanhernandez@chromium.org
I believe this error appeared on the latest CQ run.

Build:
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8924169113349556416

Logs:
https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924167895517798912/+/steps/MoblabVMTest/0/stdout

Can we get this triaged?

Comment 11 by evanhernandez@google.com, Jan 16 (6 days ago)

Cc: jclinton@google.com jclinton@chromium.org

Comment 12 by evanhernandez@google.com, Jan 16 (6 days ago)

Cc: -jclinton@google.com

Comment 13 by jclinton@chromium.org, Jan 16 (6 days ago)

Components: -Infra>ChromeOS>Test Infra>ChromeOS>Test>Platform

Comment 14 by tomhughes@chromium.org, Jan 16 (6 days ago)

This bug or crbug.com/921324 seems to be blocking my CL from going through: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1410305

Comment 15 by lgoo...@chromium.org, Jan 17 (5 days ago)

Blocking: 920548

Comment 16 by lgoo...@chromium.org, Jan 17 (5 days ago)

Cc: lgoo...@chromium.org

Comment 17 by jclinton@chromium.org, Jan 17 (5 days ago)

Owner: akes...@chromium.org
Status: Assigned (was: Available)

Comment 18 by akes...@chromium.org, Jan 18 (5 days ago)

I think we should mark this builder as experimental, permanently disable it, or hand off its ownership.

The test infra team is hands full migrating to the Skylab test infrastructure; feature development on the autotest infrastructure is effectively frozen. So, the value provided to us is minimal.

Comment 19 by evanhernandez@google.com, Jan 18 (4 days ago)

In that case, I vote to mark it experimental to prevent further CQ failures.

Comment 20 by jclinton@chromium.org, Jan 18 (4 days ago)

Agreed, experimental or remove; either is fine.

Comment 21 by mattmallett@google.com, Jan 18 (4 days ago)

We want this test to be in the CQ, we're debugging the issue right now. I think it's fine to keep it experimental until we resolve this issue.

Comment 22 by akes...@chromium.org, Jan 18 (4 days ago)

Owner: mattmallett@google.com

Comment 23 by norvez@chromium.org, Jan 18 (4 days ago)

Cc: xiaochu@chromium.org ahass...@chromium.org

Comment 24 by xiaochu@chromium.org, Jan 18 (4 days ago)

A use flag 'dlc' is enabled for amd64-generic and arm-generic to run unittest in CQ for dlcservice package. Let's disable it for overlays that are impacted by this. May I know what overlays we should target?

Comment 25 by norvez@chromium.org, Jan 18 (4 days ago)

overlay-moblab-generic-vm, overlay-variant-amd64-generic-embedded and overlay-variant-amd64-generic-mobbuild all inherit from amd64-generic.

Comment 26 by xiaochu@chromium.org, Jan 18 (4 days ago)

Owner: xiaochu@chromium.org
Status: Started (was: Assigned)
thanks!

Comment 27 by norvez@chromium.org, Jan 18 (4 days ago)

Cc: haddowk@chromium.org
For reference, this is following an email to chatty
"
Also when the moblab VM fails to provision we see dlcservice and update_engine_client crashes :
"

Not sure if dlcservice is causing the provision failure itself, but disabling it will at least remove some noise

Comment 28 by lgoo...@chromium.org, Jan 18 (4 days ago)

Blocking: -920548

Comment 30 by norvez@chromium.org, Jan 19 (3 days ago)

Owner: mattmallett@chromium.org
Disabling dlcservice doesn't fix the issue. CL https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1422555 that removes it from the build is also failing in moblab-generic-vm-pre-cq because upstart is not running. Log of the pre-cq run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923930265940279312

Not sure it's worth chumping the CL since it doesn't appear to help, seems better to wait until the breakage has been resolved.

Re-assigning to original owner for further diagnosis.

Comment 31 by wtlee@chromium.org, Jan 21 (2 days ago)

Does anybody know why "moblab-generic-vm-paladin" is not marked as experimental? 

I saw in "http://chromiumos-status.appspot.com/" there is a message "Tree is open (EXPERIMENTAL=moblab-generic-vm-paladin crbug.com/920855 )". But in the latest build ("https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923760139434038576"), it does not have experimental label on it.

Comment 32 by ljusten@chromium.org, Yesterday (47 hours ago)

Cc: ljusten@chromium.org

Comment 33 by wtlee@chromium.org, Yesterday (44 hours ago)

Since moblab-generic-vm-paladin keeps blocking CQ, we have a CL (https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1424718) to move it to experimental.

Project Member

Comment 34 by bugdroid1@chromium.org, Yesterday (43 hours ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/caadb4497559b686a231e2b5029e5abc0ef1090f

commit caadb4497559b686a231e2b5029e5abc0ef1090f
Author: paulhsia <paulhsia@google.com>
Date: Mon Jan 21 10:34:35 2019

moblab-generic-vm: mark as experimental

BUG=chromium:920855
TEST=Run local unit tests by command
     $ ./chromeos_config_unittest --update

Change-Id: I41fa7b43f8898222b509dc85915b62a26c4ff314
Reviewed-on: https://chromium-review.googlesource.com/1424718
Commit-Ready: Wei Lee <wtlee@chromium.org>
Tested-by: Wei Lee <wtlee@chromium.org>
Reviewed-by: Wei Lee <wtlee@chromium.org>

[modify] https://crrev.com/caadb4497559b686a231e2b5029e5abc0ef1090f/config/chromeos_config.py
[modify] https://crrev.com/caadb4497559b686a231e2b5029e5abc0ef1090f/config/config_dump.json

Comment 35 by ljusten@chromium.org, Yesterday (39 hours ago)

I'm still seeing failures on moblab-generic-vm-pre-cq [1]. Does the change in #34 take time to propagate or is this a different builder?

[1] https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8923724957450825408

Comment 36 by jclinton@chromium.org, Yesterday (39 hours ago)

Re: #31, Legoland only renders the experimental '*' on builders if they are set in the configuration that way. It does not render the status set in Tree Status, however, the master-paladin still considers this when deciding whether to pass for fail a run as you can see from the run that you linked to: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8923760139434038576 . Therefore that run (and all of them after) failed because the hardware lab is having an outage tracked on  issue 923737 . 

Re: #35: moblab is not configured to block CL's from passing PreCQ so the failures are irrelevant: http://cs/chromeos_public/chromite/lib/constants.py?l=634&rcl=caadb4497559b686a231e2b5029e5abc0ef1090f

Please read the Sheriff FAQ <https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriff-details-chromium-os> to prepare for your shift; it's been updated with your responsibilities. In particular, you need to be annotating this failed builds: https://chromiumos-build-annotator.googleplex.com/build_annotations/builds_list/master-paladin/ ; that's your #1 priority.

Comment 37 by ljusten@chromium.org, Yesterday (37 hours ago)

@jclinton: Not sure who you are referring to, but please note that neither Wei (#31) nor me (#35) are build sheriffs.

If moblab does not block CLs from passing PreCQ, why does my commit queue flag get reset [1]? This has been happening repeatedly since Jan 14 and it seems like I can't submit my CL.

[1] https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1404615

Comment 38 by jclinton@chromium.org, Yesterday (26 hours ago)

Wei (wtlee@) is the non-PST build sheriff for this week.

> If moblab does not block CLs from passing PreCQ, why does my commit queue flag get reset [1]? This has been happening repeatedly since Jan 14 and it seems like I can't submit my CL.
> 
> [1] https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1404615

Someone explicitly added it as a special requirement for autotest CL's. Maybe you can contact them to see if it can be removed: http://cs/chromeos_public/src/third_party/autotest/files/COMMIT-QUEUE.ini?l=11&rcl=1eb8fe70531942bd81aca9c7f634c3562fe9e617

Comment 39 by ljusten@chromium.org, Today (23 hours ago)

Prathmesh, since you've added it originally and disabled it temporarily once, should moblab-generic-vm-pre-cq be removed from 
third_party/autotest/files/COMMIT-QUEUE.ini?



Comment 40 by paulhsia@chromium.org, Today (21 hours ago)

Cc: wtlee@chromium.org

Comment 41 by tomhughes@chromium.org, Today (4 hours ago)

FWIW, I'm seeing the same thing as ljusten@ in #37 (PreCQ failure resetting my commit queue flag, so I can't submit: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1387752/10#message-93ddcb9164f9a84152a3600bdf1cd3516bb728ec)

Comment 42 by briannorris@chromium.org, Today (4 hours ago)

I'm not sure why no one has thrown anything like this up yet:

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1429219

It might not be exactly the right thing, but hopefully that will move the conversation...

In the meantime, just do what I do and chump ;)

Note: I definitely did not recommend chumping. You did *not* hear it here.

Comment 43 by briannorris@chromium.org, Today (4 hours ago)

Also, I noticed there were a couple of passing runs somewhere around Jan 17, but otherwise, this pre-cq has been red for over a week:

https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=moblab-generic-vm-pre-cq&buildBranch=master

Not sure what that's all about.

Comment 44 by haddowk@chromium.org, Today (3 hours ago)

Owner: haddowk@chromium.org
I spent the whole day trying to debug, it is not really a moblab issue, the moblab VM comes up and does what it is supposed to do.

When it runs the provision_AutoUpdate "job" the sub dut ( also a VM ) it reboots the device and the VM comes back with no networking so the provision fails and so the test fails ( WAI )

There have been no moblab changes recently that could cause this break, I am no VM expert so getting the logs of a VM that has no networking is proving challenging.  I can get to the VM UI, but not anything useful re-logs.  I also am not an AU expert so knowing why calling 

 /usr/bin/update_engine_client --update --omaha_url=http://192.168.231.1:8080/update/moblab-generic-vm-pre-cq/R73-11629.0.0-b3386716

Would stop a VM from booting, again I can not get onto the broken VM to see what is going on.


Sign in to add a comment