New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 672749 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

dummy_PassServer_SERVER_JOB is broken and blocks CrOS CQ

Project Member Reported by phoglund@chromium.org, Dec 9 2016

Issue description

At least two of my recent patches have been held up for days by

Output below this line is for buildbot consumption:
  @@@STEP_LINK@[Test-Logs]: dummy_PassServer_SERVER_JOB: FAIL@http://localhost/tko/retrieve_logs.cgi?job=/results/3-moblab/@@@
  @@@STEP_LINK@[Flake-Dashboard]: dummy_PassServer_SERVER_JOB@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=dummy_PassServer_SERVER_JOB@@@
  Will return from run_suite with status: ERROR
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 62, in run_once
      raise e
  AutoservRunError: command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_kde4PIssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos2-row1-rack8-host1 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::_call_run_once|run_once|run_as_moblab] -> ssh_run(su -
      moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool=''
      --board=samus --build=samus-release/R53-8480.0.0
      --suite_name=dummy_server')\";fi; su - moblab -c
      '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=samus
      --build=samus-release/R53-8480.0.0 --suite_name=dummy_server'"
  Exit status: 1
  Duration: 498.820991039

Log 1:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/89570370-chromeos-test/chromeos2-row1-rack8-host1/

Log 2:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/89597035-chromeos-test/chromeos2-row1-rack8-host1/
 
I copied the CC list from https://bugs.chromium.org/p/chromium/issues/detail?id=672576 since you appear to be involved with CrOS sheriffing.

I've been unable to submit patches for days, and I get this same error for two quite unrelated patches. Should this be a P0? I suspect I'm not the only one completely stuck in the CQ.

Comment 2 by mka@chromium.org, Dec 9 2016

Owner: jrbarnette@chromium.org
I'm not on sheriff duty this week.

The problems seems to be that no samus was available:

No hosts found for board:samus in pool

Reassigning to deputy
Components: Infra>Client>ChromeOS
Status: Assigned (was: Untriaged)
I've been attributing the various guado_moblab-paladin
failures in the CQ to bug 662625.  I'll go re-check if
I still believe that diagnosis.  I'll also take another
look at whether there's any other mitigation we can make
while we wait for the permanent fix (which is replacement
of samus in our moblab instances).

I'll note that as a broad matter, failures like this one
(whatever their cause) are far from the biggest cause of CQ
failure in the past week or so.  We've had _lots_ of other reasons
for failure; close to half of the failures had one or more CLs
with bugs that also blocked the run.

The guado_moblab-paladin builder has seen this failure 6 times
in the past week, and passed 14 times.  Every time the paladin
scheduled this test (pass or fail), it selected the same moblab
instance, chromeos2-row1-rack8-host1, so all 6 failures were
on that single instance.

I've moved that instance to be a spare.  It's possible that the
problem will be less prominent with a different moblab instance.
It's also possible that the change will have no impact.

Cc: sbasi@chromium.org
This symptom occurred 6 times this week.  I've looked at some of the logs,
but I don't know how to determine whether they're bug 662625 or some other
(unknown) problem.  I've been unable even to match the error reported at
the top level to the underlying moblab logs:  I'm utterly clueless as to
the possible cause.

For reference, here are the 6 failed CQ runs from this particular symptom:
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4446
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4445
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4444
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4441
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4432
    https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4430

Here are the job pages for the moblab_DummyServerSuite runs in autotest:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89597035
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89570370
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89552927
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89521496
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89295961
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89212673

I've attached the moblab_RunSuite.tgz from two of those test runs.
89570370-chromeos-test%2Fchromeos2-row1-rack8-host1%2Fmoblab_RunSuite.tgz
1.6 MB Download
89597035-chromeos-test%2Fchromeos2-row1-rack8-host1%2Fmoblab_RunSuite.tgz
1.4 MB Download

Comment 6 by sbasi@chromium.org, Dec 12 2016

With a quick glance this almost looks like a bad CL in the CQ is being caught instead of bug 662625.
The last four failures (4441, 4444, 4445, 4446) were because of one of my CLs in the CQ: https://chromium-review.googlesource.com/#/c/414446/

This was CQ:+1'ed just once, but kept getting retried because other sanity testing builders were failing (in 4445, it was not blamed because another CL was manually verified:-1).
Also, it was not clear to me that my CL was to blame because the "summary" on crosreview.com first showed daisy_skate as the failing build, and I didn't pay attention to check what else  had failed.

After it was CQ:-1'ed, I took a look at the failed build and fixed the problem. (See patch set 4.) I'm now waiting for local testing before sending it through the CQ again.

To see the actual failure, for say
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4446

download the logs for moblab (moblab_RunSuite.tgz at http://shortn/_wJLsWQRuKa)
Inside that, look at the ssp logs for 3-moblab: results/moblab_logs/results/3-moblab/192.168.231.101/ssp_logs/debug/autoserv.DEBUG


12/08 14:50:58.062 ERROR|         traceback:0013| Traceback (most recent call last):
12/08 14:50:58.064 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 547, in run_autoserv
12/08 14:50:58.064 ERROR|         traceback:0013|     machines)
12/08 14:50:58.065 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 243, in _run_with_ssp
12/08 14:50:58.065 ERROR|         traceback:0013|     metrics.Counter(
12/08 14:50:58.065 ERROR|         traceback:0013| NameError: global name 'metrics' is not defined
12/08 14:50:58.065 ERROR|          autoserv:0773| 1
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 769, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 582, in run_autoserv
    sys.exit(exit_code)
SystemExit: 1


This is very confusing because none of the other logs (not even the failing moblab suite / test logs) show any error message at all.

The first failure: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4430

One of the DUTs failed provision because it didn't return from reboot:

view results/moblab_logs/results/2-moblab/192.168.231.100/status.log

 Traceback (most recent call last):
 10     File "/usr/local/autotest/server/server_job.py", line 917, in run_op
 11       op_func()
 12     File "/usr/local/autotest/server/hosts/remote.py", line 150, in reboot
 13       **dargs)
 14     File "/usr/local/autotest/server/hosts/remote.py", line 219, in wait_for_restart
 15       self.log_op(self.OP_REBOOT, op_func)
 16     File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 548, in log_op
 17       op_func()
 18     File "/usr/local/autotest/server/hosts/remote.py", line 218, in op_func
 19       super(RemoteHost, self).wait_for_restart(timeout=timeout, **dargs)
 20     File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 309, in wait_for_restart
 21       raise error.AutoservRebootError("Host did not return from reboot")

This is a legitimate provision failure because DUT never rebooted with a new boot id after the install finished successfully:

12/05 22:54:45.927 DEBUG|      abstract_ssh:0643| Host 192.168.231.100 pre-shutdown boot_id is ec2aee68-d745-4f6a-8f8d-340500da7909
12/05 22:54:45.928 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/05 22:54:45.956 DEBUG|        base_utils:0280| [stdout] ec2aee68-d745-4f6a-8f8d-340500da7909
12/05 22:54:46.963 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/05 22:54:46.991 DEBUG|        base_utils:0280| [stdout] ec2aee68-d745-4f6a-8f8d-340500da7909
12/05 22:54:47.997 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/05 22:56:46.557 WARNI|        base_utils:0912| run process timeout (118) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_kYlNSqssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
12/05 22:56:47.570 DEBUG|      abstract_ssh:0664| Host 192.168.231.100 is now unreachable over ssh, is down
12/05 22:56:47.570 DEBUG|          ssh_host:0177| Running (ssh) 'true'
12/05 23:04:47.719 WARNI|        base_utils:0912| run process timeout (480) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_kYlNSqssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
12/05 23:04:49.743 DEBUG|      abstract_ssh:0589| Host 192.168.231.100 is still down after waiting 482 seconds
12/05 23:04:49.744 INFO |        server_job:0153|                       ABORT   ----    reboot.verify   timestamp=1481007889    localtime=Dec 05 23:04:49       Host did not return from reboot
12/05 23:04:49.747 INFO |        server_job:0153|               END FAIL        ----    reboot  timestamp=1481007889    localtime=Dec 05 23:04:49       Host did not return from reboot
Same for   https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4432

Also  a provision failure because DUT didn't return from reboot.

I want to blame these two failures on faulty DUTs???

view results/moblab_logs/results/3-moblab/192.168.231.102/debug/autoserv.DEBUG

12/06 06:44:14.287 DEBUG|      abstract_ssh:0643| Host 192.168.231.102 pre-shutdown boot_id is d31579b8-515c-4a3d-b5d3-e5ea3566089f
12/06 06:44:14.287 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/06 06:44:14.315 DEBUG|        base_utils:0280| [stdout] d31579b8-515c-4a3d-b5d3-e5ea3566089f
12/06 06:44:15.322 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/06 06:44:15.353 DEBUG|        base_utils:0280| [stdout] d31579b8-515c-4a3d-b5d3-e5ea3566089f
12/06 06:44:16.355 DEBUG|          ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
12/06 06:46:14.905 WARNI|        base_utils:0912| run process timeout (118) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_PZzv9kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
12/06 06:46:15.914 DEBUG|      abstract_ssh:0664| Host 192.168.231.102 is now unreachable over ssh, is down
12/06 06:46:15.915 DEBUG|          ssh_host:0177| Running (ssh) 'true'
12/06 06:54:16.113 WARNI|        base_utils:0912| run process timeout (480) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_PZzv9kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
12/06 06:54:18.127 DEBUG|      abstract_ssh:0589| Host 192.168.231.102 is still down after waiting 483 seconds
There is definitely a bug here about making moblab failures easier to debug. It took me over an hour on Friday to figure out why my CL failed in the (final) CQ run. Also, it was not at all obvious to me earlier that my CL was to blame (and I didn't really have a reason to take any action since my CL was being automatically retried).
Status: WontFix (was: Assigned)
I'm going to call this a bad CL (mine) and close.

Sign in to add a comment