dummy_PassServer_SERVER_JOB is broken and blocks CrOS CQ |
|||||
Issue descriptionAt least two of my recent patches have been held up for days by Output below this line is for buildbot consumption: @@@STEP_LINK@[Test-Logs]: dummy_PassServer_SERVER_JOB: FAIL@http://localhost/tko/retrieve_logs.cgi?job=/results/3-moblab/@@@ @@@STEP_LINK@[Flake-Dashboard]: dummy_PassServer_SERVER_JOB@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=dummy_PassServer_SERVER_JOB@@@ Will return from run_suite with status: ERROR Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 62, in run_once raise e AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_kde4PIssh- master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row1-rack8-host1 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_call_run_once|run_once|run_as_moblab] -> ssh_run(su - moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=samus --build=samus-release/R53-8480.0.0 --suite_name=dummy_server')\";fi; su - moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=samus --build=samus-release/R53-8480.0.0 --suite_name=dummy_server'" Exit status: 1 Duration: 498.820991039 Log 1: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/89570370-chromeos-test/chromeos2-row1-rack8-host1/ Log 2: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/89597035-chromeos-test/chromeos2-row1-rack8-host1/
,
Dec 9 2016
I'm not on sheriff duty this week. The problems seems to be that no samus was available: No hosts found for board:samus in pool Reassigning to deputy
,
Dec 9 2016
I've been attributing the various guado_moblab-paladin failures in the CQ to bug 662625. I'll go re-check if I still believe that diagnosis. I'll also take another look at whether there's any other mitigation we can make while we wait for the permanent fix (which is replacement of samus in our moblab instances). I'll note that as a broad matter, failures like this one (whatever their cause) are far from the biggest cause of CQ failure in the past week or so. We've had _lots_ of other reasons for failure; close to half of the failures had one or more CLs with bugs that also blocked the run.
,
Dec 9 2016
The guado_moblab-paladin builder has seen this failure 6 times in the past week, and passed 14 times. Every time the paladin scheduled this test (pass or fail), it selected the same moblab instance, chromeos2-row1-rack8-host1, so all 6 failures were on that single instance. I've moved that instance to be a spare. It's possible that the problem will be less prominent with a different moblab instance. It's also possible that the change will have no impact.
,
Dec 10 2016
This symptom occurred 6 times this week. I've looked at some of the logs,
but I don't know how to determine whether they're bug 662625 or some other
(unknown) problem. I've been unable even to match the error reported at
the top level to the underlying moblab logs: I'm utterly clueless as to
the possible cause.
For reference, here are the 6 failed CQ runs from this particular symptom:
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4446
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4445
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4444
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4441
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4432
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4430
Here are the job pages for the moblab_DummyServerSuite runs in autotest:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89597035
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89570370
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89552927
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89521496
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89295961
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=89212673
I've attached the moblab_RunSuite.tgz from two of those test runs.
,
Dec 12 2016
With a quick glance this almost looks like a bad CL in the CQ is being caught instead of bug 662625.
,
Dec 12 2016
The last four failures (4441, 4444, 4445, 4446) were because of one of my CLs in the CQ: https://chromium-review.googlesource.com/#/c/414446/ This was CQ:+1'ed just once, but kept getting retried because other sanity testing builders were failing (in 4445, it was not blamed because another CL was manually verified:-1). Also, it was not clear to me that my CL was to blame because the "summary" on crosreview.com first showed daisy_skate as the failing build, and I didn't pay attention to check what else had failed. After it was CQ:-1'ed, I took a look at the failed build and fixed the problem. (See patch set 4.) I'm now waiting for local testing before sending it through the CQ again. To see the actual failure, for say https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4446 download the logs for moblab (moblab_RunSuite.tgz at http://shortn/_wJLsWQRuKa) Inside that, look at the ssp logs for 3-moblab: results/moblab_logs/results/3-moblab/192.168.231.101/ssp_logs/debug/autoserv.DEBUG 12/08 14:50:58.062 ERROR| traceback:0013| Traceback (most recent call last): 12/08 14:50:58.064 ERROR| traceback:0013| File "/usr/local/autotest/server/autoserv", line 547, in run_autoserv 12/08 14:50:58.064 ERROR| traceback:0013| machines) 12/08 14:50:58.065 ERROR| traceback:0013| File "/usr/local/autotest/server/autoserv", line 243, in _run_with_ssp 12/08 14:50:58.065 ERROR| traceback:0013| metrics.Counter( 12/08 14:50:58.065 ERROR| traceback:0013| NameError: global name 'metrics' is not defined 12/08 14:50:58.065 ERROR| autoserv:0773| 1 Traceback (most recent call last): File "/usr/local/autotest/server/autoserv", line 769, in main use_ssp) File "/usr/local/autotest/server/autoserv", line 582, in run_autoserv sys.exit(exit_code) SystemExit: 1 This is very confusing because none of the other logs (not even the failing moblab suite / test logs) show any error message at all.
,
Dec 12 2016
The first failure: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4430 One of the DUTs failed provision because it didn't return from reboot: view results/moblab_logs/results/2-moblab/192.168.231.100/status.log Traceback (most recent call last): 10 File "/usr/local/autotest/server/server_job.py", line 917, in run_op 11 op_func() 12 File "/usr/local/autotest/server/hosts/remote.py", line 150, in reboot 13 **dargs) 14 File "/usr/local/autotest/server/hosts/remote.py", line 219, in wait_for_restart 15 self.log_op(self.OP_REBOOT, op_func) 16 File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 548, in log_op 17 op_func() 18 File "/usr/local/autotest/server/hosts/remote.py", line 218, in op_func 19 super(RemoteHost, self).wait_for_restart(timeout=timeout, **dargs) 20 File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 309, in wait_for_restart 21 raise error.AutoservRebootError("Host did not return from reboot") This is a legitimate provision failure because DUT never rebooted with a new boot id after the install finished successfully: 12/05 22:54:45.927 DEBUG| abstract_ssh:0643| Host 192.168.231.100 pre-shutdown boot_id is ec2aee68-d745-4f6a-8f8d-340500da7909 12/05 22:54:45.928 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/05 22:54:45.956 DEBUG| base_utils:0280| [stdout] ec2aee68-d745-4f6a-8f8d-340500da7909 12/05 22:54:46.963 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/05 22:54:46.991 DEBUG| base_utils:0280| [stdout] ec2aee68-d745-4f6a-8f8d-340500da7909 12/05 22:54:47.997 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/05 22:56:46.557 WARNI| base_utils:0912| run process timeout (118) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_kYlNSqssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi" 12/05 22:56:47.570 DEBUG| abstract_ssh:0664| Host 192.168.231.100 is now unreachable over ssh, is down 12/05 22:56:47.570 DEBUG| ssh_host:0177| Running (ssh) 'true' 12/05 23:04:47.719 WARNI| base_utils:0912| run process timeout (480) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_kYlNSqssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.100 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 12/05 23:04:49.743 DEBUG| abstract_ssh:0589| Host 192.168.231.100 is still down after waiting 482 seconds 12/05 23:04:49.744 INFO | server_job:0153| ABORT ---- reboot.verify timestamp=1481007889 localtime=Dec 05 23:04:49 Host did not return from reboot 12/05 23:04:49.747 INFO | server_job:0153| END FAIL ---- reboot timestamp=1481007889 localtime=Dec 05 23:04:49 Host did not return from reboot
,
Dec 12 2016
Same for https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/4432 Also a provision failure because DUT didn't return from reboot. I want to blame these two failures on faulty DUTs??? view results/moblab_logs/results/3-moblab/192.168.231.102/debug/autoserv.DEBUG 12/06 06:44:14.287 DEBUG| abstract_ssh:0643| Host 192.168.231.102 pre-shutdown boot_id is d31579b8-515c-4a3d-b5d3-e5ea3566089f 12/06 06:44:14.287 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/06 06:44:14.315 DEBUG| base_utils:0280| [stdout] d31579b8-515c-4a3d-b5d3-e5ea3566089f 12/06 06:44:15.322 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/06 06:44:15.353 DEBUG| base_utils:0280| [stdout] d31579b8-515c-4a3d-b5d3-e5ea3566089f 12/06 06:44:16.355 DEBUG| ssh_host:0177| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 12/06 06:46:14.905 WARNI| base_utils:0912| run process timeout (118) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_PZzv9kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi" 12/06 06:46:15.914 DEBUG| abstract_ssh:0664| Host 192.168.231.102 is now unreachable over ssh, is down 12/06 06:46:15.915 DEBUG| ssh_host:0177| Running (ssh) 'true' 12/06 06:54:16.113 WARNI| base_utils:0912| run process timeout (480) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_PZzv9kssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 12/06 06:54:18.127 DEBUG| abstract_ssh:0589| Host 192.168.231.102 is still down after waiting 483 seconds
,
Dec 12 2016
There is definitely a bug here about making moblab failures easier to debug. It took me over an hour on Friday to figure out why my CL failed in the (final) CQ run. Also, it was not at all obvious to me earlier that my CL was to blame (and I didn't really have a reason to take any action since my CL was being automatically retried).
,
Dec 19 2016
I'm going to call this a bad CL (mine) and close. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by phoglund@chromium.org
, Dec 9 2016