hana DUTs in cq are all dead |
||||||
Issue description$ dut-status -b hana -p cq hostname S last checked URL chromeos6-row4-rack2-host1 NO 2018-03-29 08:04:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/656312-repair/ chromeos6-row4-rack2-host6 NO 2018-03-29 08:03:40 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/656299-repair/ chromeos6-row4-rack3-host17 NO 2018-03-29 08:03:42 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/656305-repair/ chromeos6-row4-rack3-host10 NO 2018-03-29 08:03:41 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/656300-repair/ chromeos6-row3-rack3-host15 NO 2018-03-29 08:03:42 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/656303-repair/ chromeos6-row3-rack3-host17 NO 2018-03-29 08:03:42 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/656304-repair/ chromeos6-row3-rack3-host21 NO 2018-03-29 08:04:42 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/656311-repair/ This is causing multiple CQ failures. eg https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2687 Marking hana as experimental in tree status. We should not rebalance this without a bit of investigation, could be a CL that is killing these devices.
,
Mar 29 2018
From the debug log:
/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp.txt"'
03/28 22:27:13.727 INFO | site_crashcollect:0107| Failed to generate stack trace locally for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp (rc=127):
CmdError('minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp.txt"', * Command:
minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/c
hromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setu
id_helper.20180328.215407.1302.dmp"
"/usr/local/autotest/server/../../../lib/debug" >
"/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-
host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.2
15407.1302.dmp.txt"
Exit status: 127
Duration: 0.00155591964722
stderr:
/bin/bash: minidump_stackwalk: command not found, 'Command returned non-zero exit status')
03/28 22:27:13.728 INFO | site_crashcollect:0111| Generating stack trace using devserver for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp
03/28 22:27:13.728 DEBUG| dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.CrashServer'>
03/28 22:27:13.729 DEBUG| dev_server:0770| Pick one healthy devserver from ['http://100.109.169.115:8082', 'http://100.109.169.114:8082', 'http://100.108.133.195:8082', 'http://100.108.133.199:8082', 'http://100.108.133.196:8082', 'http://100.109.175.145:8082']
03/28 22:27:13.729 DEBUG| dev_server:0774| Check health for http://100.109.169.115:8082
03/28 22:27:14.046 DEBUG| dev_server:0779| Pick http://100.109.169.115:8082
03/28 22:27:14.416 INFO | connectionpool:0207| Starting new HTTP connection (1): 100.108.133.196
03/28 22:27:58.457 DEBUG| connectionpool:0387| "POST /symbolicate_dump?archive_url=gs://chromeos-image-archive/hana-paladin/R67-10526.0.0-rc1 HTTP/1.1" 500 1810
03/28 22:27:58.461 INFO | site_crashcollect:0128| Failed to generate stack trace on devserver for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp:
DevServerException('\n\n\n \n 500 Internal Server Error\n \n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n \n\n \n 500 Internal Server Error\n The server encountered an unexpected condition which prevented it from fulfilling the request.\n Traceback (most recent call last):\n File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond\n response.body = self.handler()\n File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__\n self.body = self.oldhandler(*args, **kwargs)\n File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__\n return self.callable(*self.args, **self.kwargs)\n File "/usr/local/google/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 1299, in symbolicate_dump\n error_text, stackwalk.returncode))\nDevServerError: Can\'t generate stack trace: 2018-03-28 22:27:58: minidump.cc:4811: INFO: Minidump opened minidump /tmp/tmpemUgmC\n2018-03-28 22:27:58: minidump.cc:4918: ERROR: Minidump header signature mismatch: (0x0, 0x0) != 0x504d444d\n2018-03-28 22:27:58: minidump_stackwalk.cc:93: ERROR: Minidump /tmp/tmpemUgmC could not be read\n2018-03-28 22:27:58: minidump.cc:4783: INFO: Minidump closing minidump\n (rc=1)\n\n \n Powered by CherryPy 3.2.2\n \n \n\n',)
03/28 22:27:58.462 WARNI| site_crashcollect:0132| Failed to generate stack trace for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp (see info logs)
03/28 22:27:58.462 INFO | site_crashcollect:0100| Trying to generate stack trace locally for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp
03/28 22:27:58.462 INFO | site_crashcollect:0034| symbol_dir: /usr/local/autotest/server/../../../lib/debug
03/28 22:27:58.462 DEBUG| utils:0214| Running 'minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp.txt"'
03/28 22:27:58.499 INFO | site_crashcollect:0107| Failed to generate stack trace locally for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp (rc=127):
CmdError('minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp.txt"', * Command:
minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/c
hromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setu
id_helper.20180328.221034.1345.dmp"
"/usr/local/autotest/server/../../../lib/debug" >
"/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-
host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.2
21034.1345.dmp.txt"
Exit status: 127
Duration: 0.00207209587097
stderr:
/bin/bash: minidump_stackwalk: command not found, 'Command returned non-zero exit status')
03/28 22:27:58.501 INFO | site_crashcollect:0111| Generating stack trace using devserver for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp
03/28 22:27:58.501 DEBUG| dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.CrashServer'>
From the powerd_setuid_helper.20180328.221034.1345.log:
===powerd.OUT (tail)===
[0328/221034:FATAL:powerd_setuid_helper.cc(51)] Check failed: execvp(command, argv) != -1. execv() failed: No such file or directory
EOF
It seems something with the power manager, then the DUT fail to reboot after powering off.
Here are some suspicious CLs in the first failed haha CQ run:
https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/959661
https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/959665
dlaurie@, can you take a look?
,
Mar 29 2018
,
Mar 29 2018
> Marking hana as experimental in tree status.
>
> We should not rebalance this without a bit of investigation, could be a CL that is killing these devices.
If the problem was a bad CL, then marking the builder experimental
allowed the CL into the tree.
Here's how all of the devices failed:
chromeos6-row4-rack2-host1
2018-03-29 01:34:33 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/653192-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/653085-provision/
2018-03-28 23:22:23 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/652169-reset/
chromeos6-row4-rack2-host6
2018-03-29 01:36:24 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/653207-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/653083-provision/
2018-03-28 23:19:33 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/652147-reset/
chromeos6-row4-rack3-host17
2018-03-29 01:28:51 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/653161-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/653084-provision/
2018-03-28 23:19:35 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/652151-reset/
chromeos6-row4-rack3-host10
2018-03-29 01:36:23 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/653206-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/653082-provision/
2018-03-28 23:19:28 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/652149-reset/
chromeos6-row3-rack3-host15
2018-03-29 01:28:39 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/653159-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/653088-provision/
2018-03-28 23:19:52 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/652155-reset/
chromeos6-row3-rack3-host17
2018-03-29 01:35:40 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/653201-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/653086-provision/
2018-03-28 22:56:59 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/652103-reset/
chromeos6-row3-rack3-host21
2018-03-29 01:35:45 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/653203-repair/
2018-03-29 01:17:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/653087-provision/
2018-03-28 23:18:44 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/652138-reset/
In short, every DUT failed provisioning at the same time.
So, it probably was a bad CL, and we need to figure out if it's
in the tree now.
,
Mar 29 2018
I find this in the second hana CQ run:
03/29 01:18:02.434 ERROR| repair:0354| Failed: servo host software is up-to-date
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/servo_repair.py", line 28, in verify
host.update_image(wait_for_update=False)
File "/usr/local/autotest/server/hosts/servo_host.py", line 552, in update_image
status, current_build_number = self._check_for_reboot(updater)
File "/usr/local/autotest/server/hosts/servo_host.py", line 464, in _check_for_reboot
self.schedule_synchronized_reboot(dut_list, afe)
File "/usr/local/autotest/server/hosts/servo_host.py", line 423, in schedule_synchronized_reboot
control_file = getter.get_control_file_contents_by_name(test)
File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 152, in get_control_file_contents_by_name
path = self.get_control_file_path(test_name)
File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 136, in get_control_file_path
raise error.ControlFileNotFound(test_name + ' is not unique.')
ControlFileNotFound: servohost_Reboot is not unique.
03/29 01:18:02.435 INFO | server_job:0218| FAIL ---- verify.update timestamp=1522311482 localtime=Mar 29 01:18:02 servohost_Reboot is not unique.
03/29 01:18:02.435 INFO | repair:0349| Verifying this condition: servo BOARD setting is correct
03/29 01:18:02.443 DEBUG| ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:02.893 DEBUG| ssh_host:0301| Running (ssh) 'CONFIG=/var/lib/servod/config_9992 ; [ -f $CONFIG ] && . $CONFIG && echo $BOARD' from '_verify_host|verify|_get_config_val|run|run|run_very_slowly'
03/29 01:18:03.334 DEBUG| utils:0282| [stdout] hana
03/29 01:18:03.352 INFO | server_job:0218| GOOD ---- verify.brd_config timestamp=1522311483 localtime=Mar 29 01:18:03
03/29 01:18:03.355 INFO | repair:0349| Verifying this condition: servo SERIAL setting is correct
03/29 01:18:03.362 DEBUG| ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:03.815 DEBUG| ssh_host:0301| Running (ssh) 'CONFIG=/var/lib/servod/config_9992 ; [ -f $CONFIG ] && . $CONFIG && echo $SERIAL' from '_verify_host|verify|_get_config_val|run|run|run_very_slowly'
03/29 01:18:04.248 DEBUG| utils:0282| [stdout] C1612060692
03/29 01:18:04.250 INFO | server_job:0218| GOOD ---- verify.ser_config timestamp=1522311484 localtime=Mar 29 01:18:04
03/29 01:18:04.257 INFO | repair:0349| Verifying this condition: servod upstart job is running
03/29 01:18:04.267 DEBUG| ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:04.704 DEBUG| ssh_host:0301| Running (ssh) 'status servod PORT=9992' from '_verify_list|_verify_host|verify|run|run|run_very_slowly'
03/29 01:18:05.130 DEBUG| utils:0282| [stderr] status: Unknown instance: 9992
03/29 01:18:05.132 ERROR| repair:0354| Failed: servod upstart job is running
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/servo_repair.py", line 190, in verify
(host.hostname, host.servo_port))
AutoservVerifyError: servod not running on chromeos6-row3-rack3-labstation port 9992
,
Mar 29 2018
The get_control_file_path is to return the path to a control file using test_name, so it failed because there are two control files named servohost_Reboot
,
Mar 29 2018
This seems like the only related CL the second failed CQ: https://chromium-review.googlesource.com/c/chromiumos/third_party/hdctools/+/985432
,
Mar 29 2018
> AutoservVerifyError: servod not running on chromeos6-row3-rack3-labstation port 9992 This error isn't related to the provision failures. Please file a separate bug for this. The bad CL we're looking for is something that could cause Hana DUTs to crash and burn at boot time. Probably, that means a kernel change, although other kinds of change are possible.
,
Mar 29 2018
Also, in case it's not sufficiently obvious, the bad CL we're
looking for is in this hana-paladin run:
https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2685
No other hana-paladin failure is dispositive: Before that run, any
failures didn't show the provision failures. After that run, all
the failures would have been because there were no working hana DUTs.
,
Mar 29 2018
The kernel stack from pmalani is already indicated in other failures as well. I would suspect it here too. For some reason only hana devices enter repair failed as a result of this though. Other devices managed to be automatically repaired.
,
Mar 29 2018
,
Mar 29 2018
> For some reason only hana devices enter repair failed as a result of this though. Other devices managed to be automatically repaired. Ah. That would be because servo was working for the other failed DUTs. I believe that for hana, every DUT's servo had failed in some way or another.
,
Mar 29 2018
^ is that a hana-wide problem, or just these particular DUTs? Ok, in that case I think it is safe to force-balance these pools and/or request manual repair on these devices from englabs (if necessary due to working DUT shortfall).
,
Mar 29 2018
,
Mar 29 2018
> ^ is that a hana-wide problem, or just these particular DUTs? Most likely specific to each DUT. In general, when DUTs are first added to the lab, the deployment procedure (at least when it's used properly) guarantees that servo for each DUT is working. So, we can expect that each of these DUTs had a working Servo at the start.
,
Mar 29 2018
Hana cq pool is re-balanced, will send the broken DUTs to the englabs
,
Mar 29 2018
You can unmark hana as experimental once it is working.
,
Mar 30 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Mar 29 2018Status: Assigned (was: Available)