New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 827190 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

hana DUTs in cq are all dead

Project Member Reported by akes...@chromium.org, Mar 29 2018

Issue description

$ dut-status -b hana -p cq
hostname                       S   last checked         URL
chromeos6-row4-rack2-host1     NO  2018-03-29 08:04:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/656312-repair/
chromeos6-row4-rack2-host6     NO  2018-03-29 08:03:40  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/656299-repair/
chromeos6-row4-rack3-host17    NO  2018-03-29 08:03:42  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/656305-repair/
chromeos6-row4-rack3-host10    NO  2018-03-29 08:03:41  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/656300-repair/
chromeos6-row3-rack3-host15    NO  2018-03-29 08:03:42  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/656303-repair/
chromeos6-row3-rack3-host17    NO  2018-03-29 08:03:42  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/656304-repair/
chromeos6-row3-rack3-host21    NO  2018-03-29 08:04:42  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/656311-repair/


This is causing multiple CQ failures.

eg https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2687

Marking hana as experimental in tree status.

We should not rebalance this without a bit of investigation, could be a CL that is killing these devices.


 
Owner: shuqianz@chromium.org
Status: Assigned (was: Available)
Cc: -bingxue@chromium.org bing...@chromium.org.dlaurie
Owner: dlaurie@chromium.org
From the debug log:
/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp.txt"'
03/28 22:27:13.727 INFO | site_crashcollect:0107| Failed to generate stack trace locally for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp (rc=127):
CmdError('minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp.txt"', * Command: 
    minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/c
    hromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setu
    id_helper.20180328.215407.1302.dmp"
    "/usr/local/autotest/server/../../../lib/debug" >
    "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-
    host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.2
    15407.1302.dmp.txt"
Exit status: 127
Duration: 0.00155591964722

stderr:
/bin/bash: minidump_stackwalk: command not found, 'Command returned non-zero exit status')
03/28 22:27:13.728 INFO | site_crashcollect:0111| Generating stack trace using devserver for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp
03/28 22:27:13.728 DEBUG|        dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.CrashServer'>
03/28 22:27:13.729 DEBUG|        dev_server:0770| Pick one healthy devserver from ['http://100.109.169.115:8082', 'http://100.109.169.114:8082', 'http://100.108.133.195:8082', 'http://100.108.133.199:8082', 'http://100.108.133.196:8082', 'http://100.109.175.145:8082']
03/28 22:27:13.729 DEBUG|        dev_server:0774| Check health for http://100.109.169.115:8082
03/28 22:27:14.046 DEBUG|        dev_server:0779| Pick http://100.109.169.115:8082
03/28 22:27:14.416 INFO |    connectionpool:0207| Starting new HTTP connection (1): 100.108.133.196
03/28 22:27:58.457 DEBUG|    connectionpool:0387| "POST /symbolicate_dump?archive_url=gs://chromeos-image-archive/hana-paladin/R67-10526.0.0-rc1 HTTP/1.1" 500 1810
03/28 22:27:58.461 INFO | site_crashcollect:0128| Failed to generate stack trace on devserver for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp:
DevServerException('\n\n\n    \n    500 Internal Server Error\n    \n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    \n\n    \n        500 Internal Server Error\n        The server encountered an unexpected condition which prevented it from fulfilling the request.\n        Traceback (most recent call last):\n  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond\n    response.body = self.handler()\n  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__\n    self.body = self.oldhandler(*args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__\n    return self.callable(*self.args, **self.kwargs)\n  File "/usr/local/google/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 1299, in symbolicate_dump\n    error_text, stackwalk.returncode))\nDevServerError: Can\'t generate stack trace: 2018-03-28 22:27:58: minidump.cc:4811: INFO: Minidump opened minidump /tmp/tmpemUgmC\n2018-03-28 22:27:58: minidump.cc:4918: ERROR: Minidump header signature mismatch: (0x0, 0x0) != 0x504d444d\n2018-03-28 22:27:58: minidump_stackwalk.cc:93: ERROR: Minidump /tmp/tmpemUgmC could not be read\n2018-03-28 22:27:58: minidump.cc:4783: INFO: Minidump closing minidump\n (rc=1)\n\n    \n    Powered by CherryPy 3.2.2\n    \n    \n\n',)
03/28 22:27:58.462 WARNI| site_crashcollect:0132| Failed to generate stack trace for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.215407.1302.dmp (see info logs)
03/28 22:27:58.462 INFO | site_crashcollect:0100| Trying to generate stack trace locally for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp
03/28 22:27:58.462 INFO | site_crashcollect:0034| symbol_dir: /usr/local/autotest/server/../../../lib/debug
03/28 22:27:58.462 DEBUG|             utils:0214| Running 'minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp.txt"'
03/28 22:27:58.499 INFO | site_crashcollect:0107| Failed to generate stack trace locally for dump /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp (rc=127):
CmdError('minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp" "/usr/local/autotest/server/../../../lib/debug" > "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp.txt"', * Command: 
    minidump_stackwalk "/usr/local/autotest/results/187332735-chromeos-test/c
    hromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setu
    id_helper.20180328.221034.1345.dmp"
    "/usr/local/autotest/server/../../../lib/debug" >
    "/usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-
    host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.2
    21034.1345.dmp.txt"
Exit status: 127
Duration: 0.00207209587097

stderr:
/bin/bash: minidump_stackwalk: command not found, 'Command returned non-zero exit status')
03/28 22:27:58.501 INFO | site_crashcollect:0111| Generating stack trace using devserver for /usr/local/autotest/results/187332735-chromeos-test/chromeos6-row4-rack2-host6/crashinfo.chromeos6-row4-rack2-host6/powerd_setuid_helper.20180328.221034.1345.dmp
03/28 22:27:58.501 DEBUG|        dev_server:2810| Get the least loaded <class 'autotest_lib.client.common_lib.cros.dev_server.CrashServer'>

From the powerd_setuid_helper.20180328.221034.1345.log:
===powerd.OUT (tail)===
[0328/221034:FATAL:powerd_setuid_helper.cc(51)] Check failed: execvp(command, argv) != -1. execv() failed: No such file or directory

EOF

It seems something with the power manager, then the DUT fail to reboot after powering off.

Here are some suspicious CLs in the first failed haha CQ run:
https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/959661
https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/959665

dlaurie@, can you take a look?
Cc: -bing...@chromium.org.dlaurie -blakeo@chromium.org bingxue@chromium.org jrbarnette@chromium.org
> Marking hana as experimental in tree status.
>
> We should not rebalance this without a bit of investigation, could be a CL that is killing these devices.

If the problem was a bad CL, then marking the builder experimental
allowed the CL into the tree.

Here's how all of the devices failed:
chromeos6-row4-rack2-host1
    2018-03-29 01:34:33  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/653192-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/653085-provision/
    2018-03-28 23:22:23  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host1/652169-reset/
chromeos6-row4-rack2-host6
    2018-03-29 01:36:24  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/653207-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/653083-provision/
    2018-03-28 23:19:33  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack2-host6/652147-reset/
chromeos6-row4-rack3-host17
    2018-03-29 01:28:51  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/653161-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/653084-provision/
    2018-03-28 23:19:35  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host17/652151-reset/
chromeos6-row4-rack3-host10
    2018-03-29 01:36:23  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/653206-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/653082-provision/
    2018-03-28 23:19:28  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack3-host10/652149-reset/
chromeos6-row3-rack3-host15
    2018-03-29 01:28:39  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/653159-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/653088-provision/
    2018-03-28 23:19:52  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host15/652155-reset/
chromeos6-row3-rack3-host17
    2018-03-29 01:35:40  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/653201-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/653086-provision/
    2018-03-28 22:56:59  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host17/652103-reset/
chromeos6-row3-rack3-host21
    2018-03-29 01:35:45  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/653203-repair/
    2018-03-29 01:17:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/653087-provision/
    2018-03-28 23:18:44  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack3-host21/652138-reset/

In short, every DUT failed provisioning at the same time.

So, it probably was a bad CL, and we need to figure out if it's
in the tree now.

I find this in the second hana CQ run:
03/29 01:18:02.434 ERROR|            repair:0354| Failed: servo host software is up-to-date
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 28, in verify
    host.update_image(wait_for_update=False)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 552, in update_image
    status, current_build_number = self._check_for_reboot(updater)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 464, in _check_for_reboot
    self.schedule_synchronized_reboot(dut_list, afe)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 423, in schedule_synchronized_reboot
    control_file = getter.get_control_file_contents_by_name(test)
  File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 152, in get_control_file_contents_by_name
    path = self.get_control_file_path(test_name)
  File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 136, in get_control_file_path
    raise error.ControlFileNotFound(test_name + ' is not unique.')
ControlFileNotFound: servohost_Reboot is not unique.
03/29 01:18:02.435 INFO |        server_job:0218| 	FAIL	----	verify.update	timestamp=1522311482	localtime=Mar 29 01:18:02	servohost_Reboot is not unique.
03/29 01:18:02.435 INFO |            repair:0349| Verifying this condition: servo BOARD setting is correct
03/29 01:18:02.443 DEBUG|          ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:02.893 DEBUG|          ssh_host:0301| Running (ssh) 'CONFIG=/var/lib/servod/config_9992 ; [ -f $CONFIG ] && . $CONFIG && echo $BOARD' from '_verify_host|verify|_get_config_val|run|run|run_very_slowly'
03/29 01:18:03.334 DEBUG|             utils:0282| [stdout] hana
03/29 01:18:03.352 INFO |        server_job:0218| 	GOOD	----	verify.brd_config	timestamp=1522311483	localtime=Mar 29 01:18:03	
03/29 01:18:03.355 INFO |            repair:0349| Verifying this condition: servo SERIAL setting is correct
03/29 01:18:03.362 DEBUG|          ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:03.815 DEBUG|          ssh_host:0301| Running (ssh) 'CONFIG=/var/lib/servod/config_9992 ; [ -f $CONFIG ] && . $CONFIG && echo $SERIAL' from '_verify_host|verify|_get_config_val|run|run|run_very_slowly'
03/29 01:18:04.248 DEBUG|             utils:0282| [stdout] C1612060692
03/29 01:18:04.250 INFO |        server_job:0218| 	GOOD	----	verify.ser_config	timestamp=1522311484	localtime=Mar 29 01:18:04	
03/29 01:18:04.257 INFO |            repair:0349| Verifying this condition: servod upstart job is running
03/29 01:18:04.267 DEBUG|          ssh_host:0301| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release' from '_verify_host|verify|is_cros_host|run|run|run_very_slowly'
03/29 01:18:04.704 DEBUG|          ssh_host:0301| Running (ssh) 'status servod PORT=9992' from '_verify_list|_verify_host|verify|run|run|run_very_slowly'
03/29 01:18:05.130 DEBUG|             utils:0282| [stderr] status: Unknown instance: 9992
03/29 01:18:05.132 ERROR|            repair:0354| Failed: servod upstart job is running
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 190, in verify
    (host.hostname, host.servo_port))
AutoservVerifyError: servod not running on chromeos6-row3-rack3-labstation port 9992
The get_control_file_path is to return the path to a control file using test_name, so it failed because there are two control files named servohost_Reboot
This seems like the only related CL the second failed CQ: https://chromium-review.googlesource.com/c/chromiumos/third_party/hdctools/+/985432
> AutoservVerifyError: servod not running on chromeos6-row3-rack3-labstation port 9992

This error isn't related to the provision failures.  Please file a
separate bug for this.

The bad CL we're looking for is something that could cause Hana DUTs
to crash and burn at boot time.  Probably, that means a kernel change,
although other kinds of change are possible.

Also, in case it's not sufficiently obvious, the bad CL we're
looking for is in this hana-paladin run:
    https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2685

No other hana-paladin failure is dispositive:  Before that run, any
failures didn't show the provision failures.  After that run, all
the failures would have been because there were no working hana DUTs.
The kernel stack from pmalani is already indicated in other failures as well. I would suspect it here too.

For some reason only hana devices enter repair failed as a result of this though. Other devices managed to be automatically repaired.
Cc: pmalani@chromium.org
> For some reason only hana devices enter repair failed as a result of this though. Other devices managed to be automatically repaired.

Ah.  That would be because servo was working for the other failed
DUTs.  I believe that for hana, every DUT's servo had failed in
some way or another.

^ is that a hana-wide problem, or just these particular DUTs?

Ok, in that case I think it is safe to force-balance these pools and/or request manual repair on these devices from englabs (if necessary due to working DUT shortfall).
Owner: shuqianz@chromium.org
> ^ is that a hana-wide problem, or just these particular DUTs?

Most likely specific to each DUT.  In general, when DUTs are
first added to the lab, the deployment procedure (at least when it's
used properly) guarantees that servo for each DUT is working.  So, we
can expect that each of these DUTs had a working Servo at the start.

Hana cq pool is re-balanced, will send the broken DUTs to the englabs
You can unmark hana as experimental once it is working.
Status: Verified (was: Assigned)

Sign in to add a comment