AC power verification fails on old FSI versions |
||||||
Issue description
HWTest failing on Daisy on ToT and Release branches
Log snippet:
09:31:26: WARNING: (stderr):
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No current process: you must name one.
The program is not being run.
09:31:56: WARNING: Killing 24449 (sig=15 SIGTERM)
09:31:26: ERROR: pre-kill notification (SIGXCPU); traceback:
File "chromite/bin/cbuildbot", line 164, in <module>
commandline.ScriptWrapperMain(FindTarget)
Full log:
https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy-release-group%20release-R49-7834.B/builds/51/steps/HWTest%20%5Bdaisy%5D%20%5Bsanity%5D/logs/stdio
This is also seen in ToT/M50
M50: https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy-release-group%20release-R50-7978.B/builds/32/steps/HWTest%20%5Bdaisy%5D%20%5Bbvt-inline%5D/logs/stdio
ToT: https://uberchromegw.corp.google.com/i/chromeos/builders/daisy-release-group/builds/4820/steps/HWTest%20%5Bdaisy%5D%20%5Bsanity%5D/logs/stdio
Seems like a problem with the pool starting on 3/29
,
Mar 30 2016
Hmm actually I'm not sure about that diagnosis. From the timestamps looks like something killed swarming client after 6 hours. 03:48:13: INFO: RunCommand: /b/cbuild/shared_internal/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpSJcT10/cbuildbot-tmp19Dnvt/tmpi0tOf_/temp_summary.json --raw-cmd --task-name daisy-release/R49-7834.66.0-sanity --dimension os Linux --print-status-updates --timeout 39600 --io-timeout 39600 --hard-timeout 39600 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build daisy-release/R49-7834.66.0 --board daisy --suite_name sanity --pool bvt --num 1 --file_bugs True --priority DEFAULT --timeout_mins 600 --retry True --max_retries 10 --suite_min_duts 1 --offload_failures_only False -m 58263974 09:31:26: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-5:7, started)>]
,
Mar 30 2016
The suite job was actually created. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=58263974 However, the debug log links for it are broken so I'm having a hard time figuring out why it failed. I do suspect a daisy pool health problem based on the DUT status email this morning.
,
Mar 30 2016
All 6 DUTs in the daisy BVT pool are reported failed. The
full diagnosis summary is below. I spot checked one of the
failures.
The DUT failed after Paygen FSI testing, e.g.:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=58230028
That test installs an old image that doesn't include the
'power_supply_info' command, and the AC power check fails
because of the missing command.
There's more going on than just that, but that problem is enough
to cause devices to bleed away.
For now, balance pools; a software fix to get the devices back
in service can follow.
Diagnosis summary:
$ dut-status -b daisy -p bvt -g
chromeos2-row3-rack5-host10
2016-03-29 00:27:57 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/352084-repair/
2016-03-29 00:22:33 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/352037-cleanup/
2016-03-28 23:51:59 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58228053-chromeos-test/
2016-03-28 23:51:34 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/351817-reset/
chromeos2-row3-rack5-host11
2016-03-29 00:35:42 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/352128-repair/
2016-03-29 00:25:38 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/352058-cleanup/
2016-03-28 23:15:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58231568-chromeos-test/
2016-03-28 23:14:45 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/351643-reset/
chromeos2-row3-rack5-host12
2016-03-29 02:43:58 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352713-repair/
2016-03-29 02:43:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352711-reset/
2016-03-29 02:28:15 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58254783-chromeos-test/
2016-03-29 02:27:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352621-reset/
chromeos2-row3-rack5-host13
2016-03-28 22:59:11 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351596-repair/
2016-03-28 22:58:46 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351594-reset/
2016-03-28 22:37:32 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58228720-chromeos-test/
2016-03-28 22:37:14 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351525-reset/
chromeos2-row3-rack5-host14
2016-03-29 01:55:52 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352489-repair/
2016-03-29 01:55:30 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352483-reset/
2016-03-29 01:44:23 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58250608-chromeos-test/
2016-03-29 01:43:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352450-reset/
chromeos2-row3-rack5-host15
2016-03-29 00:50:28 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/352216-repair/
2016-03-29 00:37:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/352137-cleanup/
2016-03-28 23:12:34 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58230028-chromeos-test/
2016-03-28 23:12:12 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/351640-reset/
,
Mar 30 2016
FYI this is the log from the swarming proxy side. It shows the same thing, the pool is unhealthy and the suite couldn't finish in time (as mentioned in #3 and #4) https://chromeos-proxy.appspot.com/user/task/2ddace7d3d33ee10 It is interesting that the timeout for the suite is 600 mins (10 hours), while the buildbot timeout seems to be 6 hours. maybe the suite timeout should be shortened to align with buildbot (so that we would get clear message on buildbot side)
,
Mar 30 2016
How did you find that log? Is that the log for http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=58263974 ? When I tried to examine that lab on cautotest/ it looked to me like it froze while trying to file a bug. See crbug.com/599194 . But you seem to be seeing a lot more output?
,
Mar 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b525670496bb8defe892688cfed6b56c6869afff commit b525670496bb8defe892688cfed6b56c6869afff Author: J. Richard Barnette <jrbarnette@chromium.org> Date: Wed Mar 30 19:05:32 2016 [autotest] Ignore 'power_supply_info' failures during verify. Checking AC power status in verify and repair depends on the 'power_supply_info' command. In some cases, DUTs can wind up in verify and/or repair with old FSI builds that don't have that command present. That makes the DUTs fail verify and repair, with no way to fix them. This changes the the AC power verifier to ignore failures from the 'power_supply_info' command. BUG= chromium:599158 TEST=None Change-Id: Ibf395a73411a470d0446190d305b9a50933b8698 Reviewed-on: https://chromium-review.googlesource.com/336440 Tested-by: Richard Barnette <jrbarnette@chromium.org> Commit-Queue: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/b525670496bb8defe892688cfed6b56c6869afff/server/hosts/cros_repair.py
,
Mar 30 2016
Is above CL resolving this issue? Will it be possible to re-run bvt/au test after this change?
,
Apr 4 2016
The CL above will stop devices from failing repair merely because
they're running an old FSI image that doesn't have the
'power_supply_info' command.
However, the change isn't complete on two fronts:
* I believe the reason we find DUTs in the lab with these old
images is that those images fail update too often. If so,
we'll need to blacklist the old FSI versions that cause
trouble.
* If we don't blacklist the images outright, it may be in our
interest to force DUTs with old FSI versions to update before
we declare them working. I'm evaluating this question now.
So, for now, daisy, pit, and other older boards should be back to
normal. However, I'm holding this bug open until I'm convinced we
have a permanent solution.
,
Apr 4 2016
I'm closing this in favor of the previously filed bug 597962 . There's a separate issue of whether any of these FSI versions need to be blacklisted; I'll think about the best way to track that...
,
Apr 4 2016
,
Apr 11 2016
Closing.
,
Apr 27 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Mar 30 2016Labels: -Pri-0 Pri-1
Owner: fdeng@chromium.org
Summary: swarming client crashing on daisy-release-group (was: HW Test failing for Daisy)