New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 599158 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

AC power verification fails on old FSI versions

Project Member Reported by josa...@chromium.org, Mar 30 2016

Issue description

HWTest failing on Daisy on ToT and Release branches 

Log snippet: 

09:31:26: WARNING: (stderr):
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No current process: you must name one.
The program is not being run.

09:31:56: WARNING: Killing 24449 (sig=15 SIGTERM)
09:31:26: ERROR: pre-kill notification (SIGXCPU); traceback:
  File "chromite/bin/cbuildbot", line 164, in <module>
    commandline.ScriptWrapperMain(FindTarget)

Full log: 
https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy-release-group%20release-R49-7834.B/builds/51/steps/HWTest%20%5Bdaisy%5D%20%5Bsanity%5D/logs/stdio

This is also seen in ToT/M50
M50: https://uberchromegw.corp.google.com/i/chromeos_release/builders/daisy-release-group%20release-R50-7978.B/builds/32/steps/HWTest%20%5Bdaisy%5D%20%5Bbvt-inline%5D/logs/stdio
ToT: https://uberchromegw.corp.google.com/i/chromeos/builders/daisy-release-group/builds/4820/steps/HWTest%20%5Bdaisy%5D%20%5Bsanity%5D/logs/stdio

Seems like a problem with the pool starting on 3/29
 
Cc: akes...@chromium.org
Labels: -Pri-0 Pri-1
Owner: fdeng@chromium.org
Summary: swarming client crashing on daisy-release-group (was: HW Test failing for Daisy)
These look like problems with the swarming client on the builder (not related to anything in the lab, this is an error in how the hwtest request is sent to the lab).

fdeng@ can you take a look at this swarming error?
Hmm actually I'm not sure about that diagnosis. From the timestamps looks like something killed swarming client after 6 hours.

03:48:13: INFO: RunCommand: /b/cbuild/shared_internal/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpSJcT10/cbuildbot-tmp19Dnvt/tmpi0tOf_/temp_summary.json --raw-cmd --task-name daisy-release/R49-7834.66.0-sanity --dimension os Linux --print-status-updates --timeout 39600 --io-timeout 39600 --hard-timeout 39600 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build daisy-release/R49-7834.66.0 --board daisy --suite_name sanity --pool bvt --num 1 --file_bugs True --priority DEFAULT --timeout_mins 600 --retry True --max_retries 10 --suite_min_duts 1 --offload_failures_only False -m 58263974
09:31:26: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-5:7, started)>]


Owner: akes...@chromium.org
Summary: daisy bvt pool health affecting canaries (was: swarming client crashing on daisy-release-group)
The suite job was actually created. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=58263974

However, the debug log links for it are broken so I'm having a hard time figuring out why it failed. I do suspect a daisy pool health problem based on the DUT status email this morning.
All 6 DUTs in the daisy BVT pool are reported failed.  The
full diagnosis summary is below.  I spot checked one of the
failures.

The DUT failed after Paygen FSI testing, e.g.:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=58230028

That test installs an old image that doesn't include the
'power_supply_info' command, and the AC power check fails
because of the missing command.

There's more going on than just that, but that problem is enough
to cause devices to bleed away.

For now, balance pools; a software fix to get the devices back
in service can follow.

Diagnosis summary:
$ dut-status -b daisy -p bvt -g
chromeos2-row3-rack5-host10
    2016-03-29 00:27:57  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/352084-repair/
    2016-03-29 00:22:33  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/352037-cleanup/
    2016-03-28 23:51:59  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58228053-chromeos-test/
    2016-03-28 23:51:34  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host10/351817-reset/
chromeos2-row3-rack5-host11
    2016-03-29 00:35:42  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/352128-repair/
    2016-03-29 00:25:38  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/352058-cleanup/
    2016-03-28 23:15:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58231568-chromeos-test/
    2016-03-28 23:14:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host11/351643-reset/
chromeos2-row3-rack5-host12
    2016-03-29 02:43:58  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352713-repair/
    2016-03-29 02:43:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352711-reset/
    2016-03-29 02:28:15  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58254783-chromeos-test/
    2016-03-29 02:27:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host12/352621-reset/
chromeos2-row3-rack5-host13
    2016-03-28 22:59:11  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351596-repair/
    2016-03-28 22:58:46  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351594-reset/
    2016-03-28 22:37:32  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58228720-chromeos-test/
    2016-03-28 22:37:14  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host13/351525-reset/
chromeos2-row3-rack5-host14
    2016-03-29 01:55:52  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352489-repair/
    2016-03-29 01:55:30  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352483-reset/
    2016-03-29 01:44:23  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58250608-chromeos-test/
    2016-03-29 01:43:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host14/352450-reset/
chromeos2-row3-rack5-host15
    2016-03-29 00:50:28  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/352216-repair/
    2016-03-29 00:37:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/352137-cleanup/
    2016-03-28 23:12:34  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/58230028-chromeos-test/
    2016-03-28 23:12:12  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack5-host15/351640-reset/

Comment 5 by fdeng@chromium.org, Mar 30 2016

FYI this is the log from the swarming proxy side. It shows the same thing, the pool is unhealthy and the suite couldn't finish in time (as mentioned in #3 and #4)
https://chromeos-proxy.appspot.com/user/task/2ddace7d3d33ee10

It is interesting that the timeout for the suite is 600 mins (10 hours), while the buildbot timeout seems to be 6 hours. maybe the suite timeout should be shortened to align with buildbot (so that we would get clear message on buildbot side)
How did you find that log? Is that the log for http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=58263974 ?

When I tried to examine that lab on cautotest/ it looked to me like it froze while trying to file a bug. See  crbug.com/599194  . But you seem to be seeing a lot more output?
Project Member

Comment 7 by bugdroid1@chromium.org, Mar 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b525670496bb8defe892688cfed6b56c6869afff

commit b525670496bb8defe892688cfed6b56c6869afff
Author: J. Richard Barnette <jrbarnette@chromium.org>
Date: Wed Mar 30 19:05:32 2016

[autotest] Ignore 'power_supply_info' failures during verify.

Checking AC power status in verify and repair depends on the
'power_supply_info' command.  In some cases, DUTs can wind up in
verify and/or repair with old FSI builds that don't have that command
present.  That makes the DUTs fail verify and repair, with no way to
fix them.

This changes the the AC power verifier to ignore failures from the
'power_supply_info' command.

BUG= chromium:599158 
TEST=None

Change-Id: Ibf395a73411a470d0446190d305b9a50933b8698
Reviewed-on: https://chromium-review.googlesource.com/336440
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Commit-Queue: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/b525670496bb8defe892688cfed6b56c6869afff/server/hosts/cros_repair.py

Is above CL resolving this issue? 
Will it be possible to re-run bvt/au test after this change?
Labels: -Hardware-Lab Infra-ChromeOS
Owner: jrbarnette@chromium.org
Summary: AC power verification fails on old FSI versions (was: daisy bvt pool health affecting canaries)
The CL above will stop devices from failing repair merely because
they're running an old FSI image that doesn't have the
'power_supply_info' command.

However, the change isn't complete on two fronts:
  * I believe the reason we find DUTs in the lab with these old
    images is that those images fail update too often.  If so,
    we'll need to blacklist the old FSI versions that cause
    trouble.
  * If we don't blacklist the images outright, it may be in our
    interest to force DUTs with old FSI versions to update before
    we declare them working.  I'm evaluating this question now.

So, for now, daisy, pit, and other older boards should be back to
normal.  However, I'm holding this bug open until I'm convinced we
have a permanent solution.

I'm closing this in favor of the previously filed  bug 597962 .

There's a separate issue of whether any of these FSI versions
need to be blacklisted; I'll think about the best way to track
that...

Status: Fixed (was: Assigned)
Status: Verified (was: Fixed)
Closing.
Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS

Sign in to add a comment