New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 615159 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

nyan_blaze builds are causing devices to go down

Reported by jrbarnette@chromium.org, May 26 2016

Issue description

Recent testing is causing nyan_blaze DUTs in the CrOS test
lab to go offline.  The devices are either crashing or shutting
down.

Below are test jobs with failures:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64552372
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64577375
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64552368
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64685739
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64685743
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64643395

The first 5 jobs are for R52-8350.?.0 release builds.  The last is
for the R53-8372.0.0 canary build.  There's no common theme in the
specific tests.

The most likely cause is a system crash; possibly the kernel.  The
bug is (apparently) in ToT, and presumably has been cherry-picked
to the 8350 branch builder.
 
I should add:  The problem appears specific to nyan_blaze; other
nyan boards seem unaffected.

A cursory check says that the DUTs simply go offline without
warning during testing.

I've filed a ticket to repair the devices and get logs:
    http://b2/28983582
I'm currently checking the possibility that this is a long-standing
problem.  Most of the nyan_blaze servo inventory went offline in the
last few days because of bug 614047.  It's possible that this problem
caused many DUT failures which were papered over by servo repairs.

Here is the summary of the failed DUTs.

chromeos4-row7-rack10-host1
  Last test: security_ChromiumOSLSM
  Pass/Fail: Failed
  Reason: provision_AutoUpdate to R52-8350.6.0 failed
chromeos4-row7-rack10-host3
  Last test: login_OwnershipApi
  Pass/Fail: Failed
  Reason: provision_AutoUpdate to R52-8350.6.0 failed
chromeos4-row7-rack10-host5
  Last test: login_OwnershipRetaken
  Pass/Fail: Failed
  Reason: provision_AutoUpdate to R52-8350.6.0 failed
chromeos4-row7-rack7-host1
  Last test: touch_TapSettings
  Pass/Fail: Passed
chromeos4-row7-rack7-host3
  Last test: video_YouTubeFlash
  Pass/Fail: Passed
chromeos4-row7-rack7-host5
  Last test: video_YouTubeFlash
  Pass/Fail: Passed

3 DUTs failed to provision_AutoUpdate to R52-8350.6.0 and then changed to Repair status. The other 3 DUTs looks weird or missing log to see what happened in between.
Re comment #4:  The list of failures provided by the Autotest
web site is misleading.  Most especially:  the list isn't sorted
by timestamp; it's sorted by job number.

Most of the cited tests aren't where the failure actually
occurred.  In any event, it looks like the specific tests
in question are uninteresting:  The DUT simply goes down
without warning.

For an example of what really happened vs. the Autotest AFE
web site, here's the history of chromeos4-row7-rack10-host1:
chromeos4-row7-rack10-host1
    2016-05-25 17:57:23  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887260-repair/
    2016-05-25 17:42:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887203-provision/
    2016-05-25 07:50:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64552368-chromeos-test/
    2016-05-25 07:40:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/884464-provision/
    2016-05-25 07:39:08  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64565060-chromeos-test/
    2016-05-25 07:30:26  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/884347-provision/

The actual failure is between these two events:
    2016-05-25 17:42:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887203-provision/
    2016-05-25 07:50:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64552368-chromeos-test/

At the end of the test at 7:50, the DUT was working.  At the
beginning of provisioning at 17:42, the DUT was down.

I have some minimal evidence that this problem may have been going
on for some time.  In the past week, I found three more BVT test
failures where a nyan_blaze DUT went offline unexpectedly.  In those
cases, servo repaired the DUTs and testing carried on.

These are the three jobs of interest:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64219435
    R53-8358.0.0 video_PowerConsumption.h264

    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64135942
    R53-8356.0.0 video_PowerConsumption.vp8

    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64208025
    R51-8172.42.0 autoupdate_CatchBadSignatures

The six failures cited in the original description were for these tests:
    touch_ScrollDirection
    autoupdate_CatchBadSignatures
    touch_TapSettings
    audio_PowerConsumption.mp3 (x2)
    video_PowerConsumption.h264

Given how far back this problem goes, I'm guessing that bisecting
to find the crash won't help.  We need to wait until the DUTs are
repaired, and we have logs for examination.

Logs are available from three of the DUTs; see attached.

There's a report of a DHCP problem in the lab; that problem
affected the other three DUTs, so the problem may not be
as serious as it appears.

chromeos4-row7-rack7-host1_logs.tar.gz
780 KB Download
chromeos4-row7-rack7-host3_logs.tar.gz
610 KB Download
chromeos4-row7-rack7-host5_logs.tar.gz
576 KB Download
I looked through the logs:

chromeos4-row7-rack7-host1 - /var/log/messages shows
    corruption characteristic of a crash.  However, the logs
    gathering program didn't grab the ramoops file, so we have
    no visibility into the cause of the crash.

chromeos4-row7-rack7-host3
chromeos4-row7-rack7-host5 - Both of these hosts look like stateful
    was wiped during the crash.  I don't know why that happened.
    In any event, there's no history of the failure.

I've filed this ticket to request a fix for the down servos on
nyan_blaze:
    http://b2/28989245

Once that's done, I think it would be best to re-open the lab
to nyan_blaze, and downgrade this bug.

Components: OS>Kernel
Labels: -Pri-0 -ReleaseBlock-Dev Pri-1
Owner: ----
Servos for nyan_blaze are now mostly working.

I've re-opened the lab.

This is probably just a vanilla kernel crash, so I'm re-labeling
it as such.

Status: Archived (was: Untriaged)

Comment 12 by ketakid@google.com, Mar 18 2017

Labels: Pri-3
Status: Available (was: Archived)
Activating. Please assign to the right owner and the appropriate priority.
Project Member

Comment 13 by sheriffbot@chromium.org, Apr 12 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Archived (was: Untriaged)

Sign in to add a comment