nyan_blaze builds are causing devices to go down
Reported by
jrbarnette@chromium.org,
May 26 2016
|
||||||
Issue description
Recent testing is causing nyan_blaze DUTs in the CrOS test
lab to go offline. The devices are either crashing or shutting
down.
Below are test jobs with failures:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64552372
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64577375
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64552368
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64685739
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64685743
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64643395
The first 5 jobs are for R52-8350.?.0 release builds. The last is
for the R53-8372.0.0 canary build. There's no common theme in the
specific tests.
The most likely cause is a system crash; possibly the kernel. The
bug is (apparently) in ToT, and presumably has been cherry-picked
to the 8350 branch builder.
,
May 26 2016
A cursory check says that the DUTs simply go offline without
warning during testing.
I've filed a ticket to repair the devices and get logs:
http://b2/28983582
,
May 26 2016
I'm currently checking the possibility that this is a long-standing problem. Most of the nyan_blaze servo inventory went offline in the last few days because of bug 614047. It's possible that this problem caused many DUT failures which were papered over by servo repairs.
,
May 26 2016
Here is the summary of the failed DUTs. chromeos4-row7-rack10-host1 Last test: security_ChromiumOSLSM Pass/Fail: Failed Reason: provision_AutoUpdate to R52-8350.6.0 failed chromeos4-row7-rack10-host3 Last test: login_OwnershipApi Pass/Fail: Failed Reason: provision_AutoUpdate to R52-8350.6.0 failed chromeos4-row7-rack10-host5 Last test: login_OwnershipRetaken Pass/Fail: Failed Reason: provision_AutoUpdate to R52-8350.6.0 failed chromeos4-row7-rack7-host1 Last test: touch_TapSettings Pass/Fail: Passed chromeos4-row7-rack7-host3 Last test: video_YouTubeFlash Pass/Fail: Passed chromeos4-row7-rack7-host5 Last test: video_YouTubeFlash Pass/Fail: Passed 3 DUTs failed to provision_AutoUpdate to R52-8350.6.0 and then changed to Repair status. The other 3 DUTs looks weird or missing log to see what happened in between.
,
May 26 2016
Re comment #4: The list of failures provided by the Autotest
web site is misleading. Most especially: the list isn't sorted
by timestamp; it's sorted by job number.
Most of the cited tests aren't where the failure actually
occurred. In any event, it looks like the specific tests
in question are uninteresting: The DUT simply goes down
without warning.
For an example of what really happened vs. the Autotest AFE
web site, here's the history of chromeos4-row7-rack10-host1:
chromeos4-row7-rack10-host1
2016-05-25 17:57:23 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887260-repair/
2016-05-25 17:42:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887203-provision/
2016-05-25 07:50:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64552368-chromeos-test/
2016-05-25 07:40:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/884464-provision/
2016-05-25 07:39:08 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64565060-chromeos-test/
2016-05-25 07:30:26 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/884347-provision/
The actual failure is between these two events:
2016-05-25 17:42:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack10-host1/887203-provision/
2016-05-25 07:50:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/64552368-chromeos-test/
At the end of the test at 7:50, the DUT was working. At the
beginning of provisioning at 17:42, the DUT was down.
,
May 26 2016
I have some minimal evidence that this problem may have been going
on for some time. In the past week, I found three more BVT test
failures where a nyan_blaze DUT went offline unexpectedly. In those
cases, servo repaired the DUTs and testing carried on.
These are the three jobs of interest:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64219435
R53-8358.0.0 video_PowerConsumption.h264
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64135942
R53-8356.0.0 video_PowerConsumption.vp8
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=64208025
R51-8172.42.0 autoupdate_CatchBadSignatures
The six failures cited in the original description were for these tests:
touch_ScrollDirection
autoupdate_CatchBadSignatures
touch_TapSettings
audio_PowerConsumption.mp3 (x2)
video_PowerConsumption.h264
Given how far back this problem goes, I'm guessing that bisecting
to find the crash won't help. We need to wait until the DUTs are
repaired, and we have logs for examination.
,
May 26 2016
Logs are available from three of the DUTs; see attached. There's a report of a DHCP problem in the lab; that problem affected the other three DUTs, so the problem may not be as serious as it appears.
,
May 27 2016
I looked through the logs:
chromeos4-row7-rack7-host1 - /var/log/messages shows
corruption characteristic of a crash. However, the logs
gathering program didn't grab the ramoops file, so we have
no visibility into the cause of the crash.
chromeos4-row7-rack7-host3
chromeos4-row7-rack7-host5 - Both of these hosts look like stateful
was wiped during the crash. I don't know why that happened.
In any event, there's no history of the failure.
,
May 27 2016
I've filed this ticket to request a fix for the down servos on
nyan_blaze:
http://b2/28989245
Once that's done, I think it would be best to re-open the lab
to nyan_blaze, and downgrade this bug.
,
May 27 2016
Servos for nyan_blaze are now mostly working. I've re-opened the lab. This is probably just a vanilla kernel crash, so I'm re-labeling it as such.
,
Feb 17 2017
,
Mar 18 2017
Activating. Please assign to the right owner and the appropriate priority.
,
Apr 12 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 12 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, May 26 2016