New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 762315 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

hardware_RamFio fails intermittently on beltino hardware

Reported by jrbarnette@chromium.org, Sep 5 2017

Issue description

The hardware_RamFio fails intermittently on several boards, mostly
(entirely) beltino-based:
    mccloud monroe tidus tricky zako

One way to see it is in test retries:
    https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=hardware_RamFio

You get the same set of failures looking back 90 days instead
of just 30 days:
    https://wmatrix.googleplex.com/retry_teststats/?days_back=90&tests=hardware_RamFio

So, the problem is probably new within the last month.

Note that this bug has caused at least one CQ failure:
    https://luci-milo.appspot.com/buildbot/chromeos/tidus-paladin/648

The tidus-paladin builder has also encountered the failure and then
passed after a successful retry.

 
Below are pointers to logs of four recent examples from tidus-paladin:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139756708-chromeos-test/chromeos4-row3-rack2-host10
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139651198-chromeos-test/chromeos4-row3-rack2-host12
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139660889-chromeos-test/chromeos4-row3-rack2-host1
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139664556-chromeos-test/chromeos4-row3-rack2-host1

In all cases, the logs indicate a kernel crash.  Looking in the
kcrash file under "crashinfo", you find failure like this:
Panic#1 Part1
<6>[ 1208.725968] chrome[3040]: segfault at fffffffd5039bcf1 ip 000055f8c88216aa sp 00007fff468f8490 error 5 in chrome[55f8c78e6000+7dd3000]
<4>[ 1209.842918] SELinux: Unable to set superblock options before the security server is initialized
<3>[ 1439.644136] INFO: task jbd2/dm-1-8:217 blocked for more than 120 seconds.

Labels: -Pri-3 Pri-1
Escalating to P1; this has caused one failure in the CQ; more are
possible.  If there are more than a few failures, action will be
mandatory.

Some fiddling with wmatrix turns up that the first canary failure is
around R62-9898.0.0 on monroe:
    http://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=138750194

That's the 2:00 AM canary on 8/31.

wmatrix also indicates failures on R61, starting at R61-9765.45.0.  So,
whatever change introduced the problem has been cherry-picked to the branch.
Cc: jintao@chromium.org
Owner: dlaurie@chromium.org
Status: Assigned (was: Untriaged)
Assigning to sheriff dlaurie@, because jintao@ has already gotten a
bug earlier.

#1: FWIW, the message suggests a chrome crash, not a kernel crash. It just happens to be logged in the kernel log.

The crash is because of

<3>[ 1439.644136] INFO: task jbd2/dm-1-8:217 blocked for more than 120 seconds.
<3>[ 1439.644158]       Not tainted 3.14.0 #1

since blocked tasks cause a reboot. Looks like several threads wait for a lock to be released.


> #1: FWIW, the message suggests a chrome crash, not a kernel crash. It just happens to be logged in the kernel log.

Regardless of what you call it, the logs show that the symptom is a DUT
reboot, not chrome hanging up.

Also, regardless of the root cause, initial triage needs to be done by
someone with kernel expertise, because the Chrome experts aren't likely
to be able to understand what the kernel logs are telling them, and that's
where the work has to start.

Cc: dlaurie@chromium.org
Owner: tbroch@chromium.org
Labels: OS-Chrome Pri-3
Status: Available (was: Assigned)
Cc: tbroch@chromium.org
Owner: ----

Comment 11 by derat@chromium.org, Jun 30 2018

Cc: puthik@chromium.org wzang@chromium.org
Labels: -Pri-3 Pri-1
This looks like it's still failing intermittently; see e.g. the tricky-tot-chrome-pfq-informational run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942355412083790784:

06/29 15:33:03.266 INFO |        server_job:0216| START	----	----	timestamp=1530311582	localtime=Jun 29 15:33:02	
06/29 15:33:03.350 INFO |        server_job:0216| 	START	hardware_RamFio	hardware_RamFio	timestamp=1530311582	localtime=Jun 29 15:33:02	
06/29 15:33:04.149 INFO |        server_job:0216| 		START	hardware_StorageFio	hardware_StorageFio	timestamp=1530311583	localtime=Jun 29 15:33:03	
06/29 15:48:04.666 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos4-row2-rack3-host13 is down.
06/29 15:48:04.667 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_HQvcs7ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host13'
06/29 15:48:06.562 INFO |        server_job:0216| 			FAIL	----	----	timestamp=1530312486	localtime=Jun 29 15:48:06	Autotest client terminated unexpectedly: DUT rebooted during the test run.
06/29 15:48:06.562 INFO |        server_job:0216| 		END FAIL	----	----	timestamp=1530312486	localtime=Jun 29 15:48:06	
06/29 15:48:06.562 INFO |        server_job:0216| 	END GOOD	----	----	timestamp=1530312486	localtime=Jun 29 15:48:06	

If nobody familiar with what this is testing cares enough to investigate, I'm happy to remove it from bvt-cq. Flaky tests don't belong in that suite.
> This looks like it's still failing intermittently; see e.g. the
> tricky-tot-chrome-pfq-informational run at
> http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942355412083790784

Stainless seems to show rare, intermittent failures:  http://shortn/_WEXc420bTf

That dashboard doesn't prove that the test is flaky, but "flaky test" is more
likely to explain the rare red boxes than "bugs are being found and fixed quickly".

I'm guessing that the flake rate shown on that dashboard is in the ballpark of
1 per 1000.  That's pretty good, but the last time I ran numbers on CQ reliability
requirements, I concluded that even at half that rate, it would still be too high.

Re #12

The red in stainless are from ABORT status. http://shortn/_mOvJmJGtiY

the test is actually pass

INFO	----	----	kernel=3.18.0-18035-g0a7c2593e03e	localtime=Jun 29 13:33:10	timestamp=1530304390	
START	----	----	timestamp=1530304407	localtime=Jun 29 13:33:27	
	START	hardware_RamFio	hardware_RamFio	timestamp=1530304407	localtime=Jun 29 13:33:27	
		START	hardware_StorageFio	hardware_StorageFio	timestamp=1530304408	localtime=Jun 29 13:33:28	
			GOOD	hardware_StorageFio	hardware_StorageFio	timestamp=1530304418	localtime=Jun 29 13:33:38	completed successfully
		END GOOD	hardware_StorageFio	hardware_StorageFio	timestamp=1530304418	localtime=Jun 29 13:33:38	
		GOOD	hardware_RamFio	hardware_RamFio	timestamp=1530304421	localtime=Jun 29 13:33:41	completed successfully
	END GOOD	hardware_RamFio	hardware_RamFio	timestamp=1530304421	localtime=Jun 29 13:33:41	
END GOOD	----	----	timestamp=1530304423	localtime=Jun 29 13:33:43	
INFO	----	----	Job aborted by autotest_system on 1530304437
This just failed again on tricky-tot-chrome-pfq-informational at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942087158872948784 :

07/02 14:35:57.953 INFO |        server_job:0216| 	START	hardware_RamFio	hardware_RamFio	timestamp=1530567357	localtime=Jul 02 14:35:57	
07/02 14:35:58.683 INFO |        server_job:0216| 		START	hardware_StorageFio	hardware_StorageFio	timestamp=1530567358	localtime=Jul 02 14:35:58	
07/02 14:50:59.215 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos4-row2-rack3-host13 is down.
07/02 14:50:59.244 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_HDh18bssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host13'
07/02 14:51:00.970 INFO |        server_job:0216| 			FAIL	----	----	timestamp=1530568260	localtime=Jul 02 14:51:00	Autotest client terminated unexpectedly: DUT rebooted during the test run.
07/02 14:51:00.970 INFO |        server_job:0216| 		END FAIL	----	----	timestamp=1530568260	localtime=Jul 02 14:51:00	
07/02 14:51:00.970 INFO |        server_job:0216| 	END GOOD	----	----	timestamp=1530568260	localtime=Jul 02 14:51:00

This test is making it difficult for Chrome gardeners: I'm watching this builder to get early warnings about Chrome issues that will disrupt the Chrome PFQ.

I've uploaded https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1122902 to move it from bvt-cq to bvt-perbuild. Please feel free to move it back after the underlying issue has been fixed or the test has been modified to be skipped on boards where it doesn't pass reliably.
> The red in stainless are from ABORT status. http://shortn/_mOvJmJGtiY

Sure enough, AFAICT, all the failures are from aborts.  Basically, that's
the fault of the infrastructure, not the test.  So, that means the test's
false failure rate is approaching "too low to measure", which is the best
we can ask for.

> This test is making it difficult for Chrome gardeners: I'm watching
> this builder to get early warnings about Chrome issues that will disrupt
> the Chrome PFQ.

I did a spot check of the failure cited with this comment.  There's a
console-ramoops uploaded with the failure, and it's got this key message:
    [ 2918.012206] Kernel panic - not syncing: Out of memory and no killable processes...

That seems likely to be a consequence of the test, but that's not _quite_
the same as asserting that there's a test bug.  There's some possibility
that the size of Chrome processes has passed a threshold that make the test
more vulnerable.  It's also possible that the test is exposing a memory leak
or wasteful usage.

A couple of observations regarding this bug:
  * On ToT, the test isn't flaky.
  * The failures on the ToT informational builder are _new_.

I'd recommend we close this bug:  The originally cited problem hasn't been seen
for 9 months.  If someone cares about what hardware_RamFio tests, then we should
also file a new bug for this new failures on the informational builder, and figure
out what's going on there.  As for disabling the test:  that's the obvious course
of action.  If someone cares enough about hardware_RamFio to file the new bug,
then they can also use that as a forum to propose an alternative.
https://crrev.com/c/1125270 should be better solution for this.

It won't affect hardware AVL and make it more harden in lab.
> It won't affect hardware AVL and make it more harden in lab.

Actually, looking at this statement, and at what the test says it
does, I have to ask:
  * If this test fails, what kind of _software_ bug would be indicated?

The bvt-cq suite isn't the right place for AVL tests.  If we're using
it that way, we need to update the process.

 * If this test fails, what kind of _software_ bug would be indicated?
- FIO tool is flaky


> The bvt-cq suite isn't the right place for AVL tests.
Where is the right place then?
>  * If this test fails, what kind of _software_ bug would be indicated?
> - FIO tool is flaky

If that were so, would the bug be a release blocker?  Would such
a bug affect consumers, or would it impact our ability to find
other bugs that could affect consumers?


> > The bvt-cq suite isn't the right place for AVL tests.
> Where is the right place then?

I don't know, possibly in a dedicated suite, possibly in some other
existing suite that's already dedicated for AVL testing.

I understand your point now.

Look like bvt-perbuild or kernel_per-build_regression is better place for this.
Project Member

Comment 21 by bugdroid1@chromium.org, Jul 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/73560c9d592b66bfce9f45667e9268e892f8494c

commit 73560c9d592b66bfce9f45667e9268e892f8494c
Author: Daniel Erat <derat@chromium.org>
Date: Wed Jul 04 23:16:31 2018

autotest: Move hardware_RamFio to bvt-perbuild.

hardware_RamFio triggers intermittent DUT reboots on
beltino-based hardware. Move it from bvt-cq to bvt-perbuild
so it doesn't cause unactionable CQ and PFQ failures.

BUG=chromium:762315
TEST=none

Change-Id: I2ecaa50673400c5460b1dc060c34dae4a3e8bbbc
Reviewed-on: https://chromium-review.googlesource.com/1122902
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Puthikorn Voravootivat <puthik@chromium.org>

[modify] https://crrev.com/73560c9d592b66bfce9f45667e9268e892f8494c/client/site_tests/hardware_RamFio/control

Project Member

Comment 22 by bugdroid1@chromium.org, Jul 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f730685a3b1520adb8f31eb25638ab76628a685b

commit f730685a3b1520adb8f31eb25638ab76628a685b
Author: Puthikorn Voravootivat <puthik@chromium.org>
Date: Tue Jul 10 00:59:28 2018

hardware_RamFio: Reserved 200 MiB ram for OS/Chrome

hardware_RamFio can fail in the lab with non faulty hardware
if we used too much ram for this test.

This CL fixes that by
1. Additionally Reserved 200 MiB of memory for OS/Chrome
2. Raise TestNAError with not enough memory.

This change won't affect existing memory AVL workflow
because new hardware should have at least 4GB of RAM.

BUG=chromium:762315
TEST=manully. Test passed on Eve. TestNAError when use 20G reserved.

Change-Id: I5843c50917ba2a23d15cdd49f314a21d9703ce91
Signed-off-by: Puthikorn Voravootivat <puthik@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1125270

[modify] https://crrev.com/f730685a3b1520adb8f31eb25638ab76628a685b/client/site_tests/hardware_RamFio/hardware_RamFio.py

Sign in to add a comment