hardware_RamFio fails intermittently on beltino hardware
Reported by
jrbarnette@chromium.org,
Sep 5 2017
|
|||||||
Issue description
The hardware_RamFio fails intermittently on several boards, mostly
(entirely) beltino-based:
mccloud monroe tidus tricky zako
One way to see it is in test retries:
https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=hardware_RamFio
You get the same set of failures looking back 90 days instead
of just 30 days:
https://wmatrix.googleplex.com/retry_teststats/?days_back=90&tests=hardware_RamFio
So, the problem is probably new within the last month.
Note that this bug has caused at least one CQ failure:
https://luci-milo.appspot.com/buildbot/chromeos/tidus-paladin/648
The tidus-paladin builder has also encountered the failure and then
passed after a successful retry.
,
Sep 5 2017
Escalating to P1; this has caused one failure in the CQ; more are possible. If there are more than a few failures, action will be mandatory.
,
Sep 5 2017
Some fiddling with wmatrix turns up that the first canary failure is
around R62-9898.0.0 on monroe:
http://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=138750194
That's the 2:00 AM canary on 8/31.
wmatrix also indicates failures on R61, starting at R61-9765.45.0. So,
whatever change introduced the problem has been cherry-picked to the branch.
,
Sep 5 2017
Assigning to sheriff dlaurie@, because jintao@ has already gotten a bug earlier.
,
Sep 5 2017
#1: FWIW, the message suggests a chrome crash, not a kernel crash. It just happens to be logged in the kernel log.
,
Sep 5 2017
The crash is because of <3>[ 1439.644136] INFO: task jbd2/dm-1-8:217 blocked for more than 120 seconds. <3>[ 1439.644158] Not tainted 3.14.0 #1 since blocked tasks cause a reboot. Looks like several threads wait for a lock to be released.
,
Sep 6 2017
> #1: FWIW, the message suggests a chrome crash, not a kernel crash. It just happens to be logged in the kernel log. Regardless of what you call it, the logs show that the symptom is a DUT reboot, not chrome hanging up. Also, regardless of the root cause, initial triage needs to be done by someone with kernel expertise, because the Chrome experts aren't likely to be able to understand what the kernel logs are telling them, and that's where the work has to start.
,
Sep 6 2017
,
Dec 2 2017
,
Jan 6 2018
,
Jun 30 2018
This looks like it's still failing intermittently; see e.g. the tricky-tot-chrome-pfq-informational run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942355412083790784: 06/29 15:33:03.266 INFO | server_job:0216| START ---- ---- timestamp=1530311582 localtime=Jun 29 15:33:02 06/29 15:33:03.350 INFO | server_job:0216| START hardware_RamFio hardware_RamFio timestamp=1530311582 localtime=Jun 29 15:33:02 06/29 15:33:04.149 INFO | server_job:0216| START hardware_StorageFio hardware_StorageFio timestamp=1530311583 localtime=Jun 29 15:33:03 06/29 15:48:04.666 INFO | ssh_multiplex:0079| Master ssh connection to chromeos4-row2-rack3-host13 is down. 06/29 15:48:04.667 INFO | ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_HQvcs7ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host13' 06/29 15:48:06.562 INFO | server_job:0216| FAIL ---- ---- timestamp=1530312486 localtime=Jun 29 15:48:06 Autotest client terminated unexpectedly: DUT rebooted during the test run. 06/29 15:48:06.562 INFO | server_job:0216| END FAIL ---- ---- timestamp=1530312486 localtime=Jun 29 15:48:06 06/29 15:48:06.562 INFO | server_job:0216| END GOOD ---- ---- timestamp=1530312486 localtime=Jun 29 15:48:06 If nobody familiar with what this is testing cares enough to investigate, I'm happy to remove it from bvt-cq. Flaky tests don't belong in that suite.
,
Jun 30 2018
> This looks like it's still failing intermittently; see e.g. the > tricky-tot-chrome-pfq-informational run at > http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942355412083790784 Stainless seems to show rare, intermittent failures: http://shortn/_WEXc420bTf That dashboard doesn't prove that the test is flaky, but "flaky test" is more likely to explain the rare red boxes than "bugs are being found and fixed quickly". I'm guessing that the flake rate shown on that dashboard is in the ballpark of 1 per 1000. That's pretty good, but the last time I ran numbers on CQ reliability requirements, I concluded that even at half that rate, it would still be too high.
,
Jun 30 2018
Re #12 The red in stainless are from ABORT status. http://shortn/_mOvJmJGtiY the test is actually pass INFO ---- ---- kernel=3.18.0-18035-g0a7c2593e03e localtime=Jun 29 13:33:10 timestamp=1530304390 START ---- ---- timestamp=1530304407 localtime=Jun 29 13:33:27 START hardware_RamFio hardware_RamFio timestamp=1530304407 localtime=Jun 29 13:33:27 START hardware_StorageFio hardware_StorageFio timestamp=1530304408 localtime=Jun 29 13:33:28 GOOD hardware_StorageFio hardware_StorageFio timestamp=1530304418 localtime=Jun 29 13:33:38 completed successfully END GOOD hardware_StorageFio hardware_StorageFio timestamp=1530304418 localtime=Jun 29 13:33:38 GOOD hardware_RamFio hardware_RamFio timestamp=1530304421 localtime=Jun 29 13:33:41 completed successfully END GOOD hardware_RamFio hardware_RamFio timestamp=1530304421 localtime=Jun 29 13:33:41 END GOOD ---- ---- timestamp=1530304423 localtime=Jun 29 13:33:43 INFO ---- ---- Job aborted by autotest_system on 1530304437
,
Jul 3
This just failed again on tricky-tot-chrome-pfq-informational at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8942087158872948784 : 07/02 14:35:57.953 INFO | server_job:0216| START hardware_RamFio hardware_RamFio timestamp=1530567357 localtime=Jul 02 14:35:57 07/02 14:35:58.683 INFO | server_job:0216| START hardware_StorageFio hardware_StorageFio timestamp=1530567358 localtime=Jul 02 14:35:58 07/02 14:50:59.215 INFO | ssh_multiplex:0079| Master ssh connection to chromeos4-row2-rack3-host13 is down. 07/02 14:50:59.244 INFO | ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_HDh18bssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host13' 07/02 14:51:00.970 INFO | server_job:0216| FAIL ---- ---- timestamp=1530568260 localtime=Jul 02 14:51:00 Autotest client terminated unexpectedly: DUT rebooted during the test run. 07/02 14:51:00.970 INFO | server_job:0216| END FAIL ---- ---- timestamp=1530568260 localtime=Jul 02 14:51:00 07/02 14:51:00.970 INFO | server_job:0216| END GOOD ---- ---- timestamp=1530568260 localtime=Jul 02 14:51:00 This test is making it difficult for Chrome gardeners: I'm watching this builder to get early warnings about Chrome issues that will disrupt the Chrome PFQ. I've uploaded https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1122902 to move it from bvt-cq to bvt-perbuild. Please feel free to move it back after the underlying issue has been fixed or the test has been modified to be skipped on boards where it doesn't pass reliably.
,
Jul 3
> The red in stainless are from ABORT status. http://shortn/_mOvJmJGtiY Sure enough, AFAICT, all the failures are from aborts. Basically, that's the fault of the infrastructure, not the test. So, that means the test's false failure rate is approaching "too low to measure", which is the best we can ask for. > This test is making it difficult for Chrome gardeners: I'm watching > this builder to get early warnings about Chrome issues that will disrupt > the Chrome PFQ. I did a spot check of the failure cited with this comment. There's a console-ramoops uploaded with the failure, and it's got this key message: [ 2918.012206] Kernel panic - not syncing: Out of memory and no killable processes... That seems likely to be a consequence of the test, but that's not _quite_ the same as asserting that there's a test bug. There's some possibility that the size of Chrome processes has passed a threshold that make the test more vulnerable. It's also possible that the test is exposing a memory leak or wasteful usage. A couple of observations regarding this bug: * On ToT, the test isn't flaky. * The failures on the ToT informational builder are _new_. I'd recommend we close this bug: The originally cited problem hasn't been seen for 9 months. If someone cares about what hardware_RamFio tests, then we should also file a new bug for this new failures on the informational builder, and figure out what's going on there. As for disabling the test: that's the obvious course of action. If someone cares enough about hardware_RamFio to file the new bug, then they can also use that as a forum to propose an alternative.
,
Jul 3
https://crrev.com/c/1125270 should be better solution for this. It won't affect hardware AVL and make it more harden in lab.
,
Jul 3
> It won't affect hardware AVL and make it more harden in lab. Actually, looking at this statement, and at what the test says it does, I have to ask: * If this test fails, what kind of _software_ bug would be indicated? The bvt-cq suite isn't the right place for AVL tests. If we're using it that way, we need to update the process.
,
Jul 3
* If this test fails, what kind of _software_ bug would be indicated? - FIO tool is flaky > The bvt-cq suite isn't the right place for AVL tests. Where is the right place then?
,
Jul 3
> * If this test fails, what kind of _software_ bug would be indicated? > - FIO tool is flaky If that were so, would the bug be a release blocker? Would such a bug affect consumers, or would it impact our ability to find other bugs that could affect consumers? > > The bvt-cq suite isn't the right place for AVL tests. > Where is the right place then? I don't know, possibly in a dedicated suite, possibly in some other existing suite that's already dedicated for AVL testing.
,
Jul 3
I understand your point now. Look like bvt-perbuild or kernel_per-build_regression is better place for this.
,
Jul 4
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/73560c9d592b66bfce9f45667e9268e892f8494c commit 73560c9d592b66bfce9f45667e9268e892f8494c Author: Daniel Erat <derat@chromium.org> Date: Wed Jul 04 23:16:31 2018 autotest: Move hardware_RamFio to bvt-perbuild. hardware_RamFio triggers intermittent DUT reboots on beltino-based hardware. Move it from bvt-cq to bvt-perbuild so it doesn't cause unactionable CQ and PFQ failures. BUG=chromium:762315 TEST=none Change-Id: I2ecaa50673400c5460b1dc060c34dae4a3e8bbbc Reviewed-on: https://chromium-review.googlesource.com/1122902 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Puthikorn Voravootivat <puthik@chromium.org> [modify] https://crrev.com/73560c9d592b66bfce9f45667e9268e892f8494c/client/site_tests/hardware_RamFio/control
,
Jul 10
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f730685a3b1520adb8f31eb25638ab76628a685b commit f730685a3b1520adb8f31eb25638ab76628a685b Author: Puthikorn Voravootivat <puthik@chromium.org> Date: Tue Jul 10 00:59:28 2018 hardware_RamFio: Reserved 200 MiB ram for OS/Chrome hardware_RamFio can fail in the lab with non faulty hardware if we used too much ram for this test. This CL fixes that by 1. Additionally Reserved 200 MiB of memory for OS/Chrome 2. Raise TestNAError with not enough memory. This change won't affect existing memory AVL workflow because new hardware should have at least 4GB of RAM. BUG=chromium:762315 TEST=manully. Test passed on Eve. TestNAError when use 20G reserved. Change-Id: I5843c50917ba2a23d15cdd49f314a21d9703ce91 Signed-off-by: Puthikorn Voravootivat <puthik@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1125270 [modify] https://crrev.com/f730685a3b1520adb8f31eb25638ab76628a685b/client/site_tests/hardware_RamFio/hardware_RamFio.py |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Sep 5 2017Below are pointers to logs of four recent examples from tidus-paladin: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139756708-chromeos-test/chromeos4-row3-rack2-host10 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139651198-chromeos-test/chromeos4-row3-rack2-host12 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139660889-chromeos-test/chromeos4-row3-rack2-host1 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/139664556-chromeos-test/chromeos4-row3-rack2-host1 In all cases, the logs indicate a kernel crash. Looking in the kcrash file under "crashinfo", you find failure like this: Panic#1 Part1 <6>[ 1208.725968] chrome[3040]: segfault at fffffffd5039bcf1 ip 000055f8c88216aa sp 00007fff468f8490 error 5 in chrome[55f8c78e6000+7dd3000] <4>[ 1209.842918] SELinux: Unable to set superblock options before the security server is initialized <3>[ 1439.644136] INFO: task jbd2/dm-1-8:217 blocked for more than 120 seconds.