[Tricky] OOM-induced kernel panic when running hardware_RamFio
Reported by
vpalatin@chromium.org,
Jun 24 2016
|
||||||||
Issue descriptionSeveral recent x86 builds (tricky, edgar, zako?) have failed in the HWtest while executing the hardware_RamFio test : 'hardware_RamFio FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.' e.g. https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2025/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio The kernel triggers the OOM Killer after hardware_RamFio has been started, but then seems to timeout waiting on page operations and died in a hung-task panic : https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/67636120-chromeos-test/chromeos4-row2-rack3-host8/crashinfo.chromeos4-row2-rack3-host8/kernel.20160624.050808.0.kcrash Luigi, are you interested in looking at this ? else please re-assign to me.
,
Jun 24 2016
Very kind of you. I am taking a look now.
,
Jun 24 2016
,
Jun 24 2016
The test measures MemFree, multiplies it by 0.95, then multiplies it further by 0.80, then allocates files on a RAM disk using that number, presumably as the max size, but in any case the system runs out of memory and kills a renderer, but unfortunately the memory shortage triggers a file system bug. The shill process hangs on a write to a regular file (ext4) as well as the rs:main process (whatever that is, but I don't think it matters). They are both waiting for a page. Then finally loop0 causes the 2-minute hang at ext4_sync_file() -> jbd2_log_wait_commit() -> schedule().
,
Jun 24 2016
The kernel issues should be fixed, but I really don't think this test is supposed to trigger OOM in the first place. To avoid this I am adding a 'stop ui' to it in https://chromium-review.googlesource.com/#/c/356143 (This test should not hold reving Chrome in the pfq.)
,
Jun 24 2016
Thank you Ilja.
,
Jun 24 2016
But there still is a kernel crash, do you think something can done about it?
,
Jun 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/aca6f56640d21fbce88053c7b17903c4193a2bcc commit aca6f56640d21fbce88053c7b17903c4193a2bcc Author: Ilja H. Friedel <ihf@chromium.org> Date: Fri Jun 24 18:50:31 2016 hardware_RamFio: refactor test. Main change is to stop chrome before running the test to avoid its memory usage as a dependency in activating OOM killer. BUG= chromium:623116 TEST=Ran on cyan, lars (pass), veyron_minnie (unrelated fail). Change-Id: I71281b08b48e037dcb196a5f32d902a2ad454c18 Reviewed-on: https://chromium-review.googlesource.com/356143 Reviewed-by: Haixia Shi <hshi@chromium.org> Reviewed-by: Vincent Palatin <vpalatin@chromium.org> Reviewed-by: Puthikorn Voravootivat <puthik@chromium.org> Commit-Queue: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/aca6f56640d21fbce88053c7b17903c4193a2bcc/client/site_tests/hardware_RamFio/hardware_RamFio.py
,
Jul 21 2016
This failed in the PFQ today on tricky. Should this be marked Started? https://uberchromegw.corp.google.com/i/chromeos/builders/tricky-chrome-pfq/builds/2127
,
Jul 21 2016
There was an OOM kill again https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/70348964-chromeos-test/chromeos4-row2-rack4-host3/crashinfo.chromeos4-row2-rack4-host3/ <12>[ 678.602651] init: debugd main process (23443) killed by TERM signal <4>[ 687.555253] fio invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=-1000 <5>[ 687.555271] Pid: 24003, comm: fio Tainted: G WC 3.8.11 #1 <5>[ 687.555281] Call Trace: <5>[ 687.555296] [<ffffffffba4bc6b8>] dump_header.isra.11+0x94/0x1d0 <5>[ 687.555311] [<ffffffffba8c9f28>] ? _raw_spin_unlock+0xe/0x10 <5>[ 687.555322] [<ffffffffba4bcfd6>] out_of_memory+0x1bd/0x27b <5>[ 687.555335] [<ffffffffba4c085e>] __alloc_pages_nodemask+0x602/0x736 <5>[ 687.555349] [<ffffffffba4d97c6>] handle_pte_fault+0x330/0x547 <5>[ 687.555361] [<ffffffffba4da853>] handle_mm_fault+0x16a/0x193 <5>[ 687.555373] [<ffffffffba429835>] __do_page_fault+0x1d4/0x38c <5>[ 687.555386] [<ffffffffba4638fa>] ? set_next_entity+0x44/0x9b <5>[ 687.555398] [<ffffffffba400c4c>] ? __switch_to+0x138/0x3b0 <5>[ 687.555410] [<ffffffffba8c9f4f>] ? _raw_spin_unlock_irq+0xe/0x11 <5>[ 687.555423] [<ffffffffba45c2ed>] ? finish_task_switch+0x69/0xa5 <5>[ 687.555434] [<ffffffffba429a1f>] do_page_fault+0xe/0x10 <5>[ 687.555445] [<ffffffffba8ca532>] page_fault+0x22/0x30 But overall the test is super stable now: https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=hardware_RamFio&days_back=100 Notice there were a bunch of other tests running before. Likely one of them leaked the memory autotest runtest video_ChromeHWDecodeUsed autotest runtest security_ASLR autotest runtest build_RootFilesystemSize autotest runtest video_VideoSanity autotest runtest sound_infrastructure autotest runtest platform_CheckCriticalProcesses autotest runtest kernel_ProtocolCheck My suggestion is to a) add a check to the test that sufficient RAM as available before claiming it. b) watch out for recent memory leaks (possibly in video).
,
Feb 17 2017
,
Mar 18 2017
Activating. Please assign to the right owner and the appropriate priority.
,
Feb 22 2018
This is not relevant now. Hardware_RAMFio is all green. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by vpalatin@google.com
, Jun 24 2016