Issue metadata
Sign in to add a comment
|
Crash likely caused by trouble freeing up memory in low-memory situations (must see an OOM to qualify for this bug) |
||||||||||||||||||||||||
Issue descriptionReport ID: b928b99500000000 Client ID - 5D773DF6ACD348358E66830F715F22B0 Device- Nyan Seeing this crash on Chrome OS. --- <4>[328793.767556] chrome invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=300 <5>[328793.767571] CPU: 2 PID: 23959 Comm: chrome Tainted: G C 3.10.18 #1 <5>[328793.767591] [<c020cf9c>] (unwind_backtrace+0x0/0x110) from [<c020a08c>] (show_stack+0x20/0x24) <5>[328793.767604] [<c020a08c>] (show_stack+0x20/0x24) from [<c07688d0>] (dump_stack+0x20/0x28) <5>[328793.767615] [<c07688d0>] (dump_stack+0x20/0x28) from [<c0768090>] (dump_header.isra.12+0x88/0x1b4) <5>[328793.767625] [<c0768090>] (dump_header.isra.12+0x88/0x1b4) from [<c02c83e8>] (oom_kill_process+0x84/0x390) <5>[328793.767636] [<c02c83e8>] (oom_kill_process+0x84/0x390) from [<c02c8b4c>] (out_of_memory+0x230/0x2d4) <5>[328793.767645] [<c02c8b4c>] (out_of_memory+0x230/0x2d4) from [<c02cbef8>] (__alloc_pages_nodemask+0x798/0x818) <5>[328793.767656] [<c02cbef8>] (__alloc_pages_nodemask+0x798/0x818) from [<c02f5f00>] (read_swap_cache_async+0x60/0x138) <5>[328793.767666] [<c02f5f00>] (read_swap_cache_async+0x60/0x138) from [<c02f605c>] (swapin_readahead+0x84/0xe0) <5>[328793.767676] [<c02f605c>] (swapin_readahead+0x84/0xe0) from [<c02e5840>] (handle_pte_fault+0x204/0x828) <5>[328793.767685] [<c02e5840>] (handle_pte_fault+0x204/0x828) from [<c02e6dcc>] (handle_mm_fault+0x120/0x154) <5>[328793.767694] [<c02e6dcc>] (handle_mm_fault+0x120/0x154) from [<c0213f28>] (do_page_fault+0x12c/0x398) <5>[328793.767703] [<c0213f28>] (do_page_fault+0x12c/0x398) from [<c02001d0>] (do_DataAbort+0x48/0xc4) <5>[328793.767712] [<c02001d0>] (do_DataAbort+0x48/0xc4) from [<c0205cb8>] (__dabt_usr+0x38/0x40) ... <3>[328793.793830] Out of memory: Kill process 23732 (Compositor) score 574 or sacrifice child <0>[328802.627654] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3 --- Dianders@/snanda@ - please recommend someone who this issue can be assigned to.
,
Sep 22 2016
The low memory condition might have nothing to do with the kernel crash. Looks like cpu 3 locked up while in cpuidle state: <2>[328802.628936] CPU3: stopping <5>[328802.628959] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G C 3.10.18 #1 <5>[328802.628990] [<c020cf9c>] (unwind_backtrace+0x0/0x110) from [<c020a08c>] (show_stack+0x20/0x24) <5>[328802.629014] [<c020a08c>] (show_stack+0x20/0x24) from [<c07688d0>] (dump_stack+0x20/0x28) <5>[328802.629038] [<c07688d0>] (dump_stack+0x20/0x28) from [<c020b7d4>] (handle_IPI+0xcc/0x124) <5>[328802.629059] [<c020b7d4>] (handle_IPI+0xcc/0x124) from [<c02003ac>] (gic_handle_irq+0x64/0x6c) <5>[328802.629081] [<c02003ac>] (gic_handle_irq+0x64/0x6c) from [<c0205b80>] (__irq_svc+0x40/0x50) <5>[328802.629095] Exception stack(0xef2fbf00 to 0xef2fbf48) <5>[328802.629113] bf00: ef2fbf48 000011e9 e7135d03 000011e9 00000001 c1f211c8 38a149c0 000011e8 <5>[328802.629131] bf20: c0e10100 c0e10100 c0e60be8 ef2fbf7c 00000008 ef2fbf48 c026d514 c05f8684 <5>[328802.629144] bf40: 80000113 ffffffff <5>[328802.629167] [<c0205b80>] (__irq_svc+0x40/0x50) from [<c05f8684>] (cpuidle_enter_state+0x60/0xe8) <5>[328802.629192] [<c05f8684>] (cpuidle_enter_state+0x60/0xe8) from [<c05f883c>] (cpuidle_idle_call+0x130/0x224) <5>[328802.629216] [<c05f883c>] (cpuidle_idle_call+0x130/0x224) from [<c0206d04>] (arch_cpu_idle+0x18/0x48) <5>[328802.629239] [<c0206d04>] (arch_cpu_idle+0x18/0x48) from [<c026bfc0>] (cpu_startup_entry+0x110/0x1d0) <5>[328802.629261] [<c026bfc0>] (cpu_startup_entry+0x110/0x1d0) from [<c07648d8>] (secondary_start_kernel+0x130/0x154) <5>[328802.629295] [<c07648d8>] (secondary_start_kernel+0x130/0x154) from [<80763e24>] (0x80763e24) <4>[328803.822976] SMP: failed to stop secondary CPUs
,
Sep 26 2016
Observed this issue on chrome device Pit with M54 54.0.2840.39 / 8743.41.0 beta during hangout calls. - Hangout call ended abruptly on chrome device sometimes during screensharing and sometimes during switching external camera Crash IDs: 35a81e2d00000000 44dc3a5e00000000 346e6e2d00000000
,
Sep 27 2016
Observed this issue on chrome device Pit with M53 53.0.2785.144 / 8530.93.0 stable during hangout calls. Hangout call ended abruptly on Pit during external camera switching Crash IDs: 112aeb2d00000000 eb29eb2d00000000 c71cf95e00000000
,
Sep 27 2016
,
Sep 30 2016
Observed this issue on chrome device Pi with M55 55.0.2874.0 / 8848.0.0 dev Crash Id: 5476516d00000000
,
Oct 31 2016
Sameer: I haven't had any time to look into this. Could you see if you could find someone? Really this looks like a Chrome memory leak and the system is so full that it gives up trying to find more memory and reboots. Probably this actually needs someone from the Chrome team, but I don't know who. Alberto? A few notes: * It's possible that this memory leak is showing up elsewhere, too. See b/31401810 * It's possible that we could make this a little better if we try to re-enable compaction (we tried in http://crosbug.com/p/45689 but that got reverted). That would just delay the inevitable, though. At least the machine Dan looked at was really truly out of memory.
,
Nov 1 2016
How would compaction help with an OOM situation? It will only help you find contiguous areas, which user space doen't use.
,
Nov 1 2016
+few chrome folks who may be able to help/evaluate too This is top kernel crash in ChromeOS M54 https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.version%3D%278743.76.0%27%20AND%20exec_name%3D%27kernel%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=
,
Nov 1 2016
@8: As per above, compaction probably wouldn't help too much in the crash analyzed in @2. ...but in general it will help avoid kernel issues in low memory situations and should cause the kernel to report out of memory less often... Actually, though, something is peculiar here. This is unlike previous OOM stuff I've seen before. One thing that is weird is: <6>[328793.777925] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name ... <6>[328793.778410] [22900] 1000 22900 134957 6316 259 30445 417 chrome <6>[328793.778416] [23053] 1000 23053 163287 5521 323 31211 358 chrome <6>[328793.778423] [23084] 1000 23084 126568 5541 248 26618 475 chrome <6>[328793.778430] [23728] 1000 23728 154236 7045 286 35064 533 chrome I don't remember seeing such weird oom_score_adj values before? Also note that, when it dies, there are still plenty of Chrome processes to kill, like: <6>[328793.793667] [10037] 1000 10037 118006 5125 217 22141 300 chrome <6>[328793.793675] [10044] 1000 10044 150264 6464 308 43793 300 chrome <6>[328793.793682] [10057] 1000 10057 115684 4670 187 16089 300 chrome <6>[328793.793689] [10261] 1000 10261 70473 2701 103 3107 300 chrome <6>[328793.793696] [10273] 1000 10273 226592 6982 500 67620 300 chrome <6>[328793.793703] [10334] 1000 10334 68171 2990 102 2648 300 chrome ...that means it's not REALLY out of memory because it should just be able to kill some tabs. Also note that it dies trying to kill the compositer: <3>[328793.793830] Out of memory: Kill process 23732 (Compositor) score 574 or sacrifice child <0>[328802.627654] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3 ...which is one of the tasks with a really weird OOM score...
,
Nov 2 2016
Sameer/Stephane, who should be able to evaluate this further?
,
Nov 3 2016
Albert, are you aware of recent spikes in Chrome memory leaks that dianders was suspecting in c#7 as a possible theory?
,
Nov 3 2016
Possibly triggered by bug 624456 or b/32464369 bccheng@ any thoughts here?
,
Jan 6 2017
reproducible in 9000.50.0, 56.0.2924.53 on Kevin when close lid/open lid with external monitor connected via apple type-c adapter. crash id: 15b2ae9080000000 6ce16e9080000000
,
Jan 6 2017
@14: Why do you think your crashes have anything to do with the other crashes reported here? 15b2ae9080000000: <3>[ 600.115145] INFO: task rockchip_drm_at:151 blocked for more than 120 seconds. 6ce16e9080000000: <3>[ 960.108733] INFO: task rockchip_drm_at:161 blocked for more than 120 seconds. Neither of those have anything to do with memory pressure that I'm aware of. Please file a new bug.
,
Feb 18 2017
Issue 679855 has been merged into this issue.
,
Mar 8 2017
Can this bug may cause crashes at asus chromebook (2048MB RAM) Asus C202SA crash? Receiving numerous crashes from this device. https://drive.google.com/a/google.com/file/d/0B-g52zibXA02UHNNVWF1cXBnUVU/view?usp=sharing
,
Mar 10 2017
Issue 700403 has been merged into this issue.
,
Mar 24 2017
This is the top crasher for M57 Stable right now with close to 1286 reports. See details below. bccheng@ can you please take a look? https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.version%3D%279202.56.1%27%20AND%20stable_signature%3D%27UnspecifiedSignature%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
,
Mar 24 2017
I _might_ be debugging a similar issue in bug #702707 . We'll see...
,
Mar 27 2017
Issue 700372 has been merged into this issue.
,
Mar 30 2017
This crash has gone up from 15.62% in Stable 9202.56.1 to 29.17% in 9202.60.0 (almost doubled). https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.version%20in%20(%279202.56.1%27%2C%279202.60.0%27)&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=&compProp=product.Version&v1=9202.56.1&v2=9202.60.0 Ben are you looking at this crash. If you are not the right owner, can you point to someone else who can take a look?
,
Mar 31 2017
This is probably the same as bug #702707 I've got patches to address that bug and am working on porting to various kernels. Right now I've got it in 4.4 and 3.14.
,
Apr 1 2017
As per the analysis in bug #702707 probably the "Compositor" process often ends up somewhere in the kernel and is blocked. Then when we choose it as an OOM victim it doesn't die. Before the fixes in bug #702707 this refusal to die will wedge the whole system with a lockup, as seen here. After the fixes in bug #702707 this refusal to die will no longer wedge the whole system. We'll instead pick some other task to kill. That will probably fix this bug. -- I was a bit curious about what this "Compositor" process was since it doesn't show up normally in a "ps aux". If someone on this thread wants to comment more about it that would be interesting... From what I can tell from quick poking the "Compositor" process seems to be a short lived process that's spawned sometimes, like when I create a new tab. It's unclear (to me) if the OOM killer should really be killing the "Compositor". Possibly it's getting killed because: a) This child inherits the oom score from the parent b) This child might share a "->mm" with the parent. In this case the OOM killer tries to kill a child instead of the parent (in the hopes that it can save the parent, since the child cannot survive without a parent). If this is the case then likely my patch is the best we can do and likely the parent will be killed shortly after the child fails to die. --- NOTE that I put a delay during the spawning of the Compositor process (patching __set_task_comm) and then I straced it. It looks like it spends a bunch of time waiting on a Futex. Here's the trace: # strace -p 4833 strace: Process 4833 attached gettid() = 5 gettid() = 5 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=86353890}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=86986807}) = 0 futex(0xffcf5164, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xffcf5144, 2) = 1 futex(0xffcf5144, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=87659974}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=87810474}) = 0 futex(0xffcf5164, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xffcf5144, 2) = 1 futex(0xffcf5144, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb95f94e4, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=92023307}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=92366890}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=92625307}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=123, tv_nsec=92830932}) = 0 futex(0xeb7fece4, FUTEX_WAIT_PRIVATE, 1, NULL < long delay here > ) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=186056067}) = 0 futex(0xeb7fecc4, FUTEX_WAKE_PRIVATE, 1) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=187452275}) = 0 futex(0xffcf459c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xffcf457c, 2) = 1 futex(0xffcf457c, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb96279c4, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=188572275}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=188773817}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=188905067}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=189098733}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=189319233}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=189448442}) = 0 futex(0xeb7fece4, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=191274567}) = 0 futex(0xeb7fecc4, FUTEX_WAKE_PRIVATE, 1) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=191495358}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=191634192}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=143, tv_nsec=191736567}) = 0 futex(0xeb7fece4, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=662066784}) = 0 futex(0xeb7fecc4, FUTEX_WAKE_PRIVATE, 1) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=662558534}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=662716909}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=662842909}) = 0 futex(0xffcf4b04, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xffcf4ae4, 2) = 1 futex(0xffcf4ae4, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb96279c4, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=663396492}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=663540284}) = 0 futex(0xffcf4b04, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xffcf4ae4, 2) = 1 futex(0xffcf4ae4, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb96279c4, FUTEX_WAKE_PRIVATE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=664300367}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=161, tv_nsec=664496367}) = 0 futex(0xeb7fece4, FUTEX_WAIT_PRIVATE, 1, NULL) = ? +++ exited with 0 +++ === If it really is that we were trying to wait on a mutex then we're all good. The kernel looks like it will properly unblock the child once the holder of the futex is killed, so we won't end up in permanent zombie mode. === Leaving this open for a little bit to see if anyone from graphics might have thoughts. ...but otherwise we can probably close it as fixed once the fixes for bug #702707 land.
,
Apr 10 2017
Issue 709945 has been merged into this issue.
,
Apr 10 2017
We're still seeing the issue on minnie in 59.0.3065.0/9448.0.0,
,
Apr 10 2017
Doug looks like merged bug in #25 has the same root cause. PTAL.
,
Apr 10 2017
@26: Grace: can you please give me some pointers to crashes that are on R59?
,
Apr 10 2017
Specifically note that I tried searching crash for minnie kernel crashes on 9448.0.0. I found 3.
1. f2f90cb610000000: Looks like some yet unknown CPU errata. CPU0 is wedged on PC 0x2d64ad24, which is userspace.
Since CPU0 is wedged we get various errors about interrupts not being serviced, too.
2. 77ee0a8c80000000: Looks like the same CPU errata. CPU3 wedged on PC 0x372c7d24.
3. 66fda20790000000: stuck_netdevice. b/35578769
So if you have examples of cases where you think we've having trouble freeing up memory, please point me at them.
,
Apr 10 2017
Forked #1 and #2 to bug #710131
,
Apr 10 2017
The first 2 crashes in #29 are the R59 ones. I saw the magic signature and thought it was the same root cause. Looks like it's not.
,
Apr 10 2017
Assigning this back to Ben. Please close if the original issue was fixed.
,
Apr 26 2017
Observed this crash on Jerry 59.0.3071.25/9460.11.0 dev while sending feedback report. Crash ID: d0fec22e80000000
,
Apr 27 2017
@33: not all "watchdog" crashes are caused by low memory. In your case, I see: <4>[ 698.956853] SMP: failed to stop secondary CPUs <3>[ 698.956868] CPU1 PC: <38506080> 0x38506080 ...so you should star bug #710131 and follow that. --- Marking this bug as a dupe, which effectively closes this bug.
,
Sep 25 2017
Issue 755938 has been merged into this issue. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by djkurtz@chromium.org
, Sep 22 2016Components: OS>Kernel
Labels: Performance-Memory
Status: Available (was: Untriaged)
127 KB
127 KB Download