base_unittests PartitionAllocTest.GenericAllocGetSize, PartitionAllocTest.GenericAllocSizes, PartitionAllocTest.RepeatedReturnNullDirect failing on chromecast |
|||||
Issue descriptionChromecast devices have limited memory (most have <= 512MB total, and the kernel reserves a fair amount of that). [ RUN ] PartitionAllocTest.GenericAllocGetSize Received signal 4 <unknown> 0000b65e529a #0 0x0000b6c290e4 <unknown> #1 0x0000b6c29440 <unknown> #2 0x0000b62f21a0 <unknown> #3 0x0000b65e529a <unknown> #4 0x0000b6c9ad62 <unknown> #5 0x0000b66a314c <unknown> #6 0x0000b6cdf75a <unknown> #7 0x0000b6cdf810 <unknown> #8 0x0000b6cdf988 <unknown> #9 0x0000b6ce2104 <unknown> #10 0x0000b6ce22a4 <unknown> #11 0x0000b6caff0c <unknown> #12 0x0000b6cb75d8 <unknown> #13 0x0000b6cb7a08 <unknown> #14 0x0000b6648c88 <unknown> #15 0x0000b62e1308 __libc_start_main [end of stack trace] [2248/2248] PartitionAllocTest.GenericAllocGetSize (CRASHED) [ RUN ] PartitionAllocTest.RepeatedReturnNullDirect ../../base/allocator/partition_allocator/partition_alloc_unittest.cc:1269: Failure Value of: ptrs[0] Actual: false Expected: true [ FAILED ] PartitionAllocTest.RepeatedReturnNullDirect (2 ms) Need to figure out a way to not allocate way too much memory on these low memory devices.
,
Dec 17 2016
,
Dec 17 2016
,
Dec 19 2016
mbjorge@ can you please provide a link to the failing bot step? gn args and actual stack traces would have been really helpful here. You just pasted a bunch of hex addresses here that, in general. are quite unactionable. moving to palmer who introduced PA in crrev.com/2518253002 . This is the same test that failed in CQ on N5. Looks like the test tries to reserve 6 GB in chunks of 256 M and assumes that the first chunk (but not the others) will always succeed. which doesn't seem to happen on Chromecast. Since palmer is OOO I will fire a CL to lower down this chunk size in the meanwhile to unblock this. > Chromecast devices have limited memory (most have <= 512MB total, and the kernel reserves a fair amount of that). Interestingly, I think this is just about virtual address space reservations, isn't really committing any memory. my reading of the situation is that chromecast address space is so fragmented that PA cannot find a virtually contiguous 128 M region. If this is true, regardless of this test, it means that chromecast might experience lot of slowdowns on mmap() due to this fragmentation and can cause crashes in subsystems like v8 that require large-ish virtually contiguous ranges. Unless, for some obscure reason, the issue is that chromecast BSP has disabled memory overcommit. Again, this has nothing to do with the physical amount of memory available. You should file an internal bug to follow up on this.
,
Dec 19 2016
https://codereview.chromium.org/2589813002/ should fix this
,
Dec 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3ac3dbf4347193119fa3f31db087fd87db2e99dd commit 3ac3dbf4347193119fa3f31db087fd87db2e99dd Author: primiano <primiano@chromium.org> Date: Mon Dec 19 16:02:55 2016 base/allocator/partition_alloc: make unittests lowmem friendly Partition Alloc, which has been recently moved to base/, has tests that rely on the ability to successfully make large (512 MB) allocations. Turns out this is not possible on small devices like ChromeCast, that seem to either have little overcommit margins or a heavily fragmented address space. This CL skips those tests on devices that have < 2 GiB of physical memory. For the same reason, it lowers down the chunk size of the RepeatedReturnNullDirect test. The rationale of the test is to check that repeated failures are handled consistently when performing directly mapped allocations. The chunk size is irrelevant as long as it is big enough to hit the directly-mmap codepaths (current threshold = 1 MB). BUG= 675255 Review-Url: https://codereview.chromium.org/2589813002 Cr-Commit-Position: refs/heads/master@{#439484} [modify] https://crrev.com/3ac3dbf4347193119fa3f31db087fd87db2e99dd/base/allocator/partition_allocator/partition_alloc_unittest.cc
,
Dec 19 2016
+Restrict-View-Google Apologies for the crappy initial bug description. The test is not failing on a chrome/chromium infra bot, but on the devices in the Chromecast test infra. Here's a sponge link to the data: https://sponge.corp.google.com/target?id=19addf74-7ee3-4e84-b29f-2bc670d2eb64&target=chromecast%2FChromecastUnitTests_chorizo_user&searchFor= (The Test Log tab links to a folder in GCS that has the actual test logs. I added you as a reader on test_detail_base_unittests.txt which is the relevant log file). And here are the gn args (-vendor specific flags): is_chromecast=true is_component_ffmpeg=true target_os="linux" symbol_level=1 arm_float_abi="hard" target_cpu="arm" target_sysroot="path/to/chromecast/build_sysroot" use_sysroot=false v8_snapshot_toolchain="//build/toolchain/linux:clang_x86_v8_arm" is_clang=false custom_toolchain="//chromecast/internal/build/toolchain:arm" is_debug=false dcheck_always_on=true cast_is_debug=true use_goma=true optimize_for_size=true enable_widevine=true openssl_config="target" openssl_target="//third_party/boringssl" protobuf_config="target" protoc_target="//third_party/protobuf:protoc" protobuf_target="//third_party/protobuf:protobuf_lite" enable_hls_sample_aes=true use_playready=true cast_product_type=1 chromecast_branding="google" For some reason when I try to reproduce the crash on my device, gtest is not printing out the stacktrace for me. I'm working with the test team to see if we can get more useful stacktraces from the test infra in the future.
,
Dec 19 2016
more importantly, does https://codereview.chromium.org/2589813002 fix this?
,
Dec 20 2016
It appears https://codereview.chromium.org/2589813002 fixed some of the issues but not all of the issues. 5 test runs have come back from various devices. In all of these, PartitionAllocTest.GenericAllocGetSize and PartitionAllocTest.GenericAllocSizes are passing now (whereas they were failing 100% before), so these I would say are fixed (Woo! Thanks!). On 2/5 runs, RepeatedReturnNull failed with CRASHED. On 1/5 runs, RepeatedReturnNullDirect failed with CRASHED. So I would say that RepeatedReturnNullDirect and RepeatedReturnNull were not fixed, though perhaps this matches the expectations given the discussion on the device overcommit_memory setting on the devices? -- Going back through some previous test runs across various devices, I found that depending on which device and whether it was user or eng build, that RepeatedReturnNull, RepeatedReturnNullDirect, both, or neither would fail each time. (Apologies, I did not happen to look at one of the variants where RepeatedReturnNull was failing in the original bug description, so it wasn't in the original list.) On one of the platform/variants that RepeatedReturnNullDirect was failing on, it did not fail this time. But on another platform that it used to fail on, it is still failing on.
,
Dec 20 2016
(And internal b/33746305 was filed for #4)
,
Dec 20 2016
This smells like what was happening to android: those two tests hit the bar where they don't fail the mmap reservation but the process is killed before by the OOM killer (do you see anything related in logs/dmesg?). Is there a way to detect chromecast at build time (so we can disable those tests there)? I really don't want to disable these tests on Linux because being able to handle properly out of virtual address space situations is something i'd really like to keep coverage on. An alternative option would be fiddling around with /proc/self/oom_score & co, but that might require root. Alternatively, I think this could be solved if the test harness that runs the test could create a memory cgroup where the oom killer is disabled (/cgroup/memory/XXX/memory. oom_kill_disable). See https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
,
Dec 20 2016
It does appear to be getting killed the OOM killer [0]. The logs from gtest don't show anything useful (they just say the test crashed). In GN there is a is_chromecast flag (https://cs.chromium.org/chromium/src/build/config/chromecast_build.gni?q=is_chromecast&l=11) available, but I do not think there is a define available. We can just disable it by passing in a gtest exlude filter, which is a thing we have to do for other tests as well. (https://cs.chromium.org/chromium/src/chromecast/BUILD.gn?q=chromecast/B&sq=package:chromium&l=133) [0] Running ./base_unittests --gtest_filter=PartitionAllocTest.* on a device results in the following in dmesg: <4>[ 8222.207751] base_unittests invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 <4>[ 8222.207764] Backtrace: <4>[ 8222.207798] [<c0011a8c>] (dump_backtrace+0x0/0x10c) from [<c03f1788>] (dump_stack+0x18/0x1c) <4>[ 8222.207804] r6:c2926010 r5:000200da r4:c2926000 r3:00000000 <4>[ 8222.207828] [<c03f1770>] (dump_stack+0x0/0x1c) from [<c013bb94>] (dump_header.isra.16+0x70/0x194) <4>[ 8222.207842] [<c013bb24>] (dump_header.isra.16+0x0/0x194) from [<c013bfe0>] (oom_kill_process+0x244/0x420) <4>[ 8222.207853] [<c013bd9c>] (oom_kill_process+0x0/0x420) from [<c013c6e8>] (out_of_memory+0x2c4/0x2f0) <4>[ 8222.207868] [<c013c424>] (out_of_memory+0x0/0x2f0) from [<c0140730>] (__alloc_pages_nodemask+0x698/0x6b0) <4>[ 8222.207886] [<c0140098>] (__alloc_pages_nodemask+0x0/0x6b0) from [<c0156aec>] (handle_pte_fault+0x304/0x42c) <4>[ 8222.207899] [<c01567e8>] (handle_pte_fault+0x0/0x42c) from [<c0156cb8>] (handle_mm_fault+0xa4/0xd4) <4>[ 8222.207914] [<c0156c14>] (handle_mm_fault+0x0/0xd4) from [<c0016ecc>] (do_page_fault+0x1a0/0x3e0) <4>[ 8222.207927] [<c0016d2c>] (do_page_fault+0x0/0x3e0) from [<c0008480>] (do_DataAbort+0x40/0xa0) <4>[ 8222.207975] [<c0008440>] (do_DataAbort+0x0/0xa0) from [<c000e0f8>] (__dabt_usr+0x38/0x40) <4>[ 8222.207984] Exception stack(0xc2927fb0 to 0xc2927ff8) <4>[ 8222.207993] 7fa0: b1804010 abababab 00490008 b3374000 <4>[ 8222.208004] 7fc0: b1804000 02000020 02000020 bec43540 bec43594 b1804010 b1801060 00000340 <4>[ 8222.208014] 7fe0: abababab bec43538 b689c4bd b653cf90 20030010 ffffffff <4>[ 8222.208019] r8:bec43594 r7:bec43540 r6:ffffffff r5:20030010 r4:b653cf90 <4>[ 8222.208033] Mem-info: <4>[ 8222.208039] Normal per-cpu: <4>[ 8222.208046] CPU 0: hi: 90, btch: 15 usd: 28 <4>[ 8222.208062] active_anon:75189 inactive_anon:4 isolated_anon:0 <4>[ 8222.208062] active_file:0 inactive_file:3 isolated_file:0 <4>[ 8222.208062] unevictable:4545 dirty:0 writeback:0 unstable:0 <4>[ 8222.208062] free:1015 slab_reclaimable:262 slab_unreclaimable:955 <4>[ 8222.208062] mapped:1862 shmem:35 pagetables:441 bounce:0 <4>[ 8222.208062] free_cma:0 <4>[ 8222.208098] Normal free:4060kB min:4096kB low:5120kB high:6144kB active_anon:300756kB inactive_anon:16kB active_file:0kB inactive_file:12kB unevictable:18180kB isolated(anon):0kB isolated(file):0kB present:340360kB managed:333004kB mlocked:16000kB dirty:0kB writeback:0kB mapped:7448kB shmem:140kB slab_reclaimable:1048kB slab_unreclaimable:3820kB kernel_stack:904kB pagetables:1764kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1315 all_unreclaimable? yes <4>[ 8222.208104] lowmem_reserve[]: 0 0 <4>[ 8222.208112] Normal: 1*4kB (E) 3*8kB (EM) 2*16kB (UE) 1*32kB (M) 6*64kB (UM) 2*128kB (U) 3*256kB (U) 3*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 4060kB <4>[ 8222.208150] 2372 total pagecache pages <4>[ 8222.214393] 85760 pages of RAM <4>[ 8222.214404] 1248 free pages <4>[ 8222.214409] 2194 reserved pages <4>[ 8222.214414] 772 slab pages <4>[ 8222.214419] 268673 pages shared <4>[ 8222.214424] 0 pages swap cached <6>[ 8222.214429] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name <6>[ 8222.214450] [ 476] 0 476 579 176 3 0 0 ueventd <6>[ 8222.214464] [ 733] 1005 733 581 178 4 0 0 logwrapper <6>[ 8222.214474] [ 736] 0 736 126516 4390 154 0 0 PE_Single_CPU <6>[ 8222.214485] [ 750] 1008 750 1167 257 6 0 0 wpa_supplicant <6>[ 8222.214495] [ 751] 1012 751 7007 180 7 0 0 panel_service <6>[ 8222.214505] [ 786] 0 786 7818 292 9 0 0 net_mgr <6>[ 8222.214516] [ 796] 1013 796 603 185 4 0 0 dhcpcd <6>[ 8222.214525] [ 798] 1003 798 668 219 4 0 0 sntpd <6>[ 8222.214535] [ 857] 0 857 642 196 4 0 0 sysmond <6>[ 8222.214544] [ 858] 0 858 566 136 4 0 0 sh <6>[ 8222.214553] [ 859] 0 859 2614 169 4 0 0 adbd <6>[ 8222.214563] [ 861] 0 861 660 182 4 0 0 sshd <6>[ 8222.214573] [ 1069] 0 1069 2927 197 5 0 -1000 watchdog <6>[ 8222.214583] [ 1070] 1005 1070 581 177 4 0 0 logwrapper <6>[ 8222.214593] [ 1071] 1005 1071 581 177 4 0 0 logwrapper <6>[ 8222.214602] [ 1073] 0 1073 566 138 4 0 0 mount_usb_drive <6>[ 8222.214612] [ 1075] 0 1075 21662 380 16 0 0 collectd <6>[ 8222.214622] [ 1079] 0 1079 4920 186 6 0 0 iperf <6>[ 8222.214631] [ 1080] 0 1080 4920 177 6 0 0 iperf <6>[ 8222.214641] [ 2699] 0 2699 698 200 4 0 0 sshd <6>[ 8222.214650] [ 2706] 0 2706 566 152 4 0 0 sh <6>[ 8222.214660] [ 2817] 0 2817 5912 801 12 0 0 base_unittests <6>[ 8222.214671] [ 2829] 0 2829 612 136 4 0 0 sleep <6>[ 8222.214681] [ 2830] 0 2830 80403 73197 162 0 0 base_unittests <3>[ 8222.214689] Out of memory: Kill process 2830 (base_unittests) score 847 or sacrifice child <3>[ 8222.223619] Killed process 2830 (base_unittests) total-vm:321612kB, anon-rss:292052kB, file-rss:736kB
,
Dec 27 2016
Yes, we should just disable these tests on Chromecast, for the same reason we did for Android. What's the relevant compile-time constant? I don't seem to see an OS_CHROMECAST or similar #define.
,
Dec 29 2016
There explicitly isn't one, and this is kinda playing whackamole with configurations this test does not work on. Is there something better we can do?
,
Dec 29 2016
Well, we could take the opposite approach: Explicitly enable it only for large, powerful platforms (desktop, basically). But I don't think (?) we have explicit #defines for that; OS_MAC and OS_WIN would work, but "defined(OS_LINUX) && !defined(OS_ANDROID)" would still include Chromecast, right? So we'd be back at square 1 for Linux-based things, at least. Another argument is that these tests are fundamentally flaky, and maybe we should disable them. I personally hesitate to make that argument, though.
,
Jan 3 2017
Is there anything the tests could check at runtime so they could just no-op on platforms/devices they wouldn't work on? We could also add a new define (ENABLE_PARTITION_ALLOC_OOM_TESTS or something) just for base_unittests that is gated on (is_mac || is_win || (is_linux && !is_android && !is_chromecast))
,
Jan 3 2017
Well, again, there is no is_chromecast intentionally. I am feeling of the opinion that these tests are flaky, but some other ideas: 1) Can they be changed to be less flaky.. any ideas? 2) Runtime checks for the tests to early out on - I hate early out tests tho tbf. 3) Split these into a separate test binary so that we can choose to run them on which bots.
,
Jan 3 2017
There is an is_chromecast, it's just not one of the OS definitions set in the BUILDCONFIG : https://cs.chromium.org/chromium/src/build/config/chromecast_build.gni?type=cs&q=%22is_chromecast+%3D%22&sq=package:chromium&l=11 2) I agree that runtime checks are not great. 3) I'd be wary that more tests might get added to the separate binary if they are logically similar, and then we could potentially get the same problem, where some tests don't work and some tests do, or where tests that would work on a platform consistently are not getting tested because the binary is not getting run.
,
Jan 3 2017
Do we want to use partitionalloc on platforms where these tests can't run?
,
Jan 3 2017
#19: I'd say yes, we potentially might (even if we don't now). I think the problem is entirely local to these specific tests, which are by design OOMy and hence flaky. It's only now that have noticed; it's not a flaw inherent to PA.
,
Jan 4 2017
Re #19: yes. Those test cover safety mechanisms (proper handling of running out of virtual address space) that must be triggered on platforms that don't have aggressive system-level OOM handling. IMHO this is quite important for security purposes and I'd be a bit worries about just disabling those tests. This is the kind of regressions that would go unnoticed in production code, as those code paths are triggered very rarely, but could become an attack vector if they regress. The key of the problem here is: 1. those test will fail on platforms that do system-level handling because the system kills the process before we manage to see the unavailability of address space in our process. 2. it is not easy to detect, at least at compile time, which platforms do system-level handling. With regards to 2, previously (in blink) we were relying on OS_ANDROID for this purpose. But now that we moved this to base this is not sufficient anymore because of chromecast My suggestions here are: A as a band-aid use the runtime heuristic I introduced in #6 for those other tests I missed. In other words, make those tests a noop if !IsLargeMemoryDevice(). B In the long term, would be nice if we could make this less heuristic by disabling OOM. This can be achieved either: 1. Disabling (maybe temporarily) OOM kill system wide via sysctl "vm.oom-kill = 0" 2. Associating the test process to a memory cgroup that has memory.oom_control Both A and B concretely would be changes to the test launching harness.
,
Jan 4 2017
Thanks for thinking more about this. Using IsLargeMemoryDevice() with a TODO pointing to a bug for disabling the OOM killer sounds great. Disabling the OOM killer sounds like a good idea since we're trying to test that, and even on platforms we can be sure that we do something good even if the OOM killer malfunctions then.
,
Jan 5 2017
CL for tests: https://codereview.chromium.org/2618853002 OOM-killer-disable bug: https://bugs.chromium.org/p/chromium/issues/detail?id=678782
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e46baf31c8607e58385f55a79b31e941d29e204f commit e46baf31c8607e58385f55a79b31e941d29e204f Author: palmer <palmer@chromium.org> Date: Fri Jan 06 01:45:04 2017 Don't run memory-intensive tests on low-memory devices. It makes the tests flaky. Add a TODO to disable OOM killing for these tests, where possible. BUG= 675255 Review-Url: https://codereview.chromium.org/2618853002 Cr-Commit-Position: refs/heads/master@{#441802} [modify] https://crrev.com/e46baf31c8607e58385f55a79b31e941d29e204f/base/allocator/partition_allocator/partition_alloc_unittest.cc
,
Jan 6 2017
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7223079dda8d234c24aed43d249bf27ece1e30ba commit 7223079dda8d234c24aed43d249bf27ece1e30ba Author: primiano <primiano@chromium.org> Date: Fri Jan 06 22:37:03 2017 PartitionAllocator: relax condition that skips test on lowmem devices Some of the exclusions introduced by crrev.com/2618853002 to deal with OOM flakiness on lowmem devices are too aggressive and were already dealt with by crrev.com/2589813002 . Effectively this is a partial revert of crrev.com/2618853002 . BUG= 675255 Review-Url: https://codereview.chromium.org/2616063002 Cr-Commit-Position: refs/heads/master@{#442075} [modify] https://crrev.com/7223079dda8d234c24aed43d249bf27ece1e30ba/base/allocator/partition_allocator/partition_alloc_unittest.cc |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by mbjorge@google.com
, Dec 17 2016