New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 675255 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

base_unittests PartitionAllocTest.GenericAllocGetSize, PartitionAllocTest.GenericAllocSizes, PartitionAllocTest.RepeatedReturnNullDirect failing on chromecast

Project Member Reported by mbjorge@google.com, Dec 16 2016

Issue description

Chromecast devices have limited memory (most have <= 512MB total, and the kernel reserves a fair amount of that).

[ RUN ] PartitionAllocTest.GenericAllocGetSize Received signal 4 <unknown> 0000b65e529a #0 0x0000b6c290e4 <unknown> #1 0x0000b6c29440 <unknown> #2 0x0000b62f21a0 <unknown> #3 0x0000b65e529a <unknown> #4 0x0000b6c9ad62 <unknown> #5 0x0000b66a314c <unknown> #6 0x0000b6cdf75a <unknown> #7 0x0000b6cdf810 <unknown> #8 0x0000b6cdf988 <unknown> #9 0x0000b6ce2104 <unknown> #10 0x0000b6ce22a4 <unknown> #11 0x0000b6caff0c <unknown> #12 0x0000b6cb75d8 <unknown> #13 0x0000b6cb7a08 <unknown> #14 0x0000b6648c88 <unknown> #15 0x0000b62e1308 __libc_start_main [end of stack trace] [2248/2248] PartitionAllocTest.GenericAllocGetSize (CRASHED)

	
 [ RUN      ] PartitionAllocTest.RepeatedReturnNullDirect
../../base/allocator/partition_allocator/partition_alloc_unittest.cc:1269: Failure
Value of: ptrs[0]
  Actual: false
Expected: true
[  FAILED  ] PartitionAllocTest.RepeatedReturnNullDirect (2 ms)

Need to figure out a way to not allocate way too much memory on these low memory devices.
 

Comment 1 by mbjorge@google.com, Dec 17 2016

Perhaps the test could utilize base::SysInfo::IsLowEndDevice?https://cs.chromium.org/chromium/src/base/sys_info.cc?sq=package:chromium&dr=CSs&rcl=1481920154&l=39

Comment 2 by danakj@chromium.org, Dec 17 2016

Owner: primiano@chromium.org
Status: Assigned (was: Untriaged)

Comment 3 by danakj@chromium.org, Dec 17 2016

Cc: gab@chromium.org
Owner: palmer@chromium.org
mbjorge@ can you please provide a link to the failing bot step?
gn args and actual stack traces would have been really helpful here. You just pasted a bunch of hex addresses here that, in general. are quite unactionable.

moving to palmer who introduced PA in crrev.com/2518253002 . 
This is the same test that failed in CQ on N5.
Looks like the test tries to reserve 6 GB in chunks of 256 M and assumes that the first chunk (but not the others) will always succeed. which doesn't seem to happen on Chromecast.


Since palmer is OOO I will fire a CL to lower down this chunk size in the meanwhile to unblock this.

> Chromecast devices have limited memory (most have <= 512MB total, and the kernel reserves a fair amount of that).
Interestingly, I think this is just about virtual address space reservations, isn't really committing any memory. 
my reading of the situation is that chromecast address space is so fragmented that PA cannot find a virtually contiguous 128 M region.
If this is true, regardless of this test, it means that chromecast might experience lot of slowdowns on mmap() due to this fragmentation and can cause crashes in subsystems like v8 that require large-ish virtually contiguous ranges. Unless, for some obscure reason, the issue is that chromecast BSP has disabled memory overcommit.
Again, this has nothing to do with the physical amount of memory available. You should file an internal bug to follow up on this.

Project Member

Comment 6 by bugdroid1@chromium.org, Dec 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3ac3dbf4347193119fa3f31db087fd87db2e99dd

commit 3ac3dbf4347193119fa3f31db087fd87db2e99dd
Author: primiano <primiano@chromium.org>
Date: Mon Dec 19 16:02:55 2016

base/allocator/partition_alloc: make unittests lowmem friendly

Partition Alloc, which has been recently moved to base/, has tests
that rely on the ability to successfully make large (512 MB)
allocations. Turns out this is not possible on small devices like
ChromeCast, that seem to either have little overcommit margins or
a heavily fragmented address space.

This CL skips those tests on devices that have < 2 GiB of physical
memory. For the same reason, it lowers down the chunk size of the
RepeatedReturnNullDirect test. The rationale of the test is to check
that repeated failures are handled consistently when performing
directly mapped allocations. The chunk size is irrelevant as long as
it is big enough to hit the directly-mmap codepaths (current
threshold = 1 MB).

BUG= 675255 

Review-Url: https://codereview.chromium.org/2589813002
Cr-Commit-Position: refs/heads/master@{#439484}

[modify] https://crrev.com/3ac3dbf4347193119fa3f31db087fd87db2e99dd/base/allocator/partition_allocator/partition_alloc_unittest.cc

Comment 7 by mbjorge@google.com, Dec 19 2016

+Restrict-View-Google 

Apologies for the crappy initial bug description. The test is not failing on a chrome/chromium infra bot, but on the devices in the Chromecast test infra. 

Here's a sponge link to the data: https://sponge.corp.google.com/target?id=19addf74-7ee3-4e84-b29f-2bc670d2eb64&target=chromecast%2FChromecastUnitTests_chorizo_user&searchFor= (The Test Log tab links to a folder in GCS that has the actual test logs. I added you as a reader on test_detail_base_unittests.txt which is the relevant log file).

And here are the gn args (-vendor specific flags):

is_chromecast=true
is_component_ffmpeg=true
target_os="linux"
symbol_level=1
arm_float_abi="hard"
target_cpu="arm"
target_sysroot="path/to/chromecast/build_sysroot"
use_sysroot=false
v8_snapshot_toolchain="//build/toolchain/linux:clang_x86_v8_arm"
is_clang=false
custom_toolchain="//chromecast/internal/build/toolchain:arm"
is_debug=false
dcheck_always_on=true
cast_is_debug=true
use_goma=true
optimize_for_size=true
enable_widevine=true
openssl_config="target"
openssl_target="//third_party/boringssl"
protobuf_config="target"
protoc_target="//third_party/protobuf:protoc"
protobuf_target="//third_party/protobuf:protobuf_lite"
enable_hls_sample_aes=true
use_playready=true
cast_product_type=1
chromecast_branding="google"


For some reason when I try to reproduce the crash on my device, gtest is not printing out the stacktrace for me. I'm working with the test team to see if we can get more useful stacktraces from the test infra in the future.

more importantly, does https://codereview.chromium.org/2589813002 fix this? 

Comment 9 by mbjorge@google.com, Dec 20 2016

It appears https://codereview.chromium.org/2589813002 fixed some of the issues but not all of the issues.

5 test runs have come back from various devices. In all of these, PartitionAllocTest.GenericAllocGetSize and PartitionAllocTest.GenericAllocSizes are passing now (whereas they were failing 100% before), so these I would say are fixed (Woo! Thanks!).

On 2/5 runs, RepeatedReturnNull failed with CRASHED.
On 1/5 runs, RepeatedReturnNullDirect failed with CRASHED.

So I would say that RepeatedReturnNullDirect and RepeatedReturnNull were not fixed, though perhaps this matches the expectations given the discussion on the device overcommit_memory setting on the devices?

--

Going back through some previous test runs across various devices, I found that depending on which device and whether it was user or eng build, that RepeatedReturnNull, RepeatedReturnNullDirect, both, or neither would fail each time. (Apologies, I did not happen to look at one of the variants where RepeatedReturnNull was failing in the original bug description, so it wasn't in the original list.)

On one of the platform/variants that RepeatedReturnNullDirect was failing on, it did not fail this time. But on another platform that it used to fail on, it is still failing on.


Comment 10 by mbjorge@google.com, Dec 20 2016

(And internal b/33746305 was filed for #4)
This smells like what was happening to android: those two tests hit the bar where they don't fail the mmap reservation but the process is killed before by the OOM killer (do you see anything related in logs/dmesg?).
Is there a way to detect chromecast at build time (so we can disable those tests there)? I really don't want to disable these tests on Linux because being able to handle properly out of virtual address space situations is something i'd really like to keep coverage on.
An alternative option would be fiddling around with /proc/self/oom_score & co, but that might require root.
Alternatively, I think this could be solved if the test harness that runs the test could create a memory cgroup where the oom killer is disabled (/cgroup/memory/XXX/memory. oom_kill_disable). See https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Comment 12 by mbjorge@google.com, Dec 20 2016

It does appear to be getting killed the OOM killer [0]. The logs from gtest don't show anything useful (they just say the test crashed).

In GN there is a is_chromecast flag (https://cs.chromium.org/chromium/src/build/config/chromecast_build.gni?q=is_chromecast&l=11) available, but I do not think there is a define available. We can just disable it by passing in a gtest exlude filter, which is a thing we have to do for other tests as well. (https://cs.chromium.org/chromium/src/chromecast/BUILD.gn?q=chromecast/B&sq=package:chromium&l=133)

[0] Running ./base_unittests --gtest_filter=PartitionAllocTest.* on a device results in the following in dmesg:

<4>[ 8222.207751] base_unittests invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
<4>[ 8222.207764] Backtrace: 
<4>[ 8222.207798] [<c0011a8c>] (dump_backtrace+0x0/0x10c) from [<c03f1788>] (dump_stack+0x18/0x1c)
<4>[ 8222.207804]  r6:c2926010 r5:000200da r4:c2926000 r3:00000000
<4>[ 8222.207828] [<c03f1770>] (dump_stack+0x0/0x1c) from [<c013bb94>] (dump_header.isra.16+0x70/0x194)
<4>[ 8222.207842] [<c013bb24>] (dump_header.isra.16+0x0/0x194) from [<c013bfe0>] (oom_kill_process+0x244/0x420)
<4>[ 8222.207853] [<c013bd9c>] (oom_kill_process+0x0/0x420) from [<c013c6e8>] (out_of_memory+0x2c4/0x2f0)
<4>[ 8222.207868] [<c013c424>] (out_of_memory+0x0/0x2f0) from [<c0140730>] (__alloc_pages_nodemask+0x698/0x6b0)
<4>[ 8222.207886] [<c0140098>] (__alloc_pages_nodemask+0x0/0x6b0) from [<c0156aec>] (handle_pte_fault+0x304/0x42c)
<4>[ 8222.207899] [<c01567e8>] (handle_pte_fault+0x0/0x42c) from [<c0156cb8>] (handle_mm_fault+0xa4/0xd4)
<4>[ 8222.207914] [<c0156c14>] (handle_mm_fault+0x0/0xd4) from [<c0016ecc>] (do_page_fault+0x1a0/0x3e0)
<4>[ 8222.207927] [<c0016d2c>] (do_page_fault+0x0/0x3e0) from [<c0008480>] (do_DataAbort+0x40/0xa0)
<4>[ 8222.207975] [<c0008440>] (do_DataAbort+0x0/0xa0) from [<c000e0f8>] (__dabt_usr+0x38/0x40)
<4>[ 8222.207984] Exception stack(0xc2927fb0 to 0xc2927ff8)
<4>[ 8222.207993] 7fa0:                                     b1804010 abababab 00490008 b3374000
<4>[ 8222.208004] 7fc0: b1804000 02000020 02000020 bec43540 bec43594 b1804010 b1801060 00000340
<4>[ 8222.208014] 7fe0: abababab bec43538 b689c4bd b653cf90 20030010 ffffffff
<4>[ 8222.208019]  r8:bec43594 r7:bec43540 r6:ffffffff r5:20030010 r4:b653cf90
<4>[ 8222.208033] Mem-info:
<4>[ 8222.208039] Normal per-cpu:
<4>[ 8222.208046] CPU    0: hi:   90, btch:  15 usd:  28
<4>[ 8222.208062] active_anon:75189 inactive_anon:4 isolated_anon:0
<4>[ 8222.208062]  active_file:0 inactive_file:3 isolated_file:0
<4>[ 8222.208062]  unevictable:4545 dirty:0 writeback:0 unstable:0
<4>[ 8222.208062]  free:1015 slab_reclaimable:262 slab_unreclaimable:955
<4>[ 8222.208062]  mapped:1862 shmem:35 pagetables:441 bounce:0
<4>[ 8222.208062]  free_cma:0
<4>[ 8222.208098] Normal free:4060kB min:4096kB low:5120kB high:6144kB active_anon:300756kB inactive_anon:16kB active_file:0kB inactive_file:12kB unevictable:18180kB isolated(anon):0kB isolated(file):0kB present:340360kB managed:333004kB mlocked:16000kB dirty:0kB writeback:0kB mapped:7448kB shmem:140kB slab_reclaimable:1048kB slab_unreclaimable:3820kB kernel_stack:904kB pagetables:1764kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1315 all_unreclaimable? yes
<4>[ 8222.208104] lowmem_reserve[]: 0 0
<4>[ 8222.208112] Normal: 1*4kB (E) 3*8kB (EM) 2*16kB (UE) 1*32kB (M) 6*64kB (UM) 2*128kB (U) 3*256kB (U) 3*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 4060kB
<4>[ 8222.208150] 2372 total pagecache pages
<4>[ 8222.214393] 85760 pages of RAM
<4>[ 8222.214404] 1248 free pages
<4>[ 8222.214409] 2194 reserved pages
<4>[ 8222.214414] 772 slab pages
<4>[ 8222.214419] 268673 pages shared
<4>[ 8222.214424] 0 pages swap cached
<6>[ 8222.214429] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
<6>[ 8222.214450] [  476]     0   476      579      176       3        0             0 ueventd
<6>[ 8222.214464] [  733]  1005   733      581      178       4        0             0 logwrapper
<6>[ 8222.214474] [  736]     0   736   126516     4390     154        0             0 PE_Single_CPU
<6>[ 8222.214485] [  750]  1008   750     1167      257       6        0             0 wpa_supplicant
<6>[ 8222.214495] [  751]  1012   751     7007      180       7        0             0 panel_service
<6>[ 8222.214505] [  786]     0   786     7818      292       9        0             0 net_mgr
<6>[ 8222.214516] [  796]  1013   796      603      185       4        0             0 dhcpcd
<6>[ 8222.214525] [  798]  1003   798      668      219       4        0             0 sntpd
<6>[ 8222.214535] [  857]     0   857      642      196       4        0             0 sysmond
<6>[ 8222.214544] [  858]     0   858      566      136       4        0             0 sh
<6>[ 8222.214553] [  859]     0   859     2614      169       4        0             0 adbd
<6>[ 8222.214563] [  861]     0   861      660      182       4        0             0 sshd
<6>[ 8222.214573] [ 1069]     0  1069     2927      197       5        0         -1000 watchdog
<6>[ 8222.214583] [ 1070]  1005  1070      581      177       4        0             0 logwrapper
<6>[ 8222.214593] [ 1071]  1005  1071      581      177       4        0             0 logwrapper
<6>[ 8222.214602] [ 1073]     0  1073      566      138       4        0             0 mount_usb_drive
<6>[ 8222.214612] [ 1075]     0  1075    21662      380      16        0             0 collectd
<6>[ 8222.214622] [ 1079]     0  1079     4920      186       6        0             0 iperf
<6>[ 8222.214631] [ 1080]     0  1080     4920      177       6        0             0 iperf
<6>[ 8222.214641] [ 2699]     0  2699      698      200       4        0             0 sshd
<6>[ 8222.214650] [ 2706]     0  2706      566      152       4        0             0 sh
<6>[ 8222.214660] [ 2817]     0  2817     5912      801      12        0             0 base_unittests
<6>[ 8222.214671] [ 2829]     0  2829      612      136       4        0             0 sleep
<6>[ 8222.214681] [ 2830]     0  2830    80403    73197     162        0             0 base_unittests
<3>[ 8222.214689] Out of memory: Kill process 2830 (base_unittests) score 847 or sacrifice child
<3>[ 8222.223619] Killed process 2830 (base_unittests) total-vm:321612kB, anon-rss:292052kB, file-rss:736kB



Yes, we should just disable these tests on Chromecast, for the same reason we did for Android. What's the relevant compile-time constant? I don't seem to see an OS_CHROMECAST or similar #define.
There explicitly isn't one, and this is kinda playing whackamole with configurations this test does not work on. Is there something better we can do?
Well, we could take the opposite approach: Explicitly enable it only for large, powerful platforms (desktop, basically). But I don't think (?) we have explicit #defines for that; OS_MAC and OS_WIN would work, but "defined(OS_LINUX) && !defined(OS_ANDROID)" would still include Chromecast, right? So we'd be back at square 1 for Linux-based things, at least.

Another argument is that these tests are fundamentally flaky, and maybe we should disable them. I personally hesitate to make that argument, though.
Is there anything the tests could check at runtime so they could just no-op on platforms/devices they wouldn't work on?

We could also add a new define (ENABLE_PARTITION_ALLOC_OOM_TESTS or something) just for base_unittests that is gated on (is_mac || is_win || (is_linux && !is_android && !is_chromecast))
Well, again, there is no is_chromecast intentionally.

I am feeling of the opinion that these tests are flaky, but some other ideas:

1) Can they be changed to be less flaky.. any ideas?
2) Runtime checks for the tests to early out on - I hate early out tests tho tbf.
3) Split these into a separate test binary so that we can choose to run them on which bots.
There is an is_chromecast, it's just not one of the OS definitions set in the BUILDCONFIG : https://cs.chromium.org/chromium/src/build/config/chromecast_build.gni?type=cs&q=%22is_chromecast+%3D%22&sq=package:chromium&l=11

2) I agree that runtime checks are not great.
3) I'd be wary that more tests might get added to the separate binary if they are logically similar, and then we could potentially get the same problem, where some tests don't work and some tests do, or where tests that would work on a platform consistently are not getting tested because the binary is not getting run.
Do we want to use partitionalloc on platforms where these tests can't run?
#19: I'd say yes, we potentially might (even if we don't now). I think the problem is entirely local to these specific tests, which are by design OOMy and hence flaky. It's only now that have noticed; it's not a flaw inherent to PA.
Re #19: yes. Those test cover safety mechanisms (proper handling of running out of virtual address space) that must be triggered on platforms that don't have aggressive system-level OOM handling. IMHO this is quite important for security purposes and I'd be a bit worries about just disabling those tests. This is the kind of regressions that would go unnoticed in production code, as those code paths are triggered very rarely, but could become an attack vector if they regress.

The key of the problem here is:
1. those test will fail on platforms that do system-level handling because the system kills the process before we manage to see the unavailability of address space in our process.
2. it is not easy to detect, at least at compile time, which platforms do system-level handling.

With regards to 2, previously (in blink) we were relying on OS_ANDROID for this purpose. But now that we moved this to base this is not sufficient anymore because of chromecast

My suggestions here are:
A as a band-aid use the runtime heuristic I introduced in #6 for those other tests I missed. In other words, make those tests a noop if !IsLargeMemoryDevice().
B In the long term, would be nice if we could make this less heuristic by disabling OOM. This can be achieved either:
 1. Disabling (maybe temporarily) OOM kill system wide via sysctl "vm.oom-kill = 0"
 2. Associating the test process to a memory cgroup that has memory.oom_control
Both A and B concretely would be changes to the test launching harness.
Thanks for thinking more about this. Using IsLargeMemoryDevice() with a TODO pointing to a bug for disabling the OOM killer sounds great. Disabling the OOM killer sounds like a good idea since we're trying to test that, and even on platforms we can be sure that we do something good even if the OOM killer malfunctions then.
Project Member

Comment 24 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e46baf31c8607e58385f55a79b31e941d29e204f

commit e46baf31c8607e58385f55a79b31e941d29e204f
Author: palmer <palmer@chromium.org>
Date: Fri Jan 06 01:45:04 2017

Don't run memory-intensive tests on low-memory devices.

It makes the tests flaky. Add a TODO to disable OOM killing for these tests,
where possible.

BUG= 675255 

Review-Url: https://codereview.chromium.org/2618853002
Cr-Commit-Position: refs/heads/master@{#441802}

[modify] https://crrev.com/e46baf31c8607e58385f55a79b31e941d29e204f/base/allocator/partition_allocator/partition_alloc_unittest.cc

Status: Fixed (was: Assigned)
Project Member

Comment 26 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7223079dda8d234c24aed43d249bf27ece1e30ba

commit 7223079dda8d234c24aed43d249bf27ece1e30ba
Author: primiano <primiano@chromium.org>
Date: Fri Jan 06 22:37:03 2017

PartitionAllocator: relax condition that skips test on lowmem devices

Some of the exclusions introduced by crrev.com/2618853002 to deal
with OOM flakiness on lowmem devices are too aggressive and were
already dealt with by crrev.com/2589813002 .
Effectively this is a partial revert of crrev.com/2618853002 .

BUG= 675255 

Review-Url: https://codereview.chromium.org/2616063002
Cr-Commit-Position: refs/heads/master@{#442075}

[modify] https://crrev.com/7223079dda8d234c24aed43d249bf27ece1e30ba/base/allocator/partition_allocator/partition_alloc_unittest.cc

Sign in to add a comment