platform_MemoryPressure unexpected behavior with kernel 4.19 |
||
Issue descriptionWhen I run platform_MemoryPressure on a fizz with image R72-11220.0.0, I get this output: MemTotal 16304304 Phase1DiscardCount 1 Phase1MaxPageFaultRate 6973.84278423 Phase1MemFree 264664 Phase1PageFaultRate 838.782935848 Phase1SwapFree 2871268 Phase1TabCount 71 Phase1Time 2382.15241718 Phase2DiscardCount 1 Phase2MaxPageFaultRate 10765.4426211 Phase2MemFree 285316 Phase2PageFaultRate 2636.84288747 Phase2SwapFree 2934436 Phase2TabCount 71 Phase2Time 70.0634291172 SwapTotal 23883256 but when I upgrade the kernel to 4.19, I get this output: MemTotal 16298004 Phase1DiscardCount 1 Phase1MaxPageFaultRate 62.2182671714 Phase1MemFree 10348272 Phase1PageFaultRate 0.0 Phase1SwapFree 23874028 Phase1TabCount 20 Phase1Time 280.45522809 Phase2DiscardCount 1 Phase2MaxPageFaultRate 0.0 Phase2MemFree 10012700 Phase2PageFaultRate 0.0 Phase2SwapFree 23874028 Phase2TabCount 20 Phase2Time 70.0700008869 SwapTotal 23874028 showing a tab discard with still 10G MemFree and no swap usage. /sys/kernel/mm/chromeos-low_mem/available seems to be decreasing correctly and only gets as low as about 12000, so the issue doesn't seem to be on that side. I am attaching some of the test_that_results logs.
,
Nov 6
,
Nov 14
We only send a discard request to Chrome when available crosses margin. We may compute available incorrectly, and we've had such bugs, but the condition for the low-memory notifier to fire is a simple one (available < margin). We've seen tab discards happening without the low-memory threshold being crossed. I thought we had a bug open for this but cannot find it---if there isn't one, let's use this one then, since we have a repro case. One theory is that chrome proactively discards tabs (possibly based on some ML algorithm) even when memory pressure is low. It's possible that some of the signals used by such algorithm has changed with kernel 4.19. This of course assumes that this problem is only reproducible on 4.19 and there are no other differences. If the theory is correct, we may want to revise it for Chrome OS. On other platforms, other apps can benefit from the tab discard. On Chrome OS this is less clear---even if ARC++ and/or VMs are running, we can coordinate memory usage across all components, which we cannot do on Windows/MacOS.
,
Nov 15
re #3 -- we should be able to tell if that happened from memd logs (like we were looking at on my other system) We'd see a clip where theres a discard but no associated low memory condition if the image has this CL: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1312383 otherwise, we wouldn't see a discard in any of the clips If we think 4.19 is broken with respect to low memory detection, there's lower level tests we can run like kernel_LowMemNotify to see whether it's working or not
,
Nov 15
If tab discarder is triggered, there should be a string "Target memory to free:" in the log. I didn't find such log in memPress_4_19_results_part.tar. As the discussion in crbug.com/896031 , Phase1DiscardCount=1 doesn't mean there is real tab discard.
,
Nov 20
I ran kernel_LowMemNotify and it passes, so it doesn't seem to be a problem with low memory detection. I also ran the simple version of the test, and I don't see the same problem. Obviously the output is less detailed, but monitoring the memory use it is using the swap space. Looking at the logs, I do see "Target memory to free:" as described in #5 when I run the simple test, but not when I run the realistic version.
,
Dec 5
I think that I have narrowed down the problem. Running the test with kernel 4.19 on R70-11018.0.0, the test runs as expected, but running the test with R70-11019.0.0 the test has the behavior described above. With kernel version 4.4 and 4.14, R70-11019.0.0 runs as expected, so I only see the behavior with kernel 4.19. I am attaching the keyval and /var/log/messages for R70-11018.0.0 with 4.19 and R70-11019.0.0 with 4.14 and 4.19 for comparison. The keyval for R70-11019 and 4.19 shows that we see a discard before we get to low memory. The messages for R70-11018 and R70-11019 with 4.14 show "entering low_mem" which I don't see in R70-11019 with 4.19, which has "Received crash notification for chrome[5492]" instead.
,
Dec 5
Huh, so I looked at the diff between 11018.0.0 and 11019.0.0 and don't see anything obvious, but here it is: https://crosland.corp.google.com/log/11018.0.0..11019.0.0 Does that crash happen consistently? It looks like Signal 5 SIGTRAP -- so Chrome is most likely hitting a DCHECK
,
Dec 5
Could it be a chrome change?
,
Dec 5
re #9 -- chrome didn't change between those two versions
,
Dec 5
chromeos-4.19 didn't change either, and neither did it change in any of the surrounding versions. Confused.
,
Dec 21
In memPress_4_19_results_part.tar/test_that_results_rbIqi2/results-1-platform_MemoryPressure/platform_MemoryPressure/debug/platform_MemoryPressure.INFO , the log shows that the test terminated early because of a devtools crash exception. 11/02 17:34:36.324 WARNI|platform_MemoryPre:0210| network wait exception Devtools target crashed (/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:539 _AddDebuggingInformation) Received a socket error in the browser connection and the tab no longer exists. The tab probably crashed. asavery@, please help to check if the devtools always crash on the test with kernel 4.19.
,
Dec 21
In the bug description, the issue of the test instance with kernel 4.19 is that it terminated when there are only 20 tabs due to devtools crash. I tried to run platform_MemoryPressure once on kench (fizz variant) with R73-11437.0.0 image, 16 GiB RAM and kernel 4.19, and the test passed with 160+ tabs created. I think the issue in bug description is a flaky issue. It take hours to run platform_MemoryPressure with 16 GiB RAM. Is it OK to run the test on 4 GiB machine for zswap testing? Or we may modify platform_MemoryPressure to make it take less time on 16 GiB machine.
,
Dec 21
I have also proposed modifying the test to allocate and lock a bunch of RAM at startup. There's some concern that the test would not be as realistic. I think it's OK to do that for performance measurements, as the locked memory would behave as a number of unused tabs. Other tests can check functionality without the need for realism. |
||
►
Sign in to add a comment |
||
Comment 1 by groeck@chromium.org
, Nov 5