Sampling Profiler may affect browser main thread |
||||
Issue descriptionWe are observing through slow-reports cases where the Sampling thread is affecting the main thread performance. Reports: 300252a236358ad4 d6e34aee91f9b693 3ca27823e9a6e02c That means the performance of the task on Main Thread should not be that slow.
,
Sep 11
,
Sep 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/329b2bbdaddcfa952b09c9507ecd9cce0127a331 commit 329b2bbdaddcfa952b09c9507ecd9cce0127a331 Author: Etienne Bergeron <etienneb@chromium.org> Date: Tue Sep 11 23:06:04 2018 Add trace event for sampling profiler record stack This CL is adding events to help investigate crbug 882982. We are observing strange interaction with the SamplerThread and the Browser main thread. We suspect the thread to be suspended for a longer period than expected. This code is temporary to investigate sampler profiling issues. Should be reverted in a few weeks. R=wittman@chromium.org, ssid@chromium.org Bug: 882982 Change-Id: I4aa1d733ea06100c64e884e014222b04d97b3492 Reviewed-on: https://chromium-review.googlesource.com/1220174 Reviewed-by: Siddhartha S <ssid@chromium.org> Reviewed-by: Mike Wittman <wittman@chromium.org> Commit-Queue: Etienne Bergeron <etienneb@chromium.org> Cr-Commit-Position: refs/heads/master@{#590518} [modify] https://crrev.com/329b2bbdaddcfa952b09c9507ecd9cce0127a331/base/profiler/native_stack_sampler_win.cc
,
Sep 14
From this slow-report: 279b74dad293c2337b RecordStack event seems to take most of the time.
,
Sep 14
From this slow-report: 7a8a3a06f01fa324 Same thing.
,
Sep 14
The first report doesn't exist and the second one never had a very long record event like we observed for 5 second in earlier cases. This could also just be thread getting descheduled sometimes? Maybe we should wait for more traces to debug this?
,
Sep 14
RecordStack happens after the thread is resumed, so what's happening on the main thread is not due to being suspended for a longer period than expected. Interaction between the threads in RecordStack could be due to I/O contention (due to paging in unwind information) or lock contention, possibly on the loader lock. But I would not expect the sampling profiler alone to be capable of the extended runtimes we're seeing on its thread unless it itself was contending with other execution on the system.
,
Sep 15
From this report: 9b74dad293c2337b Same pattern but not with Omnibox.
,
Sep 15
I do not thing we need more traces. We have many of them and a lot is showing this pattern. I agree with Wittman. The main thread performance is not affected by thread suspension. The I/O is low and the CPU is low. Thus, I really suspect an issue with locks or scheduling priority.
,
Sep 15
From this report: 9b74dad293c2337b Most of the cases I'm seeing have that pattern.
,
Oct 23
From this slow-reports: 19153909ac4e4167 e091d0177654fa6e No sampling results is emitted.
,
Oct 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08 commit e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08 Author: Etienne Bergeron <etienneb@chromium.org> Date: Wed Oct 24 18:06:03 2018 Add error code for sampling profiler failures This CL is adding a trace events to debug some failures happening in the field. An instant event will be present with an error code. R=wittman@chromium.org BUG=882982 Change-Id: I0a55ca82c4312bb212a5b7d1f3e391b42d3d38c9 Reviewed-on: https://chromium-review.googlesource.com/c/1297065 Commit-Queue: Etienne Bergeron <etienneb@chromium.org> Reviewed-by: Mike Wittman <wittman@chromium.org> Cr-Commit-Position: refs/heads/master@{#602397} [modify] https://crrev.com/e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08/base/profiler/native_stack_sampler_win.cc
,
Oct 26
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/727b87443d6757fbb1950c85ad2629e9e028345b commit 727b87443d6757fbb1950c85ad2629e9e028345b Author: Etienne Bergeron <etienneb@chromium.org> Date: Fri Oct 26 22:52:52 2018 Add trace events to debug sampling profiler issue with slow-reports There are strange behaviors observed in the field with the sampling profiler. It seems there are two running sampling profiler but only one is reporting events. This CL is adding trace events to help investigate this problem. R=wittman@chromium.org Bug: 882982 Change-Id: I70925e841b8a1f0e47564350d142f056d8b8ae99 Reviewed-on: https://chromium-review.googlesource.com/c/1302318 Reviewed-by: Mike Wittman <wittman@chromium.org> Commit-Queue: Etienne Bergeron <etienneb@chromium.org> Cr-Commit-Position: refs/heads/master@{#603235} [modify] https://crrev.com/727b87443d6757fbb1950c85ad2629e9e028345b/base/profiler/stack_sampling_profiler.cc
,
Dec 12
From this slow report: 8d04eb480c5cf9d1 The sampling profiler is having a really strange pattern. It's a sequence of 2 seconds tasks. The sampling (maybe unwinding) is failing. The main thread and the IO/Thread are janky with the same period (e.g. 2 seconds). There is definitively something bad going on.
,
Dec 12
Same thing here: fe01c05f89836da7 The sampling profiler seems to slow-down the code by a huge factor.
,
Dec 12
Similar things here: 92d9553829b509d8
,
Dec 12
Same thing here: 7fc08883b3a6699f
,
Dec 12
Every tasks is 2 seconds long which is pointing to a timeout. We need to determine is this is somehow related to 1) the unwinding is failing 2) the suspend thread failing (maybe an antivirus) 3) the loader lock is used by an other thread Do we have other idea of what can cause that timeout? Also, I believe the sampling profiler should detect that bad cases, and turn off and log an error when this pattern is happening. We are currently slowing down terribly some users when this case is happening.
,
Dec 12
The traces in comments 14-17 all have tmmon64.pdb, from Trend Micro, on the browser main thread stack. Example top of stack from 7fc08883b3a6699f: off:0xa53a4 - ntdll.pdb [7E3F314E81DE628C71B6CCBCB28921681] off:0x3732f - kernelbase.pdb [11CB9F243230447840B3E1ADD1265F721] off:0xa91b8 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1] off:0xde0a9 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1] off:0xdd888 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1] off:0x3098d - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1] off:0x735f5 - kernelbase.pdb [11CB9F243230447840B3E1ADD1265F721] off:0x134165a - chrome.dll.pdb [D2BE03D5420D0A9E4C4C44205044422E1] off:0x372141 - chrome.dll.pdb [D2BE03D5420D0A9E4C4C44205044422E1] This is almost certainly due to injecting that DLL in-process and hooking our system DLLs. I suspect that whatever they're doing in their hook is cause of the two second pauses. Note that Trend Micro is a known bad actor within the Chrome process. See also issue 882450 where they are causing a ton of crashes in the stack unwinder.
,
Dec 12
More cases: 3a5e0c5038a09e8f 3f9dfe729252ef84 fd7102285bbcc281
,
Dec 12
Good finding wittman. 1) We should not run the sampling profiling when that DLL is detected 2) We should stop the running sampling when too many long sampling event are detected. WDYT ?
,
Dec 12
It's probably worth temporarily disabling the sampling-profiling-while-tracing with that DLL present to test if it's something about execution on that thread that is triggering the issue. It looks like other threads are also having a bad time though, with delays of multiples of 500ms. So it could just be that the sampling profiler is tickling this because it's executing the same affected code on a regular basis. |
||||
►
Sign in to add a comment |
||||
Comment 1 by etienneb@chromium.org
, Sep 1119.5 KB
19.5 KB View Download