New issue
Advanced search Search tips

Issue 882982 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Sampling Profiler may affect browser main thread

Project Member Reported by etienneb@chromium.org, Sep 11

Issue description

We are observing through slow-reports cases where the Sampling thread is affecting the main thread performance.

Reports:
  300252a236358ad4
  d6e34aee91f9b693
  3ca27823e9a6e02c

That means the performance of the task on Main Thread should not be that slow.
 
sampling1.png
14.8 KB View Download
sampling2.png
12.9 KB View Download
sampling3.png
16.1 KB View Download
report: 5ed2e2a89b1c4398
sampling4.png
19.5 KB View Download
Summary: Sampling Profiler may affect browser main thread (was: Samping Profiler may affect browser main thread)
Project Member

Comment 3 by bugdroid1@chromium.org, Sep 11

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/329b2bbdaddcfa952b09c9507ecd9cce0127a331

commit 329b2bbdaddcfa952b09c9507ecd9cce0127a331
Author: Etienne Bergeron <etienneb@chromium.org>
Date: Tue Sep 11 23:06:04 2018

Add trace event for sampling profiler record stack

This CL is adding events to help investigate crbug 882982.
We are observing strange interaction with the SamplerThread and the
Browser main thread.

We suspect the thread to be suspended for a longer period than expected.

This code is temporary to investigate sampler profiling issues.
Should be reverted in a few weeks.

R=wittman@chromium.org, ssid@chromium.org

Bug: 882982
Change-Id: I4aa1d733ea06100c64e884e014222b04d97b3492
Reviewed-on: https://chromium-review.googlesource.com/1220174
Reviewed-by: Siddhartha S <ssid@chromium.org>
Reviewed-by: Mike Wittman <wittman@chromium.org>
Commit-Queue: Etienne Bergeron <etienneb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590518}
[modify] https://crrev.com/329b2bbdaddcfa952b09c9507ecd9cce0127a331/base/profiler/native_stack_sampler_win.cc

From this slow-report: 279b74dad293c2337b

RecordStack event seems to take most of the time.
sampling_bug1.png
30.2 KB View Download
From this slow-report: 7a8a3a06f01fa324
Same thing.
sampling_bug2.png
49.3 KB View Download
The first report doesn't exist and the second one never had a very long record event like we observed for 5 second in earlier cases. This could also just be thread getting descheduled sometimes? Maybe we should wait for more traces to debug this?
RecordStack happens after the thread is resumed, so what's happening on the main thread is not due to being suspended for a longer period than expected.

Interaction between the threads in RecordStack could be due to I/O contention (due to paging in unwind information) or lock contention, possibly on the loader lock. But I would not expect the sampling profiler alone to be capable of the extended runtimes we're seeing on its thread unless it itself was contending with other execution on the system.
From this report: 9b74dad293c2337b

Same pattern but not with Omnibox.
sampling_bug3.png
123 KB View Download
Labels: Hotlist-Jank
I do not thing we need more traces. We have many of them and a lot is showing this pattern.

I agree with Wittman. The main thread performance is not affected by thread suspension.
The I/O is low and the CPU is low. Thus, I really suspect an issue with locks or scheduling priority.
From this report: 9b74dad293c2337b

Most of the cases I'm seeing have that pattern.
sampling_bug4.png
115 KB View Download
From this slow-reports:
  19153909ac4e4167
  e091d0177654fa6e

No sampling results is emitted.
nosampling1.png
37.9 KB View Download
nosampling2.png
43.5 KB View Download
Project Member

Comment 12 by bugdroid1@chromium.org, Oct 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08

commit e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08
Author: Etienne Bergeron <etienneb@chromium.org>
Date: Wed Oct 24 18:06:03 2018

Add error code for sampling profiler failures

This CL is adding a trace events to debug some failures happening in the
field. An instant event will be present with an error code.

R=wittman@chromium.org
BUG=882982

Change-Id: I0a55ca82c4312bb212a5b7d1f3e391b42d3d38c9
Reviewed-on: https://chromium-review.googlesource.com/c/1297065
Commit-Queue: Etienne Bergeron <etienneb@chromium.org>
Reviewed-by: Mike Wittman <wittman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602397}
[modify] https://crrev.com/e7a1ddfb0a74efeeabf8dcc68b4fcd79fa281e08/base/profiler/native_stack_sampler_win.cc

Project Member

Comment 13 by bugdroid1@chromium.org, Oct 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/727b87443d6757fbb1950c85ad2629e9e028345b

commit 727b87443d6757fbb1950c85ad2629e9e028345b
Author: Etienne Bergeron <etienneb@chromium.org>
Date: Fri Oct 26 22:52:52 2018

Add trace events to debug sampling profiler issue with slow-reports

There are strange behaviors observed in the field with the sampling
profiler. It seems there are two running sampling profiler but only
one is reporting events.

This CL is adding trace events to help investigate this problem.

R=wittman@chromium.org

Bug: 882982
Change-Id: I70925e841b8a1f0e47564350d142f056d8b8ae99
Reviewed-on: https://chromium-review.googlesource.com/c/1302318
Reviewed-by: Mike Wittman <wittman@chromium.org>
Commit-Queue: Etienne Bergeron <etienneb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#603235}
[modify] https://crrev.com/727b87443d6757fbb1950c85ad2629e9e028345b/base/profiler/stack_sampling_profiler.cc

From this slow report: 8d04eb480c5cf9d1

The sampling profiler is having a really strange pattern.
It's a sequence of 2 seconds tasks. The sampling (maybe unwinding) is failing.
The main thread and the IO/Thread are janky with the same period (e.g. 2 seconds).

There is definitively something bad going on.

sampling.png
240 KB View Download
Same thing here: fe01c05f89836da7

The sampling profiler seems to slow-down the code by a huge factor.
sampling.png
111 KB View Download
Similar things here: 92d9553829b509d8
sampling.png
147 KB View Download
Same thing here: 7fc08883b3a6699f
sampling.png
218 KB View Download
Labels: Performance-Browser Hotlist-SamplingProfilerInField
Every tasks is 2 seconds long which is pointing to a timeout.

We need to determine is this is somehow related to 
  1) the unwinding is failing
  2) the suspend thread failing (maybe an antivirus)
  3) the loader lock is used by an other thread

Do we have other idea of what can cause that timeout?

Also, I believe the sampling profiler should detect that bad cases, and turn off and log an error when this pattern is happening. We are currently slowing down terribly some users when this case is happening.
timeout.png
13.0 KB View Download
The traces in comments 14-17 all have tmmon64.pdb, from Trend Micro, on the browser main thread stack. Example top of stack from 7fc08883b3a6699f:

 off:0xa53a4 - ntdll.pdb [7E3F314E81DE628C71B6CCBCB28921681]
 off:0x3732f - kernelbase.pdb [11CB9F243230447840B3E1ADD1265F721]
 off:0xa91b8 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1]
 off:0xde0a9 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1]
 off:0xdd888 - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1]
 off:0x3098d - tmmon64.pdb [45D8C6189BB04E5BACF7D99FA3590BED1]
 off:0x735f5 - kernelbase.pdb [11CB9F243230447840B3E1ADD1265F721]
 off:0x134165a - chrome.dll.pdb [D2BE03D5420D0A9E4C4C44205044422E1]
 off:0x372141 - chrome.dll.pdb [D2BE03D5420D0A9E4C4C44205044422E1]

This is almost certainly due to injecting that DLL in-process and hooking our system DLLs. I suspect that whatever they're doing in their hook is cause of the two second pauses.

Note that Trend Micro is a known bad actor within the Chrome process. See also issue 882450 where they are causing a ton of crashes in the stack unwinder.
More cases:
3a5e0c5038a09e8f
3f9dfe729252ef84
fd7102285bbcc281
Good finding wittman.

  1) We should not run the sampling profiling when that DLL is detected
  2) We should stop the running sampling when too many long sampling event are detected.

WDYT ?
It's probably worth temporarily disabling the sampling-profiling-while-tracing with that DLL present to test if it's something about execution on that thread that is triggering the issue.

It looks like other threads are also having a bad time though, with delays of multiples of 500ms. So it could just be that the sampling profiler is tickling this because it's executing the same affected code on a regular basis.

Sign in to add a comment