New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 597643 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug
Gfx



Sign in to add a comment

ryu: system becomes sluggish while connected to DP

Project Member Reported by seanpaul@chromium.org, Mar 24 2016

Issue description

From crosbug.com/p/41682:

==============================================================================
#28 karyj@nvidia.com
Hi, Addison

I found that the system may be sluggish with DP monitor connected in some cases. After some dig, I think this issue should be caused by the "Composite" procedure in the hwcomposer.
To be honest, I am not familiar with the hwcomposer, but I assume that the the hwcomposer does not have to do twice of "Composite" on the same frame for the different displays. So could you please have some google expert take a look? Please correct me if my assumption is wrong.

Also, I attached the results from systrace w/o DP connected.

The reproducing steps are:
1. Open a website in the chrome browser
2. Press the "Recently Opened Applications" button at right-bottom corner and then press the chrome browser to make it at front again. Repeat this step.

Thanks


==============================================================================
#40 karyj@nvidia.com
I am investigating the issue that the system may become sluggish sometimes as commented at #28, and I found that the hwcomposer may not handle the layers correctly when enabling hardware overlay. 

I tried dumping the DC_WIN_SIZE at both dc.0 and dc.1, and they are:
dc.0: all the 3 windows are outputting 2560x1800
dc.1: all the 3 windows are outputting 3840x2160

I think it looks excessive for the layers like:
dc.0 
    type   |  handle  | hint | flag | tr | blnd |   format    |     source crop (l,t,r,b)      |          frame         | name
-----------+----------+------+------+----+------+-------------+--------------------------------+------------------------+------
      GLES | 7138df1d50 | 0000 | 0001 | 00 | 0100 | RGBA_8888   |    0.0,    0.0, 2560.0, 1640.0 |    0,   48, 2560, 1688 | SurfaceView
      GLES | 7138def690 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,  128.0, 2560.0,  244.0 |    0,  128, 2560,  244 | com.android.chrome/org.chromium.chrome.browser.ChromeTabbedActivity
      GLES | 7138d80c70 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 2560.0,   48.0 |    0,    0, 2560,   48 | StatusBar
      GLES | 7138d38de0 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 2560.0,  112.0 |    0, 1688, 2560, 1800 | NavigationBar
 FB TARGET | 7138abe340 | 0000 | 0000 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 2560.0, 1800.0 |    0,    0, 2560, 1800 | HWC_FRAMEBUFFER_TARGET

dc.1
    type   |  handle  | hint | flag | tr | blnd |   format    |     source crop (l,t,r,b)      |          frame         | name
-----------+----------+------+------+----+------+-------------+--------------------------------+------------------------+------
      GLES | 7138df1d50 | 0000 | 0001 | 00 | 0100 | RGBA_8888   |    0.0,    0.0, 2560.0, 1640.0 |  384,   58, 3456, 2026 | SurfaceView
      GLES | 7138def690 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,  128.0, 2560.0,  244.0 |  384,  154, 3456,  293 | com.android.chrome/org.chromium.chrome.browser.ChromeTabbedActivity
      GLES | 7138d80c70 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 2560.0,   48.0 |  384,    0, 3456,   58 | StatusBar
      GLES | 7138d38de0 | 0000 | 0001 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 2560.0,  112.0 |  384, 2026, 3456, 2160 | NavigationBar
 FB TARGET | 7138c67b80 | 0000 | 0000 | 00 | 0105 | RGBA_8888   |    0.0,    0.0, 3840.0, 2160.0 |    0,    0, 3840, 2160 | HWC_FRAMEBUFFER_TARGET


If I disabled hw overlay through "Settings->Developer Options->Disable HW overlay", I can not see the sluggishness. And for the same layers case, I get the different results when dumping the DC_WIN_SIZE.
dc.0: window A  (2560x1800), window B (2560x48), window C (2560x112)
dc.1: window A  (3840x2160), window B (3072x58), window C (3072x134)

I am not familiar with the hwcomposer theory. But this result looks more reasonable as the SurfaceView layer and the Chrome browser is blended into the window A, and the window B/C is respectively for StatusBar layer and NavigationBar layer.

Sean, do you have any idea about this? 

Considering this bug is used to bring up the DP, would you mind file new google issue to track other DP related issues, such as the sluggishness issue?

Thanks

 
systrace_for_sluggishness_with_DP.tgz
3.5 MB Download
Cc: -kary...@nvidia.com ka...@nvidia.com
Cc: za...@chromium.org
A few questions:

1- Does this only happen with 4k display?
2- When you disabled HW overlays via settings, were all dc windows active? I'd only expect 1 to be active in that scenario

Labels: drm_hwcomposer
Components: OS>Kernel>Graphics

Comment 5 by ka...@nvidia.com, Mar 25 2016

Re #2

>>1- Does this only happen with 4k display?
No. It also happens when using 1920x1080 display

>>2- When you disabled HW overlays via settings, were all dc windows active? I'd only expect 1 to be active in that scenario
You are correct. I missed that the window B/C would not be enabled and only window A was used in that scenario

As I said before, disabling HW overlays via settings can improve the sluggishness issue, but if I directly disable the window B/C like this

static const struct tegra_dc_window_soc_info tegra210_dc_window_soc_info[] = {
	[0] = {
		.supports_v_filter = true,
		.supports_h_filter = true,
		.supports_planar_rotation = true
	},
	/*[1] = {
		.supports_v_filter = true,
		.supports_h_filter = true,
		.supports_planar_rotation = false
	},
	[2] = {
		.supports_v_filter = true,
		.supports_h_filter = true,
		.supports_planar_rotation = false
	},*/
};

the sluggishness issue still can be reproduced and I can see only the window A is used in this case.
Sean, from the view point of hwcomposer, is there any difference between disabling HW overlay via setting and directly disabling window B/C in kernel? 


The difference between those 2 scenarios is that drm_hwcomposer is doing the GL composition instead of surfaceflinger.

Zach, do you have any idea why things might slow down in the dual-monitor case?

Comment 7 by za...@chromium.org, Mar 25 2016

I'm having a bit of trouble understanding the scenario that makes it sluggish. Is it the case that disabled B/C windows causes sluggishness, while "Developer Options->Disable HW overlay" works without a problem?
That's my understanding, yes. It seems like the drm_hwc GLC is sluggish compared to SF's compositor.

Comment 9 by za...@chromium.org, Mar 25 2016

Hmmm, first thing that comes to mind is that when SurfaceFlinger is mirrowing displays and hw overlays are disabled, it composites everything once and gives each display a copy, whereas when our HWC receives displays with identical contents, it will composite both displays as if they had independent contents. This would double our rendering requirements.

Comment 10 by ka...@nvidia.com, Mar 28 2016

I agree with Zach. From the results of systrace, the composite may cost much time sometimes and then cause the delay

Zach/Sean, may we have the chance to do only one composition on the display with the same contents in the hwcomposer?
AFAICT, there's no way for hwc to know whether SF wants the contents mirrored. This means we can't take the same shortcut that SF does (since there are cases where the contents are not mirrored).

I would be curious to know whether the sluggishness is caused by complex compositions taking a long time, or hwc getting "behind" and causing a backlog of work.

Comment 12 by ka...@nvidia.com, Mar 30 2016

@c11:

Is there anyway for SF to tell hwc whether these contents should be mirrored or processed in other way ? I remember ChromeOS has a option to select the mirror mode or extend mode in display settings, is it possible for Smaug?

I dumped the time cost by the following line in the DrmDisplayCompositor::ApplyPreComposite() function:

  ret = pre_compositor_->Composite(display_comp->layers().data(),
                                   regions.data(), regions.size(), fb.buffer());

Then I get the following results:

Without DP connected, the cost time of finishing "Composite" is less than 5ms at most time and never larger than 20ms
With DP connected, the cost time of finishing "Composite" is less than 10ms at most time. But sometimes it is larger than 20ms and even over 100ms.

I guess it may be hard for the hwc to process the complex compositions twice in a short time and then the sluggishness appear. However, I am not sure about this as I does not know much about the hwc. Do you have any idea about this?

Comment 13 by ka...@nvidia.com, Mar 30 2016

As the comment #1 said, I observed that the 3 windows both for dc0 and dc1 were outputting the maximum size (2560x1800 and 3840x2160) when the sluggishness appeared. Sean/Zach is it expected behavior?



Comment 14 by ka...@nvidia.com, Apr 1 2016

Hi, Sean&Zach

I tried reading more code of hwcomposer, but sadly, I have not find the door to it yet. Would you like to throw some lights on debugging this issue?
Thanks


@#13, that would be expected, depending on the scene. We always use all the windows we can before compositing the rest and shoving that into one of the windows.

@#14, planning for a hwc_set takes place over a lot of places. Follow the call chain to learn about how things are decided
hwc_set -> DrmCompositor::QueueComposition -> DrmComposition::Plan -> DrmDisplayComposition::Plan. Through those calls, you'll see how we convert the input display contents into nicer C++ classes with automatic management of resources like FDs and fences. Then we queue up the frame which will also trigger a split of the frame into multiple displays. The various display compositors/compositions will decide among themselves how they will split up the available windows and if they will perform any layer squashing.

Comment 16 by ka...@nvidia.com, Apr 5 2016

@#15:

I agree tjat "using all windows" is normal, but it looks abnormal that all the windows are outputting the maximum size at the same time. Shouldn't we consider the overlap? 
If the overlapped area are output more than once, it's a waste of emc bandwidth and then it may cause the display to take bandwidth from GPU and make the GL composting slow due to lacking of bandwidth. That's why I am concerning all the windows are outputting maximum size.

Zach/Sean, could you please help me cc to davidu@nvidia.com? He knows more about the hwcomposer than me and may help here

Thanks

Comment 17 by ka...@nvidia.com, Apr 13 2016

Hi, Zach/Sean

After some debug, I found that the "nouveau_fence_sync(pt, chan, true)" function in the "nouveau_gem_pushbuf_queue_kthread_fn" thread will cause a 500ms delay if some fence is not signaled at that time. This fence is "sw_sync" fences and I think it should be created by surfaceflinger or hwc or something from userspace. So could you help me check whether there is some potential issue in these modules when doing dual display ? 

Thanks

Comment 18 by ka...@nvidia.com, Apr 15 2016

This fence timeout issue may be caused by the difference of processing speed of the threads in the hwcomposer.

In hwcomposer, there are 2 threads for each Compositor for each display, and there is not any sync mechanism to make sure that they are synchronous. So it can be observed that when the flow for frame N for display 0 is finished, the frame N-2 for display 1 is being processed. In this case, surfaceflinger will think we should set fence to the frame N as the display 0 has finish its frame N, but at the same time, for display 1, it is still processing frame N-2 and the GL composition procedure for this frame N-2 will be blocked at waiting for the fence of its frame N. Obvious, it will be timeout.

Zach/Sean, could you please help on this? I think you are the right man to handle this.

Thanks

Comment 19 by ka...@nvidia.com, Apr 21 2016

Hi, Sean/Zach

May I know why the changes for supporting DP in hwcomposer are not included in the GMS build ? It will be convenient for us to test DP without replacing the hwcomposer.drm.so manually.

In addition, could you help me please cc to Larry/Hari/Mark?

Thanks




> May I know why the changes for supporting DP in hwcomposer are not included in the GMS build ?

They are in as of NRD13.


> In addition, could you help me please cc to Larry/Hari/Mark?

Will do tomorrow.

Comment 21 by ka...@nvidia.com, Apr 21 2016

Update:

There may be at least 2 fence timeout issues that cause the sluggishness
1. When boot animation is running, a short hang or sluggishness may be observed.
2.  When maximizing/minimizing window of chrome browser by pressing the "Recently Used Apps" button

In hwcomposer, if not considering squash operation, I think it will signal fences at 2 places

1. After pre composting (called GL composting before) on this frame is finished, the fence of the layers which are involved into this step will be signaled. This step calls nouveau's "do pushbuf" ioctl though invoking OpenGL functions. In the "nouveau_gem_pushbuf_queue_kthread_fn", all the fences set as "input_fence" of this pushbuf must be signaled before this pushbuf is sent to GPU.

2. After the frame is committed, which means it is to be scanned out by dc, the fences of its previous frame (except the fences signaled in pre compositing) will be signaled. So the frame N's fences will be signaled after frame N+1 is committed. If there is no frame N+1, hwcomposer will call "SquashAll" to signal these fences after waiting for 500ms

For the first sluggishness case, I observed that the fence for some layer needed by pre compositing set as the "input_fence" for some "pushbuf" when the pre compositing was being executed. This is a dead circle: the fence will not be signaled until the pre compositing is finished, but the "do pushbuf" ioctl is waiting for the fence to be signaled. 

For the second sluggishness case, I saw the fence for the non-pre compositing layer of the frame N is set  as the "input_fence" for some "pushbuf" when the frame N+1 is doing pre compositing. This is also dead circle: the for the non-pre compositing layer of the frame N will not be signaled until the frame N+1 is committed, but the pre compositing of the frame N+1 is blocked to waiting this fence. 
In both 2 cases, the fence timeout happens after about 500m

Sean/Zach, do you have any idea about this? Please correct me if my understanding is wrong. 

Thanks

Comment 22 by ka...@nvidia.com, Apr 27 2016

Update:

I made some experiments to try avoiding fence timeout:

1. Sync the pre compositing between the 2 displays. Make sure the order: the pre-compositing of the frame N for display 0(1) --> the pre-compositing of frame N for display 1(0) --> the pre-compositing of the frame N +1 for display 0(1) --> the pre-compositing of frame N + 1 for display 1(0) 

2. Sync the committing between the 2 displays, Make sure the order: the committing of the frame N for display 0(1) --> the committing of frame N for display 1(0) --> the committing of the frame N +1 for display 0(1) --> the committing of frame N + 1 for display 1(0) 

3. Tune the size of the compositing queue and frame queue.

But the fence timeout still exist. 

Based on these experiments, I do not think it is a good idea to separate the procedure of compositing/committing frames for each display with multiple threads. It will increase the load of the system to pre-composite the same layers one time for each display and it's hard to sync the frames between these displays.

Sean/Zach, would you like to modify the hwcomposer to composite these same layers only once for all the displays? I think you are the right man to do this.

Comment 23 by ka...@nvidia.com, Apr 27 2016

Hi, Sean

Would you like to help cc Larry/Hari/Mark? I think that may help if getting them involved in.

Thanks
Project Member

Comment 24 by sheriffbot@chromium.org, Apr 27 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been available for more than 365 days, and should be re-evaluated. Please re-triage this issue.
The Hotlist-Recharge-Cold label is applied for tracking purposes, and should not be removed after re-triaging the issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: WontFix (was: Untriaged)

Sign in to add a comment