New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 794017 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Jan 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug



Sign in to add a comment

Linux Chrome startup is flaky on ChromeDriver waterfall without --disable-gpu

Project Member Reported by johnchen@chromium.org, Dec 12 2017

Issue description

Chrome Version: 65.0.*
OS: Linux only

On the ChromeDriver waterfall tests, we encountered numerous instances of Chrome browser stops responding soon after start up (e.g., see [1]). ChromeDriver log always shows GPU related error messages before this happens (e.g., [2] between timestamps 21.051 and 50.906), so we speculatively added --disable-gpu flag to the tests, and this flag indeed stopped the failures from occurring.

So far we haven't been able to repro this issue on any machines other than the ChromeDriver waterfall. This prevented us from bisecting the issue, as the ChromeDriver waterfall isn't configured for bisecting. Not sure whether this is due to GPU differences or VM configuration or something else. Waterfall history indicates that this issue likely started occurring in the commit range https://crrev.com/520710..520747

[1] https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.chromedriver%2FLinux%2F32499%2F%2B%2Frecipes%2Fsteps%2Fpython_tests_v522596_%2F0%2Fstdout

[2] http://chromedriver-data.storage.googleapis.com/server_logs/chromedriver_log_OhVe3r
 
Cc: kbr@chromium.org
The error from [2] is 

[15543:15543:1207/154821.419916:ERROR:gpu_process_transport_factory.cc(1017)] Lost UI shared context.
[15543:15616:1207/154821.426554:ERROR:service_manager_context.cc(219)] Attempting to run unsupported native service: /tmp/chromedriver_dDhgeZ/chrome-linux/content_gpu.service

+kbr@ from gpu team in case he knows something. Revision range is pretty small so maybe a dev can make a guess.
[SEVERE]: Timed out receiving message from renderer 

Also to note, this appears all over a run_py_tests.py run but isn't fatal or effects test results. Also generally the message is coupled with a time that is much less than what is in the above logs.

Comment 3 by kbr@chromium.org, Dec 14 2017

Cc: zmo@chromium.org
Possibly related to:
https://chromium.googlesource.com/chromium/src/+/e36bfd5b9a11989786d4a40afe4d2f21a941b979
? Mo, what do you think?

Is it expected that the GPU process will work on the machines on the ChromeDriver waterfall? Most of Chrome's testing machines are VMs and GPU functionality doesn't work there. But most probably the browser shouldn't fail in this way in this case, especially since ChromeDriver is probably used in many web companies' continuous integration systems, run on VMs.

Not sure if the GPU process ever really worked on the ChromeDriver waterfall, but at least it didn't cause any issues before. I think Chrome should handle the case when GPU isn't available, without forcing the users to add --disable-gpu switch.

Comment 5 by kbr@chromium.org, Dec 14 2017

Is this the bot to look at?
https://luci-milo.appspot.com/buildbot/chromium.chromedriver/Linux/?limit=200

It looks like run_all_tests.py started passing again recently; is this issue still happening?

We want to see whether the warning about losing the UI shared context was happening before Mo's patch landed.

Comment 6 by kbr@chromium.org, Dec 14 2017

Ah, cool, luci-milo offers more history than the old buildbot view.

https://luci-milo.appspot.com/buildbot/chromium.chromedriver/Linux/?limit=400

This should show the history before the failures started.

Comment 7 by zmo@chromium.org, Dec 14 2017

Owner: zmo@chromium.org
Status: Assigned (was: Untriaged)
It seems the bot turned green lately. Is it because --disable-gpu is explicitly passed in?

By looking at the log, I don't think there is enough info to tell if my CL is the culprit, or if it is, then why.

If ChromeDrive folks still want this issue to be figured out, then I need some help setting up an environment to reproduce locally.

If you guys think there is no need to do anything further, please close the bug. 

Comment 8 by zmo@chromium.org, Dec 14 2017

Cc: piman@chromium.org
Re comments 5 and 7, we have explicitly added --disable-gpu to the tests on waterfall as a workaround to this issue.

Comment 10 by zmo@chromium.org, Dec 14 2017

So if GPU acceleration is not desired on ChromeDrive, we can just add a logic to automatically insert that switch in Chrome.  How do we detect it's ChromeDrive reliably?
Many Chromedriver tests rely on gpu support. For example, we have an entire framework for video playback performance that would be useless if GPU acceleration were turned off. Additionally, Chromedriver is the default way to benchmark Chrome against other browsers in an automated way. With GPU acceleration turned off, we would start looking pretty bad.

Comment 12 by zmo@chromium.org, Dec 14 2017

Then someone please help me to set up a repro environment so I can repro and debug why the flaky crash.
Cc: vhang@chromium.org
The problem is that Chromedriver devs haven't been able to get a local repro.

+vhang@/chrome-labs:

1. Does chrome-labs maintain the VM that runs the chromedriver waterfall tests? looks like they are all run by https://build.chromium.org/deprecated/chromium.chromedriver/buildslaves/slave108-c1

2. If we wanted one, how hard would it be to get another VM checked in the chromedriver linux pool (https://luci-milo.appspot.com/buildbot/chromium.chromedriver/Linux/)? (Then we can take slave108-c1 off of the continuous builds temporarily and get the flakiness reproing.) Seems like if there were another slave available then we could just add it to https://cs.chromium.org/chromium/build/masters/master.chromium.chromedriver/slaves.cfg
Owner: ----
Status: Available (was: Assigned)
Since buildbot is deprecated anyway, I figured I would try to repro by running in swarming. I wrote this CL: https://chromium-review.googlesource.com/c/chromium/src/+/831052 (it should help with  issue 793370  anyway)

It seems to fail on GCE VMs: 
 (a) https://chromium-swarm.appspot.com/task?id=3a8198059326da10&refresh=10&show_raw=1 
 (b) https://chromium-swarm.appspot.com/task?id=3a83deb272e92310&refresh=10&show_raw=1

And it fails on physical machines: https://chromium-swarm.appspot.com/task?id=3a83de8622938d10&refresh=10&show_raw=1

But it works on chrome labs VMs: https://chromium-swarm.appspot.com/task?id=3a771e9b42cf0e10&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=3a83e76f3dbeaf10&refresh=10&show_raw=1

Also, unassigning this until we figure out how to repro it.
Project Member

Comment 15 by sheriffbot@chromium.org, Dec 19

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Owner: crouleau@chromium.org
Status: Assigned (was: Untriaged)
GPU Triage: crouleau@, is this bug still applicable?
Owner: johnchen@chromium.org
John can triage. Maybe just archive this?
Status: Archived (was: Assigned)

Sign in to add a comment