New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 811766 link

Starred by 6 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: ----

Blocked on:
issue 763498
issue 825793

Blocking:
issue 860358



Sign in to add a comment

chromium.perf/Linux Perf build31-a9 frequently drops offline, causing tests to fail

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Feb 13 2018

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of ashleymarie@google.com

blink_perf.css on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf and 28 other alerts

blink_perf.css on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

blink_perf.css.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

blink_perf.events on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

blink_perf.events.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

blink_perf.shadow_dom on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

blink_perf.shadow_dom.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

kraken on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

kraken.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

media.desktop on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

media.desktop.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

memory.long_running_idle_gmail_tbmv2 on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

memory.long_running_idle_gmail_tbmv2.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

rasterize_and_record_micro.partial_invalidation on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

rasterize_and_record_micro.partial_invalidation.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.gpu_rasterization.tough_path_rendering_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.gpu_rasterization.tough_path_rendering_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.gpu_rasterization.tough_scrolling_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.gpu_rasterization.tough_scrolling_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.maps on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.maps.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_animation_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_animation_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_texture_upload_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_texture_upload_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_webgl_ad_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

smoothness.tough_webgl_ad_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

system_health.common_desktop on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

system_health.common_desktop.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf

views_perftests on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf

Builders failed on: 
- Linux Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf


 

Comment 1 by aga...@chromium.org, Feb 15 2018

Owner: aga...@chromium.org
Status: Fixed (was: Available)
Failing because there's only one bot in the swarming fleet capable of running some of the tests this wants, and it's been dead for two days:
https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=cpu%3Ax86-64&f=gpu%3A10de%3A1cb3&f=id%3Abuild31-a9&f=os%3AUbuntu-14.04&f=pool%3AChrome-perf&l=100&s=id%3Aasc

Bot shut down two days ago and never came back:
https://chromium-swarm.appspot.com/bot?id=build31-a9&selected=1&sort_stats=total%3Adesc
Signal was received	bot_shutdown	2/12/2018, 5:01:45 PM (PST)		4fa20b76

Manually rebooted that bot and its back now.
Components: Infra>Labs
Owner: ----
Status: Assigned (was: Fixed)
Infra > Labs, it seems that build31-a9 is down again. Could you take a look to see what's going on?

Comment 3 by pschm...@google.com, Feb 21 2018

build31-a9 is up.  It's swarming that is down.

From the swarming log file on that host:

2135 2018-02-16 06:23:16.013 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2
2135 2018-02-16 06:23:16.013 D: Request https://chromium-swarm.appspot.com/swarming/api/v1/bot/event succeeded
2135 2018-02-16 06:23:16.014 I: on_bot_shutdown(): 0.001s
2135 2018-02-16 06:23:16.014 I: main() returning
2135 2018-02-16 06:23:16.014 I: bot_main exit code: 0

This coincides with the last event recorded: https://chromium-swarm.appspot.com/bot?id=build31-a9&selected=1&sort_stats=total%3Adesc

No idea what initiated the shutdown.

Rebooted the host and now it's back up and connected.

...and the bot is down again after running this task: https://chromium-swarm.appspot.com/task?id=3bd0ef2a25d26e10

From the last 2 events it appears the bot shuts down specifically after NVIDIA GPU tests.
Status: Untriaged (was: Assigned)
Marking it untriaged again for Labs to see it. Maybe something's wrong with the GPU?

Comment 6 by pschm...@google.com, Feb 22 2018

Owner: pschmidt@chromium.org
Status: Assigned (was: Untriaged)

Comment 7 by eyaich@chromium.org, Feb 23 2018

Cc: eyaich@google.com
 Issue 815164  has been merged into this issue.

Comment 8 by eyaich@chromium.org, Feb 23 2018

Please restart build31-a9 again.  It doesn't look like it ever came back up.

NVIDIA GPU is just the name of the bot not the test it was failing on.  It doesn't look like this bot has been up since the 20th:

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%207%20x64%20Perf?numbuilds=200

Comment 9 by b...@chromium.org, Feb 23 2018

I rebooted build31-a9. 
Should we look at replacing build31-a9?  This specific builder continues to fail after being brought back up.
Cc: kbr@chromium.org
It looks like swarming on build31-a9 dies whenever it tries to run views_perftests.  Seems to function fine on the other tests. 

Here's the last views_perftests run:

https://chromium-swarm.appspot.com/task?id=3bddac4a1d204010&refresh=10&show_raw=1

And below is the output of swarming_log on build31-a9.  Note the signal 15.   

2138 2018-02-24 03:59:54.925 D: Running command: ['/usr/bin/python', '/b/s/swarming_bot.1.zip', 'task_runner', '--swarming-server', u'https://3340-42bd5c6-dot-chromium-swarm.appspot.com', '--in-file', '/b/s/w/task_runner_in.json', '--out-file', '/b/s/w/task_runner_out.json', '--cost-usd-hour', '0.43909828559', '--start', '1519444794.03', '--bot-file', '/b/s/w/bot_file6dfQeY.json', '--auth-params-file', '/b/s/w/bot_auth_params.json', '--', '--cache', '/b/s/isolated_cache', '--min-free-space', '54752089997', '--named-cache-root', '/b/s/c', '--max-cache-size', '53687091200', '--max-items', '51200']
2138 2018-02-24 04:00:02.351 I: Got signal 15
3598 2018-02-24 04:00:02.767 I: importing bot_main: /b/s/swarming_bot.1.zip, 0c82688798abffb7da74cab18bff90adacc2d8009cf79e05fd924b0f623b45d7
3598 2018-02-24 04:00:02.768 E: Singleton held by
3598 2018-02-24 04:00:02.768 I: bot_main exit code: 1
2138 2018-02-24 04:00:03.553 I: task_runner exit: 0
2138 2018-02-24 04:00:03.616 I: [mmutex] on_after_task releasing mutex
2138 2018-02-24 04:00:03.616 I: [mmutex] Releasing maintenance mutex
2138 2018-02-24 04:00:03.617 I: ts_mon hook_name='on_after_task' pool=u'cores:8|cpu:x86-64-E3-1230_v5|cpu:x86-64-avx2|gpu:10de:1cb3-384.69|inside_docker:0|kvm:1|locale:en_US.UTF-8|machine_type:n1-standard-8|os:Linux|os:Ubuntu-14.04|pool:Chrome-perf|python:2.7.6|server_version:3340-42bd5c6|ssd:1'
2138 2018-02-24 04:00:03.617 I: on_after_task(): 0.001s
2138 2018-02-24 04:00:03.617 I: rmtree(/b/s/w)
2138 2018-02-24 04:00:03.617 D: make_tree_deleteable(/b/s/w)
2138 2018-02-24 04:00:03.618 I: ts_mon hook_name='get_settings' pool=u'cores:8|cpu:x86-64-E3-1230_v5|cpu:x86-64-avx2|gpu:10de:1cb3-384.69|inside_docker:0|kvm:1|locale:en_US.UTF-8|machine_type:n1-standard-8|os:Linux|os:Ubuntu-14.04|pool:Chrome-perf|python:2.7.6|server_version:3340-42bd5c6|ssd:1'
2138 2018-02-24 04:00:03.619 I: get_settings(): 0s
2138 2018-02-24 04:00:03.619 I: Running: ['/usr/bin/python', '/b/s/swarming_bot.1.zip', 'run_isolated', '--clean', '--log-file', '/b/s/logs/run_isolated.log', '--cache', '/b/s/isolated_cache', '--min-free-space', '54752089997', '--named-cache-root', '/b/s/c', '--max-cache-size', '53687091200', '--max-items', '51200']
2138 2018-02-24 04:00:03.845 I: Result:

2138 2018-02-24 04:00:04.244 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2
2138 2018-02-24 04:00:04.244 D: Request https://chromium-swarm.appspot.com/swarming/api/v1/bot/event succeeded
2138 2018-02-24 04:00:04.245 I: on_bot_shutdown(): 0s
2138 2018-02-24 04:00:04.245 I: main() returning
2138 2018-02-24 04:00:04.245 I: bot_main exit code: 0

Looking at the isolate inputs:  https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=df2fe553b9b98782f2e6f47994258b23f9967bd1

It's trying to run the test using xvfb?   That seems wrong?

Cc: nednguyen@chromium.org
Ned, do you have any idea if xvfb might be a problem here?

build31-a9 is failing *again* right now - the bot appears to be going purple - and I found this bug searching back through time.
Summary: chromium.perf/Linux Perf build31-a9 frequently drops offline, causing tests to fail (was: Many tests failing with exception on NVIDIA GPU on Linux on chromium.perf/Linux Perf)
Cc: vhang@chromium.org
If xvfb is a problem for this bot, it should also be a problem for other Linux bots as well. 

Can lab look into what's wrong with this machine & considering replacing it? It seems to me that just rebooting the machine doesn't work well so far.
Labels: -Pri-2 Pri-1
This is making us losing lots of perf data on Linux, so pump up the priority
Project Member

Comment 16 by bugdroid1@chromium.org, Mar 14 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/93b52afcd3182becce92a4b516f7bc3bc82cfda8

commit 93b52afcd3182becce92a4b516f7bc3bc82cfda8
Author: Peter Schmidt <pschmidt@google.com>
Date: Wed Mar 14 18:38:13 2018

Status: Started (was: Assigned)
build31-a9 has been replaced.  Both host and video card.
Ugh. This machine is still failing. I'm going to SSH into the machine and look at the swarming logs to see if anything there stands out as bad.
Issue 821005 has been merged into this issue.

Comment 20 by jo...@google.com, Mar 16 2018

 Issue 822924  has been merged into this issue.
Looks like this machine is still failing
Charlie did you have any luck with your SSHing last week?
Cc: tapted@chromium.org
Ok so we have identified that it is views_perftests that is killing the bot and we are going to disable this and reboot until we can resolve the xvfb issue.  

crrev.com/828226 landed on February 8th, here is the build: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf/builds/2420 and the bot went down in the next build: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf/builds/2421

So there was a recent bug on linux to do with xvfb and we decided that we shouldn't need to run our tests using xvfb anymore.  See the discussion on  crbug.com/822479 .  I would just remove the flag from views_perftests, but Trent you indicated (and did some refactoring) to make sure that views_perftests could run with xvfb.  See crrev.com/c/848466.

Trent can you speak to why this flag is necessary?  The linux configuration on the perf waterfall is running on bare metal machines so I don't think you need this flag to run the test.  
Project Member

Comment 23 by bugdroid1@chromium.org, Mar 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9

commit 04830e0c2322b02b9d3a6bc4cfc7d2988043eae9
Author: Emily Hanley <eyaich@google.com>
Date: Fri Mar 23 18:44:46 2018

Disabling views_perftests on the chromium.perf linux bot

Bug: 811766
Change-Id: Ia95dc3a6639ed4334243c4476930017aef2d560e
Reviewed-on: https://chromium-review.googlesource.com/978447
Reviewed-by: Ashley Enstad <ashleymarie@chromium.org>
Commit-Queue: Emily Hanley <eyaich@chromium.org>
Cr-Commit-Position: refs/heads/master@{#545534}
[modify] https://crrev.com/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9/tools/perf/core/perf_data_generator.py

Blockedon: 825793
Cc: sullivan@chromium.org
 Issue 825981  has been merged into this issue.
> Trent can you speak to why this flag is necessary?  The linux configuration on the perf waterfall is running on bare metal machines so I don't think you need this flag to run the te

Hey! Sorry about this - I had no idea stuff was breaking. (And thank you for troubleshooting!)

xvfb was required for the test to pass through the trybots -- see failures on https://chromium-review.googlesource.com/c/chromium/src/+/848466/4

Without xvfb the test just emits

[13387:13387:0109/225313.160530:12529997041:ERROR:gl_surface_osmesa_x11.cc(38)] XOpenDisplay failed.
[13387:13387:0109/225313.160564:12529997074:ERROR:gl_initializer_x11.cc(163)] GLSurfaceOSMesaX11::InitializeOneOff failed.
[13387:13387:0109/225313.160587:12529997096:FATAL:gl_surface_test_support.cc(80)] Check failed: init::InitializeGLOneOffImplementation( impl, fallback_to_software_gl, gpu_service_logging, disable_gl_drawing, init_extensions). 

and dies. - https://chromium-swarm.appspot.com/task?id=3af7a8d485b6af10&refresh=10&show_raw=1

Passing --xvfb fixed the issue, and all the regular swarming trybots seemed fine with it.


Should (can?) the xvfb wrapper script just detect "bare metal" and ignore the xvfb flag? That seems like a more robust way to fix this issue versus additional args juggling in places like chromium.perf.json
I agree that a more robust solution would be to detect this on the bots themselves but I am not sure if it is possible to know that when the test is running.

For now I suggest we take that flag out of the isolate map so all runs of views_perftest doesn't have to execute with it.

Comment 28 by kbr@chromium.org, Mar 27 2018

Blockedon: 763498
Owner: eyaich@chromium.org
It could be difficult to auto-detect whether the --xvfb flag should be ignored on all of the Linux-based bot types the Chromium project supports. There are a lot of different GPU and bot configurations on the waterfall. Our team has also seen situations on the bare metal bots where other test suites were causing the Swarming server to be killed and the bot to go offline. See for example Issue 763498, and the fact that even gpu_unittests was recently found to kill our Linux bots:

https://cs.chromium.org/chromium/src/content/test/gpu/generate_buildbot_json.py?q=generate_buildbot_&sq=package:chromium&l=1598

We didn't get to the bottom of all of these yet.

In this situation I think the less magic the better so that we know exactly what's going on on the bots.

Emily, let me assign this back to you.

Blocking: 860358

Sign in to add a comment