chromium.perf/Linux Perf build31-a9 frequently drops offline, causing tests to fail |
||||||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of ashleymarie@google.com blink_perf.css on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf and 28 other alerts blink_perf.css on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf blink_perf.css.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf blink_perf.events on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf blink_perf.events.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf blink_perf.shadow_dom on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf blink_perf.shadow_dom.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf kraken on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf kraken.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf media.desktop on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf media.desktop.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf memory.long_running_idle_gmail_tbmv2 on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf memory.long_running_idle_gmail_tbmv2.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf rasterize_and_record_micro.partial_invalidation on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf rasterize_and_record_micro.partial_invalidation.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.gpu_rasterization.tough_path_rendering_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.gpu_rasterization.tough_path_rendering_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.gpu_rasterization.tough_scrolling_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.gpu_rasterization.tough_scrolling_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.maps on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.maps.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_animation_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_animation_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_texture_upload_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_texture_upload_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_webgl_ad_cases on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf smoothness.tough_webgl_ad_cases.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf system_health.common_desktop on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf system_health.common_desktop.reference on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf views_perftests on NVIDIA GPU on Linux failing on chromium.perf/Linux Perf Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf
,
Feb 21 2018
Infra > Labs, it seems that build31-a9 is down again. Could you take a look to see what's going on?
,
Feb 21 2018
build31-a9 is up. It's swarming that is down. From the swarming log file on that host: 2135 2018-02-16 06:23:16.013 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2 2135 2018-02-16 06:23:16.013 D: Request https://chromium-swarm.appspot.com/swarming/api/v1/bot/event succeeded 2135 2018-02-16 06:23:16.014 I: on_bot_shutdown(): 0.001s 2135 2018-02-16 06:23:16.014 I: main() returning 2135 2018-02-16 06:23:16.014 I: bot_main exit code: 0 This coincides with the last event recorded: https://chromium-swarm.appspot.com/bot?id=build31-a9&selected=1&sort_stats=total%3Adesc No idea what initiated the shutdown. Rebooted the host and now it's back up and connected.
,
Feb 21 2018
...and the bot is down again after running this task: https://chromium-swarm.appspot.com/task?id=3bd0ef2a25d26e10 From the last 2 events it appears the bot shuts down specifically after NVIDIA GPU tests.
,
Feb 21 2018
Marking it untriaged again for Labs to see it. Maybe something's wrong with the GPU?
,
Feb 22 2018
,
Feb 23 2018
,
Feb 23 2018
Please restart build31-a9 again. It doesn't look like it ever came back up. NVIDIA GPU is just the name of the bot not the test it was failing on. It doesn't look like this bot has been up since the 20th: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%207%20x64%20Perf?numbuilds=200
,
Feb 23 2018
I rebooted build31-a9.
,
Feb 26 2018
Should we look at replacing build31-a9? This specific builder continues to fail after being brought back up.
,
Feb 26 2018
It looks like swarming on build31-a9 dies whenever it tries to run views_perftests. Seems to function fine on the other tests. Here's the last views_perftests run: https://chromium-swarm.appspot.com/task?id=3bddac4a1d204010&refresh=10&show_raw=1 And below is the output of swarming_log on build31-a9. Note the signal 15. 2138 2018-02-24 03:59:54.925 D: Running command: ['/usr/bin/python', '/b/s/swarming_bot.1.zip', 'task_runner', '--swarming-server', u'https://3340-42bd5c6-dot-chromium-swarm.appspot.com', '--in-file', '/b/s/w/task_runner_in.json', '--out-file', '/b/s/w/task_runner_out.json', '--cost-usd-hour', '0.43909828559', '--start', '1519444794.03', '--bot-file', '/b/s/w/bot_file6dfQeY.json', '--auth-params-file', '/b/s/w/bot_auth_params.json', '--', '--cache', '/b/s/isolated_cache', '--min-free-space', '54752089997', '--named-cache-root', '/b/s/c', '--max-cache-size', '53687091200', '--max-items', '51200'] 2138 2018-02-24 04:00:02.351 I: Got signal 15 3598 2018-02-24 04:00:02.767 I: importing bot_main: /b/s/swarming_bot.1.zip, 0c82688798abffb7da74cab18bff90adacc2d8009cf79e05fd924b0f623b45d7 3598 2018-02-24 04:00:02.768 E: Singleton held by 3598 2018-02-24 04:00:02.768 I: bot_main exit code: 1 2138 2018-02-24 04:00:03.553 I: task_runner exit: 0 2138 2018-02-24 04:00:03.616 I: [mmutex] on_after_task releasing mutex 2138 2018-02-24 04:00:03.616 I: [mmutex] Releasing maintenance mutex 2138 2018-02-24 04:00:03.617 I: ts_mon hook_name='on_after_task' pool=u'cores:8|cpu:x86-64-E3-1230_v5|cpu:x86-64-avx2|gpu:10de:1cb3-384.69|inside_docker:0|kvm:1|locale:en_US.UTF-8|machine_type:n1-standard-8|os:Linux|os:Ubuntu-14.04|pool:Chrome-perf|python:2.7.6|server_version:3340-42bd5c6|ssd:1' 2138 2018-02-24 04:00:03.617 I: on_after_task(): 0.001s 2138 2018-02-24 04:00:03.617 I: rmtree(/b/s/w) 2138 2018-02-24 04:00:03.617 D: make_tree_deleteable(/b/s/w) 2138 2018-02-24 04:00:03.618 I: ts_mon hook_name='get_settings' pool=u'cores:8|cpu:x86-64-E3-1230_v5|cpu:x86-64-avx2|gpu:10de:1cb3-384.69|inside_docker:0|kvm:1|locale:en_US.UTF-8|machine_type:n1-standard-8|os:Linux|os:Ubuntu-14.04|pool:Chrome-perf|python:2.7.6|server_version:3340-42bd5c6|ssd:1' 2138 2018-02-24 04:00:03.619 I: get_settings(): 0s 2138 2018-02-24 04:00:03.619 I: Running: ['/usr/bin/python', '/b/s/swarming_bot.1.zip', 'run_isolated', '--clean', '--log-file', '/b/s/logs/run_isolated.log', '--cache', '/b/s/isolated_cache', '--min-free-space', '54752089997', '--named-cache-root', '/b/s/c', '--max-cache-size', '53687091200', '--max-items', '51200'] 2138 2018-02-24 04:00:03.845 I: Result: 2138 2018-02-24 04:00:04.244 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2 2138 2018-02-24 04:00:04.244 D: Request https://chromium-swarm.appspot.com/swarming/api/v1/bot/event succeeded 2138 2018-02-24 04:00:04.245 I: on_bot_shutdown(): 0s 2138 2018-02-24 04:00:04.245 I: main() returning 2138 2018-02-24 04:00:04.245 I: bot_main exit code: 0 Looking at the isolate inputs: https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=df2fe553b9b98782f2e6f47994258b23f9967bd1 It's trying to run the test using xvfb? That seems wrong?
,
Mar 13 2018
Ned, do you have any idea if xvfb might be a problem here? build31-a9 is failing *again* right now - the bot appears to be going purple - and I found this bug searching back through time.
,
Mar 13 2018
,
Mar 14 2018
If xvfb is a problem for this bot, it should also be a problem for other Linux bots as well. Can lab look into what's wrong with this machine & considering replacing it? It seems to me that just rebooting the machine doesn't work well so far.
,
Mar 14 2018
This is making us losing lots of perf data on Linux, so pump up the priority
,
Mar 14 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/93b52afcd3182becce92a4b516f7bc3bc82cfda8 commit 93b52afcd3182becce92a4b516f7bc3bc82cfda8 Author: Peter Schmidt <pschmidt@google.com> Date: Wed Mar 14 18:38:13 2018
,
Mar 14 2018
build31-a9 has been replaced. Both host and video card.
,
Mar 16 2018
Ugh. This machine is still failing. I'm going to SSH into the machine and look at the swarming logs to see if anything there stands out as bad.
,
Mar 16 2018
Issue 821005 has been merged into this issue.
,
Mar 16 2018
Issue 822924 has been merged into this issue.
,
Mar 19 2018
Looks like this machine is still failing Charlie did you have any luck with your SSHing last week?
,
Mar 23 2018
Ok so we have identified that it is views_perftests that is killing the bot and we are going to disable this and reboot until we can resolve the xvfb issue. crrev.com/828226 landed on February 8th, here is the build: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf/builds/2420 and the bot went down in the next build: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf/builds/2421 So there was a recent bug on linux to do with xvfb and we decided that we shouldn't need to run our tests using xvfb anymore. See the discussion on crbug.com/822479 . I would just remove the flag from views_perftests, but Trent you indicated (and did some refactoring) to make sure that views_perftests could run with xvfb. See crrev.com/c/848466. Trent can you speak to why this flag is necessary? The linux configuration on the perf waterfall is running on bare metal machines so I don't think you need this flag to run the test.
,
Mar 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9 commit 04830e0c2322b02b9d3a6bc4cfc7d2988043eae9 Author: Emily Hanley <eyaich@google.com> Date: Fri Mar 23 18:44:46 2018 Disabling views_perftests on the chromium.perf linux bot Bug: 811766 Change-Id: Ia95dc3a6639ed4334243c4476930017aef2d560e Reviewed-on: https://chromium-review.googlesource.com/978447 Reviewed-by: Ashley Enstad <ashleymarie@chromium.org> Commit-Queue: Emily Hanley <eyaich@chromium.org> Cr-Commit-Position: refs/heads/master@{#545534} [modify] https://crrev.com/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9/testing/buildbot/chromium.perf.json [modify] https://crrev.com/04830e0c2322b02b9d3a6bc4cfc7d2988043eae9/tools/perf/core/perf_data_generator.py
,
Mar 26 2018
,
Mar 26 2018
,
Mar 26 2018
> Trent can you speak to why this flag is necessary? The linux configuration on the perf waterfall is running on bare metal machines so I don't think you need this flag to run the te Hey! Sorry about this - I had no idea stuff was breaking. (And thank you for troubleshooting!) xvfb was required for the test to pass through the trybots -- see failures on https://chromium-review.googlesource.com/c/chromium/src/+/848466/4 Without xvfb the test just emits [13387:13387:0109/225313.160530:12529997041:ERROR:gl_surface_osmesa_x11.cc(38)] XOpenDisplay failed. [13387:13387:0109/225313.160564:12529997074:ERROR:gl_initializer_x11.cc(163)] GLSurfaceOSMesaX11::InitializeOneOff failed. [13387:13387:0109/225313.160587:12529997096:FATAL:gl_surface_test_support.cc(80)] Check failed: init::InitializeGLOneOffImplementation( impl, fallback_to_software_gl, gpu_service_logging, disable_gl_drawing, init_extensions). and dies. - https://chromium-swarm.appspot.com/task?id=3af7a8d485b6af10&refresh=10&show_raw=1 Passing --xvfb fixed the issue, and all the regular swarming trybots seemed fine with it. Should (can?) the xvfb wrapper script just detect "bare metal" and ignore the xvfb flag? That seems like a more robust way to fix this issue versus additional args juggling in places like chromium.perf.json
,
Mar 27 2018
I agree that a more robust solution would be to detect this on the bots themselves but I am not sure if it is possible to know that when the test is running. For now I suggest we take that flag out of the isolate map so all runs of views_perftest doesn't have to execute with it.
,
Mar 27 2018
It could be difficult to auto-detect whether the --xvfb flag should be ignored on all of the Linux-based bot types the Chromium project supports. There are a lot of different GPU and bot configurations on the waterfall. Our team has also seen situations on the bare metal bots where other test suites were causing the Swarming server to be killed and the bot to go offline. See for example Issue 763498, and the fact that even gpu_unittests was recently found to kill our Linux bots: https://cs.chromium.org/chromium/src/content/test/gpu/generate_buildbot_json.py?q=generate_buildbot_&sq=package:chromium&l=1598 We didn't get to the bottom of all of these yet. In this situation I think the less magic the better so that we know exactly what's going on on the bots. Emily, let me assign this back to you.
,
Jul 5
|
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by aga...@chromium.org
, Feb 15 2018Status: Fixed (was: Available)