Many tests failing on Linux Perf bot |
||||||||||||||||||||||||||
Issue descriptionMany tests are failing on chromium.perf/Linux Perf at the same time. This problem started as purple bots, then become red. loading.desktop memory.long_running_idle_gmail_background_tbmv2 memory.long_running_idle_gmail_tbmv2 page_cycler_v2.basic_oopif page_cycler_v2_site_isolation.basic_oopif rasterize_and_record_micro.top_25 smoothness.gpu_rasterization.tough_filters_cases smoothness.gpu_rasterization.tough_path_rendering_cases smoothness.key_desktop_move_cases smoothness.top_25_smooth smoothness.tough_filters_cases smoothness.tough_path_rendering_cases speedometer system_health.common_desktop system_health.memory_desktop tab_switching.typical_25 tracing.tracing_with_background_memory_infra tracing.tracing_with_debug_overhead v8.browsing_desktop v8.detached_context_age_in_gc v8.infinite_scroll_tbmv2 v8.runtime_stats.top_25 v8.runtimestats.browsing_desktop Builders failed on: - Linux Perf: https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf
,
Jun 13 2017
,
Jun 13 2017
=== BISECT JOB RESULTS === NO Test failure found Bisect Details Configuration: linux_perf_bisect Benchmark : loading.desktop Metric : benchmark_duration/benchmark_duration Revision Exit Code N chromium@477411 0 +- N/A 5 good chromium@477510 0 +- N/A 5 bad To Run This Test src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests loading.desktop Debug Info https://chromeperf.appspot.com/buildbucket_job_status/8976961542213415296 Is this bisect wrong? https://chromeperf.appspot.com/bad_bisect?try_job_id=6343705130172416 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Speed>Bisection. Thank you!
,
Jun 13 2017
Stephen, bisect says there no test failure in the range. Can you check if it is something related to swarming or infra? I will assign this to you for now. Thanks! And I just saw that this bot has exception interrupted https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf/builds/775
,
Jun 13 2017
The interrupted failures were because of network issues, which seem to have been resolved. See bug 732859 for more details. I'll look into these test failures. I ran them on my linux box, and wasn't able to reproduce, so something might have happened with the bots.
,
Jun 13 2017
I'll try to run the tests on the bot itself today.
,
Jun 14 2017
Ok, I was able to remote desktop into the bot. It looks like there's some ubuntu key ring thing that's causing the tests to fail. I manually ran https://chromium-swarm.appspot.com/task?id=36bcbc364067b510&refresh=10&show_raw=1, as I was remote desktop-ed into the bot. I saw chrome pop up, and then a prompt which said something like "You need to set up your default keyring. Please enter a password." I waited until the first story in the benchmark (there are 2 stories) failed, and then entered a password. I didn't see what exactly happened, but the second story succeeded, and then the bot rebooted (swarming reboots bots on test failure) and I lost my remote desktop connection. I retried the task again (https://chromium-swarm.appspot.com/task?id=36bcc167780acd10&refresh=10&show_raw=1). This time, it asked for a password for the keyring. I entered the password, and both stories pass. So I think it has something to do with the keyring stuff. I'm not sure exactly how to disable the keyring.... any ideas?
,
Jun 14 2017
+nednguyen Ned, I vaguely remember that we encountered similar issues before. Do you have an idea?
,
Jun 14 2017
,
Jun 14 2017
Dave: I remember we have a script to disable keyring. Do you recall it?
,
Jun 14 2017
How about trying https://askubuntu.com/a/31801?
,
Jun 14 2017
An option is to delete ~/.local/share/keyrings/ as part of bot_config.py. FWIU it'd fix the problem permanently.
,
Jun 14 2017
Deleting ~/.local/share/keyrings/ didn't seem to work.
,
Jun 14 2017
I want something that doesn't require any manual intervention on our behalf. I think I could go set up the keyrings on all the machines manually, but that's a lot of effort, and very manual. As an aside, these bots seem to be in a bad state. They appear to have not been set up correctly for swarming; missing packages, etc. This is probably because we converted the bots from buildbot to swarming. We re-setup build151-m1 yesterday, before I was able to fix it. We'd have to re-setup all the other bots in order to apply the same fix. We could do that, but it's a fair bit of manual work from the labs people.
,
Jun 14 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/c4bad68e7d561a462a5f2bc7f723f4bc2fd472dd commit c4bad68e7d561a462a5f2bc7f723f4bc2fd472dd Author: Stephen Martinis <martiniss@google.com> Date: Wed Jun 14 19:16:44 2017
,
Jun 16 2017
Ok, I landed the change to delete the keyring environment variables from the bots. It didn't seem to help. (It took a while to deploy successfully, but it is deployed now, and the bots are still failing)
,
Jun 19 2017
I think the next step is to ask for a new bare metal machine on Linux, and see if the issue still happens on that bot. These bots are somewhat misconfigured (see #14) so that could maybe be causing it. Although someone on labs re-set up build151-m1, and it's still having issues. I'll also email chromium-dev@ at some point and see if anyone has some ideas. We might also be able to disable using the gnome keyring agent thing with a command line flag, I believe. I'm a bit unclear on how to exactly change that though, but I remember seeing a way to do it somewhere.
,
Jun 20 2017
+Ken: you know how gpu linux bots avoid the keyring prompt?
,
Jun 20 2017
At this point I don't know. The Labs team might. Not sure how the GPU bots auto-login, either. See Issue 371600 for a related problem that happened a long time ago on the GPU bots, but I don't think that step is running any more on the GPU bots. There seems to be a use_gnome_keyring build variable. Should we be setting it to false? https://cs.chromium.org/chromium/src/components/os_crypt/features.gni?type=cs&q=use_gnome_keyring&l=9
,
Jun 20 2017
Oh, interesting. I can try setting that build variable, and see what happens.
,
Jun 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b81f02670b27ed301c21df6da6b0a1753bcbd219 commit b81f02670b27ed301c21df6da6b0a1753bcbd219 Author: Stephen Martinis <martiniss@chromium.org> Date: Wed Jun 21 08:23:20 2017 Add new gn arg for Linux builds on chromium.perf It appears there's a way to disable using the gnome keyring service on linux, which has been causing our bots issues. Try setting this gn arg to see what happens. Bug: 732463 Change-Id: Iaa1ff400e506ae6245346741d4bcdeaaa695fd4f Reviewed-on: https://chromium-review.googlesource.com/542015 Reviewed-by: Kenneth Russell <kbr@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#481143} [modify] https://crrev.com/b81f02670b27ed301c21df6da6b0a1753bcbd219/tools/mb/mb_config.pyl
,
Jun 21 2017
Issue 735441 has been merged into this issue.
,
Jun 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/31c39d8d4b952996ae119d2663f05d3d5150e592 commit 31c39d8d4b952996ae119d2663f05d3d5150e592 Author: Stephen Martinis <martiniss@chromium.org> Date: Wed Jun 21 15:40:33 2017 Revert "Add new gn arg for Linux builds on chromium.perf" This reverts commit b81f02670b27ed301c21df6da6b0a1753bcbd219. Reason for revert: Might have totally broken the builder. Original change's description: > Add new gn arg for Linux builds on chromium.perf > > It appears there's a way to disable using the gnome keyring > service on linux, which has been causing our bots issues. Try setting > this gn arg to see what happens. > > Bug: 732463 > Change-Id: Iaa1ff400e506ae6245346741d4bcdeaaa695fd4f > Reviewed-on: https://chromium-review.googlesource.com/542015 > Reviewed-by: Kenneth Russell <kbr@chromium.org> > Reviewed-by: Dirk Pranke <dpranke@chromium.org> > Commit-Queue: Stephen Martinis <martiniss@chromium.org> > Cr-Commit-Position: refs/heads/master@{#481143} TBR=dpranke@chromium.org,kbr@chromium.org,nednguyen@google.com,martiniss@chromium.org Change-Id: I96533c8eb48da571ff289834a478f9b896dc414a No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 732463 Reviewed-on: https://chromium-review.googlesource.com/543537 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#481213} [modify] https://crrev.com/31c39d8d4b952996ae119d2663f05d3d5150e592/tools/mb/mb_config.pyl
,
Jun 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/14072aff84d5ba7298a8e61be8dba95fa209c112 commit 14072aff84d5ba7298a8e61be8dba95fa209c112 Author: Stephen Martinis <martiniss@chromium.org> Date: Thu Jun 22 21:14:48 2017 Revert "Revert "Add new gn arg for Linux builds on chromium.perf"" This reverts commit 31c39d8d4b952996ae119d2663f05d3d5150e592. Reason for revert: Trying again. Original change's description: > Revert "Add new gn arg for Linux builds on chromium.perf" > > This reverts commit b81f02670b27ed301c21df6da6b0a1753bcbd219. > > Reason for revert: Might have totally broken the builder. > > Original change's description: > > Add new gn arg for Linux builds on chromium.perf > > > > It appears there's a way to disable using the gnome keyring > > service on linux, which has been causing our bots issues. Try setting > > this gn arg to see what happens. > > > > Bug: 732463 > > Change-Id: Iaa1ff400e506ae6245346741d4bcdeaaa695fd4f > > Reviewed-on: https://chromium-review.googlesource.com/542015 > > Reviewed-by: Kenneth Russell <kbr@chromium.org> > > Reviewed-by: Dirk Pranke <dpranke@chromium.org> > > Commit-Queue: Stephen Martinis <martiniss@chromium.org> > > Cr-Commit-Position: refs/heads/master@{#481143} > > TBR=dpranke@chromium.org,kbr@chromium.org,nednguyen@google.com,martiniss@chromium.org > > Change-Id: I96533c8eb48da571ff289834a478f9b896dc414a > No-Presubmit: true > No-Tree-Checks: true > No-Try: true > Bug: 732463 > Reviewed-on: https://chromium-review.googlesource.com/543537 > Reviewed-by: Stephen Martinis <martiniss@chromium.org> > Commit-Queue: Stephen Martinis <martiniss@chromium.org> > Cr-Commit-Position: refs/heads/master@{#481213} TBR=dpranke@chromium.org,kbr@chromium.org,nednguyen@google.com,martiniss@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 732463 Change-Id: I0cf5ee9662adaffcc629832d19c8fcbab0b0c834 Reviewed-on: https://chromium-review.googlesource.com/544686 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#481661} [modify] https://crrev.com/14072aff84d5ba7298a8e61be8dba95fa209c112/tools/mb/mb_config.pyl
,
Jun 23 2017
Ok, the gn arg didn't solve anything. I manually ran a build on the bot, and it still seems to fail. :( :( :(
,
Jun 26 2017
Can we ask the Labs team how they've configured the Linux NVIDIA GPU bots? These don't have this keyring problem.
,
Jun 27 2017
Re: #26. Looks like these were converted from std. server to desktop in https://bugs.chromium.org/p/chromium/issues/detail?id=667773 whereas the gpu ones were kickstarted directly to the desktop config. But I doubt this is the problem. One of the tests is tickling this? Maybe starting chrome with --password-store=basic? Attached is the screen shot of build151-m1.
,
Jul 2 2017
https://codereview.chromium.org/2968923002/ to disable all the tests on this builder. Can the lab reconfigure the whole Linux Perf builder with same way we configure GPU bots? I think we can try with 1 machine first & see whether it fix the keyring problem.
,
Jul 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f771ff3d4d3cf2a2d495a692a268c14d0654a02a commit f771ff3d4d3cf2a2d495a692a268c14d0654a02a Author: nednguyen <nednguyen@google.com> Date: Mon Jul 03 01:35:31 2017 Deschedule all benchmarks on Linux Perf builder The Linux Perf waterfall is currently in a very broken states: _ 41 tests fail due to a broken machine. _ Many other telemetry tests are failing due to keyring pop up during test. _ The bots are suspected to be misconfigured to begin with (https://bugs.chromium.org/p/chromium/issues/detail?id=732463#c17) To avoid unnecessary burden on bot health sheriff on this broken builder, I decide to deschedule all the tests in this builder until lab properly reconfigure all the machines. (Note: as with this change, I also need to schedule net_perftest on a Mac builder to avoid missing coverage of this test) BUG= chromium:732463 NOTRY=true # Flake TBR=benhenry@chromium.org, martiniss@chromium.org Review-Url: https://codereview.chromium.org/2968923002 Cr-Commit-Position: refs/heads/master@{#483924} [modify] https://crrev.com/f771ff3d4d3cf2a2d495a692a268c14d0654a02a/testing/buildbot/chromium.perf.json [modify] https://crrev.com/f771ff3d4d3cf2a2d495a692a268c14d0654a02a/tools/perf/core/perf_data_generator.py
,
Jul 6 2017
Re #28. What swarming dimensions should I be using when looking for a candidate to set up?
,
Jul 6 2017
Re #30: From https://codereview.chromium.org/2968923002/, a typical Linux perf bot swarming dimension: "dimension_sets": [ { "gpu": "102b:0534", "id": "build150-m1", // for perf device affinity, you may not need this "os": "Ubuntu-14.04", "pool": "Chrome-perf" } ],
,
Jul 6 2017
The full list of bots to change is here: https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/bots.cfg#318 (internal link). These bots used to be on buildbot, and then were migrated to swarming. They seem to be weirdly set up; IIRC, friedman manually re-set up build151-m1, which also ran into the same keyring issues. If you could re-set these up similar to the GPU bots, that would be great. Thanks peter!
,
Jul 11 2017
,
Jul 11 2017
build{148..152}-m1 have been redeployed.
,
Jul 11 2017
I retried a task on a bot (https://chromium-swarm.appspot.com/task?id=374c4ac24fd7d110&refresh=10&show_raw=1), and the keyring prompt still showed up. I can make this work for the short term, by logging onto the bots and manually setting the password to nothing. That should work for subsequent runs. I'll double check this. This isn't a very robust solution to the problem though.
,
Jul 11 2017
Giving the bots an empty keyring password does mean they can run successful jobs.
,
Jul 11 2017
What about chmod o-r /usr/bin/gnome-keyring-daemon?
,
Jul 11 2017
I meant chmod o-x. That should stop it from running when chrome-bot logs in.
,
Jul 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b0c9bd0c21dd368849f1855e6888449591ceb76e commit b0c9bd0c21dd368849f1855e6888449591ceb76e Author: Stephen Martinis <martiniss@chromium.org> Date: Tue Jul 11 22:36:58 2017 //tools/perf: Reschedule linux tests The bots have been reimaged, and semi fixed. See bug for details TBR=nednguyen@google.com NOTRY=True Bug: 732463 Change-Id: Ibea38565ae4e231bf91803b5feec7a7b59cc68bb Reviewed-on: https://chromium-review.googlesource.com/567560 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#485714} [modify] https://crrev.com/b0c9bd0c21dd368849f1855e6888449591ceb76e/testing/buildbot/chromium.perf.json [modify] https://crrev.com/b0c9bd0c21dd368849f1855e6888449591ceb76e/tools/perf/core/perf_data_generator.py
,
Jul 12 2017
Peter, it looks like bot 'build152-m1' is still dying. Can you check?
,
Jul 12 2017
Restarted it and now it's happy.
,
Jul 12 2017
,
Jul 12 2017
build148-m1 seems to have also died :/
,
Jul 12 2017
build148-m1 did not reboot when asked to pick up the new swarming version. https://chromium-swarm.appspot.com/bot?id=build148-m1&selected=1&sort_stats=total%3Adesc Manually rebooting it.
,
Jul 19 2017
build152-m1 seems to be broken as well :-(
,
Jul 19 2017
build152-m1 is up and it still thinks it is running https://chromium-swarm.appspot.com/task?id=3760b7954ae2f110&refresh=10&show_raw=1 I can reboot it but before doing so did you want to take look at it first?
,
Jul 21 2017
These bots keep on dying, it seems. I manually rebooted build152-m1, and it's running jobs again. I rebooted build149-m1, and it doesn't seem to have come back up. :/
,
Jul 25 2017
,
Jul 27 2017
ping pschmidt@ build148-m1 is dead in the swarmign pool https://chromium-swarm.appspot.com/bot?id=build148-m1&sort_stats=total%3Adesc please bring back up and see if that helps resolve these issues
,
Jul 27 2017
build18-m1 was wedged. It's now back up.
,
Jul 28 2017
The Linux perf builder looks great now, thanks Peter!
,
Jul 29 2017
And... 'build152-m1' is now dying again.. This is like the 5th time it misbehaves ( issue 736593 , issue 523301 , issue 677972, comment #40). Can we just replace this machine?
,
Jul 31 2017
These bots are failing because of swarming, not because of the machines.
I ssh-ed into build15{2,1}-m1, and found this in both of their task_runner.log.
6502 2017-07-29 03:14:43.382 I: Fetched auth headers (['X-Luci-Machine-Token']), they expire in 17 sec. Next check in 0 sec.
6502 2017-07-29 03:14:43.763 D: "POST /swarming/api/v1/bot/task_update/37a436857a3ebe11 HTTP/1.1" 200 32
6502 2017-07-29 03:14:43.764 D: Request https://2998-3737c64-dot-chromium-swarm.appspot.com/swarming/api/v1/bot/task_update/37a436857a3ebe11 succeeded
6502 2017-07-29 03:14:43.765 D: post_task_update() = {u'ok': True, u'must_stop': False}
6502 2017-07-29 03:14:43.765 D: calc_yield_wait() = 30
6502 2017-07-29 03:15:13.795 D: calc_yield_wait() = 30
6502 2017-07-29 03:15:13.797 I: Fetched auth headers (['X-Luci-Machine-Token']), they expire in -12 sec. Next check in 0 sec.
6502 2017-07-29 03:15:13.862 D: "POST /swarming/api/v1/bot/task_update/37a436857a3ebe11 HTTP/1.1" 401 28
6502 2017-07-29 03:15:13.864 W: Authentication is required for https://2998-3737c64-dot-chromium-swarm.appspot.com/swarming/api/v1/bot/task_update/37a436857a3ebe11 on attempt 0.
401 Client Error: Unauthorized for url: https://2998-3737c64-dot-chromium-swarm.appspot.com/swarming/api/v1/bot/task_update/37a436857a3ebe11
6502 2017-07-29 03:15:13.864 E: Unable to authenticate to https://2998-3737c64-dot-chromium-swarm.appspot.com (401 Client Error: Unauthorized for url: https://2998-3737c64-dot-chromium-swarm.appspot.com/swarming/api/v1/bot/task_update/37a436857a3ebe11).
6502 2017-07-29 03:15:13.864 D: post_task_update() = None
6502 2017-07-29 03:15:13.865 W: SIGTERM finally due to Failed to contact server
(above is from build151-m1).
maruel@ or vadimsh@, do you know what's happening here?
,
Jul 31 2017
maruel@ is OOO today. Re-assigning.
,
Jul 31 2017
I'm looking at build152-m1, but so far I'm puzzled. The process that is supposed to update the X-Luci-Machine-Token is getting stuck while trying to open HTTP connection to the backend server (even before it actually opens it). I'm trying to figure out with strace where it blocks exactly.
,
Jul 31 2017
tl;dr Labs (or anyone) do you know what's special about build151-m1 and build152-m1 compared to other similar bots? Maybe BIOS version?.. So it appears processes on build152-m1 just randomly "freeze". Golang processes are susceptible more than others. A simple program that calls "Sleep(...)" in a loop eventually freezes (pretty fast, in fact). Syslog is full of this stuff: Jul 31 16:14:21 build152-m1 rtkit-daemon[2023]: The canary thread is apparently starving. Taking action. Jul 31 16:14:21 build152-m1 rtkit-daemon[2023]: Demoting known real-time threads. Jul 31 16:14:21 build152-m1 rtkit-daemon[2023]: Successfully demoted thread 2021 of process 2021 (n/a). Jul 31 16:14:21 build152-m1 rtkit-daemon[2023]: Demoted 1 threads. kern.log has this as the last entry: Jul 31 12:30:07 build152-m1 kernel: [299628.194665] WARNING: CPU: 4 PID: 22124 at /build/linux-oR3NJd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0() Jul 31 12:30:07 build152-m1 kernel: [299628.194668] Watchdog detected hard LOCKUP on cpu 4 Jul 31 12:30:07 build152-m1 kernel: [299628.194671] Modules linked in: bnep rfcomm bluetooth nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dcdbas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich shpchp video ipmi_si mac_hid parport_pc ppdev lp parport hid_generic usbhid hid tg3 ptp ahci libahci pps_core Jul 31 12:30:07 build152-m1 kernel: [299628.194719] CPU: 4 PID: 22124 Comm: strace Not tainted 3.13.0-123-generic #172-Ubuntu Jul 31 12:30:07 build152-m1 kernel: [299628.194722] Hardware name: Dell Inc. PowerEdge R220/081N4V, BIOS 1.1.4 05/06/2014 Jul 31 12:30:07 build152-m1 kernel: [299628.194725] 0000000000000000 ffff88065fd05c18 ffffffff8172d219 ffff88065fd05c60 Jul 31 12:30:07 build152-m1 kernel: [299628.194731] 0000000000000009 ffff88065fd05c50 ffffffff8106a76d ffff88063dd58000 Jul 31 12:30:07 build152-m1 kernel: [299628.194736] 0000000000000000 ffff88065fd05d80 0000000000000000 ffff88065fd05ef8 Jul 31 12:30:07 build152-m1 kernel: [299628.194742] Call Trace: Jul 31 12:30:07 build152-m1 kernel: [299628.194744] <NMI> [<ffffffff8172d219>] dump_stack+0x64/0x82 Jul 31 12:30:07 build152-m1 kernel: [299628.194761] [<ffffffff8106a76d>] warn_slowpath_common+0x7d/0xa0 Jul 31 12:30:07 build152-m1 kernel: [299628.194766] [<ffffffff8106a7dc>] warn_slowpath_fmt+0x4c/0x50 Jul 31 12:30:07 build152-m1 kernel: [299628.194772] [<ffffffff811120d0>] ? restart_watchdog_hrtimer+0x50/0x50 Jul 31 12:30:07 build152-m1 kernel: [299628.194778] [<ffffffff8111216c>] watchdog_overflow_callback+0x9c/0xd0 Jul 31 12:30:07 build152-m1 kernel: [299628.194785] [<ffffffff81149d6e>] __perf_event_overflow+0x8e/0x250 Jul 31 12:30:07 build152-m1 kernel: [299628.194793] [<ffffffff8102a458>] ? x86_perf_event_set_period+0xe8/0x150 Jul 31 12:30:07 build152-m1 kernel: [299628.194798] [<ffffffff8114a864>] perf_event_overflow+0x14/0x20 Jul 31 12:30:07 build152-m1 kernel: [299628.194803] [<ffffffff810318fd>] intel_pmu_handle_irq+0x1ed/0x400 Jul 31 12:30:07 build152-m1 kernel: [299628.194810] [<ffffffff8118ca41>] ? unmap_kernel_range_noflush+0x11/0x20 Jul 31 12:30:07 build152-m1 kernel: [299628.194815] [<ffffffff817372cb>] perf_event_nmi_handler+0x2b/0x50 Jul 31 12:30:07 build152-m1 kernel: [299628.194820] [<ffffffff81736a1a>] nmi_handle.isra.2+0x8a/0x1b0 Jul 31 12:30:07 build152-m1 kernel: [299628.194824] [<ffffffff81736c70>] do_nmi+0x130/0x3e0 Jul 31 12:30:07 build152-m1 kernel: [299628.194831] [<ffffffff81735e8f>] end_repeat_nmi+0x1a/0x1e Jul 31 12:30:07 build152-m1 kernel: [299628.194839] [<ffffffff8109b007>] ? resched_task+0x17/0x60 Jul 31 12:30:07 build152-m1 kernel: [299628.194844] [<ffffffff8109b007>] ? resched_task+0x17/0x60 Jul 31 12:30:07 build152-m1 kernel: [299628.194849] [<ffffffff8109b007>] ? resched_task+0x17/0x60 Jul 31 12:30:07 build152-m1 kernel: [299628.194851] <<EOE>> [<ffffffff810a79fa>] ? check_preempt_wakeup+0x19a/0x270 Jul 31 12:30:07 build152-m1 kernel: [299628.194863] [<ffffffff8109b9c5>] check_preempt_curr+0x85/0xa0 Jul 31 12:30:07 build152-m1 kernel: [299628.194869] [<ffffffff8109b9f9>] ttwu_do_wakeup+0x19/0xf0 Jul 31 12:30:07 build152-m1 kernel: [299628.194875] [<ffffffff8109bb7d>] ttwu_do_activate.constprop.75+0x5d/0x70 Jul 31 12:30:07 build152-m1 kernel: [299628.194880] [<ffffffff8109e112>] try_to_wake_up+0x1d2/0x2c0 Jul 31 12:30:07 build152-m1 kernel: [299628.194884] [<ffffffff8109e215>] wake_up_process+0x15/0x20 Jul 31 12:30:07 build152-m1 kernel: [299628.194891] [<ffffffff810843d4>] wake_up_worker+0x24/0x30 Jul 31 12:30:07 build152-m1 kernel: [299628.194897] [<ffffffff81084e9b>] insert_work+0x6b/0xb0 Jul 31 12:30:07 build152-m1 kernel: [299628.194904] [<ffffffff810135db>] ? __switch_to+0x16b/0x4f0 Jul 31 12:30:07 build152-m1 kernel: [299628.194909] [<ffffffff8108500e>] __queue_work+0x12e/0x360 Jul 31 12:30:07 build152-m1 kernel: [299628.194915] [<ffffffff81085477>] queue_work_on+0x27/0x50 Jul 31 12:30:07 build152-m1 kernel: [299628.194922] [<ffffffff8145be5b>] tty_schedule_flip+0x2b/0x30 Jul 31 12:30:07 build152-m1 kernel: [299628.194927] [<ffffffff8145be6e>] tty_flip_buffer_push+0xe/0x10 Jul 31 12:30:07 build152-m1 kernel: [299628.194933] [<ffffffff8145db34>] pty_write+0x54/0x60 Jul 31 12:30:07 build152-m1 kernel: [299628.194937] [<ffffffff81455714>] do_output_char+0x194/0x220 Jul 31 12:30:07 build152-m1 kernel: [299628.194941] [<ffffffff814561ac>] n_tty_write+0x22c/0x4f0 Jul 31 12:30:07 build152-m1 kernel: [299628.194946] [<ffffffff8109e240>] ? wake_up_state+0x20/0x20 Jul 31 12:30:07 build152-m1 kernel: [299628.194950] [<ffffffff81452e28>] tty_write+0x148/0x2b0 Jul 31 12:30:07 build152-m1 kernel: [299628.194955] [<ffffffff81455f80>] ? process_echoes+0x70/0x70 Jul 31 12:30:07 build152-m1 kernel: [299628.194962] [<ffffffff811c24b4>] vfs_write+0xb4/0x1f0 Jul 31 12:30:07 build152-m1 kernel: [299628.194968] [<ffffffff811c2ee9>] SyS_write+0x49/0xa0 Jul 31 12:30:07 build152-m1 kernel: [299628.194973] [<ffffffff81075e02>] ? SyS_ptrace+0x112/0x120 Jul 31 12:30:07 build152-m1 kernel: [299628.194980] [<ffffffff8173dd9d>] system_call_fastpath+0x1a/0x1f Jul 31 12:30:07 build152-m1 kernel: [299628.194983] ---[ end trace 20eaf33f4f31a0a9 ]--- --- build151-m1 appears to be in similar state (processes hang). Has same "The canary thread is apparently starving" spam in syslog, but no "hard LOCKUP" in kern.log :-/ I suspect processes hanging and "The canary thread is apparently starving" has same root cause, but I can't find it :( Googling shows that "The canary thread is apparently starving" is also associated with visible system-wide freezes. Some reports also associate it with ACPI power saving options. I'm curious why these two bots only? What's so special about them? I've compared build151-m1 to build148-m1 (which is similar, but appears to be healthier). Same kernel version, same list of loaded kernel modules :( Could they have different BIOS version perhaps? Anyway, rebooting the machine will most likely bring them back online, but they will eventually break again until we figure out what component is causing this. I'm not sure what else I can do here, so marking this as Available.
,
Aug 1 2017
,
Aug 3 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/c2673450333156a55b8b22baaaff2621bc31c063 commit c2673450333156a55b8b22baaaff2621bc31c063 Author: Peter Schmidt <pschmidt@google.com> Date: Thu Aug 03 19:00:31 2017
,
Aug 3 2017
Agreed that something is very weird about build152-m1. Just migrated it onto another r220.
,
Aug 11 2017
I'm seeing lots of problems with build150-m1 and build151-m1. 125 looks fine after the migration onto another r220 in #59. Should we do the same for 150 and 151?
,
Aug 11 2017
That sounds like a good idea. Peter, can you do the same for 150 and 151?
,
Aug 14 2017
build149-m1 also seems to be having similar issues. Can you migrate that one as well?
,
Aug 21 2017
Just got back from vacation. Will work on this this week.
,
Sep 1 2017
Have you made progress on this peter? I'm still seeing build150-m1 offline.
,
Sep 5 2017
Problem on build150-m1 is similar to that of what was seen on build152-m1 before it was swapped out: Will migrate build15{0,1}-m1 to replacement hardware
Aug 27 08:21:52 build150-m1 kernel: [26456.404712] WARNING: CPU: 6 PID: 1678 at /build/linux-oR3NJd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0()
Aug 27 08:21:52 build150-m1 kernel: [26456.404713] Watchdog detected hard LOCKUP on cpu 6
Aug 27 08:21:52 build150-m1 kernel: [26456.404714] Modules linked in: bnep rfcomm bluetooth nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache joydev x86_pkg_temp_thermal intel_powerclamp coretemp dcdbas kvm_intel kvm crct10dif_pclmul crc32_pclmul parport_pc ghash_clmulni_intel ppdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich shpchp ipmi_si video mac_hid lp parport hid_generic usbhid hid tg3 ptp ahci libahci pps_core
Aug 27 08:21:52 build150-m1 kernel: [26456.404739] CPU: 6 PID: 1678 Comm: Xorg Not tainted 3.13.0-123-generic #172-Ubuntu
Aug 27 08:21:52 build150-m1 kernel: [26456.404740] Hardware name: Dell Inc. PowerEdge R220/081N4V, BIOS 1.1.4 05/06/2014
Aug 27 08:21:52 build150-m1 kernel: [26456.404741] 0000000000000000 ffff880639b17c78 ffffffff8172d219 ffff880639b17cc0
Aug 27 08:21:52 build150-m1 kernel: [26456.404745] 0000000000000009 ffff880639b17cb0 ffffffff8106a76d ffff88063ddf0000
Aug 27 08:21:52 build150-m1 kernel: [26456.404748] 0000000000000000 ffff880639b17de0 0000000000000000 ffff880639b17f58
Aug 27 08:21:52 build150-m1 kernel: [26456.404751] Call Trace:
Aug 27 08:21:52 build150-m1 kernel: [26456.404756] [<ffffffff8172d219>] dump_stack+0x64/0x82
Aug 27 08:21:52 build150-m1 kernel: [26456.404759] [<ffffffff8106a76d>] warn_slowpath_common+0x7d/0xa0
Aug 27 08:21:52 build150-m1 kernel: [26456.404762] [<ffffffff8106a7dc>] warn_slowpath_fmt+0x4c/0x50
Aug 27 08:21:52 build150-m1 kernel: [26456.404764] [<ffffffff811120d0>] ? restart_watchdog_hrtimer+0x50/0x50
Aug 27 08:21:52 build150-m1 kernel: [26456.404767] [<ffffffff8111216c>] watchdog_overflow_callback+0x9c/0xd0
Aug 27 08:21:52 build150-m1 kernel: [26456.404770] [<ffffffff81149d6e>] __perf_event_overflow+0x8e/0x250
Aug 27 08:21:52 build150-m1 kernel: [26456.404774] [<ffffffff8102a458>] ? x86_perf_event_set_period+0xe8/0x150
Aug 27 08:21:52 build150-m1 kernel: [26456.404776] [<ffffffff8114a864>] perf_event_overflow+0x14/0x20
Aug 27 08:21:52 build150-m1 kernel: [26456.404779] [<ffffffff810318fd>] intel_pmu_handle_irq+0x1ed/0x400
Aug 27 08:21:52 build150-m1 kernel: [26456.404783] [<ffffffff81373ec0>] ? timerqueue_add+0x60/0xb0
Aug 27 08:21:52 build150-m1 kernel: [26456.404785] [<ffffffff817372cb>] perf_event_nmi_handler+0x2b/0x50
Aug 27 08:21:52 build150-m1 kernel: [26456.404787] [<ffffffff81736a1a>] nmi_handle.isra.2+0x8a/0x1b0
Aug 27 08:21:52 build150-m1 kernel: [26456.404789] [<ffffffff81736c70>] do_nmi+0x130/0x3e0
Aug 27 08:21:52 build150-m1 kernel: [26456.404792] [<ffffffff81735daa>] nmi+0x5a/0xbf
Aug 27 08:21:52 build150-m1 kernel: [26456.404794] ---[ end trace 6f56a77eda4cb95a ]---
,
Sep 5 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/aa9f0e0a2064572c684f99a0487d5d66cd21689c commit aa9f0e0a2064572c684f99a0487d5d66cd21689c Author: Peter Schmidt <pschmidt@google.com> Date: Tue Sep 05 21:53:59 2017
,
Sep 6 2017
build15{0,1}-m1 have been migrated to new hardware.
,
Sep 6 2017
Thanks Peter! build148-m1 is also busted, can you migrate it to?
,
Sep 6 2017
I'm going to re-image that one in place first.
,
Sep 6 2017
Verified that build148-m1 is exhibiting the same hard cpu lockup issue. Also build149-m1 as well. I'm having a hard time believing this is a system hardware issue.
Granted build15{0..2} have been migrated to different r220 servers in the same rack and they have been stable since, but it's early days. There is a minor kernel bump that came with the re-image.
build148-m1 has been re-imaged in place. The latest lts-xenial (4.4) kernel has been applied to build149-m1.
,
Sep 19 2017
Linux, our historically stable bot, has been failing for months (Of last 200 builds (since July 19th), only 19 green runs) . Since it is our only linux configuration I feel it is right to pay some attention to it now. I am filing a sub bug today to try and address a specific problem, please speak up if you are actively looking at this in labs and have a different angle. Peter, I am assigning that sub bug to you as well since you have looked at most of the bots to see if it is a problem with the cpu lockup issue, but I have a hunch that isn't the problem.
,
Sep 19 2017
,
Sep 19 2017
So we are already at P1 here but I am almost tempted to bump to a P0. This is the last day of my shift, but I am going to share some of my digging in case I don't get back to this as I think 3 months of no linux regression data is pretty important. I had a hunch this morning that this wasn't a labs issue, but a swarming or something else issue. Given we swarmed linux a while back (2016 sometime...) and it was pretty stable for a while (I think I can only see back 200 builds) I don't think it has something to do with initial swarming setup. Therefore I started looking at bugs filed and started this spreadsheet: https://docs.google.com/spreadsheets/d/1z_H3GeXxJEvIM-bDYy8iTzpptCDnMT89b9LxSp9rQSQ/edit#gid=426110493 From it I was trying to deduce what was outstanding since June 2nd (roughly when this mother bug was filed) and if they were correlated at all since this doesn't appear to be due to specific tests but something to do with linux and not the hardward, but some setup or software on it somewhere. I think vadimish@ was on to something with the freeze/sleep stuff. I am including Pawel only because we are pretty stumped and we chatted about sleeps this morniing and how they aren't good. Anyone else who has any thoughts on what road to try next would be greatly appreciated!
,
Sep 19 2017
To recap, these hosts were initially converted from buildbot to swarming on 11/2016 per crbug.com/667773 CPU lockups exhibiting on these starting 06/2017? Hardware has been replaced but symptoms still exist. Note that these are running headless. GPU folks are running their linux swarming pool on Nvidia based R230's (real monitor connection). Would it be worth while to spin up a set of swarming hosts that equivalent to what the GPU folks are using?
,
Sep 19 2017
#74: using the machines similar to what GPU team are using sgtm unless Ken has any concern that these machines are not good for perf.
,
Sep 19 2017
No concerns here. I strongly support standardizing on a known good hardware configuration, and the setup the Labs team has done for the Chrome-GPU Swarmed bots is working well.
,
Sep 20 2017
Although I don't think it hurts to be aligned to what the GPU team is using and to standardize the linux configuration, it does seem unlikely to be a hardware issue at this point. Peter, what is the process for spinning up these hosts? Will they be the same device ids as we currently have so our current swarming triggers will work seamlessly?
,
Sep 20 2017
I had another thought and maybe MA has insight on this. Was there a change in June that might be forcing the swarming bots to restart more often? Doesn't linux require you to re-enter the default keyring password on every startup? At least it did for me when I returned from leave and was restarting my machine multiple times. I had to keep generating a new one until I went in and reset it to something I remembered. Could this be happening since right now that entry is so manual? there was a brief green period in July when these were re-imaged and Stephen manually reset these passwords, but did it fail again when these bots were re-started and required that manual intervention again?
,
Sep 20 2017
+ marc-antoine in case he has any insights on the swarming side and anything that might have changed in early summer that might have caused more restarts?
,
Sep 20 2017
Bots restart at the rate specified by "periodic_reboot_secs" in their state. They also restart on task failure (for most configurations).
,
Sep 20 2017
I think we are chasing down two separate issues. 1) Figuring out the random cpu lockups on this headless linux r220 platform (windows based hosts on the same hardware don't have this issue). The linux gpu swarming slaves also don't exhibit this, so my suggestion to migrate these perf linux swarming slave to the same platform as the gpu ones. 2) Gnome keyring password prompt. btw, I see this on the GPU swarming slaves as well but they do not see it as a problem?
,
Sep 20 2017
Peter in your opinion do you think we are seeing a combination of these two that is contributing to our massive linux failures since June? Maybe I am focusing on the wrong thing, but I thought linux was one of our most stable bots (before I was on leave anyways in January) and all the sudden we see massive failures starting in early June. I can't see past the last 200 builds in buildbot so maybe we had this previously. I think the right move is to align with GPU, see if it helps at all. Ken commented in #19 about the gnome keyring and he didn't have any further insights. Maybe we punt on that until after we are aligned with their bot configurations?
,
Sep 20 2017
Currently there are 5 slaves in this pool. Will more help out cycle time?
,
Sep 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/fdb518706cc87b5b8859a12a2b1a7f48b91879af commit fdb518706cc87b5b8859a12a2b1a7f48b91879af Author: Peter Schmidt <pschmidt@google.com> Date: Wed Sep 20 17:41:48 2017
,
Sep 20 2017
We have Q4 and/or 2018 plans to potentially scale this number but for now we are sticking with 5.
,
Sep 21 2017
build{27..31}-a9 are the replacement slaves. They are R230 servers with Nvidia P400 cards and drivers that match jhttps://uberchromegw.corp.google.com/i/chromium.gpu.fyi/builders/Linux%20Release%20(NVIDIA%20Quadro%20P400)
Currently gpu swarming is using Nvidia GT610 cards but they are in the pipeline to get them replaced with P400's ( crbug.com/712469)
They are currently in the Chrome pool. They need to be moved to the Chrome-perf pool.
https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&c=cores&c=cpu&c=gpu&c=kvm&c=locale&c=machine_type&c=pool&c=python&c=ssd&f=cores%3A8&f=cpu%3Ax86&f=cpu%3Ax86-64&f=cpu%3Ax86-64-E3-1230_v5&f=cpu%3Ax86-64-avx2&f=gpu%3ANVIDIA%20(10de)&f=gpu%3A10de%3A1cb3&f=gpu%3A10de%3A1cb3-384.69&f=kvm%3A1&f=locale%3Aen_US.UTF-8&f=machine_type%3An1-standard-8&f=os%3ALinux&f=os%3AUbuntu&f=os%3AUbuntu-14.04&f=pool%3AChrome&f=python%3A2.7.6&f=ssd%3A1&l=100&s=id%3Aasc
,
Sep 21 2017
Stephen we need to remove the old linux bots and add the new ones in this file: https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/bots.cfg#384 Then we need to restart the perf waterfall for this to take effect. We should also notify the current bot health sheriff that this is happening. Is there anything else we need to consider here in terms of timing this swap? If we check in this change and don't restart right away, what happens to current swarming requests that are targeted to specific linux devices (ie build{148..152}-m1) that are being removed. Will these just all fail now? Marc-Antoine can you speak to the swarming side and how we should time this rollout?
,
Sep 21 2017
We will also need to update our src side json since that is currently hard coded on ids. Sidenote: This brings to light crbug.com/719631 around documentation for how to do this in perf buildbot/swarming land. This knowledge shouldn't rely on one or two people. I realize this will hopefully be changing soon with all efforts in one buildbot step, per story sharding and soft device affinity, but we might wrap up the perf waterfall revamp (crbug.com/739876) first and therefore documenting the current process might be very important.
,
Sep 21 2017
Chatted with maruel@ offline, he noted that we don't have to remove old bots at the same time that we add the new ones. So I see the rollout as follows:
1) add build{27..31}-a9 to https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/bots.cfg#384
2) Update src side config to point to these new bots for linux: https://cs.chromium.org/chromium/src/testing/buildbot/chromium.perf.json?q=chromium.perf.json&sq=package:chromium&l=1
3) Restart the waterfall and make sure it picks up the new linux bots and starts running on them
4) Check in CL to remove the old bots from bots.cfg: build{148..152}-m1
,
Sep 21 2017
I'll replace the bots.
,
Sep 21 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/091e7f65fa90b1c6d8b77e16fb17d91cc1af5a78 commit 091e7f65fa90b1c6d8b77e16fb17d91cc1af5a78 Author: Stephen Martinis <martiniss@google.com> Date: Thu Sep 21 22:28:47 2017
,
Sep 22 2017
Status: 1) Stephen is moving forward with gettnig the new linux configurations up on the waterfall 2) Dave is filing a dependent bug so that bisect aligns with the perf waterfall bots 3) As of yesterday, the existing linux bots seem to have somehow started working, or at least working way better than they have been failing for 3 months, not green though. See most recent build: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf/builds/1796 Stephen noted this could be a random occurence so we are still planning on proceeding with swapping out the linux bots. It is good to be aligned with the gpu bots as we move forward to improve the waterfall.
,
Sep 22 2017
,
Sep 25 2017
I'm landing https://chromium-review.googlesource.com/c/chromium/src/+/676725, which will switch which bots back the linux perf bot. I'll monitor it during this week to see if it's more stable.
,
Sep 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8847e91cd91862d25596bb47532bd6a3af4540ed commit 8847e91cd91862d25596bb47532bd6a3af4540ed Author: Stephen Martinis <martiniss@chromium.org> Date: Tue Sep 26 01:44:55 2017 //tools/perf: Switch to new linux perf bots Switches to new bots for the linux perf builder. Bug: 732463 Change-Id: Ied8d923448f40b8295ddfb2fde1adde481666075 Reviewed-on: https://chromium-review.googlesource.com/676725 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#504256} [modify] https://crrev.com/8847e91cd91862d25596bb47532bd6a3af4540ed/testing/buildbot/chromium.perf.json [modify] https://crrev.com/8847e91cd91862d25596bb47532bd6a3af4540ed/tools/perf/core/benchmark_sharding_map.json [modify] https://crrev.com/8847e91cd91862d25596bb47532bd6a3af4540ed/tools/perf/core/perf_data_generator.py
,
Sep 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/332db48423f8fdb50be5888b31304a92955a682e commit 332db48423f8fdb50be5888b31304a92955a682e Author: Stephen Martinis <martiniss@chromium.org> Date: Tue Sep 26 19:21:05 2017 //tools/perf: Fix Linux Perf gpu We switched over to a new bot, which has a different gpu. This CL updates the swarming dimensions to have the correct GPU. NOTRY=true Bug: 732463 Change-Id: Ibb4f034ad78e4ec294be679e70bb9f3b0ef5f61c Reviewed-on: https://chromium-review.googlesource.com/685475 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#504454} [modify] https://crrev.com/332db48423f8fdb50be5888b31304a92955a682e/testing/buildbot/chromium.perf.json [modify] https://crrev.com/332db48423f8fdb50be5888b31304a92955a682e/tools/perf/core/perf_data_generator.py
,
Sep 26 2017
Tasks are running on the new bots! Will check in on the tasks in a few hours.
,
Sep 27 2017
not holding my breath but.... https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf Nice work stephen! thanks for seeing this through
,
Oct 2 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome-golo/chrome-golo/+/1da2d59de0f4ceccd9f10d341754ca7fb9ec5a01 commit 1da2d59de0f4ceccd9f10d341754ca7fb9ec5a01 Author: Peter Schmidt <pschmidt@google.com> Date: Mon Oct 02 20:54:29 2017
,
Dec 28 2017
|
||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, Jun 12 2017