Issue metadata
Sign in to add a comment
|
|
||||||||||||||||||||||
Issue descriptionWe've been repeatedly losing a significant portion of our mac swarming capacity to WindowServer crashes since ~2018-03-22. It looks like, around that time, someone landed something in src that started causing these crashes in browser_tests. We need to identify the culprit here and address it. This is separate from the trooper toil task of powercycling the VMs, so I'm filing it separately. +ellyjones,pinkerton,lindsayw to help w/ triage -- I'm not sure where this should go or who else to loop in. I'd appreciate any insights y'all have.
Showing comments 111 - 210
of 210
Older ›
Jun 17 2018, Project MemberThe following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/32c0d7bff4c4fca418de9dc2aefbb646f2b563ed commit 32c0d7bff4c4fca418de9dc2aefbb646f2b563ed Author: John Budorick <jbudorick@google.com> Date: Sun Jun 17 23:36:48 2018 Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b879ee4c2ab27cb48a4ccb3f335d78142d42a297 commit b879ee4c2ab27cb48a4ccb3f335d78142d42a297 Author: Dirk Pranke <dpranke@chromium.org> Date: Mon Jun 18 02:18:41 2018 Shift Mac 10.12 traffic onto mac minis. With the creation of the new 'Chrome-quarantine' pool we accidentally left the 10.12 'Chrome' pool with no 'gpu: none' machines (i.e., no VMs) to run tasks. This CL shifts all of the tests to physical machines where we should have capacity. TBR=jbudorick@chromium.org BUG= 828031 NOTRY=true Change-Id: I1624d86887e65913330bc11b0cb1b05ddc05e1f9 Reviewed-on: https://chromium-review.googlesource.com/1103846 Commit-Queue: Dirk Pranke <dpranke@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#567921} [modify] https://crrev.com/b879ee4c2ab27cb48a4ccb3f335d78142d42a297/testing/buildbot/chromium.mac.json [modify] https://crrev.com/b879ee4c2ab27cb48a4ccb3f335d78142d42a297/testing/buildbot/waterfalls.pyl Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9e20e8d054094d786d19602c3f2ff031264c0801 commit 9e20e8d054094d786d19602c3f2ff031264c0801 Author: Dirk Pranke <dpranke@chromium.org> Date: Mon Jun 18 02:46:54 2018 Reenable browser_tests on chromium.mac in quarantine pool. This re-enables just the browser_tests suite on the chromium.mac bots and the matching trybots. The tests on Mac10.13 Tests are still marked as experimental, so failures there won't block the CQ. The re-enabled tests will run in the 'Chrome-quarantine' pool, which means that if the tests kill the machines, it should only affect the machines in that pool, and not the machines in the main 'Chrome' pool. The tests are still disabled on chromium.clang, chromium.fyi, and chromium.memory, and viz_browser_tests and surface_sync_browser_tests are still disabled everywhere. If it looks like this approach is working, we can reenable those as well as long as they go into the quarantine pool. R=jbudorick@chromium.org BUG= 828031 NOTRY=true Change-Id: I2e914b68584fd4d718aba246906f38b0724a08f1 Reviewed-on: https://chromium-review.googlesource.com/1103796 Commit-Queue: Dirk Pranke <dpranke@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#567922} [modify] https://crrev.com/9e20e8d054094d786d19602c3f2ff031264c0801/testing/buildbot/chromium.mac.json [modify] https://crrev.com/9e20e8d054094d786d19602c3f2ff031264c0801/testing/buildbot/test_suite_exceptions.pyl Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9e8bb711222aaf2af8c77934b0c78532d85cf5d6 commit 9e8bb711222aaf2af8c77934b0c78532d85cf5d6 Author: Dirk Pranke <dpranke@chromium.org> Date: Mon Jun 18 05:02:38 2018 Force gpu=none on quarantined mac browser_tests, remove from 10.10. This sets the gpu=none flag explicitly to get the browser_tests to run on the VMs in the Chrome-quarantine pool on Mac10.12, to avoid any confusion with the default gpu values for the other test suites. This also temporarily turns off the browser_tests on 10.10, since it looks like we're probably too low on capacity there to handle this easily. We'll add more tomorrow and try again. TBR=jbudorick@chromium.org BUG= 828031 NOTRY=true Change-Id: Ib948ba17c2bb3b4fb56912a9c612e72fa5e08f11 Reviewed-on: https://chromium-review.googlesource.com/1103938 Commit-Queue: Dirk Pranke <dpranke@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#567933} [modify] https://crrev.com/9e8bb711222aaf2af8c77934b0c78532d85cf5d6/testing/buildbot/chromium.mac.json [modify] https://crrev.com/9e8bb711222aaf2af8c77934b0c78532d85cf5d6/testing/buildbot/test_suite_exceptions.pyl Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/53e45eae2dce04db714e85d63c3e481329b56a8a commit 53e45eae2dce04db714e85d63c3e481329b56a8a Author: Patrik Höglund <phoglund@chromium.org> Date: Mon Jun 18 07:04:55 2018 Move WebRTC mac tester to the new non-quarantined pool. Chromium has split their macs into a quarantine and non-quarantine pool, so let's move our tests into the non-quarantine pool for now (we're not an active suspect in this investigation). Bug: 828031 TBR=sergiyb@chromium.org Change-Id: Iab8638d4da3bb88738b5817d65003bc996324583 Reviewed-on: https://chromium-review.googlesource.com/1103565 Reviewed-by: Patrik Höglund <phoglund@chromium.org> Commit-Queue: Patrik Höglund <phoglund@chromium.org> [modify] https://crrev.com/53e45eae2dce04db714e85d63c3e481329b56a8a/scripts/slave/recipe_modules/chromium_tests/chromium_webrtc_fyi.py Jun 18 2018,
Putting into the sheriff queue mostly for awareness, since it doesn't look like this is something a regular sheriff could handle. Jun 18 2018,Issue 853460 has been merged into this issue. Issue 853496 has been merged into this issue. Issue 853513 has been merged into this issue. Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/81afd133363c0864bc4d3a56b7303a0fe98d225b commit 81afd133363c0864bc4d3a56b7303a0fe98d225b Author: John Budorick <jbudorick@chromium.org> Date: Mon Jun 18 16:52:20 2018 Disable more mac WindowServer killer suspects. Tests in these classes have appeared frequently and near the end of the test log in mac browser_tests tasks that either resulted in BOT_DIED or exhibited a large number of failures due to crashpad (which itself appears to be symptomatic of WindowServer death). Bug: 828031 Change-Id: I37aeefe83a4ade9f69f5cb18ee5e13179fa24b89 Reviewed-on: https://chromium-review.googlesource.com/1104308 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#568040} [modify] https://crrev.com/81afd133363c0864bc4d3a56b7303a0fe98d225b/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Jun 18 2018,
Downgrading to P1, since it looks like the changes made yesterday have greatly reduced the number of dead VMs, and they're all dying in the quarantine pool. Also, reassigning to jbudorick@ as current trooper. The plan is to try and get us to a point where the VMs are no longer dying when only running the browser_tests (i.e., we will have stopped running any potentially VM-killing tests) and then re-evaluate. At that point, we can potentially shift the browser_tests that don't kill the VMs back into the main pool, just run the browser_tests that do kill things on Minis, and hand this back to devs to figure out what to do w/ the VMs. We can also then re-enable the browser_tests on the other configs (ToT clang, ASan, etc.) and re-enable viz_browser_tests if so desired. Jun 18 2018,I'd say if we re-enable the browser_tests on other configs, and they seem stable, then it would be desired to get viz_browser_tests back up and running. At least on the FYI bots to start. Jun 18 2018, Project MemberThe following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/fe66ac905033f07fc6e75a3f0b338d03a521be23 commit fe66ac905033f07fc6e75a3f0b338d03a521be23 Author: Dirk Pranke <dpranke@chromium.org> Date: Mon Jun 18 20:51:28 2018 Jun 18 2018,Is it possible to specify strict timeout for browser_tests, 20mins or less? Current mac_chromium_rel_ng builder looks to have long running time and large number of pending builds due to lack of capacity in browser_test step. Jun 18 2018,#122: using experimental+low expiration times might work but would be somewhat abusive of the systems in question. dpranke and I are both working on changes to the backing capacity (by both rebalancing capacity into the quarantine pool and recovering dead bots in the quarantine pool). if we're considering that as an escape hatch, we're probably better off just removing the suite. Jun 18 2018,(or dropping the suite below 100% experimental, as we do still want to collect data on what's killing the bots) Jun 18 2018,#123 I'm afraid that bot capacity becomes small due to this issue while non-MTV hours. If we need manually revive of bots, non-MTV people need to wait MTV hours in that case. But increased capacity is sufficient for quarantine pool, it won't be problem. Jun 18 2018,
Jun 18 2018,Issue 845723 has been merged into this issue. Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/149266a477fa258d019aa59f98a5d95d079934ea commit 149266a477fa258d019aa59f98a5d95d079934ea Author: John Budorick <jbudorick@chromium.org> Date: Mon Jun 18 23:00:10 2018 Mac WindowServer suspects, pt 2. Follow up to crrev.com/c/1104308 Bug: 828031 Change-Id: I6ebb02a2c1d0aae83c8db9d7c3e13e2c154d965a Reviewed-on: https://chromium-review.googlesource.com/1105121 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#568212} [modify] https://crrev.com/149266a477fa258d019aa59f98a5d95d079934ea/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Jun 18 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/96bff731f1338e6c8971dcc52198b81df63dc760 commit 96bff731f1338e6c8971dcc52198b81df63dc760 Author: John Budorick <jbudorick@chromium.org> Date: Mon Jun 18 23:24:10 2018 Drop browser_tests on mac_chromium_rel_ng to 50% experiment. TBR=dpranke@chromium.org No-Try: true Bug: 828031 Change-Id: Ica36364071464f05fe96b52cf699672617b3bbfa Reviewed-on: https://chromium-review.googlesource.com/1105414 Commit-Queue: John Budorick <jbudorick@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#568228} [modify] https://crrev.com/96bff731f1338e6c8971dcc52198b81df63dc760/testing/buildbot/chromium.mac.json [modify] https://crrev.com/96bff731f1338e6c8971dcc52198b81df63dc760/testing/buildbot/test_suite_exceptions.pyl Jun 19 2018,Reviving 28 VMs in Chrome-quarantine this morning. Will be expanding the suspect list shortly. Jun 19 2018, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1a6a0af64d66d42935dad0fdb6afe49d854f540b commit 1a6a0af64d66d42935dad0fdb6afe49d854f540b Author: John Budorick <jbudorick@chromium.org> Date: Tue Jun 19 15:07:48 2018 Mac WindowServer suspects, pt 3. TBR=dpranke@chromium.org Bug: 828031 Change-Id: I0c26116e0222a1d47ad5537a0b171f2fcfa022ef Reviewed-on: https://chromium-review.googlesource.com/1106049 Reviewed-by: John Budorick <jbudorick@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#568452} [modify] https://crrev.com/1a6a0af64d66d42935dad0fdb6afe49d854f540b/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Jun 21 2018, Project MemberThe following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/5b28a5f82940c11f921a1f7fc9762db59167795f commit 5b28a5f82940c11f921a1f7fc9762db59167795f Author: Dirk Pranke <dpranke@chromium.org> Date: Thu Jun 21 01:45:03 2018 Jun 28 2018,Are there any updates about this issue? This still shows up in the top of SoM as a P0 issue. Jun 28 2018,Not yet. There's no action for sheriffs here; it's in the queue mostly as a PSA for sheriffs. Jul 4 2018,
Jul 6 2018,
Issue 860694 has been merged into this issue. Jul 11 2018,
Issue 862620 has been merged into this issue. Jul 12 2018,just rebooted vm1052-m4 vm1037-m4 vm1043-m4 vm147-m1 vm143-m1 vm1346-m4 vm1277-m4 vm1273-m4 vm1051-m4 vm1056-m4 vm1065-m4 vm1118-m4 vm152-m1 vm1344-m4 vm144-m1 vm1504-m4 vm1333-m4 vm1341-m4 vm1480-m4 vm1334-m4 vm1340-m4 vm1343-m4 vm1024-m4 vm1279-m4 vm1276-m4 vm1347-m4 vm1508-m4 vm150-m1 vm151-m1 vm1331-m4 vm1064-m4 vm1507-m4 vm1348-m4 vm146-m1 vm1066-m4 vm1484-m4 vm1275-m4 vm1124-m4 vm1505-m4 vm1503-m4 vm1278-m4 vm1332-m4 vm149-m1 vm1042-m4 vm145-m1 vm1336-m4 vm1345-m4 vm148-m1 Jul 13 2018,
Issue 863388 has been merged into this issue. Jul 13 2018,Issue 863310 has been merged into this issue. Jul 24 2018,Issue 866990 has been merged into this issue. Jul 25 2018,In r577343, I widened the AreModalAnimationsEnabled() check I mentioned above ( https://bugs.chromium.org/p/chromium/issues/detail?id=828031#c103 ). However, I think only print preview and passwords dialogs are using --disable-modal-animations. Should we try passing this on the command line for browser_tests for all Mac VMs? I don't think bare-metal machines are affected, which is consistent with the observations in Issue 515627. (We should pass it for views_unittests too, except I think will cause some tests to fail). Aug 1,John - is this still a P0? Aug 1,
If it's P0, it should block a release. If you do not think this should block a release, change the label and the priority. Aug 2, Project Member
This issue is marked as a release blocker with no milestone associated. Please add an appropriate milestone. All release blocking issues should have milestones associated to it, so that the issue can tracked and the fixes can be pushed promptly. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot Aug 2,
#143, 144: it's probably a P1 at this point, though it's borderline. I think it's clear that this isn't going to block us from shipping stable, so I'm removing RBS. Aug 6,
I'm going to take on investigating this - we have to nail down what is actually wrong here. In general fixing the state of Mac Infra stability will be a big priority for the team in Q3/Q4 so we'll be looking hard at this kind of bot-killing bug. Aug 8,Here's where we are on this bug from the Mac team perspective. We need to know: * What state the WindowServer is in when it crashes (which windows are on the screen, etc) * Which specific incoming IPC is inducing the crash and where * What Chrome is doing to cause that IPC to be sent To find those things out, I propose that we: 1) Figure out a way to reproduce the bot deaths - I imagine that this will involve working with Infra to figure out how to target "vulnerable" bots (which I guess are 10.12 VM bots, for example) with the swarming client, then running the guilty tests from above on those bots. 2) Once we have a reliable way to cause a swarming bot to crash using those tests, log into a specific bot in advance, attach a debugger to WindowServer, and then run the crashing suite. That should cause the WindowServer to crash in controlled circumstances, which will let us gather both the crash stack and the state of the world. 3) Once the WindowServer is stopped at (2), we may also be able to inspect the state of the test Chrome to figure out what it was doing when it sent that IPC. I will begin experimenting with ways to achieve (1). Aug 9,
Aug 24,re #148: status from the ops perspective: 1) We saw failures and window server deaths both on VMs (this bug + issue 825215) as well as on actual hardware (issue 828003), though the VMs certainly seemed to be more frequently affected. 2) To protect capacity for running other tests, we moved upstream browser_tests into a separate swarming pool (Chrome-quarantine) w/ 10.11, 10.12, and 10.13 VMs. (http://shortn/_mNkYarhEYs) We dropped it to 100% experimental at first, then to 50% experimental to keep from blocking the CQ. If we want to experiment, we can drop the suite entirely and launch arbitrary tasks into Chrome-quarantine. 3) I got partway through trying to identify a culprit. Perhaps foolishly, I tried a different strategy for this round than you had tried in round 1 -- rather than strictly bisecting through the suite, I tried to collect tests and suites that most frequently preceded the window server crash. When one pass failed to eliminate the crashes, I dropped the frequency threshold and disabled those tests. A few passes at this failed to catch the culprit and I wound up getting pulled into other things; the current list is here: https://codesearch.chromium.org/chromium/src/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter?rcl=aa6aaad888ee4619f709de684992cac9b5d69075&l=6 4) I've been talking to a couple of folks on my team about https://bugs.chromium.org/p/chromium/issues/detail?id=827420 over the last week or two; we are going to try to implement it sooner rather than later s.t. we can at least get some degree of self-healing for situations like this. Your way forward sounds reasonable. Targeting the bots in Chrome-quarantine with locally built binaries is a solved problem, though the process is still a bit finicky. I am concerned about getting a reliable repro case. Of course, it's possible that even having a frequent but unreliable repro case -- maybe something in the 10-50% range -- would be sufficient to catch the WindowServer crash in a debugger without too much tedium. Is there anything we could add inside the task (either inside the browser_tests binary or adjacent to it) that would aid in diagnosis? Aug 24,
Aug 27,The NextAction date has arrived: 2018-08-27 Sep 7,Issue 880072 has been merged into this issue. Sep 18,I understand that browser tests currently don't run on the CQ on Mac [1] I have a browser test that enforces that no TaskPriority::BEST_EFFORT task is on the critical path of startup. I would really like it to run on the CQ on Mac to prevent regressions. Since webui_polymer2_browser_tests do run on the CQ, I wrote a CL that adds my test to webui_polymer2_browser_tests.filter. Is this an acceptable solution? Are there other ways to make a browser test run on the CQ for Mac? [1] Do they run 50% of the time? https://cs.chromium.org/chromium/src/testing/buildbot/test_suite_exceptions.pyl?l=141&rcl=249c24b1bdbc4ca55ee7176f5638942c76edb2f1 Oct 9,In a sysdiagnose that Elly captured (https://drive.google.com/file/d/1MMuxInfdwQk822F1Xy2krBUVg5MdKKqo/view?usp=sharing), I notice a few things: It seems like the ScreenSaverEngine is running(!) at very high CPU: powermetrics.txt: Name ID CPU ms/s User% Deadlines (<2 ms, 2-5 ms) Wakeups (Intr, Pkg idle) com.apple.ScreenSaver.Engine 622 387.29 16.11 15.75 ScreenSaverEngine 6265 392.12 99.29 0.00 0.00 16.30 15.94 com.apple.WindowServer 177 48.77 35.24 34.39 WindowServer 3645 48.95 95.64 21.17 3.15 35.48 34.65 ps.txt: USER UID PID PPID %CPU %MEM PRI NI VSZ RSS WCHAN TT STAT STARTED TIME COMMAND _screensaver 203 6265 1 31.3 0.2 46 0 4465260 29592 - ?? Rs Fri12AM 1951:03.36 /System/Library/CoreServices/ScreenSaverEngine.app/Contents/MacOS/ScreenSaverEngine -loginWindow The ScreenSaverEngine also shows up in spindump.txt: Thread 0x1ab1d DispatchQueue 1 1001 samples (1-1001) priority 46 (base 46) cpu time 4.140s 1001 start + 1 (libdyld.dylib + 4117) [0x7fff643db015] 1001 main + 348 (ScreenSaverEngine + 4400) [0x1099be130] 1001 -[NSApplication run] + 764 (AppKit + 223365) [0x7fff39ab7885] 1001 -[ScreenSaverApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 135 (ScreenSaverEngine + 5730) [0x1099be662] 1001 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 (AppKit + 8224308) [0x7fff3a258e34] 1001 _DPSNextEvent + 2085 (AppKit + 268915) [0x7fff39ac2a73] 1001 _BlockUntilNextEventMatchingListInModeWithFilter + 64 (HIToolbox + 194692) [0x7fff3b810884] 1001 ReceiveNextEventCommon + 613 (HIToolbox + 195334) [0x7fff3b810b06] 1001 RunCurrentEventLoopInMode + 286 (HIToolbox + 195990) [0x7fff3b810d96] 1001 CFRunLoopRunSpecific + 487 (CoreFoundation + 530951) [0x7fff3c534a07] 632 __CFRunLoopRun + 2427 (CoreFoundation + 534043) [0x7fff3c53561b] 632 __CFRunLoopDoTimers + 346 (CoreFoundation + 568954) [0x7fff3c53de7a] 629 __CFRunLoopDoTimer + 1108 (CoreFoundation + 570244) [0x7fff3c53e384] 629 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 20 (CoreFoundation + 571140) [0x7fff3c53e704] 628 __NSFireTimer + 83 (Foundation + 624441) [0x7fff3e6c3739] 621 -[ScreenSaverView _oneStep:] + 153 (ScreenSaver + 30797) [0x7fff482f484d] 621 -[AppleFlurryView animateOneFrame] + 55 (Flurry + 4986) [0x10b5e337a] 322 -[AppleFlurryView gl_display] + 410 (Flurry + 6618) [0x10b5e39da] 322 GLRenderScene + 270 (Flurry + 9384) [0x10b5e44a8] 316 glDrawArrays_IMM_Exec + 719 (GLEngine + 806652) [0x7fff46b19efc] 316 gleFlushAtomicFunc + 15 (GLEngine + 158227) [0x7fff46a7ba13] 163 ??? [0x10c89071f] 83 gldRenderFillQuads + 143 (GLRendererFloat + 58029) [0x7fff46bfb2ad] 79 ??? [0x10c891481] 76 gldMergeScanlines2x2 + 762 (GLRendererFloat + 77434) [0x7fff46bffe7a] I also see that in launchctl-print-system.txt that WindowServer most recently exited with signal 11 (SEGV): 3645 -11 com.apple.WindowServer Unfortunately nothing in the logs references the death of the WindowServer. They're filled with failures to start com.apple.postfix.master. We should definitely be sure that the screensaver is disabled, though. Oct 9,Process: WindowServer [166] Path: /System/Library/PrivateFrameworks/SkyLight.framework/Versions/A/Resources/WindowServer Identifier: WindowServer Version: 600.00 (15) Code Type: X86-64 (Native) Parent Process: launchd [1] Responsible: WindowServer [166] User ID: 88 Date/Time: 2018-10-08 14:42:22.502 -0700 OS Version: Mac OS X 10.12.6 (16G29) Report Version: 12 Anonymous UUID: 301AF3C5-2F6D-AFF4-12A3-6D33F7ABF650 Time Awake Since Boot: 20000 seconds System Integrity Protection: disabled Crashed Thread: 10 Dispatch queue: com.apple.root.default-qos Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x0000000139f6b800 Exception Note: EXC_CORPSE_NOTIFY Termination Signal: Segmentation fault: 11 Termination Reason: Namespace SIGNAL, Code 0xb Terminating Process: exc handler [0] VM Regions Near 0x139f6b800: CoreAnimation 0000000139eff000-0000000139f30000 [ 196K] rw-/rwx SM=PRV --> MALLOC_LARGE 0000000139f7f000-0000000139ff9000 [ 488K] rw-/rwx SM=PRV Thread 0:: Dispatch queue: com.apple.main-thread 0 libsystem_kernel.dylib 0x00007fffd2c0c386 semaphore_wait_trap + 10 1 libdispatch.dylib 0x00007fffd2ac7a77 _os_semaphore_wait + 16 2 libdispatch.dylib 0x00007fffd2ab57e9 _dispatch_group_wait_slow + 154 3 com.apple.SkyLight 0x00007fffcef7073f CGXSoftwareComposite + 595 4 com.apple.SkyLight 0x00007fffcef6f5a8 CompositorSW::CompositeLayersToDestination(WSCompositeSourceLayer*, WSCompositeDestination*, unsigned long long) + 212 5 com.apple.SkyLight 0x00007fffcef6fec4 CompositorSW::CreateShadowFromLayers(WSShadowDescription, WSCompositeSourceLayer*, CGXRedrawState*) + 738 6 com.apple.SkyLight 0x00007fffcee5ce7c insertWindowShadowLayers(CGXRedrawState*, CGXWindow*, CGSRegionObject*) + 497 7 com.apple.SkyLight 0x00007fffcee5a359 generate_layers_for_window(CGXRedrawState*, CGXWindow*) + 8414 8 com.apple.SkyLight 0x00007fffcee55584 CGXUpdateDisplay + 15282 9 com.apple.SkyLight 0x00007fffcee51787 update_display_callback(void*, double) + 216 10 com.apple.SkyLight 0x00007fffceeb93cd run_timer_pass + 706 11 com.apple.SkyLight 0x00007fffcef2bc7e CGXRunOneServicesPass + 152 12 com.apple.SkyLight 0x00007fffcef2d444 SLXServer + 3853 13 WindowServer 0x0000000102441dde 0x102441000 + 3550 14 libdyld.dylib 0x00007fffd2ae5235 start + 1 Thread 10 Crashed:: Dispatch queue: com.apple.root.default-qos 0 com.apple.SkyLight 0x00007fffced8588d layer_blit_byte_ANY + 2300 1 com.apple.SkyLight 0x00007fffcef71961 layer_blit_entry + 1520 2 com.apple.SkyLight 0x00007fffcef71368 layer_blit_entry_block + 9 3 libdispatch.dylib 0x00007fffd2aaf8fc _dispatch_client_callout + 8 4 libdispatch.dylib 0x00007fffd2ab18b7 _dispatch_root_queue_drain + 990 5 libdispatch.dylib 0x00007fffd2ab148c _dispatch_worker_thread3 + 99 6 libsystem_pthread.dylib 0x00007fffd2cfe5a2 _pthread_wqthread + 1299 7 libsystem_pthread.dylib 0x00007fffd2cfe07d start_wqthread + 13 Oct 9,Oh weird. It looks like it's using its own SW fallback. I wonder if disabling the CA renderer (so using our compositor instead) would fix this. I'll give that a try. Oct 9,(doing that over at https://chromium-review.googlesource.com/c/chromium/src/+/1271356) Oct 9, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0e7afef059c96e617c203c6199220d248e7bef49 commit 0e7afef059c96e617c203c6199220d248e7bef49 Author: Christopher Cameron <ccameron@chromium.org> Date: Tue Oct 09 21:02:15 2018 Blacklist CoreAnimation renderer under VMWare Bug: 828031 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: Iadff704bd8c0cce0f0267f463aaac8e558968850 Reviewed-on: https://chromium-review.googlesource.com/c/1271356 Reviewed-by: Zhenyao Mo <zmo@chromium.org> Commit-Queue: ccameron <ccameron@chromium.org> Cr-Commit-Position: refs/heads/master@{#598077} [modify] https://crrev.com/0e7afef059c96e617c203c6199220d248e7bef49/gpu/config/gpu_driver_bug_list.json Oct 10,Unfortunately we see a bot death with a WindowServer crash at 2018-10-10 021248 (i.e. considerably after 0e7afef059c96e617c203c6199220d248e7bef49 landed) so that change did not fix it. The symptoms are exactly the same as in #157: the main WindowServer thread is doing CGXSoftwareComposite via insertWindowShadowLayers and one of its worker thread crashes in layer_blit_byte_ANY() with an access outside any memory region. On that same bot is another crash from 2018-09-14 with the same symptoms. A random thing, possibly of interest: in all three crash dumps, exactly 391M is allocated to "CG Framebuffers" and around 975M is allocated to the WindowServer. I don't know if this is an odd coincidence or what. Oct 10,Our theory of this crash remains the same, then: something that browser_tests is doing crashes WindowServer, which leaves the previous session that chrome-bot is using unable to talk to WindowServer and causes all subsequent runs to fail. With remote access to the bots I should be able to run browser_tests myself, so I'll try that to see if I can get the crash to happen while I'm looking at it, but the bots do not have debug tools installed so I don't think I'll be able to attach to WindowServer. However, *if* I can get the crash to happen by copying browser_tests to the bot and running it, that will let me iterate on potential fixes quickly. I am currently using vm1504-m4 for this. Oct 10,Googling for "com.apple.SkyLight" finds a few threads where people complain that their window server keeps crashing and that some apps (including chrome) seem to trigger this (example: https://discussions.apple.com/thread/7917358). So it seems this isn't test-only, but something that happens in the wild as well. Oct 10,Running browser_tests over ssh appears not to work - the session does not have a "display" in some sense, so none of the UI code works. Every test crashes trying to create a 0x0 window, and if I pass --window-size, every test crashes trying to fetch the device scale factor. If I hack around *that* by passing --force-device-scale-factor, every test hangs trying to show a warning message box about being unable to load extensions. So, next steps here: 1) Figure out (with jbudorick@) how to run a browser_tests binary I've built on the bot, or otherwise target a specific bot with a browser_tests binary; 2) Figure out (again with jbudorick@) how to install debug tools on a bot I'm targeting; 3) Figure out how to use dtrace (?) to listen to what IPC messages WindowServer is receiving from us, since this might give a clue as to what's going on Oct 10,From chat: I think trying to debug this from the perspective of windowserver may be tricky (we probably send hundreds of IPCs back and forth, and the data are going to be in structs that we'll have no reference for). I'm still very suspicious of https://cs.chromium.org/chromium/src/ui/base/cocoa/constrained_window/constrained_window_animation.mm. We did put a bunch of constrained window tests on the test filter, but we may have missed one or two. What if we disable the animations if we're in tests? https://cs.chromium.org/chromium/src/chrome/browser/ui/cocoa/constrained_window/constrained_window_custom_sheet.mm?l=46&rcl=8920325919167a6d832e8bbc377a4d43ea4751c5 already exists, so we could probably do that if getenv("CHROME_HEADLESS") exists too. I'll put up a CL for that. Oct 10,It looks like ConstrainedWindowAnimationHide is used outside of just the constrained windows code now: https://cs.chromium.org/chromium/src/ui/views_bridge_mac/bridged_native_widget_impl.mm?l=61&rcl=5b046432f18e81cc58d32191c8a8ad150c8b74ad Oct 10,so, #164.1: 1) ./tools/mb/mb.py isolate your/output-directory browser_tests This will build browser_tests (if necessary) and create .isolate and .isolated files in your output directory. 2) ./tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -s your/output-directory/browser_tests.isolated This will output a hash for browser_tests. Keep it. 3) ./tools/swarming_client/swarming.py trigger -I isolateserver.appspot.com -S chromium-swarm.appspot.com \ --cipd-package '.swarming_module:infra/python/cpython/${platform}:version:2.7.14.chromium14' \ --cipd-package '.swarming_module:infra/tools/luci/logdog/butler/${platform}:git_revision:e1abc57be62d198b5c2f487bfb2fa2d2eb0e867c' \ --cipd-package '.swarming_module:infra/tools/luci/vpython-native/${platform}:git_revision:b6cdec8586c9f8d3d728b1bc0bd4331330ba66fc' \ --cipd-package '.swarming_module:infra/tools/luci/vpython/${platform}:git_revision:b6cdec8586c9f8d3d728b1bc0bd4331330ba66fc' \ -s $HASH_FROM_STEP_2 \ -d pool Chrome-quarantine \ [-d id $SPECIFIC_MACHINE] [-d os $SPECIFIC_OS] This should give you something like: To collect results, use: swarming.py collect -S https://chromium-swarm.appspot.com 40776f45e73bbf10 Or visit: https://chromium-swarm.appspot.com/user/task/40776f45e73bbf10 Note that: 3.a) this will run the *entire* suite -- i.e., no sharding. If you want to run a specific shard, add -e GTEST_SHARD_INDEX $SHARD_NUMBER -e GTEST_TOTAL_SHARD $TOTAL_SHARDS to the swarming.py command above. 3.b) this will not run with any other arguments (e.g. the filter file or --gtest_shuffle). If you want to run with those, add -- --gtest_shuffle --test-launcher-filter-file=../../testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Oct 10,#164.2: I'm not sure we have formal support for that; your best bet might be scp. What tools do you have in mind? Oct 10,oh, and #167 will only work with live machines -- dead VMs will not pick up swarming tasks. If you're interested in running a suite on a dead VM, I can try to write up instructions for calling run_isolated directly on the VM, though that's *definitely* not a formally supported thing. Oct 10, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6f098c8faaa897f4ade74352c40e00b3993d3d8f commit 6f098c8faaa897f4ade74352c40e00b3993d3d8f Author: Robert Sesek <rsesek@chromium.org> Date: Wed Oct 10 16:11:52 2018 mac: Disable ConstrainedWindowAnimations when running CHROME_HEADLESS. We suspect that the private CoreGraphics APIs used by these animations may be leading to WindowServer crashes. This uses simple window alpha animations instead of those when running with CHROME_HEADLESS. Bug: 828031 Change-Id: Ica1470315d7be946773f496647d65afc8759437c Reviewed-on: https://chromium-review.googlesource.com/c/1273560 Reviewed-by: Elly Fong-Jones <ellyjones@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#598353} [modify] https://crrev.com/6f098c8faaa897f4ade74352c40e00b3993d3d8f/ui/base/cocoa/constrained_window/constrained_window_animation.mm Oct 10, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/043b30e131d41857a48378031d3a099c228d19d1 commit 043b30e131d41857a48378031d3a099c228d19d1 Author: John Budorick <jbudorick@chromium.org> Date: Wed Oct 10 17:51:25 2018 Stop using browser_tests WindowServer killer filter on mac. Should increase the rate of bot death if the underlying issue hasn't been fixed. Bug: 828031 Change-Id: I046ab67d4fd43cfb43236b570039d145ccc86c08 Reviewed-on: https://chromium-review.googlesource.com/c/1273617 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#598392} [modify] https://crrev.com/043b30e131d41857a48378031d3a099c228d19d1/testing/buildbot/chromium.mac.json [modify] https://crrev.com/043b30e131d41857a48378031d3a099c228d19d1/testing/buildbot/test_suite_exceptions.pyl Oct 10,After #171 and #172, vm1066-m4 died, with 6f098c8faaa897f4ade74352c40e00b3993d3d8f :( WindowServer crashed thus: Thread 0 Crashed:: Dispatch queue: com.apple.main-thread 0 com.apple.SkyLight 0x00007fff66dbc15a layer_blit_byte_ANY + 1142 1 com.apple.SkyLight 0x00007fff66df566c layer_blit_entry(void*) + 1411 2 com.apple.SkyLight 0x00007fff66df45c4 CGXSoftwareComposite + 607 3 com.apple.SkyLight 0x00007fff66c6def6 CompositorSW::CompositeLayersToDestination(WSCompositeSourceLayer*, WSCompositeDestination*, unsigned long long) + 232 4 com.apple.SkyLight 0x00007fff66d633ee WS::DisplaySurface::Composite(CGXRedrawState*, Compositor*, unsigned int, WSCompositeSourceLayer*) + 1016 5 com.apple.SkyLight 0x00007fff66dac57a CGXUpdateDisplay + 13678 6 com.apple.SkyLight 0x00007fff66da8d8e update_display_callback(void*, double) + 257 7 com.apple.SkyLight 0x00007fff66def962 run_timer_pass + 495 8 com.apple.SkyLight 0x00007fff66e1d561 CGXRunOneServicesPass + 247 9 com.apple.SkyLight 0x00007fff66e1e11c SLXServer + 832 10 WindowServer 0x00000001060afdde 0x1060af000 + 3550 11 libdyld.dylib 0x00007fff6ce25015 start + 1 This is interesting - the crash stack is different than before. Oct 10, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0187ab8b4c1a229f9b8ffc04ee92095506174ceb commit 0187ab8b4c1a229f9b8ffc04ee92095506174ceb Author: John Budorick <jbudorick@chromium.org> Date: Wed Oct 10 19:25:06 2018 Remove browser_tests from Mac10.13 Tests & mac_chromium_rel_ng. We're experimenting with fixes in ways that may have adverse consequences on the performance of the suite. Removing it from the 10.13 bot should insulate the CQ from those consequences. TBR=rsesek@chromium.org,ellyjones@chromium.org No-Try: true Bug: 828031 Change-Id: I3d203688dbe9b037db9fb52614c942ee84ed111f Reviewed-on: https://chromium-review.googlesource.com/c/1274059 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#598448} [modify] https://crrev.com/0187ab8b4c1a229f9b8ffc04ee92095506174ceb/testing/buildbot/chromium.mac.json [modify] https://crrev.com/0187ab8b4c1a229f9b8ffc04ee92095506174ceb/testing/buildbot/test_suite_exceptions.pyl Oct 10,On vm1066-m4, the WindowServer crashed at 2018-10-10-112636. This is from /var/log/system.log: Oct 10 11:26:27 vm1066-m4 com.apple.xpc.launchd[1] (com.openssh.sshd.BEB20FD0-C852-4E1E-8E26-AA3A2BC7F837[1654]): Service exited with abnormal code: 255 Oct 10 11:26:27 vm1066-m4 systemstats[52]: assertion failed: 17F77: systemstats + 914800 [D1E75C38-62CE-3D77-9ED3-5F6D38EF0676]: 0x40 Oct 10 11:26:32 vm1066-m4 com.apple.xpc.launchd[1] (com.apple.WindowServer[160]): Service exited due to signal: Segmentation fault: 11 sent by exc handler[0] But I can't say that the two messages before the last are relevant. And from `log show --start '2018-10-10 11:26:00' --end '2018-10-10 11:30:00'`: 2018-10-10 11:26:31.560969-0700 0x536b Default 0x0 1707 0 browser_tests: (CoreVideo) CVCGDisplayLink::setCurrentDisplay didn't find a valid display - falling back to 60Hz 2018-10-10 11:26:32.308017-0700 0x872 Default 0x0 160 0 WindowServer: (QuartzCore) [com.apple.coreanimation:Render] CoreAnimation: Context is a zombie! ... but both of those also appear throughout the logs. And if you google [layer_blit_byte_ANY] there's a few threads about virtual machine crashes on Mac Minis :( Oct 10, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7b7a0f0fea42b1340516a7d7f8f98281bf0db55b commit 7b7a0f0fea42b1340516a7d7f8f98281bf0db55b Author: John Budorick <jbudorick@chromium.org> Date: Wed Oct 10 22:51:12 2018 Resume using browser_tests WindowServer killer filter on mac. Bug: 828031 Change-Id: Iaef9c4172d0182c05cc2ae8b677d60a9d6290df5 Reviewed-on: https://chromium-review.googlesource.com/c/1274453 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#598556} [modify] https://crrev.com/7b7a0f0fea42b1340516a7d7f8f98281bf0db55b/testing/buildbot/chromium.mac.json [modify] https://crrev.com/7b7a0f0fea42b1340516a7d7f8f98281bf0db55b/testing/buildbot/test_suite_exceptions.pyl Oct 10,
Oct 11,Issue 894382 has been merged into this issue. Oct 11, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/efcb830916aa5c200c9c8fef9ca7a68954778e0d commit efcb830916aa5c200c9c8fef9ca7a68954778e0d Author: Robert Sesek <rsesek@chromium.org> Date: Thu Oct 11 21:05:28 2018 Add SSLClientCertificateSelectorCocoaTest.* to mac_window_server_killers.browser_tests.filter Tbr: ellyjones@chromium.org Bug: 828031 Change-Id: Idf9b3f7140b468acdfa5b7694caf3ac55e5cd245 Reviewed-on: https://chromium-review.googlesource.com/c/1277551 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#598933} [modify] https://crrev.com/efcb830916aa5c200c9c8fef9ca7a68954778e0d/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Oct 11,I ran 90 invocations of browser_tests today, split about ~evenly between 10.12 and 10.13. I think with #179, the 10.12 fleet should be stable. There was a clear pattern of 10.12 bots running a SSLClientCertificateSelectorCocoaTest test and then the WindowServer dying. The 10.13 failures are a little more mysterious. But of all the 10.13 failures I've accumulated so far, all occur in browser_tests shard 4. I can try and reduce that tomorrow. Oct 11,I just rebooted all the dead 10.12 bots. If no more start accumulating, we can look at increasing the load on 10.12 (which will be limited because most of the fleet is on 10.13). Oct 11,What do you have in mind re "increasing the load" -- pulling suites out of the filter file? Oct 12,Ah, I thought we were running browser_tests on 10.12 experimentally too, but we aren't. So I suppose we just let 10.12 ride. We have not accumulated any new dead 10.12 bots yet. Oct 12,Current overall status: I think everything <10.13 should be stable. We have pretty concrete evidence that ConstrainedWebDialogBrowserTest and SSLClientCertificateSelectorCocoaTest were directly causing a lot of the deaths. We haven't seen a dead bot on 10.12 since #179 landed. The issues on 10.13 are still illusive. I've launched 200 iterations of shard 4 on 10.13 (which is where the failures always seem to happen), but only have a few bot deaths to show for it. But on six out of seven runs that resulted in a dead bot, these tests ran: 6 UpdateServiceTest.UpdateCheckError 6 NativeBindingsApiTest.APICreationFromNewContext 6 ExtensionTabUtilBrowserTest.OpenExtensionsOptionsPage 6 ExtensionFetchTest.ExtensionCannotFetchHostedResourceWithoutHostPermissions 6 ExtensionActionRunnerBrowserTest.ActiveScriptsAreDisplayedAndDelayExecution_ContentScripts_ExplicitHosts 6 DisableExtensionsExceptBrowserTest.DisableExtensionsExceptFlag ... so I'll add those to the filter. We can let things sit over the weekend, but then I think we should re-add browser_tests to the 10.13 waterfall bot (maybe not the CQ at first). Assuming that we can bring the bots back online to the waterfall and CQ without causing more dead bots (i.e. we've identified the culprits somewhere in mac_window_server_killers.browser_tests.filter), we can start adding tests back in small chunks. Also in issue 827420, I posted a small program that we could use as the basis of a WindowServer watchdog, which would help with the self-healing. Oct 12, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9c0d646e8bc09e34f4e7a6a8ca74e51794f00c06 commit 9c0d646e8bc09e34f4e7a6a8ca74e51794f00c06 Author: Robert Sesek <rsesek@chromium.org> Date: Fri Oct 12 18:10:36 2018 Add some more test suites to mac_window_server_killers.browser_tests.filter These may be causing the WindowServer to die on 10.13. Tbr: ellyjones@chromium.org Bug: 828031 Change-Id: I5ed668eb5fbb9c70d06ea607bdabe74964d71498 Reviewed-on: https://chromium-review.googlesource.com/c/1277576 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#599289} [modify] https://crrev.com/9c0d646e8bc09e34f4e7a6a8ca74e51794f00c06/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Oct 12, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/08481ae5b5ee9340de1a78a11a0d894415805062 commit 08481ae5b5ee9340de1a78a11a0d894415805062 Author: Robert Sesek <rsesek@chromium.org> Date: Fri Oct 12 21:06:43 2018 Add PictureInPictureWindowControllerBrowserTest to mac_window_server_killers.browser_tests.filter Tbr: ellyjones@chromium.org Bug: 828031 Change-Id: I3f492d5ed53aa30f3ce518464a5fcc96109c1e13 Reviewed-on: https://chromium-review.googlesource.com/c/1279205 Commit-Queue: Robert Sesek <rsesek@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#599346} [modify] https://crrev.com/08481ae5b5ee9340de1a78a11a0d894415805062/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Oct 12,Running shard #4 200 times resulted in 3 dead bots, so I added PictureInPictureWindowControllerBrowserTest to the filter, which appears in all the 10.13 bot deaths that I have results for. I'll spin up another 200 runs. Oct 12, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b20ba21cddf5bd37aeaf36e4056cb019ea912635 commit b20ba21cddf5bd37aeaf36e4056cb019ea912635 Author: John Budorick <jbudorick@chromium.org> Date: Fri Oct 12 21:33:39 2018 Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng. Bug: 828031 Change-Id: I1d4bdb235ad9d93df0653215c7c5928c471b6c16 Reviewed-on: https://chromium-review.googlesource.com/c/1277693 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#599364} [modify] https://crrev.com/b20ba21cddf5bd37aeaf36e4056cb019ea912635/testing/buildbot/chromium.mac.json [modify] https://crrev.com/b20ba21cddf5bd37aeaf36e4056cb019ea912635/testing/buildbot/test_suite_exceptions.pyl Oct 15, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/650bee3b0c512c7d0a666cf40841a215147dafdb commit 650bee3b0c512c7d0a666cf40841a215147dafdb Author: Lindsay Pasricha <lindsayw@google.com> Date: Mon Oct 15 18:19:30 2018 Remove browser_tests from mac-osxbeta-rel until 828031 is resolved. Bug: 828031 ,850125 Change-Id: If24779bb6387cdf06fa2936f3deb2a86ddbe7f3e Reviewed-on: https://chromium-review.googlesource.com/c/1280788 Commit-Queue: Lindsay Pasricha <lindsayw@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#599684} [modify] https://crrev.com/650bee3b0c512c7d0a666cf40841a215147dafdb/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/650bee3b0c512c7d0a666cf40841a215147dafdb/testing/buildbot/test_suite_exceptions.pyl Oct 16, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b41baee6f6fc2f2c68c50d0b222f77d4ca20df7e commit b41baee6f6fc2f2c68c50d0b222f77d4ca20df7e Author: Robert Sesek <rsesek@chromium.org> Date: Tue Oct 16 05:39:46 2018 Revert "Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng." This reverts commit b20ba21cddf5bd37aeaf36e4056cb019ea912635. Reason for revert: Added capacity isn't finding bot deaths as fast as manual, targeted swarming invocations. Will reland this after having some confidence that the 10.13 bot-killer tests are identified. Original change's description: > Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng. > > Bug: 828031 > Change-Id: I1d4bdb235ad9d93df0653215c7c5928c471b6c16 > Reviewed-on: https://chromium-review.googlesource.com/c/1277693 > Reviewed-by: Robert Sesek <rsesek@chromium.org> > Commit-Queue: John Budorick <jbudorick@chromium.org> > Cr-Commit-Position: refs/heads/master@{#599364} TBR=rsesek@chromium.org,jbudorick@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 828031 Change-Id: I38c447f53d533eeff40abfd462c3bb1f09eac75f Reviewed-on: https://chromium-review.googlesource.com/c/1282283 Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#599869} [modify] https://crrev.com/b41baee6f6fc2f2c68c50d0b222f77d4ca20df7e/testing/buildbot/chromium.mac.json [modify] https://crrev.com/b41baee6f6fc2f2c68c50d0b222f77d4ca20df7e/testing/buildbot/test_suite_exceptions.pyl Oct 17,After many, many runs I think I've stabilized shard 4 of browser_tests on 10.13! (Useful tip: you can use --gtest_filter in addition to --test-launcher-filter-file). I'll add the suspect tests to the filter file and reland the 25% experiment. If dead bots don't start piling up, we can then look at increasing the experiment percent. Oct 17,From the tests that I'm adding to the filter file, the common theme seems to be using a views::WebView/WebContents in a child window/sheet. Oct 17, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b34f2c1efd47cf2632714883fa0bdac6c3c0d163 commit b34f2c1efd47cf2632714883fa0bdac6c3c0d163 Author: Robert Sesek <rsesek@chromium.org> Date: Wed Oct 17 23:09:28 2018 Add more tests to mac_window_server_killers.browser_tests.filter This should hopefully make browser_tests stable on Mac. Tbr: ellyjones@chromium.org Bug: 828031 Change-Id: If72e7e6692f48971e88acfb5bd069e1c83c88aaf Reviewed-on: https://chromium-review.googlesource.com/c/1287310 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#600589} [modify] https://crrev.com/b34f2c1efd47cf2632714883fa0bdac6c3c0d163/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter Oct 18, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2ce2c70ed270f200fb16a87fd78347b7e853a8b0 commit 2ce2c70ed270f200fb16a87fd78347b7e853a8b0 Author: Robert Sesek <rsesek@chromium.org> Date: Thu Oct 18 18:46:29 2018 Reland "Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng." This reverts commit b41baee6f6fc2f2c68c50d0b222f77d4ca20df7e. Reason for revert: we believe browser_tests are now stable Original change's description: > Revert "Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng." > > This reverts commit b20ba21cddf5bd37aeaf36e4056cb019ea912635. > > Reason for revert: Added capacity isn't finding bot deaths as fast as manual, targeted swarming invocations. Will reland this after having some confidence that the 10.13 bot-killer tests are identified. > > Original change's description: > > Run browser_tests as a 25% experiment on Mac10.13 Tests & mac_chromium_rel_ng. > > > > Bug: 828031 > > Change-Id: I1d4bdb235ad9d93df0653215c7c5928c471b6c16 > > Reviewed-on: https://chromium-review.googlesource.com/c/1277693 > > Reviewed-by: Robert Sesek <rsesek@chromium.org> > > Commit-Queue: John Budorick <jbudorick@chromium.org> > > Cr-Commit-Position: refs/heads/master@{#599364} > > TBR=rsesek@chromium.org,jbudorick@chromium.org > > # Not skipping CQ checks because original CL landed > 1 day ago. > > Bug: 828031 > Change-Id: I38c447f53d533eeff40abfd462c3bb1f09eac75f > Reviewed-on: https://chromium-review.googlesource.com/c/1282283 > Reviewed-by: Robert Sesek <rsesek@chromium.org> > Reviewed-by: John Budorick <jbudorick@chromium.org> > Commit-Queue: Robert Sesek <rsesek@chromium.org> > Cr-Commit-Position: refs/heads/master@{#599869} TBR=rsesek@chromium.org,jbudorick@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 828031 Change-Id: I297bdbf59e21ebf50aad80f97f7d3aad83e4ccb7 Reviewed-on: https://chromium-review.googlesource.com/c/1287311 Commit-Queue: Robert Sesek <rsesek@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#600841} [modify] https://crrev.com/2ce2c70ed270f200fb16a87fd78347b7e853a8b0/testing/buildbot/chromium.mac.json [modify] https://crrev.com/2ce2c70ed270f200fb16a87fd78347b7e853a8b0/testing/buildbot/test_suite_exceptions.pyl Oct 18,I also tried running each test suite currently in the filter file individually with --gtest_filter and --gtest_repeat=100. Unfortunately, that didn't cause any bot deaths whatsoever, so it may be a bit of a chore to reenable a lot of these tests. For now, let's see if the 25% experiment is stable over the next few days. Oct 23, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d1092c71fdb53f61fb818f6c4d13d64b930e3153 commit d1092c71fdb53f61fb818f6c4d13d64b930e3153 Author: Robert Sesek <rsesek@chromium.org> Date: Tue Oct 23 13:45:27 2018 Increase Mac browser_tests experiment to 50%. Bug: 828031 Change-Id: I395c4b33b7ac394645bf626c00185b4f9efcc289 Reviewed-on: https://chromium-review.googlesource.com/c/1294309 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#601928} [modify] https://crrev.com/d1092c71fdb53f61fb818f6c4d13d64b930e3153/testing/buildbot/chromium.mac.json [modify] https://crrev.com/d1092c71fdb53f61fb818f6c4d13d64b930e3153/testing/buildbot/test_suite_exceptions.pyl Oct 25, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/59e58d5c614bb02115eb4adecd43e39f934022c4 commit 59e58d5c614bb02115eb4adecd43e39f934022c4 Author: Robert Sesek <rsesek@chromium.org> Date: Thu Oct 25 20:40:19 2018 Bring back browser_tests on Mac10.13 as non-experimental. The bots appear to be stable with the test filter in place. Bug: 828031 Change-Id: I50f12e581dbc1e3832687326f71505b92ee5d73d Reviewed-on: https://chromium-review.googlesource.com/c/1299655 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#602841} [modify] https://crrev.com/59e58d5c614bb02115eb4adecd43e39f934022c4/testing/buildbot/chromium.mac.json [modify] https://crrev.com/59e58d5c614bb02115eb4adecd43e39f934022c4/testing/buildbot/test_suite_exceptions.pyl Oct 26,
I'm not seeing any bot deaths after #197, so I think we've stopped the damage. I'm going to leave this open through the weekend to make sure we're really not accumulating dead bots, but then I think we can call this Fixed! There are two follow-up items: issue 899286 - Trim down list of excluded tests from mac_window_server_killers.browser_tests.filter (just filed) issue 827420 - Create a watchdog for Mac swarming bots Oct 26,Thanks for your persistence on this! Oct 26,I got a bunch of "shard #N expired, not enough capacity" on https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/171906 -- not sure if that's due to dying bots or just due to us not having enough capacity even without bots dying. Oct 26,No dead bots - we're just out of capacity in the Chrome-quarantine pool. jbudorick@ is going to ramp down the experiment% and then move bots out of quarantine back into Chrome. Oct 26,#200: yeah, rsesek and I noticed that too; I'm moving browser_tests back into pool:Chrome now. Oct 26, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/25eb9e4d58709999fcb50219a4dc1b44b49b7256 commit 25eb9e4d58709999fcb50219a4dc1b44b49b7256 Author: John Budorick <jbudorick@chromium.org> Date: Fri Oct 26 20:41:22 2018 mac: move browser_tests back to pool:Chrome at 50% experiment. Will promote back to non-experimental once pool:Chrome-quarantine has been folded back into pool:Chrome. Bug: 828031 Change-Id: Icb4f7da5b0aab18ec951bb612acf8d3e7e155a27 Reviewed-on: https://chromium-review.googlesource.com/c/1302657 Reviewed-by: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#603184} [modify] https://crrev.com/25eb9e4d58709999fcb50219a4dc1b44b49b7256/testing/buildbot/chromium.mac.json [modify] https://crrev.com/25eb9e4d58709999fcb50219a4dc1b44b49b7256/testing/buildbot/test_suite_exceptions.pyl Oct 26, Project MemberThe following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/d41fa26fabd353b1ddcf98207590a9e56d6dfcea commit d41fa26fabd353b1ddcf98207590a9e56d6dfcea Author: John Budorick <jbudorick@google.com> Date: Fri Oct 26 20:52:39 2018 Oct 29, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/05c8b61e89ae1ed55456dc9f4b6910253a327476 commit 05c8b61e89ae1ed55456dc9f4b6910253a327476 Author: John Budorick <jbudorick@chromium.org> Date: Mon Oct 29 02:10:33 2018 Mark browser_tests non-experimental on mac in pool:Chrome. Tbr: rsesek@chromium.org Bug: 828031 Change-Id: I22ec9c5611fa22b40964623c7ff0ff850c6cf911 Reviewed-on: https://chromium-review.googlesource.com/c/1303808 Commit-Queue: John Budorick <jbudorick@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#603401} [modify] https://crrev.com/05c8b61e89ae1ed55456dc9f4b6910253a327476/testing/buildbot/chromium.mac.json [modify] https://crrev.com/05c8b61e89ae1ed55456dc9f4b6910253a327476/testing/buildbot/test_suite_exceptions.pyl Nov 1,
Declaring victory! I'll be working on the follow-ups in #198. Dec 14,test_suite_exceptions.pyl still disables browser_tests on 7 or so bots with a link to this bug, and a bunch of other bots still use mac_window_server_killers.browser_tests.filter with a reference to this bug. Since this is now fixed, I suppose that should all be undone again? Dec 14,Yes, we can run browser tests with the filter file in place. The issues referenced in #198 are about trimming down and removing the filter file. Dec 14, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0c2861bf501684f3838e6900625a41e7e3754851 commit 0c2861bf501684f3838e6900625a41e7e3754851 Author: Nico Weber <thakis@chromium.org> Date: Fri Dec 14 18:05:16 2018 Re-enable browser_tests on a bunch of mac bots. We disabled browser_tests on all mac bots due to issue 828031 . Now that that's resolved, undo that. In particular, re-enables browser_tests on Mac Asan. Bug: 828031 Change-Id: Ic832455a962bb748afb34eebcbf95f0c52e00aca Reviewed-on: https://chromium-review.googlesource.com/c/1377520 Reviewed-by: Robert Sesek <rsesek@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#616740} [modify] https://crrev.com/0c2861bf501684f3838e6900625a41e7e3754851/testing/buildbot/chromium.clang.json [modify] https://crrev.com/0c2861bf501684f3838e6900625a41e7e3754851/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/0c2861bf501684f3838e6900625a41e7e3754851/testing/buildbot/chromium.mac.json [modify] https://crrev.com/0c2861bf501684f3838e6900625a41e7e3754851/testing/buildbot/chromium.memory.json [modify] https://crrev.com/0c2861bf501684f3838e6900625a41e7e3754851/testing/buildbot/test_suite_exceptions.pyl Jan 2, Project MemberThe following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7eed462400adc7478d33776190104ad595879905 commit 7eed462400adc7478d33776190104ad595879905 Author: Peter Kasting <pkasting@chromium.org> Date: Wed Jan 02 20:42:46 2019 Re-enable disabled tests in chrome/browser/ui/views/media_router/. These pass on Windows. This requires (mostly) reverting https://chromium-review.googlesource.com/c/chromium/src/+/1005103 . The revert is not perfect since many things changed since that CL initially landed (e.g. various tests got disabled). Bug: 658005 , 678472, 817408, 828031 , 843599 , 849146 , 863945 Change-Id: I23a3010be1faf962e0a2dfbaaa4a57a3e2cc89d3 Reviewed-on: https://chromium-review.googlesource.com/c/1351874 Reviewed-by: Ben Wells <benwells@chromium.org> Reviewed-by: Steven Bennetts <stevenjb@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Reviewed-by: Dmitry Gozman <dgozman@chromium.org> Reviewed-by: Avi Drissman <avi@chromium.org> Commit-Queue: Peter Kasting <pkasting@chromium.org> Cr-Commit-Position: refs/heads/master@{#619483} [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/apps/platform_apps/app_window_interactive_uitest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/devtools/devtools_sanity_interactive_browsertest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/extensions/api/tabs/tabs_test.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/notifications/notification_interactive_uitest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/exclusive_access/fullscreen_controller_test.h [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/keyboard_lock_interactive_browsertest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/views/location_bar/zoom_bubble_view_browsertest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/views/media_router/cast_dialog_view_unittest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/views/media_router/media_router_ui_browsertest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/browser/ui/views/media_router/presentation_receiver_window_view_browsertest.cc [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/chrome/test/base/in_process_browser_test.h [modify] https://crrev.com/7eed462400adc7478d33776190104ad595879905/testing/buildbot/filters/mac_window_server_killers.browser_tests.filter
Showing comments 111 - 210
of 210
Older ›
|
|||||||||||||||||||||||
►
Sign in to add a comment |