Investigate test errors on 10.12 image. |
||||||||||||||||||||||||
Issue description
,
Nov 17 2016
,
Nov 22 2016
,
Nov 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f2a388826d809c7c0491d360d8da5512e4f97879 commit f2a388826d809c7c0491d360d8da5512e4f97879 Author: erikchen <erikchen@chromium.org> Date: Tue Nov 22 07:05:14 2016 Don't run nacl_integration tests on mac on fyi waterfall. The tests are no longer run on mac anywhere. BUG= 665691 Review-Url: https://codereview.chromium.org/2519953003 Cr-Commit-Position: refs/heads/master@{#433804} [modify] https://crrev.com/f2a388826d809c7c0491d360d8da5512e4f97879/testing/buildbot/chromium.fyi.json
,
Nov 28 2016
Failures as of 11/28/2016: browser_tests ConstrainedWindowMacTest.BrowserWindowFullscreen ClipboardApiTest.Extension PluginPowerSaverBrowserTest.PosterTests BrowserWindowControllerTest.FullscreenResizeFlags ExtensionApiTest.BookmarkManager SSLClientCertificateSelectorCocoaTest.HideShow WebstoreInlineInstallerTest.BlockInlineInstallFromFullscreenForBrowser OmniboxViewMacBrowserTest.CopyToPasteboard OutOfProcessPPAPITest.FlashClipboard PluginPowerSaverBrowserTest.SmallCrossOrigin ServiceProcessControlBrowserTest.LaunchAndIPC SpellCheckMessageFilterPlatformMacBrowserTest.SpellCheckReturnMessage FindBarBrowserTest.EscapeKey RenderViewContextMenuMacBrowserTest.ServicesFiltering DesktopCaptureApiTest.ChooseDesktopMedia ServiceProcessControlBrowserTest.LaunchAndReconnect BrowserWindowControllerTest.FullscreenToolbarExposedForTabstripChanges components_unittests BookmarkNodeDataTest.WriteToClipboardURL BookmarkUtilsTest.CopyPaste BookmarkUtilsTest.PasteNonEditableNodes BookmarkNodeDataTest.WriteToClipboardFolderAndURL BookmarkNodeDataTest.JustURL SpellcheckPlatformMacTest.IgnoreWords_EN_US BookmarkNodeDataTest.MetaInfo BookmarkNodeDataTest.Folder BookmarkNodeDataTest.WriteToClipboardMultipleURLs BookmarkNodeDataTest.FolderWithChild BookmarkNodeDataTest.WriteToClipboardEmptyFolder BookmarkNodeDataTest.MultipleNodes BookmarkUtilsTest.PasteBookmarkFromURL BookmarkNodeDataTest.URL BookmarkUtilsTest.MakeTitleUnique BookmarkUtilsTest.CopyPasteMetaInfo BookmarkNodeDataTest.WriteToClipboardFolderWithChildren SpellcheckPlatformMacTest.SpellCheckIgnoresOrthography SpellcheckPlatformMacTest.SpellCheckSuggestions_EN_US content_browsertests CaptureScreenshotTest.CaptureScreenshotArea CaptureScreenshotTest.CaptureScreenshot content_unittests WebDragDestTest.Data WebDragDestTest.URL MacSandboxTest.ClipboardAccess interactive_ui_tests ClipboardTest/0.GetSequenceNumber OmniboxViewTest.CutTextToClipboard ClipboardTest/0.TextTest ClipboardTest/0.TrickyHTMLTest OmniboxViewTest.CutURLToClipboard SitePerProcessInteractiveBrowserTest.FullscreenElementInSubframe SitePerProcessInteractiveBrowserTest.FullscreenElementInABAAndExitViaJS OmniboxViewTest.CopyURLToClipboard SitePerProcessInteractiveBrowserTest.FullscreenElementInMultipleSubframes ClipboardTest/0.WebSmartPasteTest ClipboardTest/0.UnicodeHTMLTest SitePerProcessInteractiveBrowserTest.FullscreenElementInABAAndExitViaEscapeKey OmniboxViewTest.CopyTextToClipboard ClipboardTest/0.RTFTest ClipboardTest/0.DataTest ClipboardTest/0.MultipleDataTest ClipboardTest/0.HTMLTest ClipboardTest/0.BookmarkTest ClipboardTest/0.SharedBitmapTest ClipboardTest/0.MultiFormatTest OmniboxViewTest.Paste ExtensionApiTest.FocusWindowDoesNotExitFullscreen ClipboardTest/0.URLTest mojo_system_unittests WaiterTest.Basic WaiterTest.TimeOut net_unittests CertVerifyProcTest.LargeKey KeygenHandlerTest.SmokeTest VerifyMixed/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 KeygenHandlerTest.ConcurrencyTest CertVerifyProcTest.MacCRLIntermediate VerifyEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 VerifyEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 VerifyMixed/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 VerifyIncompleteEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 VerifyIncompleteEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 CertVerifyProcTest.RejectWeakKeys ui_base_unittests ClipboardMacTest.ReadImageNonRetina ClipboardUtilMacTest.CheckForLeak ClipboardUtilMacTest.PasteboardItemWithTitle OSExchangeDataTest.TestFileToURLConversion OSExchangeDataTest.URLAndString ClipboardUtilMacTest.PasteboardItemFromUrl OSExchangeDataTest.TestPickledData OSExchangeDataTest.StringDataGetAndSet OSExchangeDataTest.TestURLExchangeFormats ClipboardMacTest.ReadImageRetina ClipboardUtilMacTest.PasteboardItemWithFilePath unit_tests UrlDropControllerTest.DragAndDropText ServiceProcessControlMac.TestGTMSMJobSubmitRemove UrlDropControllerTest.DragAndDropURL FindPasteboardTest.ReadingFromPboardUpdatesFindText DownloadUtilMacTest.AddFileToPasteboardTest BookmarkContextMenuControllerTest.CutCopyPasteNode UrlDropControllerTest.DragAndDropTextParsableAsURL FindPasteboardTest.SendsNotificationWhenTextChanges ClipboardUtilsTest.GetClipboardText views_unittests DragDropClientMacTest.PasteboardToOSExchangeTest TextfieldTest.DragAndDrop_InitiateDrag TextfieldTest.DragAndDrop_ToTheRight TextfieldTest.DragAndDrop_ToTheLeft TextfieldTest.DragAndDrop_Canceled DragDropClientMacTest.BasicDragDrop
,
Nov 29 2016
,
Nov 29 2016
,
Nov 29 2016
I've been trying to repro these errors with no success. I've tried: running the tests on a 10.12.1 machine. Running the tests on a 10.12.0 VM [sdy tried this]. ssh-ing into build9-m1 and running the tests there. vnc-ing into build9-m1 and running the tests there. The fact that these failures are deterministic on build9-m1, but that I can't repro them when ssh/vnc-ing in is very suspicious.
,
Nov 29 2016
sdy reports this is also passes on 10.12.1 VM.
,
Nov 29 2016
I can reproduce the clipboard errors locally. Modify the test ClipboardMacTest.ReadImageRetina to leak 100000 UniquePasteboards. Now the test fails with the same symptoms. Reset ClipboardMacTest.ReadImageRetina. It still fails! There's a PB database somewhere that needs to be cleared.
,
Nov 29 2016
More observations: Restarting the machine seems to clear the pasteboard state [everything works] After causing ui_base_unittests to fail, I tried running views_unittests - the exact same set of tests fail.
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b3048cc1a9f9feb6a3f38d364cee1a79bd3fcb23 commit b3048cc1a9f9feb6a3f38d364cee1a79bd3fcb23 Author: erikchen <erikchen@chromium.org> Date: Wed Nov 30 09:50:45 2016 Fix a leak in a MacViews pasteboard test. BUG= 665691 Review-Url: https://codereview.chromium.org/2537953002 Cr-Commit-Position: refs/heads/master@{#435201} [modify] https://crrev.com/b3048cc1a9f9feb6a3f38d364cee1a79bd3fcb23/ui/views/cocoa/drag_drop_client_mac_unittest.mm
,
Nov 30 2016
"""
sudo log show --info --debug --predicate 'subsystem == "com.apple.CFPasteboard"'
"""
"""
45979 2016-11-29 22:20:49.895567-0800 0x52b02 Default 0x0 30327 browser_tests: (CoreFoundation) [com.apple.CFPasteboard.general] failed to create global data
45980 2016-11-29 22:20:49.895580-0800 0x52b07 Error 0x0 30327 browser_tests: (CoreFoundation) [com.apple.CFPasteboard.general] Connection to 'pboard' server had an error: <error: 0x7fffe4991ca0> { count = 1, transaction: 0, voucher = 0x0, contents =
45981 "XPCErrorDescription" => <string: 0x7fffe4991f18> { length = 18, contents = "Connection invalid" }
45982 }
...
...
...
1860467 2016-11-30 02:33:34.840786-0800 0x1043d7 Default 0x0 68736 sync_integration_tests: (CoreFoundation) [com.apple.CFPasteboard.general] failed to create global data
1860468 2016-11-30 02:33:34.840794-0800 0x104423 Error 0x0 68736 sync_integration_tests: (CoreFoundation) [com.apple.CFPasteboard.general] Connection to 'pboard' server had a n error: <error: 0x7fff9ced6ca0> { count = 1, transaction: 0, voucher = 0x0, contents =
1860469 "XPCErrorDescription" => <string: 0x7fff9ced6f18> { length = 18, contents = "Connection invalid" }
1860470 }
1860471 2016-11-30 02:33:37.896625-0800 0x10447e Error 0x0 68768 ui_base_unittests: (CoreFoundation) [com.apple.CFPasteboard.general] Failed to obtain 'pboard' service port: <error: 0x7fff9ced6ca0> { count = 1, transaction: 0, voucher = 0x0, contents =
1860472 "XPCErrorDescription" => <string: 0x7fff9ced6f18> { length = 18, contents = "Connection invalid" }
1860473 }
"""
For some reason, the test binaries can't connect to the pboard service.
In contrast, here's what it looks like when I manually run a test from my ssh session:
"""
2016-11-30 13:34:24.186193-0800 0x1aa19 Debug 0x0 27725 pboard: (CoreFoundation) [com.apple.CFPasteboard.sudden-termination] sudden termination disabled
2016-11-30 13:34:24.186208-0800 0x1aa19 Debug 0x0 27725 pboard: (CoreFoundation) [com.apple.CFPasteboard.sudden-termination] sudden termination enabled
2016-11-30 13:34:24.186964-0800 0x1aa19 Info 0x0 27725 pboard: (CoreFoundation) [com.apple.CFPasteboard.general] Sucessfuly started pboard: 'CFPBS:186A5:'
2016-11-30 13:34:24.186997-0800 0x1aa19 Info 0x0 27725 pboard: (CoreFoundation) [com.apple.CFPasteboard.general] Setting up the 'com.apple.pasteboard.1' connection (for pboard)
2016-11-30 13:34:24.187057-0800 0x1aa19 Info 0x0 27725 pboard: (CoreFoundation) [com.apple.CFPasteboard.general] Setting up the 'com.apple.coreservices.uauseractivitypasteboardclient.xpc' connection
"""
,
Nov 30 2016
Based on pseudocode for ___CFPasteboardSetup from CoreFoundation, it looks like the following xpc message is failing:
"""
rax = xpc_connection_create_mach_service("com.apple.pasteboard.1", rax, 0x0);
*___CFPasteboardServerConnection = rax;
...
xpc_connection_resume(*___CFPasteboardServerConnection);
rbx = xpc_dictionary_create(0x0, 0x0, 0x0);
xpc_dictionary_set_string(rbx, "com.apple.pboard.message", "com.apple.pboard.check-in");
r15 = xpc_connection_send_message_with_reply_sync(*___CFPasteboardServerConnection, rbx);
xpc_release(rbx);
if (xpc_get_type(r15) == __xpc_type_error) goto loc_14469d;
"""
I bet our sandbox is getting in the way, although I don't know why it affects ui_base_unittests, which I would expect to not spin up the sandbox.
,
Nov 30 2016
actually, this *should* be called from the browser process.
,
Nov 30 2016
Pseudocode for __CFHandlePasteboardXPCEvent:
r14 = xpc_dictionary_get_string(rbx, "com.apple.pboard.message");
if (strcmp(r14, "com.apple.pboard.check-in") == 0x0) goto loc_156a2c;
...
loc_156a2c:
r14 = xpc_dictionary_create_reply(rbx);
xpc_dictionary_set_mach_send(r14, "com.apple.pboard.port", *(int32_t *)_mach_task_self_);
if (getaudit_addr(var_50, 0x30) == 0x0) {
xpc_dictionary_set_int64(r14, "com.apple.pboard.token", sign_extend_64(var_2C));
}
rax = xpc_dictionary_get_remote_connection(rbx);
xpc_connection_send_message(rax, r14);
xpc_release(r14);
goto loc_156a8f;
,
Nov 30 2016
The pboard process is never receiving the XPC message [confirmed with lldb], although we could have known this from
"""
"XPCErrorDescription" => <string: 0x7fff9ced6f18> { length = 18, contents = "Connection invalid" }
"""
,
Dec 1 2016
The logs in c#13 are slightly deceptive, as I accidentally dropped one of the lines for browser_tests and interactive_ui_tests. The logs always come in sets of 3:
"""
2016-11-30 14:17:41.971602-0800 0x62da5 Error 0x0 48425 browser_tests: (CoreFoundation) [com.apple.CFPasteboard.general] Connection to 'pboard' server had an error: <error: 0x7fffd05fdca0> { count = 1, transaction: 0, voucher = 0x0, contents =
"XPCErrorDescription" => <string: 0x7fffd05fdf18> { length = 18, contents = "Connection invalid" }
}
2016-11-30 14:17:41.971623-0800 0x62d97 Error 0x0 48425 browser_tests: (CoreFoundation) [com.apple.CFPasteboard.general] Failed to obtain 'pboard' service port: <error: 0x7fffd05fdca0> { count = 1, transaction: 0, voucher = 0x0, contents =
"XPCErrorDescription" => <string: 0x7fffd05fdf18> { length = 18, contents = "Connection invalid" }
}
2016-11-30 14:17:41.971629-0800 0x62d97 Default 0x0 48425 browser_tests: (CoreFoundation) [com.apple.CFPasteboard.general] failed to create global data
"""
"Failed to create mach service connection" is never emitted, so we know that the problem lies somewhere very close to sending the mach msg.
We know that the relevant range is
"""
rax = qos_class_main();
rax = dispatch_get_global_queue(rax, 0x0);
rax = xpc_connection_create_mach_service("com.apple.pasteboard.1", rax, 0x0);
*___CFPasteboardServerConnection = rax;
if (rax == 0x0) goto loc_144642;
loc_1444c7:
xpc_connection_set_event_handler(rax, void ^(void * _block, void * arg1) {
rbx = arg1;
var_20 = *___stack_chk_guard;
if (xpc_get_type(rbx) == __xpc_type_error) {
rbx = xpc_copy_description(rbx);
if (os_log_type_enabled(*__CFPasteboardLog, 0x10) != 0x0) {
r15 = rsp;
rax = rsp;
rsi = *__CFPasteboardLog;
*(int8_t *)(rax + 0xfffffffffffffff0) = 0x2;
*(int8_t *)(rax + 0xfffffffffffffff1) = 0x1;
*(int8_t *)(rax + 0xfffffffffffffff2) = 0x22;
*(int8_t *)(rax + 0xfffffffffffffff3) = 0x8;
*(rax + 0xfffffffffffffff4) = rbx;
_os_log_impl(0xffffffffffeb20c5, rsi, 0x10, "Connection to 'pboard' server had an error: %{public}s", rax + 0xfffffffffffffff0, 0xc);
}
free(rbx);
}
if (*___stack_chk_guard != var_20) {
__stack_chk_fail();
}
return;
});
xpc_connection_resume(*___CFPasteboardServerConnection);
rbx = xpc_dictionary_create(0x0, 0x0, 0x0);
xpc_dictionary_set_string(rbx, "com.apple.pboard.message", "com.apple.pboard.check-in");
r15 = xpc_connection_send_message_with_reply_sync(*___CFPasteboardServerConnection, rbx);
"""
,
Dec 1 2016
I think that the tests are not being run on the VM in the right session. First, some background:
When the machine reboot, it reads and starts a service from:
/Library/LaunchDaemons//org.chromium.infra.service_manager.plist
This service, after many layers of indirection, will eventually run the recipe, and in turn the tests. Using "launchctl procinfo" to examine the test [actually "ps aux | grep browser_tests | awk -v N=2 '{print $N}' | xargs sudo launchctl procinfo"] shows:
"""
audit info
session id = 100000
uid = 4294967295
success mask = 0x0
failure mask = 0x0
flags = is_initial
"""
[full text attached].
I created a LaunchAgent on a 10.12 device to directly run ui_base_unittests.
"""
audit info
session id = 100007
...
flags = has_graphic_access,has_tty,has_console_access,has_authenticated
"""
Finally, I noticed that I had created a LaunchAgent, whereas the vm was using a LaunchDaemon! I tried creating both a LaunchAgent and a LaunchDaemon on my local 10.12 machine and voila! using a Daemon has problems and an agent does not.
,
Dec 1 2016
Nice find! That conceptually makes sense, since LaunchAgents are generally meant to be associated with user sessions, whereas LaunchDaemons don't have a GUI session and are jobs. https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html
,
Dec 1 2016
I ssh-ed into another random bot to see if this was a configuration mistake. build179-m1 from Builder: Mac10.11 Tests https://build.chromium.org/p/chromium.mac/builders/Mac10.11%20Tests Also has plists in /Library/LaunchDaemons. Next steps: I'm going to modify build9-m1 to see if moving the plist into /Library/LaunchAgents fixes the problem.
,
Dec 1 2016
This article has a nice description of LaunchAgents vs LaunchDaemons: http://www.grivet-tools.com/blog/2014/launchdaemons-vs-launchagents/ LaunchDaemons are run on system start, but don't have access to the GUI. I'm guessing that in Sierra, macOS no longer allows processes without the "has_graphic_access" audit info flag to access the clipboard. LaunchAgents are run when the user logs in but have access to the GUI. I'm kind of surprised that our browser/interactive tests have ever worked when triggered from LaunchDaemons.
,
Dec 1 2016
+mark, who in conversation said "it should definitely be an agent. I wonder what changed".
,
Dec 1 2016
Bots that run tests and expect a UI session definitely need to have their bot stuff kicked off via a LaunchAgent that specifies LimitLoadToSessionType Aqua, and have auto-login set up. UI stuff won’t work from a LaunchDaemon.
,
Dec 1 2016
+dsansome who apparently owns service_manager Dave - Looks like we need to make sure buildbot's process on Mac is launched from /Library/LaunchAgents (or ~/LaunchAgents) with service_manager to ensure it has full access to the UI in 10.12+ (see comment #22).
,
Dec 1 2016
ddoman is already moving service_manager from a LaunchDaemon to a LaunchAgent. He's started the rollout in https://chrome-internal-review.googlesource.com/c/306975/.
,
Dec 6 2016
Now, it runs as an agent within an aqua session. : https://chrome-internal.googlesource.com/infra/puppet/+/master/puppetm/etc/puppet/modules/chrome_infra/files/service_manager/org.chromium.infra.service_manager.agent.plist However, browser_tests still fail but a less number of tests are failing: https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14869 erikchen@, Would you be able to find out if those failures are still from the same cause?
,
Dec 6 2016
```
audit info
session id = 100006
uid = 500
success mask = 0x3000
failure mask = 0x3000
flags = has_graphic_access,has_tty,has_console_access
sandboxed = no
container = (no container)
```
,
Dec 6 2016
Tests that fail: browser_tests: PluginPowerSaverBrowserTest.SmallCrossOrigin PluginPowerSaverBrowserTest.PosterTests QUnitBrowserTestRunner.Remoting_Webapp_Js_Unittest SSLClientCertificateSelectorCocoaTest.HideShow mojo_system_unittests: WaiterTest.Basic WaiterTest.TimeOut net_unittests: CertVerifyProcTest.LargeKey CertVerifyProcTest.RejectWeakKeys VerifyMixed/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 VerifyMixed/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 CertVerifyProcTest.MacCRLIntermediate VerifyEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 VerifyEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 VerifyIncompleteEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/1 VerifyIncompleteEndEntity/CertVerifyProcWeakDigestTest.VerifyDetectsAlgorithm/0 net_unittests failures are issue 629712 . mojo_system_unittests do not fail when sshed/vnc-ed into the machine. The errors suggest that the process is running in a low-priority mode, which is why all the timings are off. """ ../../mojo/edk/system/waiter_unittest.cc:130: Failure Expected: (elapsed) < ((2 + 1) * test::EpsilonDeadline()), actual: 106979 vs 60000 ../../mojo/edk/system/waiter_unittest.cc:154: Failure Expected: (elapsed) < ((2 + 1) * test::EpsilonDeadline()), actual: 89956 vs 60000 ../../mojo/edk/system/waiter_unittest.cc:204: Failure Expected: (elapsed) < ((5 + 1) * test::EpsilonDeadline()), actual: 152557 vs 120000 """ QUnitBrowserTestRunner.Remoting_Webapp_Js_Unittest does not fail when sshed/vnc-ed into the machine. Logs don't seem to provide very useful information. The remaining browser_tests do reproduce when ssh-ed in. ddoman: There are still 3 failures [2 from mojo_system_unittests, 1 from browser_tests] that do not reproduce when sshed/vnced into the machine. This suggest some type of test harness difference. Note that the audit you posted is different from the one I posted in c#19 when creating a launch agent. Can you dig into those issues further? How are you triggering auto-login for the bots? Theoretically, LaunchAgents don't trigger unless someone logs into the machine.
,
Dec 6 2016
There's only one way to trigger auto login, and it's baked into the image (System Preferences -> Users and Groups -> Login Options. Automatic login drop down on the right is set to chrome-bot). Side note: If you're getting double prompted for credentials after the initial VNC password prompt, then that usually is a sign the window server has crashed and the second prompt is actually logging the user back in.
,
Dec 6 2016
How do you automatically log in if there's a password? mark@ said that when we initially set this up, he actually had to bake the password into the scripts he used. [The Window Server is not crashing on this bot].
,
Dec 6 2016
It's done automatically with the package that creates the chrome-bot (https://github.com/MagerValp/CreateUserPkg) user during image creation.
,
Dec 6 2016
,
Dec 6 2016
,
Dec 10 2016
ddoman: Ping? The mojo_system_unittests fail on the bot deterministically, but don't fail when I ssh in and run them [or use VNC]. This implies there's a difference between the environment in which the tests are being run. Note that the tests that are failing are highly suggestive of the tests being at a lowered priority level. Are we doing anything that would change the scheduling/prioritization of processes?
,
Dec 12 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/cb0b54d0d753676469d20df38ec31bb5290a0df5 commit cb0b54d0d753676469d20df38ec31bb5290a0df5 Author: Scott Lee <ddoman@chromium.org> Date: Mon Dec 12 06:25:23 2016
,
Dec 12 2016
erikchen, I found that the buildbot slave was being given lower values in NumberOfFiles and NumberOfProcesses than it used to be given. Thus, I increased the value from 10240 to 20000 for NumberOfFiles, and from 1064 to 2000 for NumberOfProcesses (# of childprocess). If it doesn't make any different result, then I will make a change such that buildbot slave process is launched with its own plist, instead of with service_manager.
,
Dec 12 2016
It doesn't seem like the limit changes made any different results.
>> Are we doing anything that would change the scheduling/prioritization of processes?
Yes, just to give you background information, buildbot slave, which takes a build request. compiles chroimum code, and runs the tests, used to run as a launchd agent.
However, we are migrating it to our own service startup program, called service_manager, so that service_manager runs it
* Before
#1. launchd launches buildbot_slave, of which plist is located in ${HOME}/Library/LaunchAgents
* Now
#1. launchd launches service_manager, of which plist is located in /Library/LaunchAgents
#2. service_manager runs buildbot slave.
I will make a change to run buildbot slave with its own plist, and investigate further to find out more info.
,
Dec 13 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/7e5465d0cd4fb05b57098ddd849186a2f7c836f6 commit 7e5465d0cd4fb05b57098ddd849186a2f7c836f6 Author: Scott Lee <ddoman@chromium.org> Date: Mon Dec 12 23:58:08 2016
,
Dec 13 2016
,
Dec 13 2016
Hi erikchen, I made a change to run buildbot slave with its own plist, but the same failures are still occurring. For example, in the following build, I can find the same tests with failures : https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14927 Here are my summary. #1. originally, buildbot slave daemon was running as an agent with its own plist placed under ${HOME}/Library/LaunchAgents. #2. A change was made to run buildbot slave daemon with service_manager so that the daemon was started with service_manager. However, this caused many failures in the build because service_manager was running as a daemon, and, therefore, its child processes, such as browser_tests and interactive_ui_tests, didn't have graphic access. This is when this bug was reported. This is an example build failed when service_manager was running as a daemon. https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14842 #3. I made a change such to run service_manager as an agent. As a result, the buildbot slave daemon ran within an aqua session and "launchctl procinfo" command showed that the buildbot daemon processes and its child processes have graphic access. Although many test failures were gone, there were still a small number of failures in browser_tests and mojo. #4. I made another change to run buildbot with its own plist placed under ${HOME}/Library/LaunchAgents : i.e., start the buildbot slave daemon with the same plist that was used in #1. However, the same test failures has occurred in browser_tests and mojo. https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14927 ----------------- erikchen@, AFAIK, there has no change made to the buildbot slave daemon such that its scheduling/prioritization would be changed, and, now, it is running with the same plist file it used to run with originally. https://chrome-internal.googlesource.com/infra/puppet/+/master/puppetm/etc/puppet/modules/chrome_infra/templates/setup/darwin/org.chromium.buildbot.slave.plist.erb > There are still 3 failures [2 from mojo_system_unittests, 1 from browser_tests] that do not reproduce when sshed/vnced into the machine. This suggest some type of test harness difference. Note that the audit you posted is different from the one I posted in c#19 when creating a launch agent. Can you dig into those issues further? Just for your information, the audit info of #3 and #4 were the same. In my understanding, session ID is a login session ID, the reason why the LaunchAgent running directly ui_base_unittests showed a different session ID is probably because you created and loaded the agent in your ssh/vnc session. I don't know what else I can try and test out to resolve those test failures, but in my opinion, it is unlikely that the remaining test failures are caused from the changes made for service_manager migration. When you sshed/vnced into the machine and browser_tests ran successfully, what's the nice value of the process? I just sshed into one and found that all browser_tests processes were having 0 in NI.
,
Dec 13 2016
ddoman: Thanks for the detailed update. I will continue to investigate.
,
Dec 14 2016
Was able to reproduce the mojo_system_unittests error on a local 10.12 device by launching it as a LaunchAgent. Does not reproduce on 10.11 machines, as expected. Adding the ProcessType "Interactive" key fixes the issue. https://chrome-internal-review.googlesource.com/#/c/311798/
,
Dec 14 2016
,
Dec 15 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/e1a549b682b4ff1f03453f566656b2b5160fcf33 commit e1a549b682b4ff1f03453f566656b2b5160fcf33 Author: erikchen <erikchen@google.com> Date: Wed Dec 14 02:56:05 2016
,
Dec 15 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/ff08cf35f259726d16f65ae7928f6d44df548da9 commit ff08cf35f259726d16f65ae7928f6d44df548da9 Author: Scott Lee <ddoman@chromium.org> Date: Thu Dec 15 06:29:38 2016
,
Dec 15 2016
+erikchen@, I am sorry that I didn't realize the following CL was to add <ProcessType> to service_manager.plist. : https://chrome-internal-review.googlesource.com/#/c/311798/ As I mentioned above, I made a change to start buildbot slave daemon with its own plist, as a result, your CL didn't have an impact to the buildbot slave daemon since it was launched with a different plist. : #1 in https://bugs.chromium.org/p/chromium/issues/detail?id=665691#c42 That's why the mojo tests were still failing in the following builds, which were triggered after your CL was landed. - https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14939 - https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14940 I landed the following CL to start buildbot slave daemon with service_manager again, and verified that service_manager is running within interactive mode. ``` $ launchctl procinfo the_pid_of_service_daemon ... spawn type = interactive ``` I will check the result of mojo_tests tomorrow again. : https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14942
,
Dec 16 2016
ddoman: It looks like every 5th build or so is going purple. The latest one is 14947. Note that while we lose contact at 15:55:27, the next build doesn't start until 16:44:42. Looking at the machine logs, it looks like it's still happily chugging away performing browser_tests. This suggests that's there's something with at the infra layer?
,
Dec 16 2016
* #14947 It ended at 15:55:27, and I could find that master.chromium.fyi has been started at 15:57:21. : http://shortn/_REfDmWKBGG "2016-12-15T15:57:21 master1 master.chromium.fyi _make_start success " I believe that the master process was stopped(killed) in 15:55, and started with "make start" at 15:57:21. As a result, you can find a purple build in other builders under master.chromium.fyi. - CrWin7Goma #37119 went purple and ended at 15:55:10 - Blimp Linux Engine #3129 went purple and ended at 15:55:31 * #14941 This was interrupted due to my CL that killed the running buildbot slave process and restarted it with service_manager. * #14938 It was interrupted and ended at Dec 14, 13:50:02, and I could find that the master was started at 13:51:17. https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14947 * #13933 It was interrupted and ended at Dec 13, 17:48:06, I could find the master was restarted at the following schedule: 2016-12-13T17:49:14 master1 master.chromium.fyi _make_start success > This suggests that's there's something with at the infra layer? It was just coincident that there have been needs to restart chromium.fyi for various reasons. You may find out what CLs have been landed to schedule a master restart in the following git repo with looking at the history. : https://chrome-internal.googlesource.com/infradata/master-manager.git
,
Dec 16 2016
erikchen: I can see that mojo_tests no longer fails in recent builds, but the following browser tests still fail. - PluginPowerSaverBrowserTest.SmallCrossOrigin - PluginPowerSaverBrowserTest.PosterTests I am not sure if those are failing only in 10.12 or not. If those tests succeed in 10.11, then feel free to keep this open, continue investigation, and ask for a help if necessary. Feel free to close this ticket and file a new one if necessary.
,
Dec 16 2016
Un-cc'ing myself for now, feel free to re-add me if I can help :).
,
Dec 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fde5cb085de250b5ac8b96ff8f95e1d54038ae2d commit fde5cb085de250b5ac8b96ff8f95e1d54038ae2d Author: erikchen <erikchen@chromium.org> Date: Fri Dec 16 22:29:30 2016 Disable two plugin power saver tests on macOS. The tests fail on macOS 10.12 and need to be investigated by the PPS team. BUG=599484, 665691 Review-Url: https://codereview.chromium.org/2585433002 Cr-Commit-Position: refs/heads/master@{#439218} [modify] https://crrev.com/fde5cb085de250b5ac8b96ff8f95e1d54038ae2d/chrome/browser/plugins/plugin_power_saver_browsertest.cc
,
Dec 18 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/1c78b4650e34bb1fd8f3defcd865844e0b33f897 commit 1c78b4650e34bb1fd8f3defcd865844e0b33f897 Author: Scott Lee <ddoman@chromium.org> Date: Thu Dec 15 07:15:01 2016
,
Dec 19 2016
Thanks to everyone here for fixing these test harness problems on 10.12. How close do you think we are to being able to use 10.12 on some of the test bots? I'd like to deploy it on the GPU bots in Issue 673921. Thanks.
,
Dec 19 2016
10.12 toolchain is now green: https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain Currently waiting on Infra-Labs to start rolling out 10.12 [and to determine whether we want to roll out 10.12.1 or 10.12.2]. https://bugs.chromium.org/p/chromium/issues/detail?id=659213#c7
,
Dec 19 2016
Awesome! Thank you! Personally I'd vote for 10.12.2 -- Apple fixed a lot of graphics driver bugs in that release.
,
Dec 19 2016
I'd rather roll 10.12.2. No sense of running old point releases when people are usually forced to upgrade to the latest. I'm reopening this and assigning to me, because the first thing i'd reinstall is this force toolchain bot and ensure it still rolls green with 10.12.2.
,
Dec 19 2016
Sweet.
,
Dec 20 2016
build9-m1 is 10.12.2 starting with https://build.chromium.org/p/chromium.fyi/builders/Chromium%20Mac%2010.11%20Force%20Mac%20Toolchain/builds/14981 I'll check up on it in a few hours.
,
Dec 29 2016
,
Jan 3 2017
10.12.2 image seems fine. |
||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||
Comment 1 by erikc...@chromium.org
, Nov 16 2016