Window server crashing causes infra failures on mac build steps |
||||||||||||||||||||||
Issue descriptionOS: Mac I've been working on burning down infra failures on the build bots. I've noticed an error that has become a trend in the build step logs of the mac chromium builders. I don't yet know if this is isolated to specific builders, OS versions, or even build steps (seems like we might see it in unit_tests if we look). It looks like this: 2016-10-03 17:06:01.530 browser_tests[67041:507] Error (1000) creating CGSWindow (pulled from https://chromium-swarm.appspot.com/user/task/31a5cfd9354d1910) These step failures end up looking like a missing shard, where the test runner exits prematurely for seemingly no reason at all. After noticing this error again, I did a search for it in the issue tracker, and it pops up in crash reports and flakey test failures as well. Based on looking at those, it seems like this is a mac-specific issue with the window server crashing. I'll paste a list of related issues below. I'm not entirely sure of the quantitative impact, but it does appear in ~65% of the logs for browser_tests (with patch) marked with INFRA_FAILURE in the past three days. Below I'll paste a list of relevant build logs. Build logs: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309349/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309241/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308621/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308245/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308243/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307902/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307672/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307613/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307568/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307518/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307441/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307269/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307115/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307028/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/306929/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text Issues: https://bugs.chromium.org/p/chromium/issues/detail?id=515627 https://bugs.chromium.org/p/chromium/issues/detail?id=536195 https://bugs.chromium.org/p/chromium/issues/detail?id=650538 https://bugs.chromium.org/p/chromium/issues/detail?id=647089 https://bugs.chromium.org/p/chromium/issues/detail?id=520668 https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.component%3D%27src%2Fui%2Fviews%2Fcocoa%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27-%5BNativeWidgetMacNSWindow%20initWithContentRect%3AstyleMask%3Abacking%3Adefer%3A%5D%27%20AND%20product.version%3D%2752.0.2743.116%27%20OMIT%20RECORD%20IF%20SOME(ProductData.key%3D%27list_annotations%27%20AND%20ProductData.value%3D%27Crashing%20on%20exception%3A%20Error%20(1000)%20creating%20CGSWindow%27)%20!%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=
,
Oct 7 2016
Found some more of these: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310651/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310590/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310460/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310395/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310285/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310242/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/310151/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309834/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
,
Oct 7 2016
I haven't touched webview in a long time. Removing myself from cc and assigning wjmaclean@
,
Oct 10 2016
,
Oct 11 2016
I spent some time poking through some of the outputs, and while there are lots of examples where the window server failure happens when a WebViewTest is running, there are cases where it happens in a shard that is running some other sort of test, and has yet to run a WebViewTest: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309834/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text I'm going to guess that WebViewTest's multi-renderer-process nature may create some sort of timing situation that is more prone to triggering the error, but I'm not sure that they are the ultimate cause of it. erikchen@ ... what do you think after looking at the output above?
,
Oct 11 2016
Hey katthomas, I just chatted with wjmaclean about ways to investigate this issue. He's going to look into some of the details on the WebView side. It would be really helpful if you could get us some stats from the Infra side: 1) What is the current prevalence of this error on the infra side? Could we come up with a rough metric such as probability of this error occurring on any given run of browser_tests? 2) Was there a spike in this error at some point in the recent past? If we can get a time range on that, we might be able to correlate with changes on the WebView side. 3) wjmaclean@ is going to try turning off WebView tests on Mac for a couple of days. Ideally, we'd have the metric from (1) ready to measure this to determine whether or not WebView tests are related to this error.
,
Oct 11 2016
,
Oct 12 2016
,
Oct 12 2016
@erikchen Thanks! I don't know how to get better numbers on this at the moment, but this is a start. 1) It seems like this happens on roughly 1% of browser_tests runs on mac_chromium_rel_ng, but makes up a majority of legit infra failures on this builder (it accounts for 100% of infra failures in the past day). 2) I don't believe so. It seems to occur consistently at this low rate. 3) Sounds good. Keep me updated!
,
Oct 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3a2ff0e84fc4bbdce80e09650492519ba8397d4f commit 3a2ff0e84fc4bbdce80e09650492519ba8397d4f Author: wjmaclean <wjmaclean@chromium.org> Date: Thu Oct 13 18:10:25 2016 Disable (temporarily) WebView browsertests on Mac. In order to investigate the causes of the windows-server crashes on the Mac bots, this CL temporarily disables WebView browsertests on Mac. Once their impact is better understood, they will be re-enabled. BUG= 653353 Review-Url: https://codereview.chromium.org/2414813002 Cr-Commit-Position: refs/heads/master@{#425090} [modify] https://crrev.com/3a2ff0e84fc4bbdce80e09650492519ba8397d4f/chrome/browser/ui/webui/webui_webview_browsertest.cc
,
Oct 18 2016
Huh, I'm still seeing WebView tests running (and causing that error), e.g. https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/316395 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/316395/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio
,
Oct 19 2016
Hmm, that's my fault ... I disabled the WebUI WebView browser tests ... I've put up another CL to disable these also.
,
Oct 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e commit 0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e Author: wjmaclean <wjmaclean@chromium.org> Date: Wed Oct 19 19:00:10 2016 Disable (temporarily) WebView Browsertests on Mac build bots. In order to investigate the causes of the windows-server crashes on the Mac bots, this CL temporarily disables WebView browsertests on Mac. Once their impact is better understood, they will be re-enabled. This CL is related to https://codereview.chromium.org/2414813002 which just disabled the WebUI WebView browser tests. BUG= 653353 Review-Url: https://chromiumcodereview.appspot.com/2433943003 Cr-Commit-Position: refs/heads/master@{#426254} [modify] https://crrev.com/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e/chrome/browser/apps/guest_view/web_view_browsertest.cc
,
Oct 21 2016
Still seeing the error, but in different places (progress!) PrintPreviewWebUITest.TestAdvancedSettings2Options (https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F320075%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Fstdout) NavigatingExtensionPopupBrowserTest.DownloadViaPost (https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F319332%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Fstdout) HotwordPrivateApiTest.HotwordSession (https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F319213%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Fstdout) HotwordPrivateApiTest.OnFinalizeSpeakerModel (https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F318883%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Fstdout)
,
Nov 3 2016
Is there a bug to track the error on these other tests? I'm assuming we should disable everything that causes the error, then slowly re-add them in order to look for the root cause?
,
Nov 3 2016
Unfortunately, that makes the assumption that the tests that fail are the ones that caused the failure. For WebView Browsertests, both the failing tests and the previously run tests were WebView Browsertests, which looked very suspicious. At this point, I think we should reenable the WebView Browsertests and take an alternative tact. Question 1: Do these errors go away if browser_tests use no swarming, and no parallel task execution? When we recently tried to swarm interactive_ui_tests, they had very similar looking failures: https://bugs.chromium.org/p/chromium/issues/detail?id=660582#c33 Basically I want to narrow down the cause to: 1a) swarming 1b) parallel task execution on a single machine 1c) Something broken about the test, even in single-task execution, no swarming.
,
Nov 3 2016
,
Nov 3 2016
Ok, I didn't know if we wanted to keep removing tests to try and get to a simpler (smaller) set of suspect tests (e.g. your 1c hypothesis). But given the other possibilities you mention, there may be alternate approaches required to decide between them. I'll re-enable the WebView browser tests then.
,
Nov 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0921d355baacbff53a1dd4bd88ee48fa5f51bb61 commit 0921d355baacbff53a1dd4bd88ee48fa5f51bb61 Author: wjmaclean <wjmaclean@chromium.org> Date: Thu Nov 03 19:10:51 2016 Revert of Disable (temporarily) WebView Browsertests on Mac build bots. (patchset #1 id:1 of https://codereview.chromium.org/2433943003/ ) Reason for revert: This patch was only meant to be temporary. Time to re-enable the tests. Original issue's description: > Disable (temporarily) WebView Browsertests on Mac build bots. > > In order to investigate the causes of the windows-server crashes on the > Mac bots, this CL temporarily disables WebView browsertests on Mac. Once > their impact is better understood, they will be re-enabled. > > This CL is related to https://codereview.chromium.org/2414813002 which > just disabled the WebUI WebView browser tests. > > BUG= 653353 > > Committed: https://crrev.com/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e > Cr-Commit-Position: refs/heads/master@{#426254} TBR=thestig@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 653353 Review-Url: https://codereview.chromium.org/2468423005 Cr-Commit-Position: refs/heads/master@{#429663} [modify] https://crrev.com/0921d355baacbff53a1dd4bd88ee48fa5f51bb61/chrome/browser/apps/guest_view/web_view_browsertest.cc
,
Nov 9 2016
Assigning to smut to help investigate from the infra side.
,
Nov 10 2016
I mostly know about iOS, I have no idea what this is about.
,
Nov 10 2016
@maruel, @dpranke I'm looking for some direction here. What do you think is the best way to debug this one? Who is an appropriate owner? Some context: I'm working on burning down infra failures on bots and need to delegate the work on the bugs that are coming out of that.
,
Nov 10 2016
@katthomas - this isn't an infra issue IMO (at least, not directly). Someone on the mac dev side should be the primary owner for this, and I don't think this should be classified as an infra failure until we have a strong reason to think it is one. Given that we mostly run things through swarming these days, it's not clear to me that we can easily run *anything* on macs at scale that takes more than a few seconds to run that isn't swarmed. It also seems kinda unlikely to me that swarming is the issue per se here. I'd start by reenabling just a few tests and disabling parallel execution (but leaving swarming on), and seeing if one can repro the issue. But make a mac dev (erikchen/shrike/wjmaclean) do it, unless you have nothing better to do or really want to dig into this yourself.
,
Nov 10 2016
@dpranke - Thanks! @maruel - Why did this get marked purple? I think we may have discussed test runner crashes should be red. This bug is old, so maybe that has been fixed by now?
,
Nov 11 2016
This bug is getting punted around a lot. I don't have the time to look into it. Over to shrike@ for more triage.
,
Nov 15 2016
@shirke -- can you take a look? Some more recent builds: https://luci-milo.appspot.com/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/334508 https://luci-milo.appspot.com/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/334584 https://luci-milo.appspot.com/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/334481 https://luci-milo.appspot.com/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/334545 https://luci-milo.appspot.com/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/334698
,
Nov 15 2016
Taking as an example https://chromium-swarm.appspot.com/user/task/327bb526753bc210 Swarming is doing what it can: - The bot starts the task - As I understand it, the task kills the GPU driver - The bot survives the OS breakdown - The bot updates the task results with failure code -15 - The bot fails to upload outputs, either because there was none, which would be surprising since Pawel changed that, or because run_isolated got killed while trying to do so (fairly possible). The recipe decides to mark it as an infrastructure failure because there's no JSON but I agree with Dirk this should be a test failure. The recipe doesn't have much signal to know that though but it could guess by looking at the first few bytes of stdout.
,
Nov 15 2016
If this wasn't a swarmed test (i.e., if it was a different kind of step like a local gtest or a script test), you would be able to tell that the process died and that we didn't get valid test results as a result, and I think the step would end up getting marked as red in that case. @maruel, is there a reason that swarming doesn't do more-or-less the same thing? Specifically, if you look at the log message from the collect/results step: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F334508%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Flogs%2Fsome_shards_did_not_complete%3A_3%2F0 You get: Missing results from the following shard(s): 3 It can happen in following cases: * Test failed to start (missing *.dll/*.so dependency for example) * Test crashed or hung * Task expired because there are not enough bots available and are all used * Swarming service experiences problems Please examine logs to figure out what happened. It seems like swarming should be able to tell the difference between the first two (which I'd consider to be a test failure) and the last two (infra failures), simply by knowing that a process was launched and it exited (and there's either no JSON or an incomplete JSON file). Would that work? I can see how there might be cases where a SIGTERM would indicate that there's something wrong with the machine and infra/troopers should look first, but I don't know how common that is?
,
Nov 15 2016
> If this wasn't a swarmed test (i.e., if it was a different kind of step like a local gtest or a script test), you would be able to tell that the process died and that we didn't get valid test results as a result, and I think the step would end up getting marked as red in that case. No, the buildbot slave would die (because it doesn't try to delay SIGTERM as much as it can like the swarming bot does) and the whole build would abruptly stop. > It seems like swarming should be able to tell the difference between the first two (which I'd consider to be a test failure) and the last two (infra failures), simply by knowing that a process was launched and it exited (and there's either no JSON or an incomplete JSON file). Would that work? It is a recipe change. Swarming doesn't care.
,
Nov 15 2016
Hi erikchen@ - you seem like you might be the best person to debug this on the Mac side. I was poking around and ran across this discussion of error 1000: https://github.com/TooTallNate/NodObjC/issues/21 They are able to reproduce the problem by creating a window without first creating an NSApplication object. I'm wondering if there might be a race condition that allows this to happen in the tests.
,
Nov 15 2016
As I indicated in c#25, I don't have time to look into this. You're welcome to assign it to me, but it isn't going to get action any time soon.
,
Nov 15 2016
> In comment #29, maruel@ wrote: >> In comment #28, dpranke@ wrote: >> If this wasn't a swarmed test (i.e., if it was a different kind of step like a local >> gtest or a script test), you would be able to tell that the process died and that >> we didn't get valid test results as a result, and I think the step would end up >> getting marked as red in that case. > No, the buildbot slave would die (because it doesn't try to delay SIGTERM as much > as it can like the swarming bot does) and the whole build would abruptly stop. You lost me. if you browser_tests exists with a SIGTERM, why would that be any different from any other subprocess exit? Or are you saying that *every* process got killed with a SIGTERM? If so, I can see how that'd be bad, yes :). >> It seems like swarming should be able to tell the difference between the first >> two (which I'd consider to be a test failure) and the last two (infra failures), >> simply by knowing that a process was launched and it exited (and there's either >> no JSON or an incomplete JSON file). Would that work? > > It is a recipe change. Swarming doesn't care. Okay. Who owns the swarming recipe_module? (though, if every process got nuked and we can't tell the two cases apart, I'm not sure it matters?).
,
Nov 16 2016
My suspicion is that *every* process get killed with SIGTERM.
,
Nov 23 2016
What are our next steps here? It sounds like there are two things: 1) Infra related, this failure should have been marked as red. What are the changes we need to make to do that and who should own it? 2) Test related, fixing the underlying issue. @erikchen has indicated he doesn't have time. @shirke, do you know of a good alternative? Assigning to you for now.
,
Nov 29 2016
Issue 652409 also brings down the window server, FWIW.
,
Dec 5 2016
Saving the actual swarming task ids as the archived data can be used to reproduce the task: https://chromium-swarm.appspot.com/task?id=32d42fbe7a78ec10 https://chromium-swarm.appspot.com/task?id=32d44712b6711e10 See the help on these pages to learn how to reproduce the task locally.
,
Dec 7 2016
manuel@ - thanks for the info in #37. Re: the instructions on those pages, for "Download inputs files into directory foo" what are the input files and for "Run this task locally", what is the string at the very end of the command specifying?
While it's true that there's mention of problems with creating windows which suggests a problem with the window server, there are other errors that occur before that, specifically:
[1201/094108.326477:ERROR:kill_posix.cc(84)] Unable to terminate process group 2738: No such process
and
[2738:50447:1201/094107.549351:WARNING:mac_util.mm(222)] Failed to set backup exclusion for file '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/.org.chromium.Chromium.Jd6PoW/d1GvCr1/Default/History': Error Domain=NSOSStatusErrorDomain Code=-50 "The operation couldn’t be completed. (OSStatus error -50.)" (paramErr: error in user parameter list) (-50)
The first error happens here:
bool KillProcessGroup(ProcessHandle process_group_id) {
bool result = kill(-1 * process_group_id, SIGKILL) == 0;
...
}
KillProcessGroup() is called from test_launcher.cc and is preceded by the following comment:
// On POSIX, in case the test does not exit cleanly, either due to a crash
// or due to it timing out, we need to clean up any child processes that
// it might have created. On Windows, child processes are automatically
// cleaned up using JobObjects.
So something has gone wrong and the test is trying to kill off lingering child processes. However, man 2 kill says that the pid parameter to kill() can be > 0, == 0, or == -1. The code above passes in -1 * pid, and I'm guessing that's why it's failing. rsesek@, it looks like you wrote this bit of code - what are your thoughts?
(It seems unlikely this failure would cause the follow-on problems, but I want to fix this and then move on to the next issue.)
,
Dec 7 2016
I didn't author that code originally -- I think the revision you're looking at is just a file move/reorganization. The use of kill(-PID) is because that's how killpg() is implemented: https://opensource.apple.com/source/Libc/Libc-1158.20.4/compat-43/FreeBSD/killpg.c.auto.html https://github.com/lattera/glibc/blob/59ba27a/sysdeps/posix/killpg.c Why the code doesn't use killpg() is a mystery to me, though.
,
Dec 7 2016
> I didn't author that code originally -- I think the revision you're looking at is just a file move/reorganization. OK. > The use of kill(-PID) is because that's how killpg() is implemented: The man page for killpg() doesn't seem to discuss negating the process group id?
,
Dec 7 2016
> The man page for killpg() doesn't seem to discuss negating the process group id? It's an implementation detail. killpg is implemented in terms of kill(-PID).
,
Dec 7 2016
I see. So this code should really be changed to killpg(pgid), which behind the scenes will call kill(-pgid)? Seems like cleaning things up is the right thing, but it may not solve the problem of this call failing.
,
Dec 7 2016
> So this code should really be changed to killpg(pgid), which behind the scenes will call kill(-pgid)? Yes, I think so. Maybe killpg didn't exist in some Libc, so it may be worth digging into the history to figure out where it was added. I agree that changing it won't solve the problem though.
,
Dec 7 2016
Re comment #c38: yes, that's what I meant.
,
Dec 7 2016
manuel@ - would you clarify? I was asking questions about the workings of the commands on the pages in c#37.
,
Dec 7 2016
I don't understand, run the command in monospace and it will do what the title just above says.
,
Dec 7 2016
Feel free to ping me over IM for follow up question, it'll be more efficient than over a bug.
,
Dec 8 2016
,
Dec 16 2016
shrike@ Are you still owning this bug? If not, we need to find a new owner, as this is a P1 and I'm concerned with it languishing for this long. As maruel@ said in comment 47, please ping him if you have questions.
,
Dec 16 2016
Unfortunately I do not have the cycles to look into this. erikchen@, are you able to look at this given that infra feels it is p1? sdy@ might also be a good person, but I don't know how strapped he is for time (sdy@ - if you do look at it, you should try to pick erikchen@'s brain to get his ideas about what's going on). When I last talked to him about it, I believe erikchen@ told me he thinks it's related to the level of swarming, so a workaround might be to reduce the amount of swarming.
,
Dec 16 2016
Do we have system logs from the failures? I'm curious if any messages were logged to the console, or crashes were generated for WindowServer or other system processes. I don't have spare cycles but can divert some if necessary… right now, I'm just interested in seeing what info is available.
,
Dec 16 2016
sdy: You can see my previous work on this here: https://bugs.chromium.org/p/chromium/issues/detail?id=515627
,
Dec 17 2016
Note that this issue may magically get better or worse when we rev the try-bots to 10.12. https://bugs.chromium.org/p/chromium/issues/detail?id=659213#c7 The current rate at which this failure happens is relatively low: I see one failure in the last ~150 builds: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng?numbuilds=200 [it looks like numbuilds=200 doesn't actually show 200 builds]. In my link in c#52, the failures were occurring frequently, and also taking the machine offline [it would reboot, and then proceed to not auto-login, thereby not triggering the service_manager launchagent]. These failures don't actually appear to take the machine offline, which makes them much less severe.
,
Dec 17 2016
Here is a query to get some recent builds w/ this failure: https://goto.google.com/window-server-query This gets all the infra failed builds on master.chromium.mac and master.tryserver.chromium.mac in the last three days where the failure is in the browser_tests (with patch) step. Usually these are due to the window server failure. To double-check I usually search the logs for "CGSWindow" and that helps me find the failures.
,
Jan 16 2017
I've filed bug 681684 to figure out if we can reclassify this kind of failure as a test failure (to mark the step red, not purple). As discussed above, we should view this as a problem with the test suite, not the infrastructure (until proven otherwise), so I'm clearing the Infra-Failures label from this bug (and applied it to the other one). @shrike - what component(s) should this be tracked under? Feel free to clear the owner once you've updated things and this is properly triaged.
,
Jan 18 2017
Haven't seen these errors before, but I don't think they add much information: 2017-01-16 18:55:53.777 browser_tests[48622:303] HIToolbox: received notification of WindowServer event port death. 2017-01-16 18:55:53.777 browser_tests[48622:303] port matched the WindowServer port created in BindCGSToRunLoop No session list! [48631:771:0116/185553.950898:WARNING:mac_util.mm(153)] Couldn't get the main display's color space, using generic 2017-01-16 18:55:54.815 browser_tests[48622:303] An uncaught exception was raised 2017-01-16 18:55:54.815 browser_tests[48622:303] Error (268435459) creating CGSWindow on line 263 2017-01-16 18:55:54.815 browser_tests[48622:303] ( 0 CoreFoundation 0x00007fff856b925c __exceptionPreprocess + 172 1 browser_tests 0x000000010e019d0c _ZN6chromeL25ObjcExceptionPreprocessorEP11objc_object + 860 2 libobjc.A.dylib 0x00007fff89098e75 objc_exception_throw + 43 3 CoreFoundation 0x00007fff856b910c +[NSException raise:format:] + 204 4 AppKit 0x00007fff83ee9e95 _NSCreateWindowWithOpaqueShape2 + 1403 5 AppKit 0x00007fff83ee8a21 -[NSWindow _commonAwake] + 3720 6 AppKit 0x00007fff83dc4400 -[NSWindow _commonInitFrame:styleMask:backing:defer:] + 882 7 AppKit 0x00007fff83dc3882 -[NSWindow _initContent:styleMask:backing:defer:contentView:] + 1054 8 AppKit 0x00007fff83dc3458 -[NSWindow initWithContentRect:styleMask:backing:defer:] + 45 9 browser_tests 0x000000010f9e008a -[UnderlayOpenGLHostingWindow initWithContentRect:styleMask:backing:defer:] + 218 10 browser_tests 0x0000000111b91053 -[ChromeEventProcessingWindow initWithContentRect:styleMask:backing:defer:] + 83
,
Jan 18 2017
,
Nov 21 2017
Is this still happening or still need an action item?
,
Jan 25 2018
,
Sep 13
Archiving old bugs that haven't been actively assigned in over 180 days. If you feel this issue should still be addressed, feel free to reopen it or to file a new issue. Thanks!
,
Sep 13
Archiving old bugs that haven't been actively assigned in over 180 days. If you feel this issue should still be addressed, feel free to reopen it or to file a new issue. Thanks! |
||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||
Comment 1 by erikc...@chromium.org
, Oct 6 2016