New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 653353 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Sep 13
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug

Blocking:
issue 649391



Sign in to add a comment

Window server crashing causes infra failures on mac build steps

Project Member Reported by katthomas@chromium.org, Oct 6 2016

Issue description

OS: Mac

I've been working on burning down infra failures on the build bots. I've noticed an error that has become a trend in the build step logs of the mac chromium builders. I don't yet know if this is isolated to specific builders, OS versions, or even build steps (seems like we might see it in unit_tests if we look).

It looks like this:

2016-10-03 17:06:01.530 browser_tests[67041:507] Error (1000) creating CGSWindow

(pulled from https://chromium-swarm.appspot.com/user/task/31a5cfd9354d1910)

These step failures end up looking like a missing shard, where the test runner exits prematurely for seemingly no reason at all. 

After noticing this error again, I did a search for it in the issue tracker, and it pops up in crash reports and flakey test failures as well. Based on looking at those, it seems like this is a mac-specific issue with the window server crashing. I'll paste a list of related issues below.

I'm not entirely sure of the quantitative impact, but it does appear in ~65% of the logs for browser_tests (with patch) marked with INFRA_FAILURE in the past three days. Below I'll paste a list of relevant build logs.

Build logs:
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309349/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309241/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308621/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308245/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308243/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307902/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307672/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307613/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307568/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307518/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307441/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307269/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307115/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/307028/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/306929/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text

Issues:
https://bugs.chromium.org/p/chromium/issues/detail?id=515627
https://bugs.chromium.org/p/chromium/issues/detail?id=536195
https://bugs.chromium.org/p/chromium/issues/detail?id=650538
https://bugs.chromium.org/p/chromium/issues/detail?id=647089
https://bugs.chromium.org/p/chromium/issues/detail?id=520668

https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.component%3D%27src%2Fui%2Fviews%2Fcocoa%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27-%5BNativeWidgetMacNSWindow%20initWithContentRect%3AstyleMask%3Abacking%3Adefer%3A%5D%27%20AND%20product.version%3D%2752.0.2743.116%27%20OMIT%20RECORD%20IF%20SOME(ProductData.key%3D%27list_annotations%27%20AND%20ProductData.value%3D%27Crashing%20on%20exception%3A%20Error%20(1000)%20creating%20CGSWindow%27)%20!%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=
 
Cc: wjmaclean@chromium.org lazyboy@chromium.org fsam...@chromium.org
The first 3 links I looked at experienced the error in the same test suite. This makes it seem highly likely that something in WebViewTests is taking down Window Server. cc-ing OWNERs.

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309349/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text

WebViewTests/WebViewTest.Shim_TestLoadDataAPI/0

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309241/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
WebViewTests/WebViewTest.Shim_TestRemoveWebviewOnExit/0

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/308621/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text
WebViewTests/WebViewTest.Shim_TestLoadDataAPI/0


Cc: -fsam...@chromium.org
Owner: wjmaclean@chromium.org
I haven't touched webview in a long time. Removing myself from cc and assigning wjmaclean@
Components: -Infra Infra>Client>Chrome
I spent some time poking through some of the outputs, and while there are lots of examples where the window server failure happens when a WebViewTest is running, there are cases where it happens in a shard that is running some other sort of test, and has yet to run a WebViewTest:

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/309834/steps/browser_tests%20%28with%20patch%29%20on%20Mac-10.9/logs/stdio/text

I'm going to guess that WebViewTest's multi-renderer-process nature may create some sort of timing situation that is more prone to triggering the error, but I'm not sure that they are the ultimate cause of it.

erikchen@ ... what do you think after looking at the output above?
Cc: katthomas@chromium.org
Hey katthomas, I just chatted with wjmaclean about ways to investigate this issue. He's going to look into some of the details on the WebView side. It would be really helpful if you could get us some stats from the Infra side:

1) What is the current prevalence of this error on the infra side? Could we come up with a rough metric such as probability of this error occurring on any given run of browser_tests?

2) Was there a spike in this error at some point in the recent past? If we can get a time range on that, we might be able to correlate with changes on the WebView side.

3) wjmaclean@ is going to try turning off WebView tests on Mac for a couple of days. Ideally, we'd have the metric from (1) ready to measure this to determine whether or not WebView tests are related to this error.
Blocking: 649391
Labels: Hotlist-Infra-Flakiness
@erikchen Thanks! I don't know how to get better numbers on this at the moment, but this is a start.

1) It seems like this happens on roughly 1% of browser_tests runs on mac_chromium_rel_ng, but makes up a majority of legit infra failures on this builder (it accounts for 100% of infra failures in the past day). 

2) I don't believe so. It seems to occur consistently at this low rate.

3) Sounds good. Keep me updated! 
Project Member

Comment 10 by bugdroid1@chromium.org, Oct 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3a2ff0e84fc4bbdce80e09650492519ba8397d4f

commit 3a2ff0e84fc4bbdce80e09650492519ba8397d4f
Author: wjmaclean <wjmaclean@chromium.org>
Date: Thu Oct 13 18:10:25 2016

Disable (temporarily) WebView browsertests on Mac.

In order to investigate the causes of the windows-server crashes on the
Mac bots, this CL temporarily disables WebView browsertests on Mac. Once
their impact is better understood, they will be re-enabled.

BUG= 653353 

Review-Url: https://codereview.chromium.org/2414813002
Cr-Commit-Position: refs/heads/master@{#425090}

[modify] https://crrev.com/3a2ff0e84fc4bbdce80e09650492519ba8397d4f/chrome/browser/ui/webui/webui_webview_browsertest.cc

Hmm, that's my fault ... I disabled the WebUI WebView browser tests ... I've put up another CL to disable these also.
Project Member

Comment 13 by bugdroid1@chromium.org, Oct 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e

commit 0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e
Author: wjmaclean <wjmaclean@chromium.org>
Date: Wed Oct 19 19:00:10 2016

Disable (temporarily) WebView Browsertests on Mac build bots.

In order to investigate the causes of the windows-server crashes on the
Mac bots, this CL temporarily disables WebView browsertests on Mac. Once
their impact is better understood, they will be re-enabled.

This CL is related to https://codereview.chromium.org/2414813002 which
just disabled the WebUI WebView browser tests.

BUG= 653353 

Review-Url: https://chromiumcodereview.appspot.com/2433943003
Cr-Commit-Position: refs/heads/master@{#426254}

[modify] https://crrev.com/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e/chrome/browser/apps/guest_view/web_view_browsertest.cc

Owner: ----
Is there a bug to track the error on these other tests? I'm assuming we should disable everything that causes the error, then slowly re-add them in order to look for the root cause?
Unfortunately, that makes the assumption that the tests that fail are the ones that caused the failure. For WebView Browsertests, both the failing tests and the previously run tests were WebView Browsertests, which looked very suspicious. At this point, I think we should reenable the WebView Browsertests and take an alternative tact.

Question 1: Do these errors go away if browser_tests use no swarming, and no parallel task execution? When we recently tried to swarm interactive_ui_tests, they had very similar looking failures: https://bugs.chromium.org/p/chromium/issues/detail?id=660582#c33

Basically I want to narrow down the cause to:
  1a) swarming
  1b) parallel task execution on a single machine
  1c) Something broken about the test, even in single-task execution, no swarming.


Cc: shrike@chromium.org
Ok, I didn't know if we wanted to keep removing tests to try and get to a simpler (smaller) set of suspect tests (e.g. your 1c hypothesis). But given the other possibilities you mention, there may be alternate approaches required to decide between them.

I'll re-enable the WebView browser tests then.
Project Member

Comment 19 by bugdroid1@chromium.org, Nov 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0921d355baacbff53a1dd4bd88ee48fa5f51bb61

commit 0921d355baacbff53a1dd4bd88ee48fa5f51bb61
Author: wjmaclean <wjmaclean@chromium.org>
Date: Thu Nov 03 19:10:51 2016

Revert of Disable (temporarily) WebView Browsertests on Mac build bots. (patchset #1 id:1 of https://codereview.chromium.org/2433943003/ )

Reason for revert:
This patch was only meant to be temporary. Time to re-enable the tests.

Original issue's description:
> Disable (temporarily) WebView Browsertests on Mac build bots.
>
> In order to investigate the causes of the windows-server crashes on the
> Mac bots, this CL temporarily disables WebView browsertests on Mac. Once
> their impact is better understood, they will be re-enabled.
>
> This CL is related to https://codereview.chromium.org/2414813002 which
> just disabled the WebUI WebView browser tests.
>
> BUG= 653353 
>
> Committed: https://crrev.com/0cfb5638a524e3bd0ce25d8bc5ab4c7187b3ab5e
> Cr-Commit-Position: refs/heads/master@{#426254}

TBR=thestig@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 653353 

Review-Url: https://codereview.chromium.org/2468423005
Cr-Commit-Position: refs/heads/master@{#429663}

[modify] https://crrev.com/0921d355baacbff53a1dd4bd88ee48fa5f51bb61/chrome/browser/apps/guest_view/web_view_browsertest.cc

Labels: -Pri-2 Pri-1
Owner: smut@chromium.org
Status: Assigned (was: Untriaged)
Assigning to smut to help investigate from the infra side. 

Comment 21 by s...@google.com, Nov 10 2016

Cc: -smut@chromium.org
Owner: ----
Status: Untriaged (was: Assigned)
I mostly know about iOS, I have no idea what this is about.
Cc: dpranke@chromium.org mar...@chromium.org
Status: Available (was: Untriaged)
@maruel, @dpranke I'm looking for some direction here. What do you think is the best way to debug this one? Who is an appropriate owner?

Some context: I'm working on burning down infra failures on bots and need to delegate the work on the bugs that are coming out of that. 
@katthomas - this isn't an infra issue IMO (at least, not directly). Someone on the mac dev side should be the primary owner for this, and I don't think this should be classified as an infra failure until we have a strong reason to think it is one.

Given that we mostly run things through swarming these days, it's not clear to me that we can easily run *anything* on macs at scale that takes more than a few seconds to run that isn't swarmed.

It also seems kinda unlikely to me that swarming is the issue per se here.

I'd start by reenabling just a few tests and disabling parallel execution (but leaving swarming on), and seeing if one can repro the issue.

But make a mac dev (erikchen/shrike/wjmaclean) do it, unless you have nothing better to do or really want to dig into this yourself.
Owner: erikc...@chromium.org
@dpranke - Thanks!

@maruel - Why did this get marked purple? I think we may have discussed test runner crashes should be red. This bug is old, so maybe that has been fixed by now? 
Owner: shrike@chromium.org
Status: Assigned (was: Available)
This bug is getting punted around a lot. I don't have the time to look into it. Over to shrike@ for more triage.
Taking as an example https://chromium-swarm.appspot.com/user/task/327bb526753bc210
Swarming is doing what it can:
- The bot starts the task
- As I understand it, the task kills the GPU driver
- The bot survives the OS breakdown
- The bot updates the task results with failure code -15
- The bot fails to upload outputs, either because there was none, which would be surprising since Pawel changed that, or because run_isolated got killed while trying to do so (fairly possible).

The recipe decides to mark it as an infrastructure failure because there's no JSON but I agree with Dirk this should be a test failure. The recipe doesn't have much signal to know that though but it could guess by looking at the first few bytes of stdout.
If this wasn't a swarmed test (i.e., if it was a different kind of step like a local gtest or a script test), you would be able to tell that the process died and that we didn't get valid test results as a result, and I think the step would end up getting marked as red in that case.

@maruel, is there a reason that swarming doesn't do more-or-less the same thing?

Specifically, if you look at the log message from the collect/results step:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F334508%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Mac-10.9%2F0%2Flogs%2Fsome_shards_did_not_complete%3A_3%2F0

You get:

Missing results from the following shard(s): 3
It can happen in following cases:
  * Test failed to start (missing *.dll/*.so dependency for example)
  * Test crashed or hung
  * Task expired because there are not enough bots available and are all used
  * Swarming service experiences problems
Please examine logs to figure out what happened.

It seems like swarming should be able to tell the difference between the first two (which I'd consider to be a test failure) and the last two (infra failures), simply by knowing that a process was launched and it exited (and there's either no JSON or an incomplete JSON file). Would that work? 

I can see how there might be cases where a SIGTERM would indicate that there's something wrong with the machine and infra/troopers should look first, but I don't know how common that is?
> If this wasn't a swarmed test (i.e., if it was a different kind of step like a local gtest or a script test), you would be able to tell that the process died and that we didn't get valid test results as a result, and I think the step would end up getting marked as red in that case.

No, the buildbot slave would die (because it doesn't try to delay SIGTERM as much as it can like the swarming bot does) and the whole build would abruptly stop.


> It seems like swarming should be able to tell the difference between the first two (which I'd consider to be a test failure) and the last two (infra failures), simply by knowing that a process was launched and it exited (and there's either no JSON or an incomplete JSON file). Would that work? 

It is a recipe change. Swarming doesn't care.
Owner: erikc...@chromium.org
Hi erikchen@ - you seem like you might be the best person to debug this on the Mac side. I was poking around and ran across this discussion of error 1000:

https://github.com/TooTallNate/NodObjC/issues/21

They are able to reproduce the problem by creating a window without first creating an NSApplication object. I'm wondering if there might be a race condition that allows this to happen in the tests.

As I indicated in c#25, I don't have time to look into this. You're welcome to assign it to me, but it isn't going to get action any time soon.
> In comment #29, maruel@ wrote:
>> In comment #28, dpranke@ wrote:
>> If this wasn't a swarmed test (i.e., if it was a different kind of step like a local 
>> gtest or a script test), you would be able to tell that the process died and that 
>> we didn't get valid test results as a result, and I think the step would end up 
>> getting marked as red in that case.

> No, the buildbot slave would die (because it doesn't try to delay SIGTERM as much 
> as it can like the swarming bot does) and the whole build would abruptly stop.

You lost me. if you browser_tests exists with a SIGTERM, why would that be any different from any other subprocess exit? 

Or are you saying that *every* process got killed with a SIGTERM? If so, I can see how that'd be bad, yes :).

>> It seems like swarming should be able to tell the difference between the first 
>> two (which I'd consider to be a test failure) and the last two (infra failures), 
>> simply by knowing that a process was launched and it exited (and there's either 
>> no JSON or an incomplete JSON file). Would that work? 
>
> It is a recipe change. Swarming doesn't care.

Okay. Who owns the swarming recipe_module? (though, if every process got nuked and we can't tell the two cases apart, I'm not sure it matters?).

Comment 33 by maruel@google.com, Nov 16 2016

My suspicion is that *every* process get killed with SIGTERM.
Owner: shrike@chromium.org
What are our next steps here?

It sounds like there are two things:

1) Infra related, this failure should have been marked as red. What are the changes we need to make to do that and who should own it?
2) Test related, fixing the underlying issue. @erikchen has indicated he doesn't have time. @shirke, do you know of a good alternative? Assigning to you for now. 

Comment 35 by sdy@chromium.org, Nov 29 2016

 Issue 652409  also brings down the window server, FWIW.
Saving the actual swarming task ids as the archived data can be used to reproduce the task:
https://chromium-swarm.appspot.com/task?id=32d42fbe7a78ec10
https://chromium-swarm.appspot.com/task?id=32d44712b6711e10

See the help on these pages to learn how to reproduce the task locally.
Cc: rsesek@chromium.org
manuel@ - thanks for the info in #37. Re: the instructions on those pages, for "Download inputs files into directory foo" what are the input files and for "Run this task locally", what is the string at the very end of the command specifying?

While it's true that there's mention of problems with creating windows which suggests a problem with the window server, there are other errors that occur before that, specifically:

[1201/094108.326477:ERROR:kill_posix.cc(84)] Unable to terminate process group 2738: No such process

and

[2738:50447:1201/094107.549351:WARNING:mac_util.mm(222)] Failed to set backup exclusion for file '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/.org.chromium.Chromium.Jd6PoW/d1GvCr1/Default/History': Error Domain=NSOSStatusErrorDomain Code=-50 "The operation couldn’t be completed. (OSStatus error -50.)" (paramErr: error in user parameter list) (-50)

The first error happens here:

bool KillProcessGroup(ProcessHandle process_group_id) {
 bool result = kill(-1 * process_group_id, SIGKILL) == 0;
...
}

KillProcessGroup() is called from test_launcher.cc and is preceded by the following comment:

      // On POSIX, in case the test does not exit cleanly, either due to a crash
      // or due to it timing out, we need to clean up any child processes that
      // it might have created. On Windows, child processes are automatically
      // cleaned up using JobObjects.

So something has gone wrong and the test is trying to kill off lingering child processes. However, man 2 kill says that the pid parameter to kill() can be > 0, == 0, or == -1. The code above passes in -1 * pid, and I'm guessing that's why it's failing. rsesek@, it looks like you wrote this bit of code - what are your thoughts?

(It seems unlikely this failure would cause the follow-on problems, but I want to fix this and then move on to the next issue.)
I didn't author that code originally -- I think the revision you're looking at is just a file move/reorganization.

The use of kill(-PID) is because that's how killpg() is implemented:

https://opensource.apple.com/source/Libc/Libc-1158.20.4/compat-43/FreeBSD/killpg.c.auto.html
https://github.com/lattera/glibc/blob/59ba27a/sysdeps/posix/killpg.c

Why the code doesn't use killpg() is a mystery to me, though.
> I didn't author that code originally -- I think the revision you're looking at is just a file move/reorganization.

OK.

> The use of kill(-PID) is because that's how killpg() is implemented:

The man page for killpg() doesn't seem to discuss negating the process group id?

> The man page for killpg() doesn't seem to discuss negating the process group id?

It's an implementation detail. killpg is implemented in terms of kill(-PID).
I see.

So this code should really be changed to killpg(pgid), which behind the scenes will call kill(-pgid)?

Seems like cleaning things up is the right thing, but it may not solve the problem of this call failing.

> So this code should really be changed to killpg(pgid), which behind the scenes will call kill(-pgid)?

Yes, I think so. Maybe killpg didn't exist in some Libc, so it may be worth digging into the history to figure out where it was added. I agree that changing it won't solve the problem though.
Re comment #c38: yes, that's what I meant.
manuel@ - would you clarify? I was asking questions about the workings of the commands on the pages in c#37.
I don't understand, run the command in monospace and it will do what the title just above says.
Feel free to ping me over IM for follow up question, it'll be more efficient than over a bug.
Labels: Infra-Failures
shrike@ Are you still owning this bug?  If not, we need to find a new owner, as this is a P1 and I'm concerned with it languishing for this long.  As maruel@ said in comment 47, please ping him if you have questions.
Cc: sdy@chromium.org
Owner: ----
Status: Available (was: Assigned)
Unfortunately I do not have the cycles to look into this.

erikchen@, are you able to look at this given that infra feels it is p1? sdy@ might also be a good person, but I don't know how strapped he is for time (sdy@ - if you do look at it, you should try to pick erikchen@'s brain to get his ideas about what's going on).

When I last talked to him about it, I believe erikchen@ told me he thinks it's related to the level of swarming, so a workaround might be to reduce the amount of swarming.

Comment 51 by sdy@chromium.org, Dec 16 2016

Do we have system logs from the failures? I'm curious if any messages were logged to the console, or crashes were generated for WindowServer or other system processes.

I don't have spare cycles but can divert some if necessary… right now, I'm just interested in seeing what info is available.
sdy: You can see my previous work on this here: https://bugs.chromium.org/p/chromium/issues/detail?id=515627
Note that this issue may magically get better or worse when we rev the try-bots to 10.12.
https://bugs.chromium.org/p/chromium/issues/detail?id=659213#c7

The current rate at which this failure happens is relatively low: I see one failure in the last ~150 builds:
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng?numbuilds=200
[it looks like numbuilds=200 doesn't actually show 200 builds].

In my link in c#52, the failures were occurring frequently, and also taking the machine offline [it would reboot, and then proceed to not auto-login, thereby not triggering the service_manager launchagent]. These failures don't actually appear to take the machine offline, which makes them much less severe.


Here is a query to get some recent builds w/ this failure:

https://goto.google.com/window-server-query	

This gets all the infra failed builds on master.chromium.mac and master.tryserver.chromium.mac in the last three days where the failure is in the browser_tests (with patch) step. Usually these are due to the window server failure. To double-check I usually search the logs for "CGSWindow" and that helps me find the failures.
Components: -Infra>Client>Chrome UI
Labels: -Pri-1 -Infra-Failures Pri-2
Owner: shrike@chromium.org
Status: Assigned (was: Available)
I've filed bug 681684 to figure out if we can reclassify this kind of failure as a test failure (to mark the step red, not purple). 

As discussed above, we should view this as a problem with the test suite, not the infrastructure (until proven otherwise), so I'm clearing the Infra-Failures label from this bug (and applied it to the other one).

@shrike - what component(s) should this be tracked under? Feel free to clear the owner once you've updated things and this is properly triaged.
Haven't seen these errors before, but I don't think they add much information:

2017-01-16 18:55:53.777 browser_tests[48622:303] HIToolbox: received notification of WindowServer event port death.
2017-01-16 18:55:53.777 browser_tests[48622:303] port matched the WindowServer port created in BindCGSToRunLoop
No session list!
[48631:771:0116/185553.950898:WARNING:mac_util.mm(153)] Couldn't get the main display's color space, using generic
2017-01-16 18:55:54.815 browser_tests[48622:303] An uncaught exception was raised
2017-01-16 18:55:54.815 browser_tests[48622:303] Error (268435459) creating CGSWindow on line 263
2017-01-16 18:55:54.815 browser_tests[48622:303] (
	0   CoreFoundation                      0x00007fff856b925c __exceptionPreprocess + 172
	1   browser_tests                       0x000000010e019d0c _ZN6chromeL25ObjcExceptionPreprocessorEP11objc_object + 860
	2   libobjc.A.dylib                     0x00007fff89098e75 objc_exception_throw + 43
	3   CoreFoundation                      0x00007fff856b910c +[NSException raise:format:] + 204
	4   AppKit                              0x00007fff83ee9e95 _NSCreateWindowWithOpaqueShape2 + 1403
	5   AppKit                              0x00007fff83ee8a21 -[NSWindow _commonAwake] + 3720
	6   AppKit                              0x00007fff83dc4400 -[NSWindow _commonInitFrame:styleMask:backing:defer:] + 882
	7   AppKit                              0x00007fff83dc3882 -[NSWindow _initContent:styleMask:backing:defer:contentView:] + 1054
	8   AppKit                              0x00007fff83dc3458 -[NSWindow initWithContentRect:styleMask:backing:defer:] + 45
	9   browser_tests                       0x000000010f9e008a -[UnderlayOpenGLHostingWindow initWithContentRect:styleMask:backing:defer:] + 218
	10  browser_tests                       0x0000000111b91053 -[ChromeEventProcessingWindow initWithContentRect:styleMask:backing:defer:] + 83

Cc: -andyb...@chromium.org
Is this still happening or still need an action item?
Cc: -shrike@chromium.org
Owner: ----
Status: Available (was: Assigned)
Status: Archived (was: Available)
Archiving old bugs that haven't been actively assigned in over 180 days.

If you feel this issue should still be addressed, feel free to reopen it or to file a new issue. Thanks!
Archiving old bugs that haven't been actively assigned in over 180 days.

If you feel this issue should still be addressed, feel free to reopen it or to file a new issue. Thanks!

Sign in to add a comment