New issue
Advanced search Search tips

Issue 750594 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows , Mac
Pri: 3
Type: Bug-Regression



Sign in to add a comment

mac*_blink_rel bots are missing osmesa on first run after midnight PST

Project Member Reported by raphael....@intel.com, Jul 31 2017

Issue description

For example:
- https://luci-milo.appspot.com/buildbot/tryserver.blink/mac10.10_blink_rel/3252
- https://luci-milo.appspot.com/buildbot/tryserver.blink/mac10.12_blink_rel/1122
- https://luci-milo.appspot.com/buildbot/tryserver.blink/mac10.11_blink_rel/3351

In all cases, the tests are invalid because the runner's crashing:
crash log for gpu (pid <unknown>):
STDOUT: <empty>
STDERR: [48456:34819:0731/003512.746204:50198333299512:ERROR:devtools_http_handler.cc(786)] 
STDERR: DevTools listening on 127.0.0.1:50758
STDERR: 
STDERR: [48459:775:0731/003512.823315:50198410411268:ERROR:gl_initializer_mac.cc(90)] osmesa.so not found at /b/c/b/mac_layout/src/out/Release/osmesa.so
STDERR: [48459:775:0731/003512.834340:50198421428528:ERROR:gl_initializer_mac.cc(90)] osmesa.so not found at /b/c/b/mac_layout/src/out/Release/osmesa.so
STDERR: [48459:775:0731/003512.838383:50198425470971:ERROR:gpu_child_thread.cc(253)] Exiting GPU process due to errors during initialization
STDERR: [48456:33027:0731/003512.847939:50198435031013:ERROR:browser_gpu_channel_host_factory.cc(103)] Failed to launch GPU process.

This doesn't seem to happen all the time (some WPT import jobs pass all the bots).
 
Another set of failures: https://chromium-review.googlesource.com/c/594856/ (patchset 1)
Labels: -Type-Bug -Pri-3 Pri-2 Type-Bug-Regression
The logs for that set of failures look pretty similar (e.g. https://storage.googleapis.com/chromium-layout-test-archives/mac10_9_blink_rel/3315/layout-test-results/test-expectations.html). Interestingly, the crashing tests seem to be http tests, and some other tests failed in other ways but didn't crash.
Cc: vmp...@chromium.org
Labels: -Pri-2 Pri-1
Owner: qyears...@chromium.org
Status: Assigned (was: Available)
Multiple people have been affected by this, and besides blocking the wpt importer, this also blocks people from using webkit-patch rebaseline-cl to rebaseline tests. Will look at this tomorrow.
 Bug 751421  is probably relevant, although in the examples above, I didn't see the analyze step failure.
 Bug 751421  was likely caused by a commit that landed yesterday/today, whereas the problems reported here have been happening for at least a few days.
Good point; now that one is fixed. Next, we can find some more recent examples to confirm whether this is still happening and look through the logs more.
Latest case I saw when checking just now was: https://build.chromium.org/p/tryserver.blink/builders/mac10.12_blink_rel/builds/1157, from about 16 hours ago.
Status: WontFix (was: Assigned)
Haven't seen this again, probably was a transient issue.
Status: Available (was: WontFix)
It happened again today: https://chromium-review.googlesource.com/c/604878
Status: Assigned (was: Available)
Links to the set of failed jobs for that CL:

https://build.chromium.org/p/tryserver.blink/builders/mac10.9_blink_rel/builds/3464
https://build.chromium.org/p/tryserver.blink/builders/mac10.10_blink_rel/builds/3423
https://build.chromium.org/p/tryserver.blink/builders/mac10.11_blink_rel/builds/3546
https://build.chromium.org/p/tryserver.blink/builders/mac10.11_retina_blink_rel/builds/3476
https://build.chromium.org/p/tryserver.blink/builders/mac10.12_blink_rel/builds/1272

Quick notes about the crash message:

 - The error message suggests that osmesa.so is not found in the build directory
 - I haven't found anything about osmesa in the compile step about this yet though (although perhaps it should be there?)
 - osmesa stands for "Off-screen Mesa"
 - osmesa seems to be listed as a dependency of the target webkit_layout_tests, which is a dependency of blink_tests in https://cs.chromium.org/chromium/src/BUILD.gn?l=891
 - The code where the message is printed is https://cs.chromium.org/chromium/src/ui/gl/init/gl_initializer_mac.cc?l=71
This sort of reminds me of  bug 739282  since I've only seen this happen once a day, though the symptoms are quite different from that one.
Summary: mac*_blink_rel bots are missing osmesa on first run after midnight PST (was: mac*_blink_rel bots sometimes crash when running layout tests)
Actually, that's a great point, since the timing of the failed jobs appears to be just after midnight California time.

e.g.:
https://build.chromium.org/p/tryserver.blink/builders/mac10.10_blink_rel/builds/3380
Tue Aug 8 00:11:46 2017

https://build.chromium.org/p/tryserver.blink/builders/mac10.10_blink_rel/builds/3423
Fri Aug 11 00:11:46 2017

Other notes:
 - On the waterfall, WebKit Mac Builder compiles and includes osmesa.so in the build package (listed in the "package build" step).
 - Then the testers on the waterfall unpack that in the "extract build" step, so osmesa.so is present.
 - The try bots are different since they "compile (with patch)" for each job.

I just looked at a couple successful try jobs, and they do have several lines in the compile step related to osmesa, including:

[4386/13966] CXX obj/ui/gl/gl/gl_bindings_autogen_osmesa.o
[4420/13966] CXX obj/ui/gl/gl/gl_context_osmesa.o
[4437/13966] CXX obj/ui/gl/gl/gl_surface_osmesa.o

In every non-crashy run I've looked at, these lines are present; but in every crashy run these lines are not present.
Actually, the CXX obj/ui/gl/gl/gl_context_osmesa.o lines occured in the compile step for https://build.chromium.org/p/tryserver.blink/builders/mac10.10_blink_rel/builds/3380, but interestingly int build 3381 (the next build), we also get a line that says:

[5400/21662] SOLINK_MODULE osmesa.so

Also, it's worth noting that for cleanup_disk (which was identified as being related to  issue 739282 ), it says:

	{Path: `/b/c/b/*/src/out/Release*/*`, MaxAge: twoDays},

https://chrome-internal.googlesource.com/infra/infra_internal/+/master/go/src/infra_internal/tools/cleanup_disk/cmd/cleanup_disk/main.go#32

So, if this is happening after cleanup_disk is run, then we expect that this should happen again at:

Sun Aug 13 just after midnight
Tues Aug 15 just after midnight
Not sure if this is related as I'm not familiar with gn, but when I was poking around during the rotation, I found the following difference in the JSON output of the "analyze" (mb analyze) step:

* The crashed runs  "found dependency" with empty compile targets (e.g. https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.blink%2Fmac10.12_blink_rel%2F1272%2F%2B%2Frecipes%2Fsteps%2Fanalyze%2F0%2Flogs%2Fjson.output%2F0)
* The successful runs "found dependency (all)" with some compile targets (e.g. https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.blink%2Fmac10.12_blink_rel%2F1274%2F%2B%2Frecipes%2Fsteps%2Fanalyze%2F0%2Flogs%2Fjson.output%2F0)

I think we probably need to compile the :blink_tests target? It depends on :webkit_layout_tests, which depends on osmesa.
If that's the case, perhaps osmesa.so is still present when `gn analyze' is run, which leads to it not being rebuilt, but by the time the compile step finishes (or at any point before webkit_tests starts) cleanup_disk has run and removed it.
Project Member

Comment 17 by bugdroid1@chromium.org, Aug 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4eb6c118161fb1e8890d66ce4d5a0dc00fa81b03

commit 4eb6c118161fb1e8890d66ce4d5a0dc00fa81b03
Author: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Date: Sun Aug 13 17:12:58 2017

Remove wrong expectations from TestExpectations.

These were added incorrectly in
https://chromium-review.googlesource.com/c/612807 due to crashes in the
Mac bots as well as bad results from android_blink_rel (discussed at
https://groups.google.com/a/chromium.org/d/msg/ecosystem-infra/QzH1LlvP5ao/lEnKNDdxAAAJ).

TBR=foolip@chromium.org,qyearsley@chromium.org

Bug:  750594 
Change-Id: Iba524d544e4b29d824add743806c47e9d9cc00f4
Reviewed-on: https://chromium-review.googlesource.com/612077
Reviewed-by: Raphael Kubo da Costa (rakuco) <raphael.kubo.da.costa@intel.com>
Commit-Queue: Raphael Kubo da Costa (rakuco) <raphael.kubo.da.costa@intel.com>
Cr-Commit-Position: refs/heads/master@{#494000}
[modify] https://crrev.com/4eb6c118161fb1e8890d66ce4d5a0dc00fa81b03/third_party/WebKit/LayoutTests/TestExpectations

> I think the same thing's happened to the win7_blink_rel bot

By "same thing" I mean "files getting erased by the cleanup cron job". In this specific case, it looks like icudtl.dat was gone.
Labels: -Pri-1 OS-Windows Pri-2
Owner: ----
Status: Available (was: Assigned)
Haven't seen this recently, but it's probably not actually fixed. Marking as available since I'm not currently working on it.
Cc: robertma@chromium.org
Ecosystem infra bug triage Ping: robertma, can you see any imports failing due to this in the last little while?  Just wondering if it should still be Pri-2 (fix soon) or Pri-3 (backlog).
Labels: -Pri-2 Pri-3
Downgrading to P3 as I haven't seen it for a while (I'm also not aware of any intentional effort investigating/fixing the root cause).
Status: WontFix (was: Available)
I think this went away long ago.

Sign in to add a comment