New issue
Advanced search Search tips

Issue 916775 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

WASM builds on Mac fails all the time

Project Member Reported by serg...@chromium.org, Dec 19

Issue description

Owner: serg...@chromium.org
Status: Started (was: Untriaged)
My current theory is that running subprocess.check_call with commands that have very long output is what's causing it. I'll try to remove running those commands on Mac to see if this has an effect.
CL: https://github.com/WebAssembly/waterfall/pull/441
CL: https://github.com/WebAssembly/waterfall/pull/442 (this is needed to prevent timeouts on builds where we actually manage to print list of binaries, e.g. see https://ci.chromium.org/p/wasm/builders/luci.wasm.ci/mac/4036)
Description: Show this description
Description: Show this description
Summary: WASM builds on Mac fails all the time (was: WASM builds on Mac fails while trying to print to stdout)
There are also other exceptions that cause a build to fail, e.g. in https://luci-milo.appspot.com/p/wasm/builders/luci.wasm.ci/mac/4037, the step "Archive binaries" fails with "LookupError: unknown encoding: string-escape".
To summarize, there are currently several known reasons why builds fail based on the analysis of the last 10 completed builds:

 - IOError while running "Archive binaries step"
   - seen in 3 builds while printing list of files to be archived
   - attempt to fix: https://github.com/WebAssembly/waterfall/pull/441
 - Infra failure due to timeout
   - based on linux bot timings, we probably timeout shortly before finishing running the tests and
     increasing timeout from 3h to 4h should be enough
   - fix: https://github.com/WebAssembly/waterfall/pull/442
 - IOError while running "Link LLVM Torture (lld, O0)" step
   - seen in https://ci.chromium.org/p/wasm/builders/luci.wasm.ci/mac/4029
 - LookupError
   - seen in https://luci-milo.appspot.com/p/wasm/builders/luci.wasm.ci/mac/40

Last two need further investigation. With first 2 fixed, we should fix ~80% of the builds.
Some background: Number 3 is https://bugs.chromium.org/p/chromium/issues/detail?id=829034
if you look at recent commits (https://github.com/WebAssembly/waterfall/commits/master) youll see several attempts to fix it, none of which really had any effect except for https://github.com/WebAssembly/waterfall/commit/ef2b08f5755592ea7360ebc67f6b8944eff7c2cd#diff-b894d242ede74f4676033d37fc3f0497R249 (and I think line 249 is the only thing that had any effect, because the fcntl part showed that os.O_NONBLOCK was never set).

After I landed that, number 3 just moved over to https://logs.chromium.org/logs/wasm/buildbucket/cr-buildbucket.appspot.com/8926695009160194720/+/steps/Execute_emscripten_testsuite__emwasm_/0/stdout where we get the same IOError in a different place.
wait, I didn't notice the timeout issue. it's odd because the log of the last step also includes the same IOError (EAGAIN) errors as the other, but maybe that's not affecting anything.

wrt check_call, it's supposed to hook up stdout to the stdout of the parent process. I can't think of any reason that should be any different from what are likely dozens of other instances of check_call elsewhere in chrome builds?
Looks like 4 hours was not enough to fix the timeout: https://ci.chromium.org/p/wasm/builders/luci.wasm.ci/mac/4042. I have re-triggered this build manually (via led) with a 10h timeout to see how much time we actually need: https://ci.chromium.org/swarming/task/41e3f48dddf55d10.

Thanks for referencing an older issue. I forgot about it. Maybe I will dupe this one against it after we've resolved the timeout issue.
10h build failed due to IOError problem. Triggered another one: https://ci.chromium.org/swarming/task/41e4dc1acb952210.
That one is timing out with 7h on the emscripten test suite phase, which should definitely not happen. For reference that phase takes 39 minutes on the Linux bot, and I'd expect it to be similar for mac, modulo differences in hardware.
Owner: ----
Status: Available (was: Started)
Unfortunately I'll have to drop this for now due to low priority.
Status: Untriaged (was: Available)
Available, but no owner or component? Please find a component, as no one will ever find this without one.

Sign in to add a comment