New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 882852 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-10-11
OS: ----
Pri: 1
Type: ----



Sign in to add a comment

//content/test:content_nocompile_tests_run_nocompile flaky on linux-jumbo-rel

Project Member Reported by eseckler@chromium.org, Sep 11

Issue description

Hi troopers :) Jumbo builds seem to be hiccuping on //content/test:content_nocompile_tests_run_nocompile in a flaky manner:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-jumbo-rel

[2170/2276] ACTION //content/test:content_nocompile_tests_run_nocompile(//build/toolchain/linux:clang_x64)
FAILED: gen/content/test/browser_task_traits_unittest_nc.cc
python ../../tools/nocompile_driver.py 4 ../../content/public/browser/browser_task_traits_unittest.nc gen/content/test/browser_task_traits_unittest_nc.cc -- -nostdinc++ -isystem../../buildtools/third_party/libc++/trunk/include -isystem../../buildtools/third_party/libc++abi/trunk/include -std=c++14 -Wall -Werror -Wfatal-errors -Wthread-safety -I../../ -Igen --sysroot ../../build/linux/debian_sid_amd64-sysroot

Is there a way to obtain the output of the failing step from the bot? I have trouble reproducing this locally, but will continue trying.

Thanks!
 
Cc: dpranke@chromium.org
Unfortunately, there's not much more info I can provide. Looking through the logs, it's the following script invocation that fails:

python ../../tools/nocompile_driver.py 4 ../../content/public/browser/browser_task_traits_unittest.nc gen/content/test/browser_task_traits_unittest_nc.cc -- -nostdinc++ -isystem../../buildtools/third_party/libc++/trunk/include -isystem../../buildtools/third_party/libc++abi/trunk/include -std=c++14 -Wall -Werror -Wfatal-errors -Wthread-safety -I../../ -Igen --sysroot ../../build/linux/debian_sid_amd64-sysroot

And it doesn't appear to print anything to stdout/stderr. All we seem to know is that it has a non-zero return code. Though it looks like the script prints log info to a file at "gen/content/test/browser_task_traits_unittest_nc.cc.log". However, that file's since been overwritten by subsequent successful builds.

If you'd like, I can help give you ssh access to the machine running the builds so you can poke around yourself. Might be easier to repro manually on the bot than to do so locally.
Cc: wychen@chromium.org ajwong@chromium.org
Components: Build
Summary: //content/test:content_nocompile_tests_run_nocompile flaky on linux-jumbo-rel (was: Requesting help with troubleshooting a jumbo build issue)
What is weird is that we've seen some problems with this particular step on other bots as well (but fixed those before linux-jumbo-rel started flaking) - and there, the command did print some more useful error messages (see  bug 882234 ).

Looking at nocompile_driver.py, it seems like it only returns a non-zero exit code when a subprocess command fails (apart from exceptions thrown, which should be logged too). And stderr of the failing command should be printed: https://cs.chromium.org/chromium/src/tools/nocompile_driver.py?l=472

I can only assume that stderr is empty for some reason.

+ajwong and +wychen who may know more about the script's internals.
Labels: -Infra-Troopers
Removing trooper label. Feel free to reapply if there's something we can do.
Is this only flaky on jumbo builds? If so, we might want to look into how jumbo builds are different.

If we couldn't reproduce this locally, we could probably print resultlog before sys.exit(non-zero) for debugging.

Is it possible that some useful information is in stdout and we throw it away? We could keep stdout and also print that in https://cs.chromium.org/chromium/src/tools/nocompile_driver.py?l=469.

  _, stderr = test['proc'].communicate()


These CLs should make nocompile_driver.py easier to diagnose, so they can stay even after this particular issue is fixed.

NoCompile test is an underused feature in Chromium, and you are the first user outside of //base, so I guess there are some rough corners. Hopefully after these fixes, it is more usable.
Project Member

Comment 5 by bugdroid1@chromium.org, Sep 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c1f97a8cd41c4c9a5e82592f2e13732cfc6bb141

commit c1f97a8cd41c4c9a5e82592f2e13732cfc6bb141
Author: Eric Seckler <eseckler@chromium.org>
Date: Thu Sep 13 00:35:37 2018

tools: Add more diagnostic output to nocompile_driver.py

Bug: 882852
Change-Id: Ie88e6fceb726cd69963eaed5eef90f71f55b38e4
Reviewed-on: https://chromium-review.googlesource.com/1222313
Reviewed-by: Nico Weber <thakis@chromium.org>
Reviewed-by: Wei-Yin Chen (陳威尹) <wychen@chromium.org>
Commit-Queue: Eric Seckler <eseckler@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590876}
[modify] https://crrev.com/c1f97a8cd41c4c9a5e82592f2e13732cfc6bb141/tools/nocompile_driver.py

I'll keep an eye out for more failures on the bot, but it has been running fine for the last two days without any changes to the test.
Components: -Infra Infra>Client>Chrome
Owner: eseckler@chromium.org
Status: Assigned (was: Untriaged)
Assigning to remove from our triaging queue; mark as Untriaged to get the infra trooper to take a look at this bug.
Here's another recent failure: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8934298114243077232/+/steps/compile/0/stdout

From the log:

[4196/4286] ACTION //content/test:content_nocompile_tests_run_nocompile(//build/toolchain/linux:clang_x64)
 FAILED: gen/content/test/browser_task_traits_unittest_nc.cc
 python ../../tools/nocompile_driver.py 4 ../../content/public/browser/browser_task_traits_unittest.nc gen/content/test/browser_task_traits_unittest_nc.cc -- -nostdinc++ -isystem../../buildtools/third_party/libc++/trunk/include -isystem../../buildtools/third_party/libc++abi/trunk/include -std=c++14 -Wall -Werror -Wfatal-errors -Wthread-safety -I../../ -Igen --sysroot ../../build/linux/debian_sid_amd64-sysroot
 No-compile driver failure with return_code -15. Result log:
 TEST(NoCompileBrowserTaskTraitsUnittest): Started 1537986901.797163, Ended 1537987301.102194, Total 399.305031s, Extract 1.120920s, Compile 120.119999s, Process 278.064112s
According to python documentation, return code -15 means that the process was terminated because it received signal 15 (SIGTERM). The nocompile driver seems to send this when it thinks the compilation has timed out, see [1]. This timeout is currently 60 seconds [2]. Maybe we can increase it, sending a patch [3].

[1] https://cs.chromium.org/chromium/src/tools/nocompile_driver.py?l=386
[2] https://cs.chromium.org/chromium/src/tools/nocompile_driver.py?l=79
[3] https://chromium-review.googlesource.com/c/chromium/src/+/1248601
Project Member

Comment 11 by bugdroid1@chromium.org, Oct 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e056c759fd28d9ea4cea81ab77be309d23e94b12

commit e056c759fd28d9ea4cea81ab77be309d23e94b12
Author: Eric Seckler <eseckler@chromium.org>
Date: Wed Oct 03 14:11:46 2018

tools: Increase timeout of nocompile tests due to test flakiness.

The linux jumbo bot is flaking on content nocompile tests due to the
nocompilation tests timing out. This patch increases the timeout to
twice what it was before.

Bug: 882852
Change-Id: Ib2bc0023acd8d677ea77eb2769a5f83da39ed0da
Reviewed-on: https://chromium-review.googlesource.com/c/1248601
Reviewed-by: Nico Weber <thakis@chromium.org>
Commit-Queue: Eric Seckler <eseckler@chromium.org>
Cr-Commit-Position: refs/heads/master@{#596199}
[modify] https://crrev.com/e056c759fd28d9ea4cea81ab77be309d23e94b12/tools/nocompile_driver.py

NextAction: 2018-10-11
Cc: thakis@chromium.org
Hmm, this still seems to time out occasionally even with 120 sec timeout, e.g.:

https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8933076481390421392/+/steps/compile/0/stdout

Should we increase the timeout further, or are we better off just disabling the tests on jumbo builds?
The NextAction date has arrived: 2018-10-11
What is the jumbo build?

When I first wrote this driver, we didn't enable it because it turned out the error-reporting path for gcc was way way slower than the success path. Worse, the variance in time for completion was higher too.

If the jumbo build is creating huge translation units, I wonder if that's causing us to perform slowly?
Jumbo: https://chromium.googlesource.com/chromium/src/+/HEAD/docs/jumbo.md

In short, yes, it does create large translation units, which is probably why the nocompile tests take longer there. I think we'd probably be better off simply disabling the tests on the jumbo bots. I'm not familiar enough with the buildbot configs to know how to best accomplish that though.
Status: WontFix (was: Assigned)
Looks like this hasn't been flaking anymore recently. Feel free to reopen if this reoccurs.
Status: Assigned (was: WontFix)
Reopening, since this reoccurred: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-jumbo-rel/10198
Project Member

Comment 21 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/be6feb0aca9d5e0025f3898c01d75ba9db11f1fa

commit be6feb0aca9d5e0025f3898c01d75ba9db11f1fa
Author: Eric Seckler <eseckler@chromium.org>
Date: Wed Dec 19 10:56:40 2018

content: Disable nocompile tests on jumbo builds.

They simply take too long to execute on the FYI bots, and they
are covered by the regular builders already.

Bug: 882852
Change-Id: Ibfd67a54fb243e35b49a2b1a877db7bd13b73517
Reviewed-on: https://chromium-review.googlesource.com/c/1378181
Commit-Queue: Eric Seckler <eseckler@chromium.org>
Reviewed-by: Wei-Yin Chen (陳威尹) <wychen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#617794}
[modify] https://crrev.com/be6feb0aca9d5e0025f3898c01d75ba9db11f1fa/content/test/BUILD.gn

Status: Fixed (was: Assigned)
Cc: brat...@opera.com
Labels: -Restrict-View-Google
The reason this happens is that nocompile tests run one compile for each fail assertion from what I understand, and then jumbo probably packs a bunch of them together.

A real fix is probably to tell jumbo to not jumbo together .nc files. bratell, is that possible to do?
It shouldn't be grouping nc files. Only .cpp/.cc/.mm and .c files are (should) be grouped. I looked at the code and I don't see any chance for any other extensions to get into the jumbo files.

Is this goma (i.e. 8 files per jumbo chunk) or not (i.e. 50 files per jumbo chunk)? If it's still a timeout issue hidden somewhere.
Oh, it's the normal builder. Not sure why I thought it was some internal builder. So they should be goma, but still using large jumbo chunks since that is needed to catch the problems for non-goma users.

120 seconds should be enough, but it depends on the actual hardware and other factors. And jumbo will bring it much closer to the limit. Chunks of 50 files on average takes 5 times longer to compile than 1 file, but I'm sure there are outliers where it's worse.


Status: Started (was: Fixed)
The .nc files get converted to .cc files which are built as a normal test() target: https://cs.chromium.org/chromium/src/build/nocompile.gni?q=nocom&sq=package:chromium&g=0&l=111

Is there a way to not jumbo those?
Adding |never_build_jumbo=true| to a target block disables jumbo for it. Typically used when jumbo comes from a template or to disable it for nacl but this seems like an alternative reason.

I'm still not sure how they became jumbo though. A |test| target will not expand to jumbo compilation. (I tried that once and it was not possible even if we wanted to.) Is there something else, which does support jumbo compilation, that extracts the sources from the |test| target?

Comment 28 by eseckler@chromium.org, Yesterday (46 hours ago)

Owner: brat...@opera.com
Status: Assigned (was: Started)
Nocompile tests run clang manually via a driver script (see the gni file thakis@ links to in #26). I'm not sure how jumbo builds affect that. But I was pretty certain that the bit that's timing out is the compile initiated by the driver script (as opposed to the compile of the generated "result" cc files that are later packaged into a normal test() target).

I'm not sure I'm the best person to look into this further, I'm not an expert in jumbo builds nor nocompile tests :)

Comment 29 by brat...@opera.com, Yesterday (39 hours ago)

Cc: eseckler@chromium.org
I've looked at the nocompile tests and there is nothing jumbo there. Just a script that spawns a couple (two) clang processes and wait for them to complete. 

It could be that parallel jumbo processes slow down the host but the tests run in hundreds of milliseconds so that would mean that the computer is completely locked up for two minutes. Unlikely.

Another possibility is that the processes fill up the proc stdout/stderr buffers. The code does not read from any process until they have proc.poll() returns that the process is done. 

I've not tested this, but I think the buffers are like 64 KB though so that also seems unlikely. Why would clang suddenly throw out 64 KB of data? Still, it's possible.

Comment 30 by brat...@opera.com, Today (19 hours ago)

Status: Started (was: Assigned)

Sign in to add a comment