New issue
Advanced search Search tips

Issue 826408 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug



Sign in to add a comment

net_unittests timing out on chromium.linux/Linux Tests (dbg)(1)(32)]

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Mar 27 2018

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of rogerm@chromium.org

net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)

Builders failed on: 
- Linux Tests (dbg)(1)(32): 
  https://build.chromium.org/p/chromium.linux/builders/Linux%20Tests%20%28dbg%29%281%29%2832%29


 
The test is taking much longer than it used to. It's timing out on swarming.

https://screenshot.googleplex.com/TJrrXi3c0pG

It looks like it occasionally takes forever, and usually takes less than a minute?? Not sure why.

I looked at the test, and it doesn't seem to be taking that long? I wrote a python script to add up all the times in https://chromium-swarm.appspot.com/task?id=3c81d7ad687ca610&refresh=10&show_raw=1, and I got 14 minutes. 
Components: Tests
Labels: OS-Linux
Owner: sergeybe...@chromium.org
Status: Assigned (was: Available)
Looking at martiniss@'s analysis, I'm inclined to think that the problem is in the test itself. Most likely, it fails to recover after a failed test and just hangs there until the timeout. This would explain that most "good" runs take very little time, unless the test hangs.

If this is true, there is very little a trooper can do...
To drill down a bit into this: http://plx/script/#a=qo%7Ci=google%253A%253Ascript_3c._2bcb50_d970_4bf5_ba87_179fef9e77bf

There is a clear jump in the runtime for this test around 5pm PDT on Mar 15. The drop right after that is the weekend, then it picks up again. Something regressed at that time.
Screen Shot 2018-03-28 at 3.02.55 PM.png
39.6 KB View Download
There is one CL that landed roughly around that time (Mar 16 5:30am - 12h after the first peak started) and that touched net/BUILD.gn: https://crrev.com/c/947844

Cc: sergeybe...@chromium.org
Labels: -Sheriff-Chromium
Owner: juliatut...@chromium.org
Because of r546886 reassigning.
Labels: Type-Bug
Still seeing this today on the same bug. e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49607 and 49608 and 49609.

Though in 2 of the 3, the failure at the end was:

[20811:20816:0423/112809.372140:16457192576:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents). 

Project Member

Comment 9 by bugdroid1@chromium.org, Apr 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e0ed6f1d52bd888c99af8fa6038e627c1a300dd4

commit e0ed6f1d52bd888c99af8fa6038e627c1a300dd4
Author: Lei Zhang <thestig@chromium.org>
Date: Tue Apr 24 17:24:56 2018

Add more logging in DoLaunchChildTestProcess().

When it calls ReadFileToString() and fails, log the name of the file it
is trying to read.

BUG=826408

Change-Id: I546977b985c71cbd631820e7d7950b7f7449faba
Reviewed-on: https://chromium-review.googlesource.com/1024743
Reviewed-by: Julia Tuttle <juliatuttle@chromium.org>
Commit-Queue: Lei Zhang <thestig@chromium.org>
Cr-Commit-Position: refs/heads/master@{#553188}
[modify] https://crrev.com/e0ed6f1d52bd888c99af8fa6038e627c1a300dd4/base/test/launcher/test_launcher.cc

Cc: -sergeybe...@chromium.org
Now we get:

[9735:9743:0424/140831.979081:7761044591:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/it8adCxZ/.org.chromium.Chromium.UJ0NAg

But it's not that helpful.
Labels: -Pri-2 Pri-1
What is the current status of this?

This still happens 25% of times on chromium.linux/Linux Tests (dbg)(1)(32)
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29?limit=200

Sample builds:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49768
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49788
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49804
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49891
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49931

The failing log is
[3696:3702:0508/231208.271283:36881135548:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/ity26N4y/.org.chromium.Chromium.cF6OFD

or

SUCCESS: all tests passed.
[15402:15402:0430/222034.524406:24773201955:ERROR:test_launcher.cc(1215)] Failed to save test launcher output summary.

Comment 13 by st...@chromium.org, May 18 2018

Anyone picking up this bug?

net_unittests seems super flaky on "chromium.linux/Linux Tests (dbg)(1)(32)" recently as vitaliii@ pointed out.
Owner: ----
Status: Available (was: Assigned)
Labels: -Infra-Troopers Sheriff-Chromium
Sheriffs, can you take a look at this? I don't know if troopers can help much here, it looks like it's mostly a test failing.

Feel free to add the infra troopers label if you need trooper help.

Comment 16 by yutak@chromium.org, May 24 2018

Components: Internals>Network

Comment 17 by gab@chromium.org, May 24 2018

Labels: -Restrict-View-Google
Owner: gab@chromium.org
Status: Started (was: Available)
Summary: base's test_launcher is flaky (because mkstemp isn't thread-safe?) [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)] (was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32))
I'll take a stab as today's sheriff.

IMO the problem is that mkstemp (used by CreateAndOpenFdForTemporaryFileInDir()) probably isn't thread-safe. i.e. perhaps it can return a filename which is "unique and non-existing" at the time it's returned but since multiple DoLaunchChildTestProcess() run in parallel, two calls can perhaps sometimes be assigned the same filename (and then flakiness ensues).

Comment 18 by gab@chromium.org, May 24 2018

mkstemp is documented as MT-safe, but that just means it won't result in races/garbage if called from two threads at once, not that it's atomic or transactional in anyway :

"""
Being MT-Safe does not imply a function is atomic, nor that it uses any of the memory synchronization mechanisms POSIX exposes to users. It is even possible that calling MT-Safe functions in sequence does not yield an MT-Safe combination. For example, having a thread call two MT-Safe functions one right after the other does not guarantee behavior equivalent to atomic execution of a combination of both functions, since concurrent calls in other threads may interfere in a destructive way.
"""

Comment 19 by gab@chromium.org, May 24 2018

Hmmm actually that's probably the wrong route... docs says the file is opened with O_EXCL :

O_EXCL
If O_CREAT and O_EXCL are set, open() shall fail if the file exists. The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set. If O_EXCL and O_CREAT are set, and path names a symbolic link, open() shall fail and set errno to [EEXIST], regardless of the contents of the symbolic link. If O_EXCL is set and O_CREAT is not set, the result is undefined.
Project Member

Comment 20 by bugdroid1@chromium.org, May 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/20a5f39c907e802ca206b60c28334026cd181875

commit 20a5f39c907e802ca206b60c28334026cd181875
Author: Gabriel Charette <gab@chromium.org>
Date: Thu May 24 22:49:47 2018

[FileUtil] Test thread-safety of temp file APIs.

Ruling out https://crbug.com/826408#c17 as the source of the issue.

R=thestig@chromium.org

Bug: 826408
Change-Id: I00396084d674fa2f700eb917300280b6eed1684d
Reviewed-on: https://chromium-review.googlesource.com/1072030
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: Lei Zhang <thestig@chromium.org>
Cr-Commit-Position: refs/heads/master@{#561666}
[modify] https://crrev.com/20a5f39c907e802ca206b60c28334026cd181875/base/files/file_util_unittest.cc

Project Member

Comment 21 by bugdroid1@chromium.org, May 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0420ef0e74e20d6eedaa550d4a12902f82837c81

commit 0420ef0e74e20d6eedaa550d4a12902f82837c81
Author: Michael Moss <mmoss@chromium.org>
Date: Fri May 25 05:11:57 2018

Revert "[FileUtil] Test thread-safety of temp file APIs."

This reverts commit 20a5f39c907e802ca206b60c28334026cd181875.

Reason for revert: Breaking official linux builds.

Original change's description:
> [FileUtil] Test thread-safety of temp file APIs.
> 
> Ruling out https://crbug.com/826408#c17 as the source of the issue.
> 
> R=​thestig@chromium.org
> 
> Bug: 826408
> Change-Id: I00396084d674fa2f700eb917300280b6eed1684d
> Reviewed-on: https://chromium-review.googlesource.com/1072030
> Commit-Queue: Gabriel Charette <gab@chromium.org>
> Reviewed-by: Lei Zhang <thestig@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#561666}

TBR=gab@chromium.org,thestig@chromium.org

Change-Id: Icc838da2c247563b94c5e6508c8c268df68f3ce9
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 826408, 846563
Reviewed-on: https://chromium-review.googlesource.com/1072966
Reviewed-by: Michael Moss <mmoss@chromium.org>
Commit-Queue: Michael Moss <mmoss@chromium.org>
Cr-Commit-Position: refs/heads/master@{#561772}
[modify] https://crrev.com/0420ef0e74e20d6eedaa550d4a12902f82837c81/base/files/file_util_unittest.cc

Project Member

Comment 22 by bugdroid1@chromium.org, May 25 2018

Labels: merge-merged-3440
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b9eb1437134413de27145c8d5fc339c903e517b5

commit b9eb1437134413de27145c8d5fc339c903e517b5
Author: Michael Moss <mmoss@chromium.org>
Date: Fri May 25 05:36:22 2018

Revert "[FileUtil] Test thread-safety of temp file APIs."

This reverts commit 20a5f39c907e802ca206b60c28334026cd181875.

Reason for revert: Breaking official linux builds.

Original change's description:
> [FileUtil] Test thread-safety of temp file APIs.
> 
> Ruling out https://crbug.com/826408#c17 as the source of the issue.
> 
> R=​thestig@chromium.org
> 
> Bug: 826408
> Change-Id: I00396084d674fa2f700eb917300280b6eed1684d
> Reviewed-on: https://chromium-review.googlesource.com/1072030
> Commit-Queue: Gabriel Charette <gab@chromium.org>
> Reviewed-by: Lei Zhang <thestig@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#561666}

TBR=gab@chromium.org,thestig@chromium.org

Change-Id: Icc838da2c247563b94c5e6508c8c268df68f3ce9
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 826408, 846563
Reviewed-on: https://chromium-review.googlesource.com/1072966
Reviewed-by: Michael Moss <mmoss@chromium.org>
Commit-Queue: Michael Moss <mmoss@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#561772}(cherry picked from commit 0420ef0e74e20d6eedaa550d4a12902f82837c81)
Reviewed-on: https://chromium-review.googlesource.com/1073087
Reviewed-by: Krishna Govind <govind@chromium.org>
Cr-Commit-Position: refs/branch-heads/3440@{#3}
Cr-Branched-From: 010ddcfda246975d194964ccf20038ebbdec6084-refs/heads/master@{#561733}
[modify] https://crrev.com/b9eb1437134413de27145c8d5fc339c903e517b5/base/files/file_util_unittest.cc

Project Member

Comment 23 by bugdroid1@chromium.org, May 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/301470a833a72768c5277141f7835a055ca4ee1d

commit 301470a833a72768c5277141f7835a055ca4ee1d
Author: Gabriel Charette <gab@chromium.org>
Date: Fri May 25 14:33:01 2018

Reland "[FileUtil] Test thread-safety of temp file APIs."

This is a reland of 20a5f39c907e802ca206b60c28334026cd181875

Patch set 1 is the original change. Diff fixes the following compile
error on official builders :

../../base/files/file_util_unittest.cc:3674:5: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
    ::write(::fileno(output_file.get()), content.c_str(), content.length());

Original change's description:
> [FileUtil] Test thread-safety of temp file APIs.
>
> Ruling out https://crbug.com/826408#c17 as the source of the issue.
>
> R=thestig@chromium.org
>
> Bug: 826408
> Change-Id: I00396084d674fa2f700eb917300280b6eed1684d
> Reviewed-on: https://chromium-review.googlesource.com/1072030
> Commit-Queue: Gabriel Charette <gab@chromium.org>
> Reviewed-by: Lei Zhang <thestig@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#561666}

TBR=thestig@chromium.org

Bug: 826408
Change-Id: Ifa37f12e708357fe7e6ce7a2dd688eab1e0a2324
Reviewed-on: https://chromium-review.googlesource.com/1073090
Reviewed-by: Gabriel Charette <gab@chromium.org>
Commit-Queue: Gabriel Charette <gab@chromium.org>
Cr-Commit-Position: refs/heads/master@{#561867}
[modify] https://crrev.com/301470a833a72768c5277141f7835a055ca4ee1d/base/files/file_util_unittest.cc

Project Member

Comment 24 by bugdroid1@chromium.org, May 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fa655227e17a1fed586b3d293af07de142345346

commit fa655227e17a1fed586b3d293af07de142345346
Author: Wez <wez@chromium.org>
Date: Fri May 25 22:59:18 2018

Filter out FileUtilMultiThreadedTest.MultiThreadedTempFiles on Fuchsia.

This test runs too slowly on Fuchsia bots, due to emulation overheads.

Bug: 826408, 844416
Change-Id: Ieada8fc8ca56583a34d8c24f1eecbcc4f6b45852
Reviewed-on: https://chromium-review.googlesource.com/1073731
Reviewed-by: Scott Graham <scottmg@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#562049}
[modify] https://crrev.com/fa655227e17a1fed586b3d293af07de142345346/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 25 by gab@chromium.org, May 29 2018

Cc: gab@chromium.org
Owner: ----
Status: Available (was: Started)
New test stuck and proves my theory in #17-19 wrong.

I will not have time to look closer, back on sheriffs queue.
I am the sheriff today.

gab: Thanks for working on this. Were you able to repro the issue locally?

Comment 27 by gab@chromium.org, May 29 2018

No, never repro'ed, just speculation.
Components: -Internals>Network
Think this isn't a net issue?  Feel free to add back, if I'm wrong.

Comment 29 by gab@chromium.org, Jun 5 2018

Summary: base's test_launcher is flaky [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)] (was: base's test_launcher is flaky (because mkstemp isn't thread-safe?) [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)])

Comment 30 by gab@chromium.org, Jun 5 2018

Components: Internals>Network
Labels: Test-Flaky
Summary: base's test_launcher is flaky when tests timeout [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)] (was: base's test_launcher is flaky [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)])
The latest logs reveal interesting information:

[ERROR:kill_posix.cc(83)] Unable to terminate process group 3235: No such process (3)
[FATAL:test_launcher.cc(526)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/ity1CRja/.org.chromium.Chromium.V97vpm

All failing tests were in the process of timing out after 1 hour.

So basically the problem is that net_unittests is timing out. But the symptom is that base's test_launcher dies because ReadFileToString() has a hard time reading the redirected stdout of a process that was killed midway (or perhaps failed being killed; according to the log from kill_posix.cc).

Comment 31 by gab@chromium.org, Jun 5 2018

Owner: gab@chromium.org
Status: Started (was: Available)
Previous successful run took 52 minutes to run all 37767 tests [1] while a failing run failed @ [35166/37767] after 1 hour.

Seems the problem is the test suite taking too long, individual tests seem fast (ms), needs sharding? https://chromium-review.googlesource.com/c/chromium/src/+/1087288

I'll fix the check so it goes through with reporting the test suite timeout instead of failing on CHECK(ReadFileToString(output_file...)) in this situation: https://chromium-review.googlesource.com/#/c/chromium/src/+/1087176 

[1]  https://chromium-swarm.appspot.com/task?id=3de780094a686810&refresh=10&show_raw=1
Project Member

Comment 32 by bugdroid1@chromium.org, Jun 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2fc11c2fc69e51c977b1d42fccaecceb82014c11

commit 2fc11c2fc69e51c977b1d42fccaecceb82014c11
Author: Gabriel Charette <gab@chromium.org>
Date: Tue Jun 05 19:09:55 2018

[TestLauncher] Allow failure reading test output file after its process was killed.

This check is incorrectly identified as the source of failure for a
testsuite timing out : https://crbug.com/826408#c30.

R=thakis@chromium.org

Bug: 826408
Change-Id: I30d6496d161a15e9e5fdfa8c683667e604c76cc4
Reviewed-on: https://chromium-review.googlesource.com/1087176
Reviewed-by: Nico Weber <thakis@chromium.org>
Commit-Queue: Gabriel Charette <gab@chromium.org>
Cr-Commit-Position: refs/heads/master@{#564606}
[modify] https://crrev.com/2fc11c2fc69e51c977b1d42fccaecceb82014c11/base/test/launcher/test_launcher.cc

Labels: -Sheriff-Chromium

Comment 34 by gab@chromium.org, Jun 12 2018

Owner: ----
Status: Untriaged (was: Started)
Summary: net_unittests timing out on chromium.linux/Linux Tests (dbg)(1)(32)] (was: base's test_launcher is flaky when tests timeout [was: net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32)])
I did my best at the test_launcher.cc level but the problem is really that net_unittests went from 14 minutes to 52 minutes and now times out a lot. This is where the fix needs to occur.

More sharding is always an option but Dirk rightfully pointed out [1] that we should first figure out why the test suite became slower.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/1087288#message-438a3b3f208925c9d142a3e916641f88c73550d6
Cc: rch@chromium.org
Components: Internals>Network>QUIC
There are a huge number of parameterized QUIC tests, which may be related to this.
Labels: Network-Triaged

Comment 37 by rch@google.com, Jun 16 2018

I've landed a number of QUIC test performance improvements in crbug/806109. This has dropped net_unittests runtime by about 40%. This is great, but QUIC tests are more than 50% of net_unittests runtime and it makes we wonder if we should consider breaking them  out into their own target?

Dirk, what do you think?

Comment 38 by rch@chromium.org, Jun 16 2018

Cc: dpranke@chromium.org
(Argh. Wrong account).

What's the best way to get a visualization of how long net_unittests is taking, similar to the plx script in #4? I can run the script but I only get data from april (And I don't get a graph, just the raw data)

@dpranke: please see comment 37
The data should still be in plx; I will take a look at the scripts/queries when I get the chance.
Looks like gab@ incorrectly removed the RVG label (which is needed because of the links to PLX scripts). Re-adding.

The old script was looking at completed_step times, which is of dubious accuracy. You should be looking at the sum of all of the swarming tasks times for a given step on a build instead.

Here's a different, hopefully better query:

https://plx.corp.google.com/scripts2/script_5b._2971d8_0000_222a_ada9_94eb2c144adc

The time looks significantly shorter, down to ~36 min these days. That is, obviously, still far longer than it was months ago.


net_unittests_on_linux_tests_dbg_1_32_2018_03_20_2018_06_20.png
39.5 KB View Download

Comment 41 Deleted

Here's a more useful graph, over the past 180 days.
net_unittests_linux_tests_dbg_1_32_2018_01_02_2018_06_20.png
34.4 KB View Download

Comment 43 by rch@chromium.org, Jun 20 2018

Owner: rch@chromium.org
Status: Started (was: Untriaged)
Ah, great! Ok, so that confirms my hope that the recent changes I made:

https://chromium-review.googlesource.com/c/chromium/src/+/1098030
https://chromium-review.googlesource.com/c/chromium/src/+/1103393

have move the runtime back into the reasonable range. As such, I think I'll close this bug.

But before I do, I'd love to know if there's some metric we should use to consider if we should split net_unittests up into two targets? Like, if it's regularly running more than 45 minutes, would that be a good time?
In theory we target total cycle times of 5-15 minutes per test suite, but you can adjust that by splitting the test suite across machines by adjusting the 'shards' parameter in the test configuration.

There's no hard and fast rule for splitting up the test suites themselves, it's more what's convenient for devs. If you're likely to make changes that only affect one particular subset of //net, then splitting the suite up can keep you from running unrelated tests, and that can be a win. If you'd want to run only a subset of the tests in some configurations but not others, splitting them into test suites helps with that, as well. 

Otherwise, I wouldn't worry about it too much. webkit_layout_tests and browser_tests are both 8x bigger (~4 hours each in debug).

Comment 45 by yigu@chromium.org, Jun 29 2018

Cc: osc...@opera.com
 Issue 859100  has been merged into this issue.

Comment 46 by yigu@chromium.org, Jun 29 2018

Cc: -osc...@opera.com

Sign in to add a comment