net_unittests timing out on chromium.linux/Linux Tests (dbg)(1)(32)] |
||||||||||||||||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of rogerm@chromium.org net_unittests failing on chromium.linux/Linux Tests (dbg)(1)(32) Builders failed on: - Linux Tests (dbg)(1)(32): https://build.chromium.org/p/chromium.linux/builders/Linux%20Tests%20%28dbg%29%281%29%2832%29
,
Mar 27 2018
https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_95._e18ffc_9493_4bc4_bcdf_fa43e5483ca3 is the relevant query.
,
Mar 28 2018
Looking at martiniss@'s analysis, I'm inclined to think that the problem is in the test itself. Most likely, it fails to recover after a failed test and just hangs there until the timeout. This would explain that most "good" runs take very little time, unless the test hangs. If this is true, there is very little a trooper can do...
,
Mar 28 2018
To drill down a bit into this: http://plx/script/#a=qo%7Ci=google%253A%253Ascript_3c._2bcb50_d970_4bf5_ba87_179fef9e77bf There is a clear jump in the runtime for this test around 5pm PDT on Mar 15. The drop right after that is the weekend, then it picks up again. Something regressed at that time.
,
Mar 28 2018
There is one CL that landed roughly around that time (Mar 16 5:30am - 12h after the first peak started) and that touched net/BUILD.gn: https://crrev.com/c/947844
,
Mar 29 2018
This appears to be happening on win10_chromium_x64_rel_ng as well. Failed build https://ci.chromium.org/buildbot/tryserver.chromium.win/win10_chromium_x64_rel_ng/112873 and swarming task https://chromium-swarm.appspot.com/task?id=3c8c425e6ed86310&refresh=10&show_raw=1.
,
Apr 9 2018
Because of r546886 reassigning.
,
Apr 23 2018
Still seeing this today on the same bug. e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49607 and 49608 and 49609. Though in 2 of the 3, the failure at the end was: [20811:20816:0423/112809.372140:16457192576:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents).
,
Apr 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e0ed6f1d52bd888c99af8fa6038e627c1a300dd4 commit e0ed6f1d52bd888c99af8fa6038e627c1a300dd4 Author: Lei Zhang <thestig@chromium.org> Date: Tue Apr 24 17:24:56 2018 Add more logging in DoLaunchChildTestProcess(). When it calls ReadFileToString() and fails, log the name of the file it is trying to read. BUG=826408 Change-Id: I546977b985c71cbd631820e7d7950b7f7449faba Reviewed-on: https://chromium-review.googlesource.com/1024743 Reviewed-by: Julia Tuttle <juliatuttle@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org> Cr-Commit-Position: refs/heads/master@{#553188} [modify] https://crrev.com/e0ed6f1d52bd888c99af8fa6038e627c1a300dd4/base/test/launcher/test_launcher.cc
,
Apr 24 2018
,
Apr 24 2018
Now we get: [9735:9743:0424/140831.979081:7761044591:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/it8adCxZ/.org.chromium.Chromium.UJ0NAg But it's not that helpful.
,
May 9 2018
What is the current status of this? This still happens 25% of times on chromium.linux/Linux Tests (dbg)(1)(32) https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29?limit=200 Sample builds: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49768 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49788 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49804 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49891 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/49931 The failing log is [3696:3702:0508/231208.271283:36881135548:FATAL:test_launcher.cc(523)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/ity26N4y/.org.chromium.Chromium.cF6OFD or SUCCESS: all tests passed. [15402:15402:0430/222034.524406:24773201955:ERROR:test_launcher.cc(1215)] Failed to save test launcher output summary.
,
May 18 2018
Anyone picking up this bug? net_unittests seems super flaky on "chromium.linux/Linux Tests (dbg)(1)(32)" recently as vitaliii@ pointed out.
,
May 23 2018
,
May 23 2018
Sheriffs, can you take a look at this? I don't know if troopers can help much here, it looks like it's mostly a test failing. Feel free to add the infra troopers label if you need trooper help.
,
May 24 2018
,
May 24 2018
I'll take a stab as today's sheriff. IMO the problem is that mkstemp (used by CreateAndOpenFdForTemporaryFileInDir()) probably isn't thread-safe. i.e. perhaps it can return a filename which is "unique and non-existing" at the time it's returned but since multiple DoLaunchChildTestProcess() run in parallel, two calls can perhaps sometimes be assigned the same filename (and then flakiness ensues).
,
May 24 2018
mkstemp is documented as MT-safe, but that just means it won't result in races/garbage if called from two threads at once, not that it's atomic or transactional in anyway : """ Being MT-Safe does not imply a function is atomic, nor that it uses any of the memory synchronization mechanisms POSIX exposes to users. It is even possible that calling MT-Safe functions in sequence does not yield an MT-Safe combination. For example, having a thread call two MT-Safe functions one right after the other does not guarantee behavior equivalent to atomic execution of a combination of both functions, since concurrent calls in other threads may interfere in a destructive way. """
,
May 24 2018
Hmmm actually that's probably the wrong route... docs says the file is opened with O_EXCL : O_EXCL If O_CREAT and O_EXCL are set, open() shall fail if the file exists. The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set. If O_EXCL and O_CREAT are set, and path names a symbolic link, open() shall fail and set errno to [EEXIST], regardless of the contents of the symbolic link. If O_EXCL is set and O_CREAT is not set, the result is undefined.
,
May 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/20a5f39c907e802ca206b60c28334026cd181875 commit 20a5f39c907e802ca206b60c28334026cd181875 Author: Gabriel Charette <gab@chromium.org> Date: Thu May 24 22:49:47 2018 [FileUtil] Test thread-safety of temp file APIs. Ruling out https://crbug.com/826408#c17 as the source of the issue. R=thestig@chromium.org Bug: 826408 Change-Id: I00396084d674fa2f700eb917300280b6eed1684d Reviewed-on: https://chromium-review.googlesource.com/1072030 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org> Cr-Commit-Position: refs/heads/master@{#561666} [modify] https://crrev.com/20a5f39c907e802ca206b60c28334026cd181875/base/files/file_util_unittest.cc
,
May 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0420ef0e74e20d6eedaa550d4a12902f82837c81 commit 0420ef0e74e20d6eedaa550d4a12902f82837c81 Author: Michael Moss <mmoss@chromium.org> Date: Fri May 25 05:11:57 2018 Revert "[FileUtil] Test thread-safety of temp file APIs." This reverts commit 20a5f39c907e802ca206b60c28334026cd181875. Reason for revert: Breaking official linux builds. Original change's description: > [FileUtil] Test thread-safety of temp file APIs. > > Ruling out https://crbug.com/826408#c17 as the source of the issue. > > R=thestig@chromium.org > > Bug: 826408 > Change-Id: I00396084d674fa2f700eb917300280b6eed1684d > Reviewed-on: https://chromium-review.googlesource.com/1072030 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: Lei Zhang <thestig@chromium.org> > Cr-Commit-Position: refs/heads/master@{#561666} TBR=gab@chromium.org,thestig@chromium.org Change-Id: Icc838da2c247563b94c5e6508c8c268df68f3ce9 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 826408, 846563 Reviewed-on: https://chromium-review.googlesource.com/1072966 Reviewed-by: Michael Moss <mmoss@chromium.org> Commit-Queue: Michael Moss <mmoss@chromium.org> Cr-Commit-Position: refs/heads/master@{#561772} [modify] https://crrev.com/0420ef0e74e20d6eedaa550d4a12902f82837c81/base/files/file_util_unittest.cc
,
May 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b9eb1437134413de27145c8d5fc339c903e517b5 commit b9eb1437134413de27145c8d5fc339c903e517b5 Author: Michael Moss <mmoss@chromium.org> Date: Fri May 25 05:36:22 2018 Revert "[FileUtil] Test thread-safety of temp file APIs." This reverts commit 20a5f39c907e802ca206b60c28334026cd181875. Reason for revert: Breaking official linux builds. Original change's description: > [FileUtil] Test thread-safety of temp file APIs. > > Ruling out https://crbug.com/826408#c17 as the source of the issue. > > R=thestig@chromium.org > > Bug: 826408 > Change-Id: I00396084d674fa2f700eb917300280b6eed1684d > Reviewed-on: https://chromium-review.googlesource.com/1072030 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: Lei Zhang <thestig@chromium.org> > Cr-Commit-Position: refs/heads/master@{#561666} TBR=gab@chromium.org,thestig@chromium.org Change-Id: Icc838da2c247563b94c5e6508c8c268df68f3ce9 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 826408, 846563 Reviewed-on: https://chromium-review.googlesource.com/1072966 Reviewed-by: Michael Moss <mmoss@chromium.org> Commit-Queue: Michael Moss <mmoss@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#561772}(cherry picked from commit 0420ef0e74e20d6eedaa550d4a12902f82837c81) Reviewed-on: https://chromium-review.googlesource.com/1073087 Reviewed-by: Krishna Govind <govind@chromium.org> Cr-Commit-Position: refs/branch-heads/3440@{#3} Cr-Branched-From: 010ddcfda246975d194964ccf20038ebbdec6084-refs/heads/master@{#561733} [modify] https://crrev.com/b9eb1437134413de27145c8d5fc339c903e517b5/base/files/file_util_unittest.cc
,
May 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/301470a833a72768c5277141f7835a055ca4ee1d commit 301470a833a72768c5277141f7835a055ca4ee1d Author: Gabriel Charette <gab@chromium.org> Date: Fri May 25 14:33:01 2018 Reland "[FileUtil] Test thread-safety of temp file APIs." This is a reland of 20a5f39c907e802ca206b60c28334026cd181875 Patch set 1 is the original change. Diff fixes the following compile error on official builders : ../../base/files/file_util_unittest.cc:3674:5: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] ::write(::fileno(output_file.get()), content.c_str(), content.length()); Original change's description: > [FileUtil] Test thread-safety of temp file APIs. > > Ruling out https://crbug.com/826408#c17 as the source of the issue. > > R=thestig@chromium.org > > Bug: 826408 > Change-Id: I00396084d674fa2f700eb917300280b6eed1684d > Reviewed-on: https://chromium-review.googlesource.com/1072030 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: Lei Zhang <thestig@chromium.org> > Cr-Commit-Position: refs/heads/master@{#561666} TBR=thestig@chromium.org Bug: 826408 Change-Id: Ifa37f12e708357fe7e6ce7a2dd688eab1e0a2324 Reviewed-on: https://chromium-review.googlesource.com/1073090 Reviewed-by: Gabriel Charette <gab@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#561867} [modify] https://crrev.com/301470a833a72768c5277141f7835a055ca4ee1d/base/files/file_util_unittest.cc
,
May 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fa655227e17a1fed586b3d293af07de142345346 commit fa655227e17a1fed586b3d293af07de142345346 Author: Wez <wez@chromium.org> Date: Fri May 25 22:59:18 2018 Filter out FileUtilMultiThreadedTest.MultiThreadedTempFiles on Fuchsia. This test runs too slowly on Fuchsia bots, due to emulation overheads. Bug: 826408, 844416 Change-Id: Ieada8fc8ca56583a34d8c24f1eecbcc4f6b45852 Reviewed-on: https://chromium-review.googlesource.com/1073731 Reviewed-by: Scott Graham <scottmg@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#562049} [modify] https://crrev.com/fa655227e17a1fed586b3d293af07de142345346/testing/buildbot/filters/fuchsia.base_unittests.filter
,
May 29 2018
New test stuck and proves my theory in #17-19 wrong. I will not have time to look closer, back on sheriffs queue.
,
May 29 2018
I am the sheriff today. gab: Thanks for working on this. Were you able to repro the issue locally?
,
May 29 2018
No, never repro'ed, just speculation.
,
May 30 2018
Think this isn't a net issue? Feel free to add back, if I'm wrong.
,
Jun 5 2018
,
Jun 5 2018
The latest logs reveal interesting information: [ERROR:kill_posix.cc(83)] Unable to terminate process group 3235: No such process (3) [FATAL:test_launcher.cc(526)] Check failed: ReadFileToString(output_filename, &output_file_contents). /b/s/w/ity1CRja/.org.chromium.Chromium.V97vpm All failing tests were in the process of timing out after 1 hour. So basically the problem is that net_unittests is timing out. But the symptom is that base's test_launcher dies because ReadFileToString() has a hard time reading the redirected stdout of a process that was killed midway (or perhaps failed being killed; according to the log from kill_posix.cc).
,
Jun 5 2018
Previous successful run took 52 minutes to run all 37767 tests [1] while a failing run failed @ [35166/37767] after 1 hour. Seems the problem is the test suite taking too long, individual tests seem fast (ms), needs sharding? https://chromium-review.googlesource.com/c/chromium/src/+/1087288 I'll fix the check so it goes through with reporting the test suite timeout instead of failing on CHECK(ReadFileToString(output_file...)) in this situation: https://chromium-review.googlesource.com/#/c/chromium/src/+/1087176 [1] https://chromium-swarm.appspot.com/task?id=3de780094a686810&refresh=10&show_raw=1
,
Jun 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2fc11c2fc69e51c977b1d42fccaecceb82014c11 commit 2fc11c2fc69e51c977b1d42fccaecceb82014c11 Author: Gabriel Charette <gab@chromium.org> Date: Tue Jun 05 19:09:55 2018 [TestLauncher] Allow failure reading test output file after its process was killed. This check is incorrectly identified as the source of failure for a testsuite timing out : https://crbug.com/826408#c30. R=thakis@chromium.org Bug: 826408 Change-Id: I30d6496d161a15e9e5fdfa8c683667e604c76cc4 Reviewed-on: https://chromium-review.googlesource.com/1087176 Reviewed-by: Nico Weber <thakis@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#564606} [modify] https://crrev.com/2fc11c2fc69e51c977b1d42fccaecceb82014c11/base/test/launcher/test_launcher.cc
,
Jun 8 2018
,
Jun 12 2018
I did my best at the test_launcher.cc level but the problem is really that net_unittests went from 14 minutes to 52 minutes and now times out a lot. This is where the fix needs to occur. More sharding is always an option but Dirk rightfully pointed out [1] that we should first figure out why the test suite became slower. [1] https://chromium-review.googlesource.com/c/chromium/src/+/1087288#message-438a3b3f208925c9d142a3e916641f88c73550d6
,
Jun 13 2018
There are a huge number of parameterized QUIC tests, which may be related to this.
,
Jun 13 2018
,
Jun 16 2018
I've landed a number of QUIC test performance improvements in crbug/806109. This has dropped net_unittests runtime by about 40%. This is great, but QUIC tests are more than 50% of net_unittests runtime and it makes we wonder if we should consider breaking them out into their own target? Dirk, what do you think?
,
Jun 16 2018
(Argh. Wrong account). What's the best way to get a visualization of how long net_unittests is taking, similar to the plx script in #4? I can run the script but I only get data from april (And I don't get a graph, just the raw data) @dpranke: please see comment 37
,
Jun 20 2018
The data should still be in plx; I will take a look at the scripts/queries when I get the chance.
,
Jun 20 2018
Looks like gab@ incorrectly removed the RVG label (which is needed because of the links to PLX scripts). Re-adding. The old script was looking at completed_step times, which is of dubious accuracy. You should be looking at the sum of all of the swarming tasks times for a given step on a build instead. Here's a different, hopefully better query: https://plx.corp.google.com/scripts2/script_5b._2971d8_0000_222a_ada9_94eb2c144adc The time looks significantly shorter, down to ~36 min these days. That is, obviously, still far longer than it was months ago.
,
Jun 20 2018
Here's a more useful graph, over the past 180 days.
,
Jun 20 2018
Ah, great! Ok, so that confirms my hope that the recent changes I made: https://chromium-review.googlesource.com/c/chromium/src/+/1098030 https://chromium-review.googlesource.com/c/chromium/src/+/1103393 have move the runtime back into the reasonable range. As such, I think I'll close this bug. But before I do, I'd love to know if there's some metric we should use to consider if we should split net_unittests up into two targets? Like, if it's regularly running more than 45 minutes, would that be a good time?
,
Jun 20 2018
In theory we target total cycle times of 5-15 minutes per test suite, but you can adjust that by splitting the test suite across machines by adjusting the 'shards' parameter in the test configuration. There's no hard and fast rule for splitting up the test suites themselves, it's more what's convenient for devs. If you're likely to make changes that only affect one particular subset of //net, then splitting the suite up can keep you from running unrelated tests, and that can be a win. If you'd want to run only a subset of the tests in some configurations but not others, splitting them into test suites helps with that, as well. Otherwise, I wouldn't worry about it too much. webkit_layout_tests and browser_tests are both 8x bigger (~4 hours each in debug).
,
Jun 29 2018
,
Jun 29 2018
|
||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||
Comment 1 by martiniss@chromium.org
, Mar 27 2018