New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 851996 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Chrome
Pri: 1
Type: ----



Sign in to add a comment

net_unittests failing on multiple builders

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Jun 12 2018

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of rogerm@chromium.org

net_unittests failing on multiple builders

Builders failed on: 
- linux-chromeos-dbg: 
  https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-chromeos-dbg


 
Cc: steve...@chromium.org jkop@chromium.org hzl@chromium.org cra...@chromium.org
Components: Infra
Labels: -Pri-2 OS-Chrome Pri-1
Status: Untriaged (was: Available)
Is this failing on builders other than linux-chromeos-dbg? Or did you mean "multiple builds"?

I have been looking into this. None of the tests seem to be taking very long, but the builder is timing out, even though the sum total of the test times is consistent with other test runs.

Is it possible that the builder load has increased, so that even though the individual test times are similar the overall time consumed is higher?

Unfortunately the log does not appear to include any other time stamps to compare.

Here is a sample of the failing output:

[38800/40208] EndToEndTestsWithTls/EndToEndTestWithTls.BadEncryptedData/102 (138 ms)
[19108:19116:0611/151901.854133:23319062113:ERROR:kill_posix.cc(83)] Unable to terminate process group 17299: No such process (3)
Failed to get out-of-band test success data, dumping full stdio below:
[19108:19108:0611/151901.855433:23319063397:ERROR:file_util_posix.cc(703)] mkdtemp: No such file or directory (2)
[19108:19108:0611/151901.855519:23319063483:FATAL:unit_test_launcher.cc(664)] Check failed: platform_delegate->CreateResultsFile(&output_file). 
#0 0x7f286fe81fbd (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x3a4fbc)
#1 0x7f286fbd189c (/b/s/w/ir/out/Debug/libbase.so (deleted)+0xf489b)
#2 0x7f286fc3b2bd (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x15e2bc)
#3 0x000003b64865 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b64864)
#4 0x000003b670eb (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b670ea)
#5 0x000003b6a4e6 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b6a4e5)
#6 0x000003b90e90 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b90e8f)
#7 0x000003b90d61 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b90d60)
#8 0x000003b90cad (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b90cac)
#9 0x000003b90ac9 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b90ac8)
#10 0x7f286fb82bce (/b/s/w/ir/out/Debug/libbase.so (deleted)+0xa5bcd)
#11 0x7f286fbd2c8a (/b/s/w/ir/out/Debug/libbase.so (deleted)+0xf5c89)
#12 0x7f286fc5c76a (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x17f769)
#13 0x7f286fc6599b (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x18899a)
#14 0x7f286fc65be5 (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x188be4)
#15 0x7f286fc65ec4 (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x188ec3)
#16 0x7f286fec681b (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x3e981a)
#17 0x7f286fc6528e (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x18828d)
#18 0x7f286fd0e482 (/b/s/w/ir/out/Debug/libbase.so (deleted)+0x231481)
#19 0x000003b7ef3c (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b7ef3b)
#20 0x000003b63b97 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b63b96)
#21 0x000003b63835 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x3b63834)
#22 0x000002b17b90 (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x2b17b8f)
#23 0x7f286e3a7f45 __libc_start_main
#24 0x00000073679a (/b/s/w/ir/out/Debug/net_unittests (deleted)+0x736799)
+---------------------------------------------------------------------------------------------------+
| End of shard 0                                                                                    |
|  Pending: 2.1s  Duration: 3600.0s  Bot: gce-trusty-5967a2b0-us-west1-b-h5xr  Exit: -15  TIMED_OUT |
+---------------------------------------------------------------------------------------------------+
Total duration: 3600.0s
WARNING:root:collect_cmd had non-zero return code: 241
Test runtime exceeded allocated time
some shards did not complete: 0
step returned non-zero exit code: 241
net_unittests alone are indeed taking >1h to run on swarming, and therefore timeout (by design). Example build: 

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-chromeos-dbg/6254
Task: https://chromium-swarm.appspot.com/task?id=3e0d609706986410 (See the purple "net_unittests" step, link "shard #0 timed out, took too much time to complete").

This usually happens when something bad gets committed causing the test to run very long. As such, I don't think it's an infra failure.
Cc: -hzl@chromium.org sergeybe...@chromium.org
That was of course my first thought, but the output logs do not suggest that. Here is a spreadsheet that I put together showing that the total, average, and max times of the tests run were similar:

https://docs.google.com/spreadsheets/d/183U1bxbIOgkhrGxpSp7gWKNZ8KDE4qce8ZM5VuWNqcY/edit#gid=0

Re #c4: Were the runtimes taken directly from the log? Are we sure any overhead between the tests is accounted for?

Two points:

1) 26 min is still a long time for a swarming shard; CQ will be much happier if it was closer to 10-15 min tops;

2) Swarming task has 2 timeouts: one is total runtime (1h), another is 20m without any logs. If the tests indeed finished in 26 min, this would mean the task was stuck for the remaining 34 min (apparently without output, since there is basically nothing interesting printed after the tests). This in turn suggests that it should've been killed sooner - 46 min into the task. Since it timed out at 1h, it's more likely that the tests actually took longer to run, even if they reported the same runtimes in the logs. This is why I'm wondering about the test overhead that the logs might not be reporting.

Labels: OS-Linux
It looks like we are having the same problem on non CrOS linux builders also:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Tests%20%28dbg%29%281%29%2832%29/50605

I agree that it seems likely that the test is taking longer to run than the sum of the output times, which is why I suspect something with the builder, but the fact that it is failing on linux also is suspicious. I'll try to look at the chrome changes in the respective initial failures.

Looking more closely at the linux builder, net_unittests have been timing out very frequently, just more frequently recently.
Sampling a few tasks over the course of the month, you can see how the total amount of test cases being run have grown:
https://chromium-swarm.appspot.com/task?id=3db52ebf4c9eb010 37k
https://chromium-swarm.appspot.com/task?id=3e0cf0b7f7263410 40k

Prob just natural growth. We really shouldn't be running such a large test in a single shard. The least we can do is double shard it, which'll cut its runtime down to half.
That sounds reasonable to me. As I mentioned, it's been flakey on linux(dbg) for a very long time, we should do the same there.


stevenjb: would you take the bug then to increase sharding, or triage it to someone who owns the builder or the test?

If there is anything else a trooper can help here, please let us know, otherwise let's remove Infra-Troopers label and component:Infra. Thanks!
Owner: bpastene@chromium.org
Status: Assigned (was: Untriaged)
We own the testing specs (https://chromium.googlesource.com/chromium/src/+/master/testing/buildbot/OWNERS), so I'll handle it.
Thanks Ben! I have no idea how to increase sharding :)

Project Member

Comment 13 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8bf205b836e9e218987cbd9aa9f5f91ea37bab24

commit 8bf205b836e9e218987cbd9aa9f5f91ea37bab24
Author: Ben Pastene <bpastene@chromium.org>
Date: Wed Jun 13 04:00:33 2018

Double shard dbg builds of net_unittests on linux(-chromos) testers.

Bug:  851996 
Change-Id: Id1b01007c65fe417dc9f3beac1aac158e5c97bd0
Reviewed-on: https://chromium-review.googlesource.com/1097701
Reviewed-by: John Budorick <jbudorick@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#566712}
[modify] https://crrev.com/8bf205b836e9e218987cbd9aa9f5f91ea37bab24/testing/buildbot/chromium.chromiumos.json
[modify] https://crrev.com/8bf205b836e9e218987cbd9aa9f5f91ea37bab24/testing/buildbot/chromium.linux.json
[modify] https://crrev.com/8bf205b836e9e218987cbd9aa9f5f91ea37bab24/testing/buildbot/test_suite_exceptions.pyl

linux-chromeos-dbg and Linux Tests (dbg) both appear happier since #13. Anything else to do here?
Looks great to me, thanks! Overall test times are better also!


Status: Fixed (was: Assigned)
Closing for now. Feel free to reopen if I missed a bot

Sign in to add a comment