New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users

Issue metadata

Status: WontFix
Closed: Jan 17
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Sign in to add a comment

paygen tests from previous canary delayed bvt-arc from canary, causing timeout

Project Member Reported by, Dec 28

Issue description

Labels: -Pri-2 Pri-1
Happened on cave-release as well for the bvt-arc tests.

There is a lot of canary redness today and I am wondering if it's also related to this.  For instance, all bvt-arc tests timed out on pyro-release
I don't understand what the ask of the bug here exactly is.

What you're seeing are suite aborts because they didn't finish in time.

So, all tests did not pass. Some were aborted because they didn't run in the given time.
Let's keep drilling into that example:

So there were 6 DUTs, as required for the bvt suite (is it the same for arc? I don't know +ihf, +jrbarnette).

Also, one DUT went through one failure-repair cycle. We should be able to deal with that in the allotted.

There is a possibility that tests ran slower than usual here, and that caused timeout. But I'll need to dig into some data to determine either way.
I do see a large number autoupdate tests each taking 20+ minutes. They are pretty much hogging the DUTs towards the latter half of the suite. Is that intentional?
Owner: ----
I don't have enough evidence to support that infra was actually to blame here (by the shard being too slow yet), so I'm not taking ownership yet.
Summary: various canary hwtest suite are taking too long, timing out (was: various canary hwtest suite fails, but all tests pass)
I don't quite see why this wound NOT be an infra problem, but let's look.

Let's look at edgar-release build 1803. The timing diagram is here:

10:12 Build starts (cleanup)
10:52 buildpackages starts
11:58 buildimage starts
12:43 bvt-arc starts
15:53 bvt-arc is timed out

so bvt-arc started 2h 41m into the build, then lasted 3h 10m and was timed out.

Comparing with successful build 1802:

02:12 start
05:02 bvt-arc start
07:14 bvt-arc ends

Interestingly, in the earlier build getting to bvt-arc took longer (3h instead of 2.5h) but the tests ran faster (finished in 2h 10m instead of timing out at 3h 10m).
The only way this wouldn't be caused by slowness is if we added more tests to the suite in question. Working assumption is that it is because of slowness (since there were enough DUTs in each case)
Ah. I noticed one thing that I think is quite relevant: we're running bvt-arc and paygen tests in parallel and in the failing case, some of the paygen tests got intermixed with the bvt-arc tests.
These tests (the autoupdate tests I was complaining about earlier) take a long time, hogged the DUTs, and caused the bvt-arc suite to timeout.

paygen tests have longer timeout, and should have run later.


good case:
build timeline:
bvt-arc suite timeline:

bad case:
build timeline:
bvt-arc suite timeline:

For the bad case (build 1803)
Here are the logs from the suite job for arc-bvt:

and for paygen-canary-test suite:

We see that the bvt-arc tests were all scheduled ~12:45:

12/27 12:45:19.986 DEBUG|             suite:1120| Scheduling cheets_GTS.5.1_r2.GtsPlacementTestCases
12/27 12:45:20.446 DEBUG|             suite:1390| Adding job keyval for cheets_GTS.5.1_r2.GtsPlacementTestCases=165751485-chromeos-test
12/27 12:45:20.446 DEBUG|             suite:1120| Scheduling cheets_GTS.5.1_r2.GtsNetTestCases
12/27 12:45:20.894 DEBUG|             suite:1390| Adding job keyval for cheets_GTS.5.1_r2.GtsNetTestCases=165751488-chromeos-test

12/27 12:45:37.762 DEBUG|             suite:1390| Adding job keyval for cheets_CameraOrientation=165751625-chromeos-test
12/27 12:45:37.763 DEBUG|             suite:1192| Scheduled 38 tests, writing the total to keyval.
12/27 12:45:37.763 DEBUG|     dynamic_suite:0622| Waiting on suite.

whereas the canary tests were scheduled ~13:20
12/27 13:19:17.451 DEBUG|             suite:1172| Discovered 4 tests.
12/27 13:19:17.454 INFO |        server_job:0218| INFO	----	Start paygen_au_canary	timestamp=1514409557	localtime=Dec 27 13:19:17	
12/27 13:19:17.455 DEBUG|             suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_full_10255.0.0
12/27 13:19:17.893 DEBUG|             suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_full_10255.0.0=165762248-chromeos-test
12/27 13:19:17.893 DEBUG|             suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_full_10254.0.0
12/27 13:19:18.313 DEBUG|             suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_full_10254.0.0=165762250-chromeos-test
12/27 13:19:18.314 DEBUG|             suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_delta_10254.0.0
12/27 13:19:18.735 DEBUG|             suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_delta_10254.0.0=165762253-chromeos-test
12/27 13:19:18.735 DEBUG|             suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_delta_10255.0.0
12/27 13:19:19.170 DEBUG|             suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_delta_10255.0.0=165762256-chromeos-test
12/27 13:19:19.171 DEBUG|             suite:1192| Scheduled 4 tests, writing the total to keyval.
12/27 13:19:19.171 DEBUG|     dynamic_suite:0622| Waiting on suite.

Still, they ended up getting mixed up. I can't quite explain this. Although it is not impossible, the autotest scheduler is in practice FIFO, all other things being equal. And in this case, the scheduled test jobs should have had the same job dependencies.
The timeout must be specific to the tests, because I see much longer builds.  For instance, build 1719 took a total of 6h 35m and succeeded.

Maybe we should just increase the bvt-arc timeout?

Why is there so much variability though?

In fact, we see that _no_ bvt-arc tests were run in the middle 40 minutes in the bad case, which single-handedly explains the timeout later:
OK, no priority inversion.

I looked at one of the interjecting paygen jobs, it's parent suite is: http://cautotest/afe/#tab_id=view_job&object_id=165683967

compare that with the bvt-arc suite:

That paygen job is a delayed job from the _last_ canary. :'(

So, paygen jobs got so delayed from the last canary that they got run in the middle of the bvt-arc from the next one, killing it.
Summary: paygen tests from previous canary delayed bvt-arc from canary, causing timeout (was: various canary hwtest suite are taking too long, timing out)
Overall problem is still capacity (6 DUTs are not enough for the load we're putting on bvt) or throughput problem (there were DUTs sitting around but infra was not using them well). Need to determine which.
OK I will stop working on this.  This is what I was about to submit, but you've already answered those questions.  Thanks :)

Sorry, I had not noticed #8 and #9 when I entered $10.  You're ahead.  I was just checking devserver overload for edgar 1803 (it's chromeos4-devserver4) and there is no overload.

So arc-bvt timed out because it was sharing DUTs with paygen?

For that build I see bvt-arc starting at 12:43 and paygen at 12:53.

That's very similar to successful builds.  Paygen starts 9 or 10 minutes later.

 Issue 796275  has been merged into this issue.
 Issue 796439  has been merged into this issue.
Note  issue 796439  comment #2. Richard claims there were idle DUTs during the delay.
Status: Assigned (was: Untriaged)
> Note   issue 796439   comment #2. Richard claims there were idle DUTs during the delay.

The defining feature of  bug 796439  isn't the idle DUTs.  The
defining feature is that all of the scheduled tests ran to
successful completion on the shard, but the master never saw
results from the shard and reported the jobs as aborted.  The
DUTs were idle merely because they had finished work.

Someone should check whether that's what's happening here.
Honor of first investigation goes to the primary deputy.

It doesn't look like the tests got a DUT even on the shard
Status: WontFix (was: Assigned)
I no longer know what this bug is about.  Later analysis (and the bug
summary) suggests that tests were slow to start because of other prior
testing that wasn't done.  But that wasn't the cause of the first failure.
It's also not clear to me that there are any problems still remaining to
be fixed...

As a practical matter, we've had problems recently with shards that start
to run slow because of a problematic query.  See bug 798489.  That became
a particular issue shortly after this bug was filed, and the shards for
edgar and quawks (server104 and server112, respectively) are specifically
named as problems.  So, likely this is a dup of that.

In any event:  WontFix, for want of a current, actionable problem.

Sign in to add a comment