New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 809588 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

wificell-pre-cq: incorrect (?) timeout accounting - "your change timed out after 240 minutes"

Project Member Reported by briannorris@chromium.org, Feb 6 2018

Issue description

Example CL:

https://chromium-review.googlesource.com/c/chromiumos/overlays/portage-stable/+/897057

This CL has so far tried to run through 2 wificell-pre-cq runs:

#1
https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/84

#2
https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/85

Run #1 may have been a legit timeout (for whatever reason, it didn't get picked up for HW tests for several hours?).

Run #2 looks weird though; either it didn't start when it said it did (2018-02-05 11:23 PM (PST)) or the PreCQ didn't post to Gerrit to say that it started testing the change.

Steps:
1. Select TR+1, at [Feb 06, 2018 9:05:51 AM UTC-8:00]
2. See PreCQ message that it was picked up at [Feb 06, 2018 9:07:32 AM UTC-8:00]
3. 2 seconds later, PreCQ posts message that the change timed out
4. Associated job says:
Start	2018-02-05 11:23 PM (PST)
End	2018-02-06 3:29 AM (PST)
Elapsed	4 hrs 5 mins

Either Step 3 or Step 4 is wrong.

---

Related question: I don't know why this timeout is occurring at all, as the logging isn't very clear. If I look at the #1 and #2 builds in GoldenEye:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955374033663436864

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955358296036341456

It looks like this preCQ is actually running many independent boards, but it's serializing all of this. That seems like a recipe for making this take WAY too long, so it's basically always going to time out. Is it even *possible* to finish this in under 4 hours?

It also seems like these instances had particularly slow BuildPackages times. According to the GE view above, most of the builds took "XX% longer than average", where XX is often between 30 and 50. Whereas, the last run that worked showed builds taking less than average:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955886830580044944

IOW, this builder looks to be very much borderline for hitting timeouts.
 
Cc: -kirtika@chromium.org kirtika@google.com
:( 1 more failure for the same CL:

https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/87

I don't think I see any actual failures (e.g., provisioning). It's just that the build takes way too long:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955305215816276496
The mistakes about PreCQ failures causing excessive/premature re-rejections might be bug 808683. So perhaps we should focus on only the (legitimate) timeouts here, and not the poor notification logic. It seems like we'll either need to parallelize the builds, increase the timeouts, or reduce the number of configurations we do at once.
(Also bug 810628, but that as considered a duplicate.)
Labels: wifi-test-failures
Cc: tienchang@chromium.org bhthompson@chromium.org akes...@chromium.org chinyue@chromium.org benzh@chromium.org
 Issue 601813  has been merged into this issue.
Cc: -tienchang@chromium.org -chinyue@chromium.org
My latest timeout:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8946771502912122064

It also had a failure because we had a missing (locked for repair) veyron_speedy device, and so we couldn't allocate any testing for it.

I think I can confidently declare that wificell-pre-cq is broken in its current state. Among the things that would need to change:
(a) improve device allocation behaviors -- we are simultaneously testing too little and too much; if we can't allocate many devices for this, it's not good to require 1 of each -- we should probably be flexible; but on the other hand, we don't really have a good recent selection of boards in this pool either
(b) make a sane timeout -- if we're running X builds in sequence, we need more than 4 hours
(c) probably more?
Cc: nxia@chromium.org
Agree that this is not usable at the moment. In terms of expanding the pre-cq pool with more recent boards, there is an existing open bug https://b.corp.google.com/issues/73493055


As for the number of devices / boards of each kind in the pre-cq pool, I am not sure what a good number is as this pretty much depends on how often its used and how many folks try to use it at the same time. The existing setup is suppose to have 2 of the following; winky, speedy, jerry, snow, elm and 1 of the following: lulu, cyan, samus.


Regarding the 4 hour timeout, I am not sure where that comes from (akeshet@ / nxia@ ?), I see 60min max runtime in the suite control file https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/test_suites/control.wificell-pre-cq?l=32
> As for the number of devices / boards of each kind in the pre-cq pool, I am not sure what a good number

I'm not directly asking for more types of boards right now; and not necessarily even allocating more boards. I just want the existing setup to be reliable. For the few small sanity tests we run here, it feels like we don't often care about running on *all* board types all the time. If the alternatives are
(a) fail because a DUT or two were down/locked or
(b) run on an incomplete set of boards
I'd always choose the latter for this suite.

BTW, allocating 1 DUT for a board type that is required, means that this suite will often fail. As soon as a lab tech needs to repair something, swap APs, etc., we're dead. So we definitely either need more than 1, or else modify the suite to do something more flexible.

Related question: is it possible to allocate DUTs to more than one pool? e.g., we probably don't fully utilize both pool:wificell and pool:wificell-pre-cq.

> The existing setup is suppose to have 2 of the following; winky, speedy, jerry, snow, elm and 1 of the following: lulu, cyan, samus.

$ atest host list -b pool:wificell-pre-cq | grep '^chromeos' | wc -l
9

That doesn't add up ;)

Comment 10 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Labels: -Pri-2 Pri-3
[I think we've implicitly stopped using this pre-cq, as it's in bad shape.]

Sign in to add a comment