wificell-pre-cq: incorrect (?) timeout accounting - "your change timed out after 240 minutes" |
|||||||
Issue descriptionExample CL: https://chromium-review.googlesource.com/c/chromiumos/overlays/portage-stable/+/897057 This CL has so far tried to run through 2 wificell-pre-cq runs: #1 https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/84 #2 https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/85 Run #1 may have been a legit timeout (for whatever reason, it didn't get picked up for HW tests for several hours?). Run #2 looks weird though; either it didn't start when it said it did (2018-02-05 11:23 PM (PST)) or the PreCQ didn't post to Gerrit to say that it started testing the change. Steps: 1. Select TR+1, at [Feb 06, 2018 9:05:51 AM UTC-8:00] 2. See PreCQ message that it was picked up at [Feb 06, 2018 9:07:32 AM UTC-8:00] 3. 2 seconds later, PreCQ posts message that the change timed out 4. Associated job says: Start 2018-02-05 11:23 PM (PST) End 2018-02-06 3:29 AM (PST) Elapsed 4 hrs 5 mins Either Step 3 or Step 4 is wrong. --- Related question: I don't know why this timeout is occurring at all, as the logging isn't very clear. If I look at the #1 and #2 builds in GoldenEye: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955374033663436864 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955358296036341456 It looks like this preCQ is actually running many independent boards, but it's serializing all of this. That seems like a recipe for making this take WAY too long, so it's basically always going to time out. Is it even *possible* to finish this in under 4 hours? It also seems like these instances had particularly slow BuildPackages times. According to the GE view above, most of the builds took "XX% longer than average", where XX is often between 30 and 50. Whereas, the last run that worked showed builds taking less than average: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955886830580044944 IOW, this builder looks to be very much borderline for hitting timeouts.
,
Feb 7 2018
:( 1 more failure for the same CL: https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/wificell_pre_cq/87 I don't think I see any actual failures (e.g., provisioning). It's just that the build takes way too long: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8955305215816276496
,
Feb 9 2018
The mistakes about PreCQ failures causing excessive/premature re-rejections might be bug 808683. So perhaps we should focus on only the (legitimate) timeouts here, and not the poor notification logic. It seems like we'll either need to parallelize the builds, increase the timeouts, or reduce the number of configurations we do at once.
,
Feb 9 2018
(Also bug 810628, but that as considered a duplicate.)
,
Feb 13 2018
,
May 14 2018
Issue 601813 has been merged into this issue.
,
May 14 2018
My latest timeout: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8946771502912122064 It also had a failure because we had a missing (locked for repair) veyron_speedy device, and so we couldn't allocate any testing for it. I think I can confidently declare that wificell-pre-cq is broken in its current state. Among the things that would need to change: (a) improve device allocation behaviors -- we are simultaneously testing too little and too much; if we can't allocate many devices for this, it's not good to require 1 of each -- we should probably be flexible; but on the other hand, we don't really have a good recent selection of boards in this pool either (b) make a sane timeout -- if we're running X builds in sequence, we need more than 4 hours (c) probably more?
,
May 14 2018
Agree that this is not usable at the moment. In terms of expanding the pre-cq pool with more recent boards, there is an existing open bug https://b.corp.google.com/issues/73493055 As for the number of devices / boards of each kind in the pre-cq pool, I am not sure what a good number is as this pretty much depends on how often its used and how many folks try to use it at the same time. The existing setup is suppose to have 2 of the following; winky, speedy, jerry, snow, elm and 1 of the following: lulu, cyan, samus. Regarding the 4 hour timeout, I am not sure where that comes from (akeshet@ / nxia@ ?), I see 60min max runtime in the suite control file https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/test_suites/control.wificell-pre-cq?l=32
,
May 15 2018
> As for the number of devices / boards of each kind in the pre-cq pool, I am not sure what a good number I'm not directly asking for more types of boards right now; and not necessarily even allocating more boards. I just want the existing setup to be reliable. For the few small sanity tests we run here, it feels like we don't often care about running on *all* board types all the time. If the alternatives are (a) fail because a DUT or two were down/locked or (b) run on an incomplete set of boards I'd always choose the latter for this suite. BTW, allocating 1 DUT for a board type that is required, means that this suite will often fail. As soon as a lab tech needs to repair something, swap APs, etc., we're dead. So we definitely either need more than 1, or else modify the suite to do something more flexible. Related question: is it possible to allocate DUTs to more than one pool? e.g., we probably don't fully utilize both pool:wificell and pool:wificell-pre-cq. > The existing setup is suppose to have 2 of the following; winky, speedy, jerry, snow, elm and 1 of the following: lulu, cyan, samus. $ atest host list -b pool:wificell-pre-cq | grep '^chromeos' | wc -l 9 That doesn't add up ;)
,
Jun 8 2018
,
Dec 5
[I think we've implicitly stopped using this pre-cq, as it's in bad shape.] |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by briannorris@chromium.org
, Feb 6 2018