CLs timed out in PreCQ despite a low load. |
|||||||
Issue descriptionThis CL is an example: https://crrev.com/c/963758/3 Every PreCQ builder associated with it timed out with: We were not able to launch a chromite-pre-cq trybot for your change within 90 minutes. 3/14/18 at 8:20 PM. 90 minutes before that is at 6:50 PM. Looking at Viceroy, the in-progress builder loads are low, there should be plenty of capacity. Why did they time out?
,
Mar 15 2018
Checking the builders from the logs, I see they failed in the PreCQ sync stage:
Traceback (most recent call last):
File "/tmp/cbuildbot-tmpFTJgAV/tmpkYi8I4/chromite/lib/failures_lib.py", line 229, in wrapped_functor
return functor(*args, **kwargs)
File "/tmp/cbuildbot-tmpFTJgAV/tmpkYi8I4/chromite/cbuildbot/validation_pool.py", line 374, in AcquirePreCQPool
pool = cls(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'overlay'
,
Mar 15 2018
I can see two ways to improve this: 1) Now that we can link to tryjobs via buildbucket_id, annotate the CLs as soon as the jobs are requested. This should be easy to do. It would make the PreCQ feel more responsive, since users would be notified as soon as their CLs are picked up, and would make it a lot easier to diagnose edge cases like this. 2) Use buildbucket, not CIDB to scan for PreCQ completion and pass/fail, only using CIDB for extended information if needed. That will allow the PreCQ to correctly handle edge cases like this, as well as better detect (and thus handle) cases where jobs are scheduled but not running right away (perhaps because of load).
,
Mar 15 2018
I might try to implement 1 after I reland my CL to have "cros tryjob" export it's results as Json. That should allow the PreCQ launcher to avoid having to know how to generate the relevant URLs at all.
,
Mar 30 2018
,
Mar 30 2018
,
Apr 5 2018
,
Jun 8 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dgarr...@chromium.org
, Mar 15 2018