PreCQ time out, not explained by metrics. |
|||||
Issue descriptionThis CL timed out in the PreCQ. 2017/5/18 at 9:40 PM. https://chromium-review.googlesource.com/c/505955 Looking at the PreCQ graphs, I don't see any particular connection to the failure. https://viceroy.corp.google.com/chromeos/pre-cq?duration=8d&utc_end=1495192685
,
May 23 2017
+ xixuan (current deputy) is also seeing this + chris who has another example too
,
May 23 2017
Can we check the buildbucket id for one of these builds to see if buildbucket results confirm that the build never started?
,
May 23 2017
You can find the buildbucket id in the clActionTable mysql> select * from clActionTable where change_number=505955 and patch_number=3; | 12609885 | 1528888 | 505955 | 3 | external | trybot_launching | rambi-pre-cq | 2017-05-19 03:10:14 | 8979201249746998544 | https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/pre_cq/builds/33945 Basically it's because it failed at InitialCheckout, so it didn't reach the PreCQSync stage to insert any useful information.
,
May 23 2017
The build ran for 11 minutes, then was killed by the launcher as timed out starting. Hum... two issues happening. That means it didn't start for 80 minutes (there is currently a 90 minute launch timeout, right?). A) 80 minutes is a long wait for a builder. The delay needs further investigation, since I don't THINK the metrics show all builders in use at that time. B) The launcher doesn't notice the build started until PreCQ Sync stage runs. How hard would it be to change the launcher to use buildbucket to learn if the builders have started? And if the build finished/passed? If the launcher can use buildbucket, we can delete the PreCQ Sync/Completion code on the PreCQ builders. More robust and flexible, but not urgent.
,
May 24 2017
A) rambi-pre-cq started in time, but it failed at InitialCheckout stage and reported nothing back, then it's considered as timeout. It was triggered at 10:14 and the build started at 10:18. B) cbuildbot is be the place to check the pre-cq runs and to report the status. If the build failed at InitialCheckout step, no Cbuildbot code can be run.
,
May 24 2017
A) You're right. I thought the sync timed out, here was the real error: error: insufficient permission for adding an object to repository database /b/cbuild/repository/.repo/projects/src/third_party/chromiumos-overlay.git/objects I *think* the updated launcher code will recover in this case now by forcing a full sync. I'm not 100% certain. B) Yes: I agree that that's how things work today, I was proposing a change. I'm just not sure how expensive that change would be to implement.
,
May 30 2017
Is there an immediate change needed here? Or are we proposing to wait until the new launcher code is live to see?
,
May 31 2017
crbug.com/726065 will add additinal metrics that would help diagnose this kind of problem. Let close this, and see if we still have unexplained timeouts after we have the better metrics.
,
Jun 6 2017
Re c#9: If you're closing this as WontFix, can you ensure that the follow up job links back to this job, and has an explicit action item to look for these timeouts. Otherwise this just falls through the cracks again.
,
Jun 20 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pho...@chromium.org
, May 19 2017Labels: -Pri-2 Pri-1