factory branch builders in factory-oak-8182.B (elm and hana) are broken |
||||||
Issue descriptionAlthough the build_packages failure in factory-oak is fixed, and some artifacts have been produced, the builder still isn't healthy. https://cros-goldeneye.corp.google.com/chromeos/console/listFFBuild?type=factory&branch=factory-oak-8182.B#/ The recent builds all seem to be aborted. Maybe a timeout issue?
,
Jul 31
Over to the current oncall. Alec, we only need to raise the timeout from 4 hours to 12. Do that and then restart the chromeos.branch waterfall with a 4 hour drain time so that it doesn't interrupt any currently running builds. If builds in progress complete sooner than that, it will restart sooner.
,
Jul 31
How do I go about doing that? Are there any example CLs I can follow or something to get started?
,
Jul 31
https://chrome-internal-review.googlesource.com/c/chrome/tools/build/+/653348/2/masters/master.chromeos.branch/master_cros_branch_cfg.py#34 makes it look like it was already done. You'll need to investigate why that isn't working. See also this other place that we store timeouts: https://cs.corp.google.com/chromeos_public/chromite/config/chromeos_config.py
,
Jul 31
The waterfall was restarted right after this CL: https://chrome-internal-review.googlesource.com/c/infradata/master-manager/+/653784
,
Jul 31
I need help :( I am very lost on this one. This is what I see: - Found https://luci-milo.appspot.com/buildbot/chromeos.branch/elm%20factory%20factory-oak-8182.B/ which shows (some but not all?) of the builds. That shows the 4 hours timeout being hit. - I don't see a 4 hours timeout in any of those files, they are all 6 hours+ - These are Buildbot builds, they don't have Builtbot IDs (which I don't understand how they are added to GE, must be a totally different flow from Legoland) - Because it's not on Buildbucket, idk how I would find any logs explaining why it timed out (normally I would look at raw swarming logs next) - Some of these builds seem to have timed out while waiting for packages to finish building, which seems odd. Many are just stuck with `Still building <some package>...`, except that last one which has errors that make no sense to me like `losetup: /dev/loop0: detach failed: No such device or address WARNING : losetup -d /dev/loop0 failed (try 3)`
,
Jul 31
Don, could you maybe chat with Alec over IM when it's convenient for you to see if you can help steer toward the right direction?
,
Aug 1
This is the chromeos.branch waterfall: https://uberchromegw.corp.google.com/i/chromeos.branch/waterfall All of the firmware and factory builders run there. It's pure buildbot, no buildbucket, and the older branches will report degraded (or no) information into CIDB. Totally old school.
,
Aug 2
How much work would it be to move all of the factory builders to use buildbucket? Is that even possible?
,
Aug 2
I'm currently working on a big refactor of firmware builders to mostly modernize them, even when building older branches. go/tot-for-firmware-branches https://crbug.com/855291 When firmware is done, I'll investigate factory in more detail. They will be migrated to swarming, which involves a move to buildbucket, but are the trickiest builders to migrate.
,
Aug 2
I'm actively working on this, sorry for the slow progress.
,
Aug 2
Hi ChOPs, any ideas why buildbot might be timing out like this? All the timeouts I see in config and code look to be 6 hours or more, but we are getting timeouts right at 4 hors.
,
Aug 2
,
Aug 2
FYI: Restarting chromeos.branch waterfall
,
Aug 2
Looks like the restart took, just passed the 4 hour mark.
,
Aug 2
And now it's passed. Nice work, Alec. My best guess for this would be that the previous attempt to increase the timeout didn't take because it was restarted so quickly after the config CL landed.
,
Aug 2
Thanks! |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by nsanders@chromium.org
, Jul 27