Buildbucket builds inconsistent between master and cq/rietveld |
||||||||
Issue descriptionIn this CL https://codereview.chromium.org/1884293003/ a bunch of builds are hanging as pending, but the actual builds finished long time ago. Example: The CQ claims it's pending on id 9015336399675264560 According to https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=9015336399675264560&_h=1& that build is in started state. Clicking on the associated buildbot build URL: http://build.chromium.org/p/tryserver.v8/builders/v8_linux_gcc_compile_rel/builds/14756 This claims it belongs to buildbucket id 9015335360657784368 This is a different ID! https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=9015335360657784368&_h=2& claims it's finished, but it belongs to a different CL/patchset. I'll manually cancel the started jobs now to unblock this CL.
,
Apr 15 2016
This looks like a bug somewhere.
,
Apr 15 2016
2016-04-14 23:54:03-0700 [-] [buildbucket] Build 9015335360657784368 started as v8_linux_gcc_compile_rel/14756 2016-04-14 23:37:38-0700 [-] [buildbucket] Build 9015336399675264560 started as v8_linux_gcc_compile_rel/14756 both buildbucket builds started as the same buildbot build. This so looks like build merging was on, but according to logs and actual files on master machine, the merging was not on...
,
Apr 15 2016
Given lately infra team is fighting duplication of master processes, I suspect two tryserver.v8 processes were running at that time, leased two different buildbucket builds, and then decided to use same build number for two new builds (because buildbot "next build number" is racy https://code.google.com/p/chromium/codesearch#chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py&q=buildnumber%20file:%5Ebuild/third_party/buildbot_8_4p1/&sq=package:chromium&type=cs&l=145) Philippe, yesterday trooper, do you remember seeing >1 master.tryserver.v8 processes yesterday?
,
Apr 15 2016
Correction: it is even more racy than that, it just keeps the next build number in process memory and increments on each build https://code.google.com/p/chromium/codesearch#chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py&q=buildnumber%20file:%5Ebuild/third_party/buildbot_8_4p1/&sq=package:chromium&type=cs&l=532
,
Apr 15 2016
Investigation of the build pair mentioned in #2 revealed that although the build looks complete on the build page (this stored in pickled file), the postgres db corrupted. The build does not have finish_time: id | number | brid | start_time | finish_time --------+--------+--------+------------+------------- 340599 | 5146 | 342817 | 1460702531 | and the buildrequest is incomplete and claimed by the previous process: id | buildsetid | buildername | priority | claimed_at | claimed_by_name | claimed_by_incarnation | complete | results | submitted_at | complete_at --------+------------+-------------------------+----------+------------+---------------------------------------------------------------------+-------------------------+----------+---------+--------------+------------- 342817 | 342817 | v8_win_rel_ng_triggered | 0 | 1460702531 | master4:/home/chrome-bot/buildbot/build/masters/master.tryserver.v8 | pid18995-boot1460554327 | 0 | -1 | 1460702531 |
,
Apr 15 2016
What a mistake it was during buildbucket implementation to assume that buildbot state is consistent with itself...
,
Apr 15 2016
pgervais didn't work on master4 yesterday
,
Apr 15 2016
stip@ confirmed that master.tryserver.v8 was accidentally killed with `kill -9`, thus corrupted state. So, it is not a bug in the code (good), all we need to do is to cleanup state.
,
Apr 15 2016
Sounds more like it. I also saw a bunch of these builds in one CL. All triggered around the same time, but not in any other...
,
Apr 15 2016
Yes, that was bad timing. Those builds were scheduled at the ~same time. I've cancelled the builds: POST https://cr-buildbucket.appspot.com/_ah/api/buildbucket/v1/builds/cancel { "build_ids": [ "9015336278442014336", "9015336277109439408", "9015336274166636992", "9015336272240125344", "9015336258249231456", "9015336258162978224", "9015336249162494240", "9015336187701926704", "9015336125417120240", "9015336105826381504" ] }
,
Apr 15 2016
,
Apr 15 2016
Thanks a bunch!
,
Apr 27 2016
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by machenb...@chromium.org
, Apr 15 2016