New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 603864 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Buildbucket builds inconsistent between master and cq/rietveld

Project Member Reported by machenb...@chromium.org, Apr 15 2016

Issue description

In this CL https://codereview.chromium.org/1884293003/ a bunch of builds are hanging as pending, but the actual builds finished long time ago.

Example:
The CQ claims it's pending on id 9015336399675264560
According to https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=9015336399675264560&_h=1& that build is in started state.

Clicking on the associated buildbot build URL:
http://build.chromium.org/p/tryserver.v8/builders/v8_linux_gcc_compile_rel/builds/14756
This claims it belongs to buildbucket id 9015335360657784368
This is a different ID!

https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=9015335360657784368&_h=2& claims it's finished, but it belongs to a different CL/patchset.

I'll manually cancel the started jobs now to unblock this CL.

 

Comment 2 by no...@chromium.org, Apr 15 2016

Labels: -Restrict-View-Google -Infra-Platform Infra-Buildbucket
Owner: no...@chromium.org
Status: Started (was: Untriaged)
This looks like a bug somewhere.

Comment 3 by no...@chromium.org, Apr 15 2016

2016-04-14 23:54:03-0700 [-] [buildbucket] Build 9015335360657784368 started as v8_linux_gcc_compile_rel/14756
2016-04-14 23:37:38-0700 [-] [buildbucket] Build 9015336399675264560 started as v8_linux_gcc_compile_rel/14756

both buildbucket builds started as the same buildbot build. This so looks like build merging was on, but according to logs and actual files on master machine, the merging was not on...

Comment 4 by no...@chromium.org, Apr 15 2016

Cc: pgervais@chromium.org
Given lately infra team is fighting duplication of master processes, I suspect two tryserver.v8 processes were running at that time, leased two different buildbucket builds, and then decided to use same build number for two new builds (because buildbot "next build number" is racy https://code.google.com/p/chromium/codesearch#chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py&q=buildnumber%20file:%5Ebuild/third_party/buildbot_8_4p1/&sq=package:chromium&type=cs&l=145)

Philippe, yesterday trooper, do you remember seeing >1 master.tryserver.v8 processes yesterday?

Comment 5 by no...@chromium.org, Apr 15 2016

Correction: it is even more racy than that, it just keeps the next build number in process memory and increments on each build https://code.google.com/p/chromium/codesearch#chromium/build/third_party/buildbot_8_4p1/buildbot/status/builder.py&q=buildnumber%20file:%5Ebuild/third_party/buildbot_8_4p1/&sq=package:chromium&type=cs&l=532

Comment 6 by no...@chromium.org, Apr 15 2016

Investigation of the build pair mentioned in #2 revealed that although the build looks complete on the build page (this stored in pickled file), the postgres db corrupted. The build does not have finish_time:

   id   | number |  brid  | start_time | finish_time
--------+--------+--------+------------+-------------
 340599 |   5146 | 342817 | 1460702531 |

and the buildrequest is incomplete and claimed by the previous process:
  id   | buildsetid |       buildername       | priority | claimed_at |                           claimed_by_name                           | claimed_by_incarnation  | complete | results | submitted_at | complete_at
--------+------------+-------------------------+----------+------------+---------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------
 342817 |     342817 | v8_win_rel_ng_triggered |        0 | 1460702531 | master4:/home/chrome-bot/buildbot/build/masters/master.tryserver.v8 | pid18995-boot1460554327 |        0 |      -1 |   1460702531 |

Comment 7 by no...@chromium.org, Apr 15 2016

What a mistake it was during buildbucket implementation to assume that buildbot state is consistent with itself...

Comment 8 by no...@chromium.org, Apr 15 2016

Cc: -pgervais@chromium.org
pgervais didn't work on master4 yesterday

Comment 9 by no...@chromium.org, Apr 15 2016

Cc: stip@chromium.org
stip@ confirmed that master.tryserver.v8 was accidentally killed with `kill -9`, thus corrupted state. So, it is not a bug in the code (good), all we need to do is to cleanup state.
Sounds more like it. I also saw a bunch of these builds in one CL. All triggered around the same time, but not in any other...

Comment 11 by no...@chromium.org, Apr 15 2016

Yes, that was bad timing. Those builds were scheduled at the ~same time.

I've cancelled the builds:


POST https://cr-buildbucket.appspot.com/_ah/api/buildbucket/v1/builds/cancel
 
{
 "build_ids": [
  "9015336278442014336",
  "9015336277109439408",
  "9015336274166636992",
  "9015336272240125344",
  "9015336258249231456",
  "9015336258162978224",
  "9015336249162494240",
  "9015336187701926704",
  "9015336125417120240",
  "9015336105826381504"
 ]
}

Comment 12 by no...@chromium.org, Apr 15 2016

Status: Fixed (was: Started)
Status: Verified (was: Fixed)
Thanks a bunch!
Components: Infra>Platform>Buildbucket
Labels: -Infra-Buildbucket

Sign in to add a comment