New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 599874 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
gone, assign your bugs elsewhere :)
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

trybot doesn't see that webgl_conformance_tests completed

Project Member Reported by jam@chromium.org, Apr 1 2016

Issue description

context: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/203575

I'm not sure what's going on here, but it looks like the webgl_conformance_tests  swarming task completed successfully in 10 minutes, but the trybot never noticed. It's stuck on pending (yellow) 9 hours later.
 
This should probably be sent to the trooper. Please check first if it is a buildbot issue.

Comment 2 by jam@chromium.org, Apr 5 2016

Labels: Infra-Troopers
Labels: -Infra-Troopers
Hm. Looks like flake to me. CommitQueue team should triage and investigate.
(flake of the infrastructure... it looks like the recipe just sort of... stopped. I'm not sure if this was the slave, buildbot, recipes, CQ or what).

Comment 5 by jam@chromium.org, Apr 5 2016

Given that the tryjob is the one that's stopped, how is this related to CQ? Seems like a problem in the recipe?

Comment 6 by jam@chromium.org, Apr 5 2016

Owner: iannucci@chromium.org
Status: Assigned (was: Untriaged)
assigning an owner for now.
Owner: phajdan.jr@chromium.org
I was thinking that the CQ team ~maintains this recipe, but you're right, I should have assigned to an actual owner. Pawel, can you look into this?
Cc: chrishall@chromium.org
I believe we are seeing this issue across multiple trybots and multiple different tests

Attached a file showing seemingly skipped runs for https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng?numbuilds=200


Screenshot from 2016-04-06 16:43:38.png
243 KB View Download
Cc: phajdan@google.com
Please note the original build https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/203575 was not using logdog.

FWIW, example other build that was in this state is https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/207636 .

This doesn't seem to be related to a specific recipe. I suspect something with buildbot and/or annotator. It might be quite difficult to debug this.

Also see https://goto.google.com/moofe (Google-internal).

Comment 11 by jam@chromium.org, Apr 6 2016

Labels: -Pri-1 Pri-0
I'm bumping up to P0 since this is a serious regression.
Labels: Infra-Troopers
Owner: ----
Status: Untriaged (was: Assigned)
Moving to trooper queue.

It's close to EOD here for me, and I don't have cycles for an urgent initial investigation.

Feel free to move this back to me and/or CQ queue once we know something more. I can also take a more detailed look tomorrow when I have more time to look into it.

Comment 13 by d...@chromium.org, Apr 6 2016

From slave logs for build #0:

2016-03-31 18:33:47-0700 [-] sending app-level keepalive
2016-03-31 18:36:29-0700 [Broker,client] lost remote
2016-03-31 18:36:29-0700 [Broker,client] lost remote
2016-03-31 18:36:29-0700 [Broker,client] lost remote
2016-03-31 18:36:29-0700 [Broker,client] lost remote step
2016-03-31 18:36:29-0700 [Broker,client] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x10d4dfea8>
2016-03-31 18:36:29-0700 [Broker,client] command interrupted, attempting to kill
2016-03-31 18:36:29-0700 [Broker,client] trying to kill process group 995
2016-03-31 18:36:29-0700 [Broker,client]  signal 9 sent successfully
2016-03-31 18:36:29-0700 [Broker,client] Lost connection to master4a.golo.chromium.org:8191
2016-03-31 18:36:29-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x10d409f38> will retry in 2 seconds
2016-03-31 18:36:29-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x10d4d26c8>
2016-03-31 18:36:29-0700 [-] command finished with signal 9, exit code None, elapsedTime: 4481.044728
2016-03-31 18:36:29-0700 [-] would sendStatus but not .running
2016-03-31 18:36:29-0700 [-] SlaveBuilder.commandComplete None
2016-03-31 18:36:31-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x10d4d26c8>
2016-03-31 18:36:31-0700 [-] Connecting to master4a.golo.chromium.org:8191
2016-03-31 18:36:31-0700 [Uninitialized] Connection to master4a.golo.chromium.org:8191 failed: Connection Refused
2016-03-31 18:36:31-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x10d409f38> will retry in 7 seconds
2016-03-31 18:36:31-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x10d4d26c8>
2016-03-31 18:36:39-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x10d4d26c8>
...
2016-03-31 18:49:49-0700 [-] Connecting to master4a.golo.chromium.org:8191
2016-03-31 18:49:49-0700 [Broker,client] message from master: attached

It looks like either the master restarted, or the slave lost master connectivity abruptly.
Yeah those timestamps match up pretty damningly... That said, I don't think that lines up with a master restart, so I think something else is going on.
A row from buildrequests table for jam's build:

   id    | buildsetid |     buildername     | priority | claimed_at |                                claimed_by_name                                 | claimed_by_incarnation | complete | results | submitted_at | complete_at
---------+------------+---------------------+----------+------------+--------------------------------------------------------------------------------+------------------------+----------+---------+--------------+-------------
 1139639 |    1139639 | mac_chromium_rel_ng |        0 | 1459474039 | master4a:/home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.mac | pid3043-boot1459388832 |        0 |      -1 |   1459470108 |

which probably means that current buildbot process cannot cancel this build because it thinks it belongs to process 3043 (which doe not exist)
Just saw this. Robbie, I assume you are taking care of this already. I'll be heading out home soon.
There are 60 builds like that

select COUNT(*) from buildrequests where claimed_by_incarnation='pid3043-boot1459388832' and complete=0;

=> 60

Using this method found e.g. https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_10.10_rel_ng/builds/69036

I am going to cancel them

The root cause seems to be a bad master restart

Comment 18 by d...@chromium.org, Apr 6 2016

Master was restarted via master manager. The restart script waits 10 seconds then kill -9. Kill -9 on master stops it from updating state, which is what we're seeing.

IMO never ever ever kill -9 master, even if it takes a year to restart.
Cc: dsansome@chromium.org
yeah, it changed recently https://codereview.chromium.org/1754103002
btw modifying postgres db does not help because build files are stored in pickle files and unclaiming the build request won't cause buildbot master to modify and _existing_ build.
Owner: stip@chromium.org
Status: Assigned (was: Untriaged)
Ouch. Thanks for the help nodir/dnj :).

Assigning to stip to fix/triage (e.g. we need to add a real 'waiting for stop' state to master manager instead of kill -9'ing it). To avoid having this happen elsewhere, I won't be restarting any masters with master manager until then :/

Comment 22 by stip@chromium.org, Apr 6 2016

I have a doctor's appt now, won't be able to act on this for about 2 hrs.

Comment 23 by stip@chromium.org, Apr 6 2016

But can we just revert dsansome@'s change?

Comment 24 by stip@chromium.org, Apr 6 2016

I'm back. Investigating.

Comment 25 by stip@chromium.org, Apr 6 2016

I agree that the SIGKILL logic may be causing harm here. I'm curious why the logic was implemented to begin with. I suspect it has to do with crbug.com/588118, for which the core cause should be fixed now. If so, we should remove the SIGKILL logic.
Drive-by comment: we're still seeing duplicate master processes:
https://bugs.chromium.org/p/chromium/issues/detail?id=600952#c5

Heck, there's even two chromium.linux tryservers running right now.

Comment 27 by stip@chromium.org, Apr 6 2016

Ben, did you kill those tryservers? I did a sweep in https://bugs.chromium.org/p/chromium/issues/detail?id=588118 and was about to recommend revert. Your data suggests otherwise.

Comment 28 by stip@chromium.org, Apr 6 2016

Augh, the script I used (infra.tools.list_running_masters) is unreliable for this purpose. There are indeed two tryserver.chromium.linuxes on master4a. Using `ps aux | grep buildbot.tac | awk '{print $2}' | xargs -Ixxx pwdx xxx` from now on.

Comment 29 by stip@chromium.org, Apr 6 2016

So I looked into at least one dupe (tryserver.chromium.linux on master4a):

chrome-bot@master4a:~$ ps aux | grep buildbot.tac | awk '{print $2}' | xargs -Ixxx pwdx xxx 
4308: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.linux
9298: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.linux
9299: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.win
13228: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.custom_tabs_client
13248: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.mac
13336: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.angle
13666: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.mojo
13729: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.syzygy
18306: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.catapult
23380: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.android
25139: No such process
25768: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.pdfium
31443: /home/chrome-bot/buildbot/build/masters/master.tryserver.infra
chrome-bot@master4a:~$ ps -p "4308" -o etime=
124-19:53:31
chrome-bot@master4a:~$ ps -p "9298" -o etime=
   18:04:41

This means that the dupe tryserver.chromium.linux is over 124 days old -- before any fix had been applied. I suspect that other dupes are the same, although we need to do a sweep of masters to check. If so, I think this means the problem should be over and we can revert dsansome's SIGKILL.

Comment 30 by stip@chromium.org, Apr 6 2016

See comments 15 and 16 on https://bugs.chromium.org/p/chromium/issues/detail?id=588118. It looks like any dupes are significantly old. Proceeding with the revert.

Comment 31 by stip@chromium.org, Apr 6 2016

Status: Started (was: Assigned)
Reverting https://codereview.chromium.org/1754103002/, follow along at https://codereview.chromium.org/1863083003/.

Comment 32 by stip@chromium.org, Apr 6 2016

I'm leaving this open until we can verify that restarts happen without messing up builds.
Labels: -Pri-0 Pri-1
Based on #29 and #31, I assume this is a not a P0 anymore. Please bump it back to P0 otherwise.
Labels: -Infra-Troopers
Not sure what troopers could do here. Removing the ticket from the queue.

Comment 35 by stip@chromium.org, Apr 13 2016

Status: Fixed (was: Started)
Without much traffic I can assume things are working. Marking as fixed.
Components: Infra>CQ
Labels: -Infra-CommitQueue
Components: Infra>Platform>Swarming
Labels: -Infra-Swarming

Sign in to add a comment