trybot doesn't see that webgl_conformance_tests completed |
||||||||||||||||
Issue descriptioncontext: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/203575 I'm not sure what's going on here, but it looks like the webgl_conformance_tests swarming task completed successfully in 10 minutes, but the trybot never noticed. It's stuck on pending (yellow) 9 hours later.
,
Apr 5 2016
,
Apr 5 2016
Hm. Looks like flake to me. CommitQueue team should triage and investigate.
,
Apr 5 2016
(flake of the infrastructure... it looks like the recipe just sort of... stopped. I'm not sure if this was the slave, buildbot, recipes, CQ or what).
,
Apr 5 2016
Given that the tryjob is the one that's stopped, how is this related to CQ? Seems like a problem in the recipe?
,
Apr 5 2016
assigning an owner for now.
,
Apr 6 2016
I was thinking that the CQ team ~maintains this recipe, but you're right, I should have assigned to an actual owner. Pawel, can you look into this?
,
Apr 6 2016
,
Apr 6 2016
I believe we are seeing this issue across multiple trybots and multiple different tests Attached a file showing seemingly skipped runs for https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng?numbuilds=200
,
Apr 6 2016
Please note the original build https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/203575 was not using logdog. FWIW, example other build that was in this state is https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/207636 . This doesn't seem to be related to a specific recipe. I suspect something with buildbot and/or annotator. It might be quite difficult to debug this. Also see https://goto.google.com/moofe (Google-internal).
,
Apr 6 2016
I'm bumping up to P0 since this is a serious regression.
,
Apr 6 2016
Moving to trooper queue. It's close to EOD here for me, and I don't have cycles for an urgent initial investigation. Feel free to move this back to me and/or CQ queue once we know something more. I can also take a more detailed look tomorrow when I have more time to look into it.
,
Apr 6 2016
From slave logs for build #0: 2016-03-31 18:33:47-0700 [-] sending app-level keepalive 2016-03-31 18:36:29-0700 [Broker,client] lost remote 2016-03-31 18:36:29-0700 [Broker,client] lost remote 2016-03-31 18:36:29-0700 [Broker,client] lost remote 2016-03-31 18:36:29-0700 [Broker,client] lost remote step 2016-03-31 18:36:29-0700 [Broker,client] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x10d4dfea8> 2016-03-31 18:36:29-0700 [Broker,client] command interrupted, attempting to kill 2016-03-31 18:36:29-0700 [Broker,client] trying to kill process group 995 2016-03-31 18:36:29-0700 [Broker,client] signal 9 sent successfully 2016-03-31 18:36:29-0700 [Broker,client] Lost connection to master4a.golo.chromium.org:8191 2016-03-31 18:36:29-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x10d409f38> will retry in 2 seconds 2016-03-31 18:36:29-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x10d4d26c8> 2016-03-31 18:36:29-0700 [-] command finished with signal 9, exit code None, elapsedTime: 4481.044728 2016-03-31 18:36:29-0700 [-] would sendStatus but not .running 2016-03-31 18:36:29-0700 [-] SlaveBuilder.commandComplete None 2016-03-31 18:36:31-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x10d4d26c8> 2016-03-31 18:36:31-0700 [-] Connecting to master4a.golo.chromium.org:8191 2016-03-31 18:36:31-0700 [Uninitialized] Connection to master4a.golo.chromium.org:8191 failed: Connection Refused 2016-03-31 18:36:31-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x10d409f38> will retry in 7 seconds 2016-03-31 18:36:31-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x10d4d26c8> 2016-03-31 18:36:39-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x10d4d26c8> ... 2016-03-31 18:49:49-0700 [-] Connecting to master4a.golo.chromium.org:8191 2016-03-31 18:49:49-0700 [Broker,client] message from master: attached It looks like either the master restarted, or the slave lost master connectivity abruptly.
,
Apr 6 2016
Yeah those timestamps match up pretty damningly... That said, I don't think that lines up with a master restart, so I think something else is going on.
,
Apr 6 2016
A row from buildrequests table for jam's build: id | buildsetid | buildername | priority | claimed_at | claimed_by_name | claimed_by_incarnation | complete | results | submitted_at | complete_at ---------+------------+---------------------+----------+------------+--------------------------------------------------------------------------------+------------------------+----------+---------+--------------+------------- 1139639 | 1139639 | mac_chromium_rel_ng | 0 | 1459474039 | master4a:/home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.mac | pid3043-boot1459388832 | 0 | -1 | 1459470108 | which probably means that current buildbot process cannot cancel this build because it thinks it belongs to process 3043 (which doe not exist)
,
Apr 6 2016
Just saw this. Robbie, I assume you are taking care of this already. I'll be heading out home soon.
,
Apr 6 2016
There are 60 builds like that select COUNT(*) from buildrequests where claimed_by_incarnation='pid3043-boot1459388832' and complete=0; => 60 Using this method found e.g. https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_10.10_rel_ng/builds/69036 I am going to cancel them The root cause seems to be a bad master restart
,
Apr 6 2016
Master was restarted via master manager. The restart script waits 10 seconds then kill -9. Kill -9 on master stops it from updating state, which is what we're seeing. IMO never ever ever kill -9 master, even if it takes a year to restart.
,
Apr 6 2016
,
Apr 6 2016
btw modifying postgres db does not help because build files are stored in pickle files and unclaiming the build request won't cause buildbot master to modify and _existing_ build.
,
Apr 6 2016
Ouch. Thanks for the help nodir/dnj :). Assigning to stip to fix/triage (e.g. we need to add a real 'waiting for stop' state to master manager instead of kill -9'ing it). To avoid having this happen elsewhere, I won't be restarting any masters with master manager until then :/
,
Apr 6 2016
I have a doctor's appt now, won't be able to act on this for about 2 hrs.
,
Apr 6 2016
But can we just revert dsansome@'s change?
,
Apr 6 2016
I'm back. Investigating.
,
Apr 6 2016
I agree that the SIGKILL logic may be causing harm here. I'm curious why the logic was implemented to begin with. I suspect it has to do with crbug.com/588118, for which the core cause should be fixed now. If so, we should remove the SIGKILL logic.
,
Apr 6 2016
Drive-by comment: we're still seeing duplicate master processes: https://bugs.chromium.org/p/chromium/issues/detail?id=600952#c5 Heck, there's even two chromium.linux tryservers running right now.
,
Apr 6 2016
Ben, did you kill those tryservers? I did a sweep in https://bugs.chromium.org/p/chromium/issues/detail?id=588118 and was about to recommend revert. Your data suggests otherwise.
,
Apr 6 2016
Augh, the script I used (infra.tools.list_running_masters) is unreliable for this purpose. There are indeed two tryserver.chromium.linuxes on master4a. Using `ps aux | grep buildbot.tac | awk '{print $2}' | xargs -Ixxx pwdx xxx` from now on.
,
Apr 6 2016
So I looked into at least one dupe (tryserver.chromium.linux on master4a):
chrome-bot@master4a:~$ ps aux | grep buildbot.tac | awk '{print $2}' | xargs -Ixxx pwdx xxx
4308: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.linux
9298: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.linux
9299: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.win
13228: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.custom_tabs_client
13248: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.mac
13336: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.angle
13666: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.mojo
13729: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.syzygy
18306: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.catapult
23380: /home/chrome-bot/buildbot/build/masters/master.tryserver.chromium.android
25139: No such process
25768: /home/chrome-bot/buildbot/build/masters/master.tryserver.client.pdfium
31443: /home/chrome-bot/buildbot/build/masters/master.tryserver.infra
chrome-bot@master4a:~$ ps -p "4308" -o etime=
124-19:53:31
chrome-bot@master4a:~$ ps -p "9298" -o etime=
18:04:41
This means that the dupe tryserver.chromium.linux is over 124 days old -- before any fix had been applied. I suspect that other dupes are the same, although we need to do a sweep of masters to check. If so, I think this means the problem should be over and we can revert dsansome's SIGKILL.
,
Apr 6 2016
See comments 15 and 16 on https://bugs.chromium.org/p/chromium/issues/detail?id=588118. It looks like any dupes are significantly old. Proceeding with the revert.
,
Apr 6 2016
Reverting https://codereview.chromium.org/1754103002/, follow along at https://codereview.chromium.org/1863083003/.
,
Apr 6 2016
I'm leaving this open until we can verify that restarts happen without messing up builds.
,
Apr 8 2016
Based on #29 and #31, I assume this is a not a P0 anymore. Please bump it back to P0 otherwise.
,
Apr 13 2016
Not sure what troopers could do here. Removing the ticket from the queue.
,
Apr 13 2016
Without much traffic I can assume things are working. Marking as fixed.
,
Apr 26 2016
,
Apr 26 2016
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by mar...@chromium.org
, Apr 1 2016