New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 611139 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Blocked on:
issue 279618
issue 593172

Blocking:
issue 695268



Sign in to add a comment

PFQ master is killing slaves too frequently

Project Member Reported by steve...@chromium.org, May 11 2016

Issue description

This PFQ master run:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-chromium-pfq/builds/2853

Killed one of the slaves after 4.5 hours:

MasterSlaveSyncCompletion:
  start:    0:47:02 median 0:50:02 mean 0:33:37 min 0:12:27 max 0:53:21
  duration: 3:42:21 median 2:34:02 mean 1:36:16 min 0:12:09 max 3:09:49
  finish:   4:29:23 median 3:25:01 mean 2:09:54 min 0:24:37 max 4:00:50

The slave:
https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-chrome-pfq/builds/551

Started 40 minutes after the master (reason unknown), then took 51 minutes to complete the SyncChrome stage instead of the usual 9:

SyncChrome:
  start:    0:25:26 median 0:13:35 mean 0:14:19 min 0:11:41 max 0:25:26
  duration: 0:51:02 median 0:05:52 mean 0:08:38 min 0:05:03 max 0:51:02
  finish:   1:16:28 median 0:19:38 mean 0:22:58 min 0:16:44 max 1:16:28

(Looking at the SyncChrome log there is no obvious cause except likely overall system load)

The combined additional 80 minutes caused the slave to be killed before it completed.

I would like to suggest that we are too aggressive in killing slaves. There have been other similar cases where a non fatal load related slowness has caused PFQ slaves to be killed prematurely.

Padding the maximum timeout would likely improve overall reliability with the following caveats:
* We need to ensure that aborting a hung slave will cause the master to complete.
* We need to identify slow builders (i.e. total time > 120% of the median). This would be useful anyway to help identify regressions in the time taken for various stages. Builders that were unusually slow should be reported in the master.


 
The deadline comes from the master builder's timeout, which is set in cbuildbot config (I believe the master sets a slave deadline that is a minute or two less than its own deadline).

If you want to extend this, you just need to increase the master timeout.
Owner: steve...@chromium.org
Cc: abodenha@chromium.org

Comment 4 by benhenry@google.com, Jun 21 2016

Status: Assigned (was: Untriaged)
There's an owner on this bug but the status != Assigned. Fixing. If you feel you don't own this bug, please remove yourself as the owner and mark it as "Available" or "Untriaged".
Components: Infra>Client>ChromeOS
Components: -Infra
Blockedon: 279618 593172
Cc: afakhry@chromium.org derat@chromium.org
Summary: PFQ master is killing slaves too frequently (was: PFQ master is too aggressive killing slaves)
This came up again recently in  issue 695268 .

The master timeout needs to be set here as 'build_timeout':

https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/chromeos_config.py?q=chromeos_config.py&l=2821

The default is 16200, set here (although I thought this was generated so I'm not entirely sure where it comes from):
https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/config_dump.json?q=config_dump.json

We should set it to 6 hours (21600) while we look into various ways to improve PFQ build times (issues linked).


Blocking: 695268
Cc: lhchavez@chromium.org khmel@chromium.org
Project Member

Comment 10 by bugdroid1@chromium.org, Mar 4 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/6735830d9db646296d3d76412245ec07980a9810

commit 6735830d9db646296d3d76412245ec07980a9810
Author: Ahmed Fakhry <afakhry@google.com>
Date: Sat Mar 04 00:50:34 2017

[chromeos_config]: Increase the master PFQ timeout to 6 hours.

Recently some builders started to timeout due to increased
BuildPackages and HWTest times. This CL increase this timeout to
6 hours from 4.5 hours.

BUG= chromium:611139 ,  chromium:695268 
TEST=none

Change-Id: I27bfd348c1fe9c61d2f5a24084777656620f82ef
Reviewed-on: https://chromium-review.googlesource.com/448109
Trybot-Ready: Ahmed Fakhry <afakhry@chromium.org>
Tested-by: Ahmed Fakhry <afakhry@chromium.org>
Reviewed-by: Ahmed Fakhry <afakhry@chromium.org>

[modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/config_dump.json
[modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config_unittest.py
[modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config.py

Labels: Pri-3
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Status: Fixed (was: Assigned)

Sign in to add a comment