PFQ master is killing slaves too frequently |
|||||||||||||
Issue descriptionThis PFQ master run: https://uberchromegw.corp.google.com/i/chromeos/builders/master-chromium-pfq/builds/2853 Killed one of the slaves after 4.5 hours: MasterSlaveSyncCompletion: start: 0:47:02 median 0:50:02 mean 0:33:37 min 0:12:27 max 0:53:21 duration: 3:42:21 median 2:34:02 mean 1:36:16 min 0:12:09 max 3:09:49 finish: 4:29:23 median 3:25:01 mean 2:09:54 min 0:24:37 max 4:00:50 The slave: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-chrome-pfq/builds/551 Started 40 minutes after the master (reason unknown), then took 51 minutes to complete the SyncChrome stage instead of the usual 9: SyncChrome: start: 0:25:26 median 0:13:35 mean 0:14:19 min 0:11:41 max 0:25:26 duration: 0:51:02 median 0:05:52 mean 0:08:38 min 0:05:03 max 0:51:02 finish: 1:16:28 median 0:19:38 mean 0:22:58 min 0:16:44 max 1:16:28 (Looking at the SyncChrome log there is no obvious cause except likely overall system load) The combined additional 80 minutes caused the slave to be killed before it completed. I would like to suggest that we are too aggressive in killing slaves. There have been other similar cases where a non fatal load related slowness has caused PFQ slaves to be killed prematurely. Padding the maximum timeout would likely improve overall reliability with the following caveats: * We need to ensure that aborting a hung slave will cause the master to complete. * We need to identify slow builders (i.e. total time > 120% of the median). This would be useful anyway to help identify regressions in the time taken for various stages. Builders that were unusually slow should be reported in the master.
,
May 11 2016
,
May 11 2016
,
Jun 21 2016
There's an owner on this bug but the status != Assigned. Fixing. If you feel you don't own this bug, please remove yourself as the owner and mark it as "Available" or "Untriaged".
,
Jun 21 2016
,
Jun 21 2016
,
Mar 1 2017
This came up again recently in issue 695268 . The master timeout needs to be set here as 'build_timeout': https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/chromeos_config.py?q=chromeos_config.py&l=2821 The default is 16200, set here (although I thought this was generated so I'm not entirely sure where it comes from): https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/config_dump.json?q=config_dump.json We should set it to 6 hours (21600) while we look into various ways to improve PFQ build times (issues linked).
,
Mar 1 2017
,
Mar 3 2017
,
Mar 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/6735830d9db646296d3d76412245ec07980a9810 commit 6735830d9db646296d3d76412245ec07980a9810 Author: Ahmed Fakhry <afakhry@google.com> Date: Sat Mar 04 00:50:34 2017 [chromeos_config]: Increase the master PFQ timeout to 6 hours. Recently some builders started to timeout due to increased BuildPackages and HWTest times. This CL increase this timeout to 6 hours from 4.5 hours. BUG= chromium:611139 , chromium:695268 TEST=none Change-Id: I27bfd348c1fe9c61d2f5a24084777656620f82ef Reviewed-on: https://chromium-review.googlesource.com/448109 Trybot-Ready: Ahmed Fakhry <afakhry@chromium.org> Tested-by: Ahmed Fakhry <afakhry@chromium.org> Reviewed-by: Ahmed Fakhry <afakhry@chromium.org> [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/config_dump.json [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config_unittest.py [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config.py
,
Mar 8 2018
,
Mar 30 2018
,
Mar 30 2018
,
Jun 14 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by akes...@chromium.org
, May 11 2016