cyan-chrome-pfq timing out due to long build_packages step |
|||||||||
Issue descriptionhttps://luci-milo.appspot.com/buildbot/chromeos/cyan-chrome-pfq/808 We see that we failed during HWTest: @@@STEP_FAILURE@@@ 16:47:38: ERROR: Timeout occurred- waited 16000 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time. @@@STEP_FAILURE@@@ 16:47:38: ERROR: Timeout occurred- waited 15764 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time. 16:47:38: INFO: Running cidb query on pid 14005, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x3c74410> 16:47:39: INFO: Running cidb query on pid 14005, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x3762290> But, the HWTest stage only ran for ~15 mins. We spend nearly 2 hours running build_packages, probably because we're building chrome from source.
,
Feb 27 2017
I'm not sure that this is due to issue 695172, nor that it is necessarily related to the BuildPackages step per-se. BuildPackages does take a very long time, and may have gone up across the board at some point (we should investigate that), however it takes just as long (from a quick glance) on other builders. It does appear that we may be taking longer in general, and cyan runs the most tests so tends to take the longest. We need to consider a short term fix (e.g. increase the master timeout), and also investigate why the builders are taking so long in general. +derat@ and jrbarnette@ who have been investigating PFQ cycle times. ->afakhry@ (current gardener)
,
Feb 28 2017
It looks like this build spent about 114 minutes on BuildPackages and another 80 minutes on SimpleChromeArtifacts. As far as I know, we'll always need to run both of those stages serially on the PFQ builders, so even if we're able to cut down the time spent in HWTest (HWTest [sanity] took almost 100 minutes), it feels like we're still going to be awfully close to the current limit (which looks like it's about 267 minutes). Will we be able to use GOMA to make either or both of these stages complete in less time?
,
Feb 28 2017
Most recent failure is due to the same timeout: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-chrome-pfq/builds/826 Steven, there used to be a ReportStage step in the build that showed a timeline graph. Was that removed?
,
Feb 28 2017
It's there ('Build stages timeline') in other builds, just not in build #826, for reasons I don't understand.
,
Mar 1 2017
BuildPackages seem to take the majority of the time, at least for build: https://00e9e64bac2c953341ef1691bcb4988a2b674130f73f09b1e1-apidata.googleusercontent.com/download/storage/v1/b/chromeos-image-archive/o/cyan-chrome-pfq%2FR58-9327.0.0-rc2%2Ftimeline-stages.html?qk=AD5uMEtU36HNkOg3AkH_B_VBBAtFYhj-GlCwYXxkk5pVeJU5AlXJwcEVoyQtDS_6S8ibXKMwOr9NS8ibpqtT7WJ8eR9-zzKkCzPYZZ9sdTJGJNudylMMRiJ07coZMKEp27gszQp-iS-vEsM_OPHbZrUzWZeiMJ1waJ7L89RnI-DO1ifES2cX5BJXW4qgAYAb8utCSJCkX8pdZe-KnQsBpfaelAYIBrXIfo3rPNcrlSEgLU2hAI8iEa-5L6ygX3UY1bcac5O4jRQOHTNppoTODERF9vqTWxCLcXz2WKpHWJdusGDkn9TNMhLulsIYkD5FBYrIk-dxvNgTPqY4SEJpavrMBu89esqZ2rLNMp-nuad6mDVPkwh1HVqWe1EU0O-uclEEq6mwk_qX_8bAKrOhaBzbMSIXHGwmZupi7HGG6fTeZ4tzbqB_m8IV93W4OI1UKxUumS4gi6Z9djt7B8tFh8xlIgzHSp6qdxuFrBt_anCRJbhwd31APRcGU69ioqLn8d_uF9LnTAJZETx8eztUMq_w3I7UCI4ULYkumrxGYfffif0AebPfKZV3LFqYm6o-MXz4-sPGetnuTstHeqxqS1Cbd_ogE2mSM0psCSIgto7MsAydgKHz5ZgZam0eImDcW7JFI92a1bQ2TV15hFNxUtxkbafe3RFSlYejg5ox0-8CkTEiJG3Bbj86H40NkbG0fv-e-6lasO4-W0jAasmodzmu_UrY3Dxi3lCb50imPL16_gPwlBETLVfyexPM1W8OrN6gEkjRw0FkmZMUmEdGYjhaaGjjfSaO-b7raLVItl8wMTfqHcOAE9ZdJEMhZppa69vWMt7T_HpCXg_vlNnmxw1sflQlWV-wqA As a temporary workaround, can we increase the timeout deadline set by the master builder for cyan? Is the timeout set per builder, or is it a single value for all builders?
,
Mar 1 2017
There's a single master deadline that is shared by all slaves. Yep, you can increase it, the knob is somewhere in chromeos_config I believe.
,
Mar 1 2017
I can see a bunch of timeout overrides for cyan-chrome-pfq in chromite/cbuildbot/config_dump.json (https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/config_dump.json?type=cs&q=cyan-chrome-pfq+p:chromeos_public&l=6481-6486), but I'm not sure whether these are the ones I should be changing?
,
Mar 1 2017
Or maybe this is the value that should be changed: https://cs.corp.google.com/chromeos_public/chromite/lib/config_lib.py?type=cs&q=ASYNC_HW_TEST_TIMEOUT+p:chromeos_public&l=404?
,
Mar 1 2017
Looking at the history of cyan builders: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-chrome-pfq?numbuilds=200 We appear to have been pushing the edge of the master timeout at least as far back as September: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-chrome-pfq/builds/215 BuildPackages appears to always take just under 2 hours, although it appears to have slowly crept up from ~1:40 to ~2:00. HWTest shows the most variation, with times ranging from 0:20 to 1:15. The config_dump.json file is generated, and what you really want to change is the master timeout, not the slaves. akeshet@, jrbarnette@, can you please help us track down where the master sets the slave timeouts?
,
Mar 1 2017
Apologies, I wasn't very clear in that last comment: 1) We appear to have been closer than comfortable to the timeout for a while now but have recently crept up to the point where the slaves (cyan in particular which runs the most tests) are taking to long. There is no obvious "smoking gun" here - build times and test times both appear to have crept up. 2) This needs a better fix (i.e. we should do less) but for now we should figure out how to increase the master timeout.
,
Mar 1 2017
BTW, the specific failing error is: @@@STEP_FAILURE@@@ 06:22:28: ERROR: Timeout occurred- waited 15595 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time. 06:22:28: INFO: Running cidb query on pid 30746, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x5370d50> 15595 seconds is just over 4.5 hours, which I recall being the master timeout, but I can not for the life of me recall where that is set. That error mesage is set here: https://cs.corp.google.com/chromeos_public/chromite/scripts/cbuildbot.py?q=p:chromeos_public+%22timeout+deadline+set+by+the+master%22&l=1308 I am struggling to follow that code, it appears to get the timeout from CIDB, but I can't figure out where that gets set? akeshet@ - git blames you for that message :)
,
Mar 1 2017
akeshet@ suggested that we need to change the master-chromium-pfq config here: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/chromeos_config.py?q=chromeos_config.py&l=2821 I think we need to set build_timeout to something > 16200 (4.5 hours). The default is here btw: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/config_dump.json?q=_dump+file:%5Echromite/+package:%5Echromeos_public$&l=23 I'm not sure where that value comes from exactly, i thought that file was generated? I would suggest with upping the timeout to 6 hours - I would rather investigate why individual stages are spinning for a ridiculous time than run into this problem again.
,
Mar 1 2017
This is already being tracked as issue 611139 , we should continue the discussion there.
,
Mar 1 2017
,
Mar 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/6735830d9db646296d3d76412245ec07980a9810 commit 6735830d9db646296d3d76412245ec07980a9810 Author: Ahmed Fakhry <afakhry@google.com> Date: Sat Mar 04 00:50:34 2017 [chromeos_config]: Increase the master PFQ timeout to 6 hours. Recently some builders started to timeout due to increased BuildPackages and HWTest times. This CL increase this timeout to 6 hours from 4.5 hours. BUG= chromium:611139 , chromium:695268 TEST=none Change-Id: I27bfd348c1fe9c61d2f5a24084777656620f82ef Reviewed-on: https://chromium-review.googlesource.com/448109 Trybot-Ready: Ahmed Fakhry <afakhry@chromium.org> Tested-by: Ahmed Fakhry <afakhry@chromium.org> Reviewed-by: Ahmed Fakhry <afakhry@chromium.org> [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/config_dump.json [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config_unittest.py [modify] https://crrev.com/6735830d9db646296d3d76412245ec07980a9810/cbuildbot/chromeos_config.py
,
Mar 10 2017
We have increased the master PFQ builder timeout to 6 hours. cyan-chrome-pfq hasn't been timing out since.
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by sha...@chromium.org
, Feb 23 2017