chromeos-server11 is dragging down tick time, and causing timeouts. |
|||||||||||
Issue descriptionI am looking at build failures, and some are due to timeout. I already mention that problem in https://bugs.chromium.org/p/chromium/issues/detail?id=583014#c30, but on strago suite timeouts errors are dwarfed by the eMMC error. I take an example of leon (slippy) Currently bvt_timeout is set at 120 minutes (I am not sure where), but that 's not enough to complete the test. see http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=60889236 Take the example of that suite. Using its status.log: We see that 6 hosts are used: 13 chromeos4-row1-rack11-host3 11 chromeos4-row1-rack9-host1 8 chromeos4-row1-rack9-host5 7 chromeos4-row1-rack8-host3 3 chromeos4-row1-rack8-host1 3 chromeos4-row1-rack10-host5 Taking chromeos4-row1-rack9-host1 as an example, the test start dates are: chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.first : Apr 22 08:04:19 chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.middle : Apr 22 08:04:35 chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.last : Apr 22 08:04:52 chromeos4-row1-rack9-host1/provision_AutoUpdate.double : Apr 22 08:46:01 chromeos4-row1-rack9-host1/security_NetworkListeners : Apr 22 09:02:39 chromeos4-row1-rack9-host1/logging_CrashSender : Apr 22 09:10:18 chromeos4-row1-rack9-host1/security_ChromiumOSLSM : Apr 22 09:17:25 chromeos4-row1-rack9-host1/security_EnableChromeTesting : Apr 22 09:23:14 chromeos4-row1-rack9-host1/security_Firewall : Apr 22 09:29:48 chromeos4-row1-rack9-host1/security_RootfsStatefulSymlinks : Apr 22 09:36:54 chromeos4-row1-rack9-host1/platform_FilePerms : Apr 22 09:43:06 Notice the 44 minutes delay between platform_DMVerityBitCorruption.last and provision_AutoUpdate.double on chromeos4-row1-rack9-host1. Looking into cautotest, I don't see any job/reset/provision between 8:11 and 8:42. Why was that machine idle? If the scheduler was overloaded, we should consider increase these simple suites timeouts, like bvt-inline or HWTest.
,
Apr 25 2016
,
Apr 25 2016
I just discovered that chromeos-server11.mtv was misconfigured; it was running suite_scheduler, which is incorrect, since it was a drone. That same drone was slow on the drone refresh check which drove up the tick time, causing long waits in between tests. I've killed the errant suite_scheduler. I want to wait and see what happens next, before we change timeouts.
,
Apr 26 2016
,
Apr 27 2016
The drone seems not to have improved after stopping suite_scheduler. We reduced the maximum jobs on the drone from 400 to 200, and that also seems to have had minimal impact. More work is needed, but as best I can tell, the right fix is to get chromeos-server11 to a reasonable response time, or else remove it. Adjusting the suite timeout is undesirable.
,
May 4 2016
Passing this to the current deputy for follow-up. I don't know if there's still a problem, so the first thing to do is to check drone sync times.
,
Jul 25 2016
,
Jul 26 2016
Assign to current deputy. The last attempt was a typo for the owner. Note, the stats server seems to have disk space issue again, let me fix it and then we can take a look at the data.
,
Jul 26 2016
From the stats, I don't see server11 is slow on refresh.
,
Jul 26 2016
We have tick time on monarch/panopticon now. See for instance http://shortn/_OiAZXSPNbL (you can filter out hosts you don't care about; also note this is *rate*, which is the inverse of tick time)
,
Jul 26 2016
Hmm. I don't see data for chromeos-server11. Is it actually in prod?
,
Jul 26 2016
chromeos-server11.hot is a drone. Every tick, the master contacts the drones synchronously. When one drone is slow, the entire tick cycle slows down. $ atest server list chromeos-server11.hot.corp.google.com Hostname : chromeos-server11.hot.corp.google.com Status : primary Roles : drone Attributes : {u'max_processes': u'200'} Date Created : 2016-04-12 17:02:36 Date Modified: 2016-04-12 17:02:36 Note : None
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 18 2017
,
Jul 18 2017
I don't imagine this is still a problem. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by jrbarnette@chromium.org
, Apr 22 2016