New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 606018 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 583014
issue 597711



Sign in to add a comment

chromeos-server11 is dragging down tick time, and causing timeouts.

Project Member Reported by gwendal@chromium.org, Apr 22 2016

Issue description

I am looking at build failures, and some are due to timeout.

I already mention that problem in https://bugs.chromium.org/p/chromium/issues/detail?id=583014#c30,
but on strago suite timeouts errors are dwarfed by the eMMC error.

I take an example of leon (slippy)

Currently bvt_timeout is set at 120 minutes (I am not sure where), but that 's not enough to complete the test.
see http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=60889236

Take the example of that suite. Using its status.log:

We see that 6 hosts are used:
     13 chromeos4-row1-rack11-host3 
     11 chromeos4-row1-rack9-host1 
      8 chromeos4-row1-rack9-host5 
      7 chromeos4-row1-rack8-host3 
      3 chromeos4-row1-rack8-host1 
      3 chromeos4-row1-rack10-host5 

Taking chromeos4-row1-rack9-host1 as an example, the test start dates are:
chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.first : Apr 22 08:04:19
chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.middle : Apr 22 08:04:35
chromeos4-row1-rack9-host1/platform_DMVerityBitCorruption.last : Apr 22 08:04:52
chromeos4-row1-rack9-host1/provision_AutoUpdate.double : Apr 22 08:46:01
chromeos4-row1-rack9-host1/security_NetworkListeners : Apr 22 09:02:39
chromeos4-row1-rack9-host1/logging_CrashSender : Apr 22 09:10:18
chromeos4-row1-rack9-host1/security_ChromiumOSLSM : Apr 22 09:17:25
chromeos4-row1-rack9-host1/security_EnableChromeTesting : Apr 22 09:23:14
chromeos4-row1-rack9-host1/security_Firewall : Apr 22 09:29:48
chromeos4-row1-rack9-host1/security_RootfsStatefulSymlinks : Apr 22 09:36:54
chromeos4-row1-rack9-host1/platform_FilePerms : Apr 22 09:43:06


Notice the 44 minutes delay between platform_DMVerityBitCorruption.last and provision_AutoUpdate.double on chromeos4-row1-rack9-host1. Looking into cautotest, I don't see any job/reset/provision between 8:11 and 8:42. Why was that machine idle?

If the scheduler was overloaded, we should consider increase these simple suites timeouts, like bvt-inline or HWTest.



 
60889236-chromeos-test%2Fhostless%2Fstatus.log
25.3 KB View Download
Labels: Infra-ChromeOS
Blocking: 597711
Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
I just discovered that chromeos-server11.mtv was misconfigured;
it was running suite_scheduler, which is incorrect, since it was
a drone.

That same drone was slow on the drone refresh check which drove
up the tick time, causing long waits in between tests.

I've killed the errant suite_scheduler.  I want to wait and see what
happens next, before we change timeouts.

Comment 4 by benhenry@google.com, Apr 26 2016

Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Summary: chromeos-server11 is dragging down tick time, and causing timeouts. (was: bvt-inline: timeout_mins == 120 is too short)
The drone seems not to have improved after stopping suite_scheduler.

We reduced the maximum jobs on the drone from 400 to 200, and that
also seems to have had minimal impact.

More work is needed, but as best I can tell, the right fix is to
get chromeos-server11 to a reasonable response time, or else remove
it.  Adjusting the suite timeout is undesirable.

Owner: sbash...@chromium.org
Passing this to the current deputy for follow-up.

I don't know if there's still a problem, so the first thing
to do is to check drone sync times.

Owner: ----

Comment 8 by dshi@chromium.org, Jul 26 2016

Cc: dshi@chromium.org
Owner: kevcheng@chromium.org
Status: Assigned (was: Started)
Assign to current deputy. The last attempt was a typo for the owner.

Note, the stats server seems to have disk space issue again, let me fix it and then we can take a look at the data.

Comment 9 by dshi@chromium.org, Jul 26 2016

From the stats, I don't see server11 is slow on refresh.

server11.png
301 KB View Download
We have tick time on monarch/panopticon now. See for instance http://shortn/_OiAZXSPNbL (you can filter out hosts you don't care about; also note this is *rate*, which is the inverse of tick time)
Hmm. I don't see data for chromeos-server11. Is it actually in prod?
chromeos-server11.hot is a drone.  Every tick, the master contacts
the drones synchronously.  When one drone is slow, the entire tick
cycle slows down.

$ atest server list chromeos-server11.hot.corp.google.com

Hostname     : chromeos-server11.hot.corp.google.com
Status       : primary
Roles        : drone
Attributes   : {u'max_processes': u'200'}
Date Created : 2016-04-12 17:02:36
Date Modified: 2016-04-12 17:02:36
Note         : None

Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.

Comment 14 by sosa@chromium.org, Jul 18 2017

Owner: jrbarnette@chromium.org
Status: WontFix (was: Assigned)
I don't imagine this is still a problem.

Sign in to add a comment