New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 874705 link

Starred by 1 user

Issue metadata

Status: Started
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Unnecessary kevin-paladin & kevin-arcnext-paladin failures due to the share of DUTs

Project Member Reported by xixuan@chromium.org, Aug 16

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1186
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1184

https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5326
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5323

Recently we see kevin-paladin & kevin-arcnext-paladin are experiencing "suite abort" failure. It's caused by they share the same pool of DUTs. So these DUTs are provisioned very frequently, e.g.

  host: chromeos6-row1-rack24-host13, status: Ready, locked: False diagnosis: Working
  Last 10 jobs within 1:48:00:
  228375576 kevin-arcnext-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsAdminTestCases started on: 2018-08-15 07:26:57 status Completed
  1921943 Provision started on: 2018-08-15 07:20:10 status PASS
  228372017 kevin-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsPermissionTestCases started on: 2018-08-15 07:12:38 status Completed
  1921775 Provision started on: 2018-08-15 07:06:06 status PASS
  228368872 kevin-arcnext-paladin/R70-10971.0.0-rc1/provision/dummy_Pass started on: 2018-08-15 07:04:54 status Completed
  1921667 Provision started on: 2018-08-15 06:56:07 status PASS
  228368618 kevin-paladin/R70-10971.0.0-rc1/provision/dummy_Pass started on: 2018-08-15 06:54:49 status Completed


and any provision failure will cause the test to retry.

If 2 or 3 DUTs fail provision or provision takes longer time, retry will cost most of the time and the suite is timed out, e.g.

The failure of https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1186 is because the DUT is used for kevin-paladin before, so this DUT is re-provisioned, and the provision takes more than 1 hour so that the suite is timed out.

chromeos6-row2-rack24-host15
    ..., ...
    2018-08-15 08:40:00  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/227684253-chromeos-test/  (kevin-arcnext-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsPermissionTestCases)
    2018-08-15 07:20:37  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack24-host15/1921947-provision/
    2018-08-15 07:14:06  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/227677330-chromeos-test/  (kevin-paladin/R70-10971.0.0-rc1/provision/dummy_Pass)
    2018-08-15 06:46:12  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack24-host15/1921539-provision/


Mark it as chase-pending for more discussions.

 
Summary: Unnecessary kevin-paladin & kevin-arcnext-paladin failures due to the share of DUTs (was: Unnecessary kevin-paladin & kevin-arcnext-paladin failures due to the shard of DUTs)
Cc: bhthompson@chromium.org
Labels: -Pri-2 -Chase-Pending Pri-1
Owner: bhthompson@google.com
We need to either disable one of these paladin hwtests, or move it to a dedicated pool. We can't support concurrent CQs within a single pool due to provision thrashing, which is a long standing limitation.

 -> Bernie for input on which course is more desired.
Owner: akes...@chromium.org
This is tricky, so even if we have a 2x deployment of DUTs (as we do for Kevin) it can still step on itself.

I guess we could split the cq pool (say cq and cq-arcnext) for Kevin and put in a hack so that we run on the appropriate pool?

If we have to actually tear one down I think we could probably replace kevin more easily than kevin-arcnext, we already have bob in the CQ, and if we can get them stable we probably could add in scarlet (dru). 
> This is tricky, so even if we have a 2x deployment of DUTs (as we do for Kevin) it can still step on itself.

Yes, with the current scheduler. Quotascheduler should eventually resolve this.

> I guess we could split the cq pool (say cq and cq-arcnext) for Kevin and put in a hack so that we run on the appropriate pool?

Yep, that's doable if we have enough devices. We currently have 16 devices in pool:cq. We'd need to create a new managed pool so that devices get autorepaired, which is a bit of additional development overhead.

> If we have to actually tear one down I think we could probably replace kevin more easily than kevin-arcnext, we already have bob in the CQ, and if we can get them stable we probably could add in scarlet (dru). 

That's easier than above. Should be just go ahead and do it?
While I don't like the reduction in coverage, if it is necessary for stability we can drop kevin for now if that is the best course of action we have available. 
Cc: rajatja@google.com zamorzaev@chromium.org
Owner: ----
Is this still relevant? I have a CL languishing for disabling these tests, but not sure if still relevant.
It seemed to me that the recent paladin failed due to similar reasons?

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/19700
Cc: puthik@google.com dgarr...@chromium.org
Cc: ihf@chromium.org
I saw this in the log (cc'ing ihf@ who I mentioned this to over hangouts...)

...
09/26 13:35:19.184 INFO |    tradefed_utils:0032| Waiting for cache lock...
09/26 13:39:37.053 INFO |    tradefed_utils:0032| Waiting for cache lock...
09/26 14:26:08.636 ERROR|    tradefed_utils:0047| Permanent lock failure. Trying to break lock.
09/26 14:26:08.637 WARNI|              test:0606| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 567, in _exec
    _cherry_pick_call(self.initialize, *args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 715, in _cherry_pick_call
    return func(*p_args, **p_dargs)
  File "/usr/local/autotest/server/cros/tradefed_test.py", line 124, in initialize
    self._clean_download_cache_if_needed()
  File "/usr/local/autotest/server/cros/tradefed_test.py", line 483, in _clean_download_cache_if_needed
    with tradefed_utils.lock(self._tradefed_cache_lock):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/autotest/server/cros/tradefed_utils.py", line 54, in lock
    raise error.TestFail('Error: permanent cache lock failure.')
TestFail: Error: permanent cache lock failure.

Re #10, sorry, this is issue 885199.
As for the thrashing, I am not convinced the analysis is correct. We already schedule in the same pool builds from different branches. Scheduling kevin and kevin-arcnext in the same pool is no different (and they get the same priority but different spots in the queue). Maybe a race? Now the lab does provision too aggressively, as it tries to grab the whole pool instead of orienting itself on the minimum_duts parameter for instance. 

Sign in to add a comment