Unnecessary kevin-paladin & kevin-arcnext-paladin failures due to the share of DUTs |
|||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1186 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1184 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5326 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5323 Recently we see kevin-paladin & kevin-arcnext-paladin are experiencing "suite abort" failure. It's caused by they share the same pool of DUTs. So these DUTs are provisioned very frequently, e.g. host: chromeos6-row1-rack24-host13, status: Ready, locked: False diagnosis: Working Last 10 jobs within 1:48:00: 228375576 kevin-arcnext-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsAdminTestCases started on: 2018-08-15 07:26:57 status Completed 1921943 Provision started on: 2018-08-15 07:20:10 status PASS 228372017 kevin-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsPermissionTestCases started on: 2018-08-15 07:12:38 status Completed 1921775 Provision started on: 2018-08-15 07:06:06 status PASS 228368872 kevin-arcnext-paladin/R70-10971.0.0-rc1/provision/dummy_Pass started on: 2018-08-15 07:04:54 status Completed 1921667 Provision started on: 2018-08-15 06:56:07 status PASS 228368618 kevin-paladin/R70-10971.0.0-rc1/provision/dummy_Pass started on: 2018-08-15 06:54:49 status Completed and any provision failure will cause the test to retry. If 2 or 3 DUTs fail provision or provision takes longer time, retry will cost most of the time and the suite is timed out, e.g. The failure of https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/1186 is because the DUT is used for kevin-paladin before, so this DUT is re-provisioned, and the provision takes more than 1 hour so that the suite is timed out. chromeos6-row2-rack24-host15 ..., ... 2018-08-15 08:40:00 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/227684253-chromeos-test/ (kevin-arcnext-paladin/R70-10971.0.0-rc1/bvt-arc/cheets_GTS.6.0_r1.GtsPermissionTestCases) 2018-08-15 07:20:37 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack24-host15/1921947-provision/ 2018-08-15 07:14:06 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/227677330-chromeos-test/ (kevin-paladin/R70-10971.0.0-rc1/provision/dummy_Pass) 2018-08-15 06:46:12 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack24-host15/1921539-provision/ Mark it as chase-pending for more discussions.
,
Aug 20
We need to either disable one of these paladin hwtests, or move it to a dedicated pool. We can't support concurrent CQs within a single pool due to provision thrashing, which is a long standing limitation. -> Bernie for input on which course is more desired.
,
Aug 21
This is tricky, so even if we have a 2x deployment of DUTs (as we do for Kevin) it can still step on itself. I guess we could split the cq pool (say cq and cq-arcnext) for Kevin and put in a hack so that we run on the appropriate pool? If we have to actually tear one down I think we could probably replace kevin more easily than kevin-arcnext, we already have bob in the CQ, and if we can get them stable we probably could add in scarlet (dru).
,
Aug 21
> This is tricky, so even if we have a 2x deployment of DUTs (as we do for Kevin) it can still step on itself. Yes, with the current scheduler. Quotascheduler should eventually resolve this. > I guess we could split the cq pool (say cq and cq-arcnext) for Kevin and put in a hack so that we run on the appropriate pool? Yep, that's doable if we have enough devices. We currently have 16 devices in pool:cq. We'd need to create a new managed pool so that devices get autorepaired, which is a bit of additional development overhead. > If we have to actually tear one down I think we could probably replace kevin more easily than kevin-arcnext, we already have bob in the CQ, and if we can get them stable we probably could add in scarlet (dru). That's easier than above. Should be just go ahead and do it?
,
Aug 22
While I don't like the reduction in coverage, if it is necessary for stability we can drop kevin for now if that is the best course of action we have available.
,
Aug 23
https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1187546
,
Sep 26
Is this still relevant? I have a CL languishing for disabling these tests, but not sure if still relevant.
,
Sep 26
It seemed to me that the recent paladin failed due to similar reasons? https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/19700
,
Sep 26
,
Sep 26
I saw this in the log (cc'ing ihf@ who I mentioned this to over hangouts...)
...
09/26 13:35:19.184 INFO | tradefed_utils:0032| Waiting for cache lock...
09/26 13:39:37.053 INFO | tradefed_utils:0032| Waiting for cache lock...
09/26 14:26:08.636 ERROR| tradefed_utils:0047| Permanent lock failure. Trying to break lock.
09/26 14:26:08.637 WARNI| test:0606| The test failed with the following exception
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/test.py", line 567, in _exec
_cherry_pick_call(self.initialize, *args, **dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 715, in _cherry_pick_call
return func(*p_args, **p_dargs)
File "/usr/local/autotest/server/cros/tradefed_test.py", line 124, in initialize
self._clean_download_cache_if_needed()
File "/usr/local/autotest/server/cros/tradefed_test.py", line 483, in _clean_download_cache_if_needed
with tradefed_utils.lock(self._tradefed_cache_lock):
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/autotest/server/cros/tradefed_utils.py", line 54, in lock
raise error.TestFail('Error: permanent cache lock failure.')
TestFail: Error: permanent cache lock failure.
,
Sep 27
Re #10, sorry, this is issue 885199.
,
Sep 27
As for the thrashing, I am not convinced the analysis is correct. We already schedule in the same pool builds from different branches. Scheduling kevin and kevin-arcnext in the same pool is no different (and they get the same priority but different spots in the queue). Maybe a race? Now the lab does provision too aggressively, as it tries to grab the whole pool instead of orienting itself on the minimum_duts parameter for instance. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by xixuan@chromium.org
, Aug 16