caroline-paladin HWTests are taking too long |
||||||
Issue descriptionI've seen multiple instances of the suite getting aborted. e.g.: paladin-build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15089 paladin-build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15096 And trybot: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/release/builds/12172 ----------------- Symptom: - Suite is aborted with some of the tests still in flight. - Nothing wrong with the suite job, it just took too long. - (yet) no indication of any of the completed tests misbehaving. I've seen one related failure where one of the test jobs went rouge and took forever ( issue 734690 ). These haven't shown the same behaviour. ------------------- Some more thoughts: - This could be chrome crashing and us taking too long to collect crash logs. Must look into one of the tests to see if we're taking long to collect crash reports. - Nothing seems wrong with the caroline shard from vi/chromeos ------------------- Pri-1 This killed a few CQ runs over the weekend. May kill CQ run today, in which case, this goes Pri-0. +sheriffs
,
Jun 19 2017
,
Jun 19 2017
Looking at https://viceroy.corp.google.com/chromeos/suite_details?job_id=124102483 There are only 6 DUTs in the cq pool. jrbarnette@ says we should have more. One of the tests failed somehow taking the DUT down with it, so we had only 5 DUTs remaining, apparently not enough for the suite to finish in time. Current status of the pool: pprabhu@pprabhu:chromiumos$ dut-status -p cq -b caroline hostname S last checked URL chromeos2-row8-rack1-host3 OK 2017-06-19 06:58:15 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host3/774465-repair/ chromeos2-row4-rack11-host19 OK 2017-06-19 12:10:48 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/774863-reset/ chromeos2-row4-rack11-host20 OK 2017-06-19 12:10:30 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host20/774861-reset/ chromeos2-row4-rack11-host17 OK 2017-06-19 12:10:08 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host17/774858-reset/ chromeos2-row4-rack11-host21 OK 2017-06-19 12:09:17 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host21/774857-reset/ chromeos2-row4-rack10-host22 OK 2017-06-19 12:10:34 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host22/774862-reset/
,
Jun 19 2017
Same story in the other CQ run: https://viceroy.corp.google.com/chromeos/suite_details?job_id=123929907 Something caused a reset (after a passing test!) on chromeos2-row8-rack1-host3 in the middle of the suite. As a result, the suite fell back on just 4 DUTs for the most part -- not enough. Note that this is the same DUT as affected by 734690
,
Jun 19 2017
AIs: [1] Remove chromeos2-row8-rack1-host3 from circulation [2] Increasing CQ pool size for caroline kirtika: I think it's still useful to be sure we don't have too many crashes on these DUTs, so please confirm / reject the crashes hypothesis.
,
Jun 19 2017
[1] pprabhu@pprabhu:chromiumos$ atest host mod -l -r ' crbug.com/734701 crbug.com/734690 ' chromeos2-row8-rack1-host3 Locked host: chromeos2-row8-rack1-host3 [2] pprabhu@pprabhu:chromiumos$ balance_pool -t 10 cq caroline Balancing caroline cq pool: Total 6 DUTs, 5 working, 1 broken, 0 reserved. Target is 10 working DUTs; grow pool by 5 DUTs. caroline cq pool has 3 spares available. ERROR: Not enough spares: need 5, only have 3. Transferring 3 DUTs from suites to cq. Updating host: chromeos6-row2-rack21-host5. Removing labels ['pool:suites'] from host chromeos6-row2-rack21-host5 Adding labels ['pool:cq'] to host chromeos6-row2-rack21-host5 Updating host: chromeos6-row2-rack23-host10. Removing labels ['pool:suites'] from host chromeos6-row2-rack23-host10 Adding labels ['pool:cq'] to host chromeos6-row2-rack23-host10 Updating host: chromeos6-row1-rack23-host21. Removing labels ['pool:suites'] from host chromeos6-row1-rack23-host21 Adding labels ['pool:cq'] to host chromeos6-row1-rack23-host21 pprabhu@pprabhu:chromiumos$ dut-status -b caroline -p cq hostname S last checked URL chromeos2-row8-rack1-host3 OK 2017-06-19 06:58:15 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host3/774465-repair/ chromeos2-row4-rack11-host19 OK 2017-06-19 12:15:19 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/774904-reset/ chromeos2-row4-rack11-host20 OK 2017-06-19 12:16:22 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host20/774912-reset/ chromeos2-row4-rack11-host17 OK 2017-06-19 12:16:12 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host17/774905-reset/ chromeos2-row4-rack11-host21 OK 2017-06-19 12:17:30 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host21/774927-reset/ chromeos2-row4-rack10-host22 OK 2017-06-19 12:16:46 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host22/774913-reset/ chromeos6-row2-rack21-host5 OK 2017-06-19 03:30:36 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack21-host5/774276-verify/ chromeos6-row2-rack23-host10 OK 2017-06-19 03:30:36 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host10/774277-verify/ chromeos6-row1-rack23-host21 OK 2017-06-19 03:30:36 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack23-host21/774278-verify/ I have increased the cq pool size to 8 (1 of the ones above is the bad DUT that is locked). Looking at why we don't have enough spares. There are 87 caroline DUTs in the lab! (way over the average, because we're doing a DVT --> PVT migration right now).
,
Jun 19 2017
The current CQ run again picked up the bad DUT ( issue 732999 ). And the run is actually doomed because of two bad CLs in the CQ: https://chromiumos-build-annotator.googleplex.com/build_annotations/edit_annotations/master-paladin/1604818/ So, I tried to abort the suite job because it was holding the CQ back. http://cautotest/afe/#tab_id=view_job&object_id=124127683 But I couldn't abort the job and the suite job For posterity, failure was: pprabhu@pprabhu:chromiumos$ atest job abort 124127683 --debug Operation abort_host_queue_entries failed: AssertionError: Traceback (most recent call last): File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 120, in dispatchRequest results['result'] = self.invokeServiceEndpoint(meth, args) File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 160, in invokeServiceEndpoint return meth(*args) File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn return f(*args, **keyword_args) File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 921, in abort_host_queue_entries models.AclGroup.check_abort_permissions(query) File "/usr/local/autotest/frontend/afe/models.py", line 1007, in check_abort_permissions for entry in cannot_abort) File "/usr/local/autotest/frontend/afe/models.py", line 1007, in <genexpr> for entry in cannot_abort) File "/usr/local/autotest/frontend/afe/models.py", line 1714, in host_or_metahost_name assert self.meta_host AssertionError I'll try (again) to get the DUT out of circulation.
,
Jun 19 2017
Went in with the hammer: aborted caroline-paladin https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/303 Now I have till the next HWTest to make sure that the rogue DUT is really out of circulation.
,
Jun 19 2017
,
Jun 19 2017
Like issue 732999 says, unlocking-locking the DUT gets past this problem: pprabhu@pprabhu:chromiumos$ ateatest host mod -u -r ' crbug.com/734701 crbug.com/734690 ' chromeos2-row8-rack1-host3 ateatest: command not found pprabhu@pprabhu:chromiumos$ atest host mod -u -r ' crbug.com/734701 crbug.com/734690 ' chromeos2-row8-rack1-host3 Unlocked host: chromeos2-row8-rack1-host3 pprabhu@pprabhu:chromiumos$ atest host mod -l -r ' crbug.com/734701 crbug.com/734690 ' chromeos2-row8-rack1-host3 Locked host: chromeos2-row8-rack1-host3 pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos2-row8-rack1-host3 Provisioning chromeos-server98.mtv.corp.google.com True crbug.com/734701 crbug.com/734690 pprabhu caroline board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2 pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w chromeos-server98.mtv Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos2-row8-rack1-host3 Provisioning chromeos-server98.mtv.corp.google.com True crbug.com/734701 crbug.com/734690 pprabhu caroline board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2 The rogue test job is still running, but is less of a problem.
,
Jun 19 2017
Latest caroline-pre-cq passed. I'll use issue 734690 to follow up on what's wrong with that particular DUT. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by pprabhu@chromium.org
, Jun 19 2017