New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734701 link

Starred by 0 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

caroline-paladin HWTests are taking too long

Project Member Reported by pprabhu@chromium.org, Jun 19 2017

Issue description

I've seen multiple instances of the suite getting aborted.

e.g.: 
paladin-build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15089
paladin-build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15096

And trybot: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/release/builds/12172

-----------------
Symptom:
- Suite is aborted with some of the tests still in flight.
- Nothing wrong with the suite job, it just took too long.
- (yet) no indication of any of the completed tests misbehaving.

I've seen one related failure where one of the test jobs went rouge and took forever ( issue 734690 ). These haven't shown the same behaviour.

-------------------
Some more thoughts:
- This could be chrome crashing and us taking too long to collect crash logs. Must look into one of the tests to see if we're taking long to collect crash reports.

- Nothing seems wrong with the caroline shard from vi/chromeos

-------------------
Pri-1 This killed a few CQ runs over the weekend.
May kill CQ run today, in which case, this goes Pri-0.

+sheriffs
 
Here's the current CQ hwtest on caroline-paladin: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=124127683

If it fails similarly, this bug goes P0.

Comment 2 by kirtika@google.com, Jun 19 2017

Cc: igo@chromium.org
Owner: kirtika@chromium.org
Status: Started (was: Untriaged)
Cc: jrbarnette@chromium.org
Looking at https://viceroy.corp.google.com/chromeos/suite_details?job_id=124102483

There are only 6 DUTs in the cq pool. jrbarnette@ says we should have more.
One of the tests failed somehow taking the DUT down with it, so we had only 5 DUTs remaining, apparently not enough for the suite to finish in time.

Current status of the pool:
pprabhu@pprabhu:chromiumos$ dut-status -p cq -b caroline
hostname                       S   last checked         URL
chromeos2-row8-rack1-host3     OK  2017-06-19 06:58:15  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host3/774465-repair/
chromeos2-row4-rack11-host19   OK  2017-06-19 12:10:48  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/774863-reset/
chromeos2-row4-rack11-host20   OK  2017-06-19 12:10:30  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host20/774861-reset/
chromeos2-row4-rack11-host17   OK  2017-06-19 12:10:08  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host17/774858-reset/
chromeos2-row4-rack11-host21   OK  2017-06-19 12:09:17  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host21/774857-reset/
chromeos2-row4-rack10-host22   OK  2017-06-19 12:10:34  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host22/774862-reset/

Same story in the other CQ run: https://viceroy.corp.google.com/chromeos/suite_details?job_id=123929907

Something caused a reset (after a passing test!) on chromeos2-row8-rack1-host3 in the middle of the suite. As a result, the suite fell back on just 4 DUTs for the most part -- not enough.

Note that this is the same DUT as affected by 734690
Owner: pprabhu@chromium.org
AIs: 

[1] Remove chromeos2-row8-rack1-host3 from circulation
[2] Increasing CQ pool size for caroline

kirtika: I think it's still useful to be sure we don't have too many crashes on these DUTs, so please confirm / reject the crashes hypothesis.
[1]
pprabhu@pprabhu:chromiumos$ atest host mod -l -r ' crbug.com/734701   crbug.com/734690 ' chromeos2-row8-rack1-host3
Locked host: 
        chromeos2-row8-rack1-host3


[2]
pprabhu@pprabhu:chromiumos$ balance_pool -t 10 cq caroline

Balancing caroline cq pool:
Total 6 DUTs, 5 working, 1 broken, 0 reserved.
Target is 10 working DUTs; grow pool by 5 DUTs.
caroline cq pool has 3 spares available.
ERROR: Not enough spares: need 5, only have 3.
Transferring 3 DUTs from suites to cq.
Updating host: chromeos6-row2-rack21-host5.
Removing labels ['pool:suites'] from host chromeos6-row2-rack21-host5
Adding labels ['pool:cq'] to host chromeos6-row2-rack21-host5
Updating host: chromeos6-row2-rack23-host10.
Removing labels ['pool:suites'] from host chromeos6-row2-rack23-host10
Adding labels ['pool:cq'] to host chromeos6-row2-rack23-host10
Updating host: chromeos6-row1-rack23-host21.
Removing labels ['pool:suites'] from host chromeos6-row1-rack23-host21
Adding labels ['pool:cq'] to host chromeos6-row1-rack23-host21
pprabhu@pprabhu:chromiumos$ dut-status -b caroline -p cq
hostname                       S   last checked         URL
chromeos2-row8-rack1-host3     OK  2017-06-19 06:58:15  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host3/774465-repair/
chromeos2-row4-rack11-host19   OK  2017-06-19 12:15:19  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/774904-reset/
chromeos2-row4-rack11-host20   OK  2017-06-19 12:16:22  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host20/774912-reset/
chromeos2-row4-rack11-host17   OK  2017-06-19 12:16:12  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host17/774905-reset/
chromeos2-row4-rack11-host21   OK  2017-06-19 12:17:30  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host21/774927-reset/
chromeos2-row4-rack10-host22   OK  2017-06-19 12:16:46  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack10-host22/774913-reset/
chromeos6-row2-rack21-host5    OK  2017-06-19 03:30:36  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack21-host5/774276-verify/
chromeos6-row2-rack23-host10   OK  2017-06-19 03:30:36  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host10/774277-verify/
chromeos6-row1-rack23-host21   OK  2017-06-19 03:30:36  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack23-host21/774278-verify/


I have increased the cq pool size to 8 (1 of the ones above is the bad DUT that is locked).
Looking at why we don't have enough spares. There are 87 caroline DUTs in the lab! (way over the average, because we're doing a DVT --> PVT migration right now).

The current CQ run again picked up the bad DUT ( issue 732999 ).
And the run is actually doomed because of two bad CLs in the CQ: https://chromiumos-build-annotator.googleplex.com/build_annotations/edit_annotations/master-paladin/1604818/

So, I tried to abort the suite job because it was holding the CQ back.
http://cautotest/afe/#tab_id=view_job&object_id=124127683
But I couldn't abort the job and the suite job

For posterity, failure was:
pprabhu@pprabhu:chromiumos$ atest job abort 124127683 --debug
Operation abort_host_queue_entries failed:
    AssertionError: 
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 120, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 160, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 921, in abort_host_queue_entries
    models.AclGroup.check_abort_permissions(query)
  File "/usr/local/autotest/frontend/afe/models.py", line 1007, in check_abort_permissions
    for entry in cannot_abort)
  File "/usr/local/autotest/frontend/afe/models.py", line 1007, in <genexpr>
    for entry in cannot_abort)
  File "/usr/local/autotest/frontend/afe/models.py", line 1714, in host_or_metahost_name
    assert self.meta_host
AssertionError


I'll try (again) to get the DUT out of circulation.
Went in with the hammer: aborted caroline-paladin
https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/303

Now I have till the next HWTest to make sure that the rogue DUT is really out of circulation.
Labels: -Pri-1 Pri-0
Like  issue 732999  says, unlocking-locking the DUT gets past this problem:

pprabhu@pprabhu:chromiumos$ ateatest host mod -u -r ' crbug.com/734701   crbug.com/734690 ' chromeos2-row8-rack1-host3                                                                            
ateatest: command not found
pprabhu@pprabhu:chromiumos$ atest host mod -u -r ' crbug.com/734701   crbug.com/734690 ' chromeos2-row8-rack1-host3                                                                               
Unlocked host: 
        chromeos2-row8-rack1-host3
pprabhu@pprabhu:chromiumos$ atest host mod -l -r ' crbug.com/734701   crbug.com/734690 ' chromeos2-row8-rack1-host3
Locked host: 
        chromeos2-row8-rack1-host3
pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3
Host                        Status        Shard                                  Locked  Lock Reason                        Locked by  Platform  Labels
chromeos2-row8-rack1-host3  Provisioning  chromeos-server98.mtv.corp.google.com  True     crbug.com/734701   crbug.com/734690   pprabhu    caroline  board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2
pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w chromeos-server98.mtv
Host                        Status        Shard                                  Locked  Lock Reason                        Locked by  Platform  Labels
chromeos2-row8-rack1-host3  Provisioning  chromeos-server98.mtv.corp.google.com  True     crbug.com/734701   crbug.com/734690   pprabhu    caroline  board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2

The rogue test job is still running, but is less of a problem.
Status: Fixed (was: Started)
Latest caroline-pre-cq passed. I'll use  issue 734690  to follow up on what's wrong with that particular DUT.

Sign in to add a comment