New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 692179 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 692342
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 691729
issue 692172
issue 692342



Sign in to add a comment

kevin-paladin failures plagued by flaky provisioning

Project Member Reported by adurbin@chromium.org, Feb 14 2017

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin

Feb 14 05:56	??	failure	#205	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]
Feb 14 03:04	??	failure	#204	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-cq]
Feb 14 00:17	??	failure	#203	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]
Feb 13 21:30	??	failure	#202	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]
Feb 13 18:37	??	failure	#201	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-cq]
Feb 13 15:48	??	failure	#200	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline] failed hwtest [bvt-cq]
Feb 13 12:49	??	failure	#199	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-cq]
Feb 13 10:10	??	failure	#198	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]
Feb 13 07:41	??	failure	#197	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]
Feb 13 05:14	??	failure	#196	Failed steps failed cbuildbot [kevin-paladin] failed hwtest [bvt-inline]


#206
https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fkevin-paladin%2F206%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-inline_%2F0%2Fstdout
  host: chromeos2-row8-rack8-host3, status: Ready, locked: False diagnosis: Working
  labels: ['board:kevin', 'arc', 'ec:cros', 'hw_video_acc_enc_vp8', 'audio_loopback_dongle', 'os:cros', 'power:battery', 'cts_abi_arm', 'webcam', 'hw_video_acc_enc_h264', 'hw_video_acc_vp8', 'hw_video_acc_h264', 'storage:mmc', 'kevin', 'internal_display', 'servo', 'phase:PVT', 'touchpad', 'variant:kevin', 'sku:kevin_rk3399_4Gb', 'touchscreen', 'bluetooth', 'pool:cq']
  Last 10 jobs within 1:48:00:
  59984564 Repair started on: 2017-02-14 10:04:59 status PASS
  59983986 Provision started on: 2017-02-14 09:25:39 status FAIL

I'm not sure what job this provision shows up under.

#204
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/204/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio

Some background job failed. I have no idea which one. All the hosts listed pass...

05:43:29: ERROR: BaseException in _RunParallelStages <class 'chromite.lib.failures_lib.StepFailure'>: 
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 440, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/internal_master/chromite/cbuildbot/stages/generic_stages.py", line 629, in Run
    raise failures_lib.StepFailure()
StepFailure
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/cbuildbot/builders/generic_builders.py", line 118, in _RunParallelStages
    parallel.RunParallelSteps(steps)
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 677, in RunParallelSteps
    return [queue.get_nowait() for queue in queues]
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 674, in RunParallelSteps
    pass
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 560, in ParallelTasks
    raise BackgroundFailure(exc_infos=errors)
BackgroundFailure: <class 'chromite.lib.failures_lib.StepFailure'>: 
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 440, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/internal_master/chromite/cbuildbot/stages/generic_stages.py", line 629, in Run
    raise failures_lib.StepFailure()
StepFailure

#203: 
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/203/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio

  host: chromeos2-row8-rack9-host14, status: Ready, locked: False diagnosis: Working
  labels: ['board:kevin', 'arc', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'os:cros', 'power:battery', 'ec:cros', 'hw_video_acc_h264', 'servo', 'hw_video_acc_vp8', 'cts_abi_arm', 'storage:mmc', 'webcam', 'kevin', 'audio_loopback_dongle', 'internal_display', 'bluetooth', 'pool:cq', 'phase:PVT', 'touchpad', 'variant:kevin', 'sku:kevin_rk3399_4Gb', 'touchscreen', 'cros-version:kevin-paladin/R58-9280.0.0-rc2']
  Last 10 jobs within 1:48:00:
  59979040 Repair started on: 2017-02-14 02:42:58 status PASS
  59978033 Provision started on: 2017-02-14 01:08:35 status FAIL

I think the logs for that are here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101053570-chromeos-test/chromeos2-row8-rack9-host14/

But I have no idea how to correlate a job on a host with the metajobs.  This host needs a devserver, but it never succeeds.

The host chromeos2-row8-rack9-host14 (100.115.231.55) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.224.0:19
02/13 09:25:23.462 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"''
02/13 09:25:38.586 DEBUG|        dev_server:0892| Error occurred with exit_code 255 when executing the ssh call: ssh: connect to host 100.115.245.197 port 22: Connection timed out
.
02/13 09:25:38.589 WARNI|             retry:0221| <class 'autotest_lib.client.common_lib.error.CmdError'>(Command <ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"'> failed, rc=255, Command returned non-zero exit status
* Command: 
    ssh 100.115.245.197 'curl "http://100.115.245.197:8082/check_health?"'
Exit status: 255
Duration: 15.0399751663

#202:
https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fkevin-paladin%2F202%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-inline_%2F0%2Fstdout

  host: chromeos2-row8-rack9-host14, status: Ready, locked: False diagnosis: Working
  labels: ['board:kevin', 'arc', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'os:cros', 'power:battery', 'ec:cros', 'hw_video_acc_h264', 'servo', 'hw_video_acc_vp8', 'cts_abi_arm', 'storage:mmc', 'webcam', 'kevin', 'audio_loopback_dongle', 'internal_display', 'bluetooth', 'pool:cq', 'phase:PVT', 'touchpad', 'variant:kevin', 'sku:kevin_rk3399_4Gb', 'touchscreen', 'cros-version:kevin-paladin/R58-9280.0.0-rc2']
  Last 10 jobs within 1:48:00:
  59977262 Repair started on: 2017-02-13 23:56:02 status PASS
  59976051 Provision started on: 2017-02-13 22:25:42 status FAIL

Same host.  


I really don't know how to get the details of the job for the actual host. If someone could teach me to fish better, I could dig out more info. I'm also not clear what's triggering the failure for the whole paladin. Does it take one host to fail provisioning to fail the whole run? And, if so, how do I get the information on the specific provisioning logs?
 
Blockedon: 692172 691729
dut-status is a good tool for looking at host history. You can get it here https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/lab-tools

I believe we have traced this spate of kevin provision failures to issues with two DUTs, but I haven't ruled out that there may be other causes.

Comment 2 by aut...@google.com, Feb 15 2017

Status: Fixed (was: Untriaged)
Should be fixed, please re-open if still happening

Comment 3 by tfiga@chromium.org, Feb 15 2017

Cc: mnissler@chromium.org adurbin@chromium.org
Labels: -Pri-3 Pri-1
Status: Available (was: Fixed)
This seems to be still happening and has been blocking the CQ since 2 days.

Comment 4 by tfiga@chromium.org, Feb 15 2017

Owner: ejcaruso@chromium.org
Status: Assigned (was: Available)
master-paladin 13669 = kevin-paladin 213: chromeos2-row8-rack9-host1 fails
master-paladin 13668 = kevin-paladin 212: chromeos2-row8-rack9-host1 fails
master-paladin 13667 = kevin-paladin 211: chromeos2-row8-rack9-host1 fails
master-paladin 13666 = kevin-paladin 210: chromeos2-row8-rack9-host1 fails

For kevin-paladin 213:
59999883 Provision started on: 2017-02-15 03:47:08 status FAIL

What does that first identifier mean? Is it anything remotely meaningful to link into autotest results? It resembles nothing like an autotest id. I don't know if the times reported as failure are in the same timezone as results in autotest for jobs on that machine. 

I'm assuming it's this job:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack9-host1/59999883-provision?pli=1

That job's logs just appear to fall off the cliff. Did the autotest job just blow up?

I guess I'll go look at the next ones, but there's not much to see unless someone would like to teach me to fish better.
master-paladin 13664 = kevin-paladin 208: chromeos2-row8-rack9-host1 fails
master-paladin 13663 = kevin-paladin 207: passes chromeos2-row8-rack9-host1 isn't used. passes
Same results for kevin-paladin 208. Logs are clipped. No other information about error. Is there some magic where this kicks off another job that I can't find the linkage to?
master-paladin 136670 = kevin-paladin 214: chromeos2-row8-rack9-host1 fails

How am I supposed to kick this device out of the pool?
Cc: shuqianz@chromium.org
Blockedon: 692342
Owner: xixuan@chromium.org
This device was locked by xixuan@. Assigning over there.
Mergedinto: 692342
Status: Duplicate (was: Assigned)

Sign in to add a comment