New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 771257 link

Starred by 2 users

Issue metadata

Status: Closed
Owner: ----
Closed: Dec 13
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 771236



Sign in to add a comment

wizpig-paladin failing often in hwtest (5/7 fails from 1093--1099)

Project Member Reported by djkurtz@chromium.org, Oct 3 2017

Issue description

wizpig-paladin failed 5 (1099, 1097, 1096, 1095, 1093) of last six builds.  Each time with message: 
"** HWTest did not complete due to infrastructure issues (code 3) **"

Each time the suite timed out during bvt-inline:

Suite timed out. Started on 10-03-2017 [09:27:59], timed out on 10-03-2017 [11:00:01]
10-03-2017 [11:00:01] Start collecting test results and dump them to json.
Suite job                               [ FAILED ]
Suite job                                 ABORT: 

There appear to be 8 DUTs in pool:cq:

  host: chromeos2-row8-rack8-host12, status: Ready, locked: False diagnosis: Working 
  host: chromeos2-row8-rack8-host17, status: Ready, locked: False diagnosis: Working
  host: chromeos2-row8-rack8-host18, status: Ready, locked: False diagnosis: Working
  host: chromeos6-row2-rack20-host4, status: Repairing, locked: False diagnosis: Unused
  host: chromeos6-row2-rack20-host20, status: Repairing, locked: False diagnosis: Failed repair
  host: chromeos6-row2-rack20-host22, status: Ready, locked: False diagnosis: Working
  host: chromeos6-row2-rack19-host1, status: Ready, locked: False diagnosis: Working
  host: chromeos6-row2-rack19-host3, status: Ready, locked: False diagnosis: Working


However, looking at the test reports, 3 of them never seem to run a test successfully (or just a single test in -host22's case):

10-03-2017 [11:02:51] Attempting to display pool info: cq
...
  host: chromeos6-row2-rack20-host4, status: Repairing, locked: False diagnosis: Unused
  labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'audio_loopback_dongle', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig']
  Last 10 jobs within 1:48:00:
  735094 Provision started on: 2017-10-03 09:30:16 status FAIL
  
  host: chromeos6-row2-rack20-host20, status: Repairing, locked: False diagnosis: Failed repair
  labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig']
  Last 10 jobs within 1:48:00:
  735164 Verify started on: 2017-10-03 09:39:48 status FAIL
  
  host: chromeos6-row2-rack20-host22, status: Ready, locked: False diagnosis: Working
  labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'cros-version:wizpig-paladin/R63-9998.0.0-rc2', '4k_video']
  Last 10 jobs within 1:48:00:
  147013555 wizpig-paladin/R63-9998.0.0-rc2/suite_attr_wrapper/cheets_StartAndroid.stress.0 started on: 2017-10-03 10:41:25 status Completed
  735091 Provision started on: 2017-10-03 09:30:16 status PASS
  

 
Note, even for the one build that passed (1098), the 3 DUTs at chromeos6-row2-rack20 didn't run tests; but HWTest managed to finish in ~53 minutes:
Testing started at 2017-10-03 06:45:49
Testing ended at 2017-10-03 07:37:20


https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1098/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio

host: chromeos6-row2-rack20-host4, status: Ready, locked: False diagnosis: Working
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'audio_loopback_dongle', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig']
Last 10 jobs within 1:48:00:

host: chromeos6-row2-rack20-host20, status: Repairing, locked: False diagnosis: Failed repair
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig']
Last 10 jobs within 1:48:00:
733899 Verify started on: 2017-10-03 06:09:01 status FAIL

host: chromeos6-row2-rack20-host22, status: Ready, locked: False diagnosis: Working
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'cros-version:wizpig-paladin/R63-9998.0.0-rc1']
Last 10 jobs within 1:48:00:
146985372 wizpig-paladin/R63-9998.0.0-rc1/bvt-inline/cheets_StartAndroid.stress.0 started on: 2017-10-03 07:26:51 status Completed
734057 Provision started on: 2017-10-03 06:26:52 status PASS


There's a number of devices which seem to be provision/repairing repeatedly.  Sheriffs should also be trying to identify if a bad CL is causing all of these issues.  Since the problems seem localized to chromeos6-row2-rack20 duts (backed up by dut-status), I'm guessing that something else is going on specific to that network.

This might be related to  issue 770806 .
dut-status  -b wizpig -p cq
hostname                       S   last checked         URL
chromeos2-row8-rack8-host12    OK  2017-10-03 13:06:24  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host12/736330-reset/
chromeos2-row8-rack8-host17    OK  2017-10-03 13:07:12  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host17/736333-reset/
chromeos2-row8-rack8-host18    OK  2017-10-03 13:06:33  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host18/736331-reset/
chromeos6-row2-rack20-host4    OK  2017-10-03 05:07:55  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733637-repair/
chromeos6-row2-rack20-host20   NO  2017-10-03 10:34:07  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735562-repair/
chromeos6-row2-rack20-host22   OK  2017-10-03 09:30:16  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/735091-provision/
chromeos6-row2-rack19-host1    OK  2017-10-03 12:24:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host1/736113-provision/
chromeos6-row2-rack19-host3    OK  2017-10-03 12:24:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host3/736111-provision/

Host chromeos6-row2-rack20-host20 is continuously failing Verify/Repair:

chromeos6-row2-rack20-host20
    2017-10-03 13:10:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/736354-verify/
    2017-10-03 10:34:07  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735562-repair/
    2017-10-03 09:39:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735164-verify/
    2017-10-03 07:25:40  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/734462-repair/
    2017-10-03 06:09:01  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/733899-verify/
    2017-10-03 03:20:02  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/733122-repair/
    2017-10-03 02:08:16  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/732893-verify/
    2017-10-02 23:20:26  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/732247-repair/
    2017-10-02 21:48:46  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/731795-provision/
    2017-10-02 17:16:26  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/730468-repair/
    2017-10-02 15:46:47  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/729797-provision/


Host chromeos6-row2-rack20-host20 also started failing Provision (The '--'?) (but used to work, before 2017-10-03 00:46:30?).

chromeos6-row2-rack20-host4
    2017-10-03 10:59:57  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/735672-repair/
    2017-10-03 09:30:16  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/735094-provision/
    2017-10-03 05:07:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733637-repair/
    2017-10-03 04:21:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733469-provision/
    2017-10-03 02:18:18  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/732929-repair/
    2017-10-03 00:46:30  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/732505-provision/
    2017-10-02 23:00:58  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146542520-chromeos-test/
    2017-10-02 21:48:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/731794-provision/
    2017-10-02 20:13:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146523567-chromeos-test/
    2017-10-02 18:48:15  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/730861-provision/
    2017-10-02 17:03:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146496980-chromeos-test/
    2017-10-02 15:46:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/729799-provision/

DUT chromeos6-row2-rack20-host22 seems to behaving better than the others, its just that it has its tests aborted occasionally (I guess when the suite times out).

chromeos6-row2-rack20-host22
    2017-10-03 13:47:53  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146701813-chromeos-test/
    2017-10-03 13:47:31  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736518-reset/
    2017-10-03 13:45:21  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146701412-chromeos-test/
    2017-10-03 13:44:52  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736502-reset/
    2017-10-03 13:38:44  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146693167-chromeos-test/
    2017-10-03 12:24:00  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736114-provision/
    2017-10-03 10:41:25  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146648690-chromeos-test/
    2017-10-03 09:30:16  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/735091-provision/
    2017-10-03 07:26:51  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146620539-chromeos-test/
    2017-10-03 06:26:52  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/734057-provision/
    2017-10-03 05:02:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146602164-chromeos-test/
    2017-10-03 03:37:28  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/733190-provision/
    2017-10-03 02:18:21  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/732930-cleanup/
    2017-10-03 02:17:21  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146583706-chromeos-test/
    2017-10-03 00:46:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/732509-provision/
    2017-10-02 22:59:20  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146542510-chromeos-test/
    2017-10-02 21:48:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/731792-provision/
    2017-10-02 19:54:59  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146523566-chromeos-test/
    2017-10-02 18:48:12  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/730860-provision/
    2017-10-02 17:16:42  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/730469-cleanup/
    2017-10-02 17:15:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146496975-chromeos-test/
    2017-10-02 15:46:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/729794-provision/

Cc: johndhong@chromium.org
Checking 736638-repair failure for chromeos6-row2-rack20-host20, I see a lot of messages about servo failing:

10/03 14:20:07.081 INFO |            repair:0327| Verifying this condition: servod service is taking calls
10/03 14:20:07.081 DEBUG|      abstract_ssh:0892| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 43598:localhost:9991 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack20-labstation2
10/03 14:20:07.125 DEBUG|      abstract_ssh:0900| Started ssh tunnel, local = 43598 remote = 9991, pid = 7512
10/03 14:20:07.126 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
10/03 14:20:07.127 WARNI|             retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
10/03 14:20:07.128 WARNI|             retry:0183| Retrying in 1.450435 seconds...
10/03 14:20:08.586 INFO |rpc_server_tracker:0208| XMLRPC server started successfully.
10/03 14:21:08.587 ERROR|            repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
    host.connect_servo()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 136, in connect_servo
    'Servo initialize timed out.')
AutoservVerifyError: Servo initialize timed out.


----------------------


10/03 14:21:32.928 DEBUG|      abstract_ssh:0892| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 38339:localhost:9991 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack20-labstation2
10/03 14:21:32.988 DEBUG|      abstract_ssh:0900| Started ssh tunnel, local = 38339 remote = 9991, pid = 22566
10/03 14:21:32.989 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
10/03 14:21:32.990 WARNI|             retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
10/03 14:21:32.991 WARNI|             retry:0183| Retrying in 1.405245 seconds...
10/03 14:22:33.403 ERROR|rpc_server_tracker:0201| Failed to start XMLRPC server.
10/03 14:22:33.404 DEBUG|      abstract_ssh:0915| Terminated tunnel, pid 22566
10/03 14:22:33.404 ERROR|            repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
    host.connect_servo()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 130, in connect_servo
    servo_obj = servo.Servo(servo_host=self, servo_serial=self.servo_serial)
  File "/usr/local/autotest/server/cros/servo/servo.py", line 176, in __init__
    self._server = servo_host.get_servod_server_proxy()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 183, in get_servod_server_proxy
    timeout_seconds=60)
  File "/usr/local/autotest/server/hosts/rpc_server_tracker.py", line 197, in xmlrpc_connect
    ready_test()
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 246, in func_retry
    raise error.TimeoutException(exception_message)
TimeoutException: retry exception (function="ready_test()"), timeout = 60s
10/03 14:22:33.405 INFO |        server_job:0214| 		FAIL	----	verify.servod	timestamp=1507065753	localtime=Oct 03 14:22:33	retry exception (function="ready_test()"), timeout = 60s


-----------------------------

10/03 14:22:34.655 INFO |        dev_server:0707| Getting devservers for host: chromeos6-row2-rack20-labstation2
10/03 14:22:34.694 DEBUG|        dev_server:0728| The host chromeos6-row2-rack20-labstation2 (100.115.172.156) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:18.
10/03 14:22:34.695 DEBUG|        dev_server:0679| Pick one healthy devserver from ['http://100.115.185.226:8082', 'http://100.115.185.227:8082', 'http://100.115.185.228:8082']


-----------------------------


10/03 14:22:44.334 ERROR|             utils:0280| [stderr] [1003/142243:INFO:update_engine_client.cc(493)] Querying Update Engine status...
10/03 14:22:44.340 DEBUG|             utils:0280| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
10/03 14:22:44.507 INFO |        servo_host:0458| servo host has the following duts: ['chromeos6-row2-rack20-host4', 'chromeos6-row2-rack20-host2', 'chromeos6-row2-rack20-host10', 'chromeos6-row2-rack20-host8', 'chromeos6-row2-rack20-host14', 'chromeos6-row2-rack20-host12', 'chromeos6-row2-rack20-host6', 'chromeos6-row2-rack20-host16', 'chromeos6-row2-rack20-host20', 'chromeos6-row2-rack20-host22', 'chromeos6-row2-rack20-host18']
10/03 14:22:44.508 INFO |        servo_host:0460| servo host has multiple duts, scheduling synchronized reboot

10/03 14:22:47.928 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/appengine/cq_stats/build_annotations for control files.

10/03 14:22:48.012 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/.git/objects/d8 for control files.

// It then spends almost 2 hours(!) spewing 27,595(!) more lines like "Unable to search directory ... for control files."...

10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_141187676_1505126915_24130 for control files.
10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_141027216_1505079083_13467 for control files.
10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146418336_1506974725_13109 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146476177_1506979022_4331 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146637435_1507047945_15206 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146396008_1506978668_9413 for control files.
10/03 16:31:40.711 ERROR|            repair:0332| Failed: servo host software is up-to-date
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 28, in verify
    host.update_image(wait_for_update=False)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 550, in update_image
    status, current_build_number = self._check_for_reboot(updater)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 462, in _check_for_reboot
    self.schedule_synchronized_reboot(dut_list, afe)
  File "/usr/local/autotest/server/hosts/servo_host.py", line 421, in schedule_synchronized_reboot
    control_file = getter.get_control_file_contents_by_name(test)
  File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 152, in get_control_file_contents_by_name
    path = self.get_control_file_path(test_name)
  File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 136, in get_control_file_path
    raise error.ControlFileNotFound(test_name + ' is not unique.')
ControlFileNotFound: servohost_Reboot is not unique.
10/03 16:31:40.712 INFO |        server_job:0214| 	FAIL	----	verify.update	timestamp=1507073500	localtime=Oct 03 16:31:40	servohost_Reboot is not unique.

...


10/03 16:41:27.280 ERROR|            repair:0332| Failed: host is available via ssh
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 37, in verify
    raise hosts.AutoservVerifyError(msg % host.hostname)
AutoservVerifyError: No answer to ping from chromeos6-row2-rack20-host20


.....

10/03 16:41:27.280 ERROR|            repair:0332| Failed: host is available via ssh
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 37, in verify
    raise hosts.AutoservVerifyError(msg % host.hostname)
AutoservVerifyError: No answer to ping from chromeos6-row2-rack20-host20

...

10/03 16:44:25.895 INFO |     ssh_multiplex:0107| Timed out waiting for master-ssh connection to be established.
10/03 16:45:29.323 ERROR|             utils:0280| [stderr] ssh: connect to host chromeos6-row2-rack20-host20 port 22: Connection timed out
10/03 16:45:30.327 DEBUG|      abstract_ssh:0632| Host chromeos6-row2-rack20-host20 is still down after waiting 227 seconds
10/03 16:45:30.327 ERROR|            repair:0449| Repair failed: Power cycle the host with RPM
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 96, in repair
    host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 is still offline after powercycling
10/03 16:45:30.328 INFO |        server_job:0214| 		FAIL	----	repair.rpm	timestamp=1507074330	localtime=Oct 03 16:45:30	chromeos6-row2-rack20-host20 is still offline after powercycling
10/03 16:45:30.329 INFO |        server_job:0214| 	END FAIL	----	repair.rpm	timestamp=1507074330	localtime=Oct 03 16:45:30	

...

10/03 16:45:30.329 INFO |            repair:0105| Attempting this repair action: Reset the DUT via keyboard sysrq-x
10/03 16:45:30.329 DEBUG|            repair:0106| Repairing because these triggers failed:
10/03 16:45:30.329 DEBUG|            repair:0108|     host is available via ssh
10/03 16:45:30.330 INFO |        server_job:0214| 	START	----	repair.sysrq	timestamp=1507074330	localtime=Oct 03 16:45:30	
10/03 16:45:30.330 ERROR|            repair:0449| Repair failed: Reset the DUT via keyboard sysrq-x
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 394, in repair
    '%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.
10/03 16:45:30.331 INFO |        server_job:0214| 		FAIL	----	repair.sysrq	timestamp=1507074330	localtime=Oct 03 16:45:30	chromeos6-row2-rack20-host20 has no servo support.

...

10/03 16:45:30.332 INFO |        server_job:0214| 	START	----	repair.servoreset	timestamp=1507074330	localtime=Oct 03 16:45:30	
10/03 16:45:30.332 ERROR|            repair:0449| Repair failed: Reset the DUT via servo
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 419, in repair
    '%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.
10/03 16:45:30.333 INFO |        server_job:0214| 		FAIL	----	repair.servoreset	timestamp=1507074330	localtime=Oct 03 16:45:30	chromeos6-row2-rack20-host20 has no servo support.

...

10/03 16:45:30.341 INFO |        server_job:0214| 	START	----	repair.usb	timestamp=1507074330	localtime=Oct 03 16:45:30	
10/03 16:45:30.341 ERROR|            repair:0449| Repair failed: Reinstall from USB using servo
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 491, in repair
    '%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.


...

10/03 16:45:30.345 ERROR|            repair:0043| Repair failed due to Exception.
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/repair", line 38, in repair
    target.repair()
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1216, in repair
    self._repair_strategy.repair(self)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 703, in repair
    self._verify_root._verify_host(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 326, in _verify_host
    self._verify_dependencies(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 199, in _verify_dependencies
    self._verify_list(host, self._dependency_list, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 188, in _verify_list
    raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: No answer to ping from chromeos6-row2-rack20-host20

Blockedon: 771236
Possibly related to issue 771236?
Was chromeos6-row2-rack20-host20 replaced by chromeos6-row2-rack20-host20 in pool:cq ?

dut-status  -b wizpig -p cq
hostname                       S   last checked         URL
chromeos2-row8-rack8-host12    OK  2017-10-04 14:42:04  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host12/743781-reset/
chromeos2-row8-rack8-host10    OK  2017-10-04 14:48:35  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/743806-repair/
chromeos2-row8-rack8-host17    OK  2017-10-04 14:43:22  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host17/743789-reset/
chromeos2-row8-rack8-host18    OK  2017-10-04 14:51:13  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host18/743819-reset/
chromeos6-row2-rack20-host4    OK  2017-10-04 14:08:18  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/743598-provision/
chromeos6-row2-rack20-host22   OK  2017-10-04 14:08:18  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/743593-provision/
chromeos6-row2-rack19-host1    OK  2017-10-04 14:42:01  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host1/743779-reset/
chromeos6-row2-rack19-host3    OK  2017-10-04 14:41:43  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host3/743776-reset/

We rebalance the pools in a regular basis, so probably. Rebalancing moves bad/locked duts into the suites pool (which is our holding ground), and good duts from suites into the pool.
So... what will stop the next rebalance from moving host20 back into pool:cq?
Not sure if it is a root cause or a symptom, but I spawned issue 771823 about the 2 hours of "Unable to search directory ... for control files." spewed during verify.update.


> So... what will stop the next rebalance from moving host20 back into pool:cq?

In balancing, the direction of movement is
  "bad devices" -> suites
  "good devices" -> pool being balanced

If a device isn't working, it won't go back into pool:cq.  Also, nothing
changes unless at least one device in pool:cq (or pool:bvt) is broken.

Something is really funny with wizpig-paladin; it passed 3/4 runs (1106-1109), but has now been purple for 7 straight runs (1110-1116):

Each time with a message like the following in HWTest bvt-inline:

https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1116/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio


@@@BUILD_STEP@HWTest [bvt-inline]@@@
************************************************************
@@@STEP_LINK@stdout-->stdio@https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fwizpig-paladin%2F1116%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-inline_%2F0%2Fstdout@@@
** Start Stage HWTest [bvt-inline] - Thu, 05 Oct 2017 09:06:39 -0700 (PDT)
** 
** Stage that runs tests in the Autotest lab.
************************************************************
09:06:39: INFO: Created cidb engine bot@130.211.191.11 for pid 8295
09:06:39: INFO: Running cidb query on pid 8295, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f91be6e0610>
09:06:40: INFO: Waiting up to forever for payloads and test artifacts ...
Preconditions for the stage successfully met. Beginning to execute stage...
09:12:52: INFO: Running cidb query on pid 8295, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f91be4f2990>
09:12:52: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpM6r1aL/tmplspkpx/temp_summary.json --raw-cmd --task-name wizpig-paladin/R63-10004.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:wizpig-paladin/R63-10004.0.0-rc3' '--tags=task_name:wizpig-paladin/R63-10004.0.0-rc3-bvt-inline' '--tags=board:wizpig' -- /usr/local/autotest/site_utils/run_suite.py --build wizpig-paladin/R63-10004.0.0-rc3 --board wizpig --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58200904L, 'cidb_build_id': 1916954, 'datastore_parent_key': ('Build', 1916954, 'BuildStage', 58200904L)}" -c
Autotest instance created: cautotest-prod
10-05-2017 [09:12:55] Submitted create_suite_job rpc
10-05-2017 [09:13:07] Created suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147078382
@@@STEP_LINK@Link to suite@http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147078382@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
09:13:09: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpM6r1aL/tmp3K7K1I/temp_summary.json --raw-cmd --task-name wizpig-paladin/R63-10004.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:wizpig-paladin/R63-10004.0.0-rc3' '--tags=task_name:wizpig-paladin/R63-10004.0.0-rc3-bvt-inline' '--tags=board:wizpig' -- /usr/local/autotest/site_utils/run_suite.py --build wizpig-paladin/R63-10004.0.0-rc3 --board wizpig --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58200904L, 'cidb_build_id': 1916954, 'datastore_parent_key': ('Build', 1916954, 'BuildStage', 58200904L)}" -m 147078382
09:18:09: INFO: Refreshing due to a 401 (attempt 1/2)
09:18:09: INFO: Refreshing access_token
09:21:08: INFO: Refreshing due to a 401 (attempt 1/2)
09:21:08: INFO: Refreshing access_token
09:23:14: INFO: RetriableHttp: attempt 1 receiving status 504, will retry
10:14:45: INFO: Refreshing due to a 401 (attempt 1/2)
10:14:45: INFO: Refreshing access_token
10:18:13: INFO: Refreshing due to a 401 (attempt 1/2)
10:18:13: INFO: Refreshing access_token
10:21:20: INFO: Refreshing due to a 401 (attempt 1/2)
10:21:20: INFO: Refreshing access_token


--------------------------------------------------------------------------------
started: Thu Oct  5 09:17:33 2017
ended: Thu Oct  5 10:39:26 2017
duration: 1 hrs, 21 mins, 53 secs
status: EXCEPTION
status reason: step was interrupted.
Re: #12, I think that's just Nxia@'s logic to self-destruct the CQ once we know it's bad. The "Refreshing due to a 401" is just logspam from ts_mon which we should remove (I opened  crbug.com/772157  to track that).
The last purple build was 1123, with a log a lot like #12:
https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1123

This time I noticed there is a link there to the suite [0], and the logs [1] show that actually the suite ran for quite a while -
in fact 63 tests ran successfully (END GOOD), but the suite was eventually aborted during:
[0] http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147252369
[1] https://storage.cloud.google.com/chromeos-autotest-results/147252369-chromeos-test/hostless/debug/autoserv.DEBUG?_ga=2.191751148.-734044362.1501703718


10/06 05:01:41.505 INFO |        server_job:0214| INFO	----	Start bvt-inline	timestamp=1507291301	localtime=Oct 06 05:01:41	
...
10/06 05:02:14.805 DEBUG|             suite:1172| Scheduled 58 tests, writing the total to keyval.

...

10/06 06:35:52.209 INFO |        server_job:0214| START	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	
10/06 06:35:52.210 INFO |        server_job:0214| 	ABORT	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	
10/06 06:35:52.211 INFO |        server_job:0214| END ABORT	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	


So the ABORT looks like it happens because the suite ran for 90 minutes.

Checking back a bit we can see a spot where the test stalled for 30 of those minutes:

10/06 05:55:49.851 DEBUG|             suite:1365| Adding job keyval for platform_DMVerityBitCorruption.middle=147260586-chromeos-test
10/06 05:55:49.852 INFO |        server_job:0214| START	147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last	platform_DMVerityBitCorruption.last	timestamp=1507294460	localtime=Oct 06 05:54:20	
10/06 05:55:49.852 INFO |        server_job:0214| 	GOOD	147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last	platform_DMVerityBitCorruption.last	timestamp=1507294482	localtime=Oct 06 05:54:42	completed successfully
10/06 05:55:49.853 INFO |        server_job:0214| END GOOD	147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last	platform_DMVerityBitCorruption.last	timestamp=1507294482	localtime=Oct 06 05:54:42	
10/06 05:55:49.853 DEBUG|             suite:1365| Adding job keyval for platform_DMVerityBitCorruption.last=147260586-chromeos-test

<-----  right here is 37 minutes of silence, followed by a suspicious "MySQL server has gone away" message ----->

10/06 06:32:58.339 ERROR|                db:0023| 06:32:58 10/06/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
10/06 06:33:39.681 INFO |        server_job:0214| START	147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2	graphics_GLMark2	timestamp=1507296636	localtime=Oct 06 06:30:36	
10/06 06:33:39.682 INFO |        server_job:0214| 	GOOD	147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2	graphics_GLMark2	timestamp=1507296656	localtime=Oct 06 06:30:56	completed successfully
10/06 06:33:39.683 INFO |        server_job:0214| END GOOD	147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2	graphics_GLMark2	timestamp=1507296656	localtime=Oct 06 06:30:56	
10/06 06:33:39.683 DEBUG|             suite:1365| Adding job keyval for graphics_GLMark2=147252419-chromeos-test
10/06 06:35:52.209 INFO |        server_job:0214| START	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	
10/06 06:35:52.210 INFO |        server_job:0214| 	ABORT	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	
10/06 06:35:52.211 INFO |        server_job:0214| END ABORT	----	wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken	timestamp=1507296952	localtime=Oct 06 06:35:52	
10/06 06:36:30.840 DEBUG|     dynamic_suite:0623| Finished waiting on suite. Returning from _perform_reimage_and_run.

Status: Verified (was: Available)
From chat, akeshet stated that if a builder is marked EXPERIMENTAL, and it takes longer than other slaves, then master won't wait for it and it will get aborted.  Thus, the only way to know if wizpig is fixed well enough to pass is to mark it important again, 1126 & 1127.

Closing this bug.
Status: Assigned (was: Verified)
This is still broken with same symptoms. The last 3 CQ runs failed on wizpig-paladin HWTest:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16549
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16550
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16551

Over the weekend and the past few days there have been many more such failures: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/

I'll follow up on this later today if phobbs doesn't get to it.
Owner: djkurtz@chromium.org
This builder should be marked as experimental if it's failing this often. Looks like it got un-marked as such by djkurtz.

https://chromiumos-status.appspot.com/
> This builder should be marked as experimental if it's failing this
> often. Looks like it got un-marked as such by djkurtz.

Wait.  Before we do that, we have to prove that the recent failures
aren't a bad CL.  Otherwise, we'll let bad code in, and then _lots_
of stuff will go bad in the lab.

Cc: jclinton@chromium.org
I did a spot-check of recent repairs in wizpig CQ, and compared
them to recent repairs in the BVT.  In both cases, there were
more repairs than I expected.  In both cases, there's a mixture
of "the DUT was offline" and "the failed update but stayed online".

The frequency of "offline" is higher in the CQ, and there were
reports of kernel crashes there.  So, there could have been bad CLs
in the CQ.  Someone should look at the blamelists.

Cc: haoweiw@chromium.org englab-sys-cros@google.com
Owner: pho...@chromium.org
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1178 -> infra DNS flake
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1171 -> infra network flake: Connection timed out
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1170 -> infra network flake: Connection timed out
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1169 -> infra network flake: Connection timed out
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1148 -> infra network flake: Connection timed out
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1143 -> infra network flake: Connection timed out
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1136 -> infra network flake: Connection timed out

I don't see any evidence that a CL is to blame.

Owner: dgarr...@chromium.org
Connection timed out, CAN be because of bugs in the product.

It's normally generated when we are unable to reconnect to a DUT after rebooting it. It can also come from all kinds of other causes. Bad DUT hardware, overloaded devserver network interfaces, DNS errors, etc.
phobbs@ is OOO

Spot checking these, looks like most (all?) of these are failures of the same DUT, which seems to be provision fail-loopling.

$ dut-status -f chromeos2-row8-rack8-host10
chromeos2-row8-rack8-host10
    2017-10-12 15:17:06  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/800032-repair/
    2017-10-12 14:55:30  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799852-provision/
    2017-10-12 14:34:49  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799713-repair/
    2017-10-12 14:29:33  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799680-verify/
    2017-10-12 13:26:37  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799462-repair/
    2017-10-12 12:55:33  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799183-provision/
    2017-10-12 12:53:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799141-repair/
    2017-10-12 12:39:47  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799041-provision/
    2017-10-12 09:54:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/798137-repair/
    2017-10-12 09:38:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/798039-provision/
    2017-10-12 07:23:09  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/797458-repair/
    2017-10-12 06:48:39  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/797272-provision/
    2017-10-12 04:36:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/796728-repair/
    2017-10-12 04:00:03  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/796426-provision/
    2017-10-12 01:28:44  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/795684-repair/
    2017-10-12 01:11:58  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/795613-provision/
    2017-10-11 22:27:18  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794861-repair/
    2017-10-11 22:05:15  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794688-provision/
    2017-10-11 20:13:51  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794080-repair/
    2017-10-11 19:40:02  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/793785-provision/
    2017-10-11 17:58:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/793151-repair/
Locking DUT for investigation.


I see some sshl crashdumps in at least one of those repair jobs. This might explain the DUT being unreachable after boot.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host10/799713-repair/20171210143448/crashinfo.chromeos2-row8-rack8-host10/
Owner: jclinton@chromium.org
This looks like the same issue as  Issue 766211  which was also afflicting wizpig.

I suggest this is a product bug, returning to the sheriff.
by sshl I meant supplied_sslh_fork
Also telling is that it appears to be stable after a USB reinstall to an older version (R62-9901.35.1) after repair.
Why is the repair infrastructure failing considered a product bug?

The repair infrastructure is succeeding. It is repairing a DUT that is unable to come onto the network, by reinstalling an older stable image.
The best explanation is that there is something in the newer image, interacting with something about that particular DUT (or its network dongle?) that is causing it to not come onto the network after boot.
But why is the provision failing for only one DUT? If it were a software bug, it would affect all wizpig DUT's, wouldn't it?

Repair is working, provision is failing.

Provision can only succeed if the product can be installed, boot, and listen for an SSH connection.

Many provision failures are infra, but these don't appear to be because of a mix of #31 and #32.

If our diagnosis is correct, this board is going to continue to cause CQ outages, despite the problematic DUT being locked.
Owner: dgarr...@chromium.org
This DUT has a failing or loose USB ethernet adapter. It's barely able to be interacted with over SSH (my session kept locking up):

[ 9627.819193] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041
[ 9627.819213] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[ 9627.819227] usb 2-2: Product: Linksys USB3GIGV1
[ 9627.819237] usb 2-2: Manufacturer: Linksys
[ 9627.819246] usb 2-2: SerialNumber: 000001000000
[ 9627.821289] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49
[ 9627.824802] platform vpd: Driver vpd requests probe deferral
[ 9632.855428] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9632.866625] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9632.866692] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9757.581271] usb 2-2: USB disconnect, device number 109
[ 9757.581916] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device
[ 9757.808892] usb 2-2: new SuperSpeed USB device number 110 using xhci_hcd
[ 9757.821347] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041
[ 9757.821366] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[ 9757.821378] usb 2-2: Product: Linksys USB3GIGV1
[ 9757.821387] usb 2-2: Manufacturer: Linksys
[ 9757.821395] usb 2-2: SerialNumber: 000001000000
[ 9757.824764] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49
[ 9757.828574] platform vpd: Driver vpd requests probe deferral
[ 9762.859426] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9762.870506] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9762.870572] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9917.615808] usb 2-2: USB disconnect, device number 110
[ 9917.616039] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device
[ 9917.854083] usb 2-2: new SuperSpeed USB device number 111 using xhci_hcd
[ 9917.865548] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041
[ 9917.865566] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[ 9917.865578] usb 2-2: Product: Linksys USB3GIGV1
[ 9917.865587] usb 2-2: Manufacturer: Linksys
[ 9917.865596] usb 2-2: SerialNumber: 000001000000
[ 9917.868491] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49
[ 9917.870440] platform vpd: Driver vpd requests probe deferral
[ 9922.902049] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9922.910172] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9922.910240] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped
[ 9937.630804] usb 2-2: USB disconnect, device number 111
[ 9937.631089] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device
[ 9937.861848] usb 2-2: new SuperSpeed USB device number 112 using xhci_hcd
[ 9937.876353] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041
[ 9937.876376] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[ 9937.876391] usb 2-2: Product: Linksys USB3GIGV1
[ 9937.876402] usb 2-2: Manufacturer: Linksys
[ 9937.876413] usb 2-2: SerialNumber: 000001000000
[ 9937.878773] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49
[ 9937.879039] platform vpd: Driver vpd requests probe deferral
[ 9937.879285] platform vpd: Driver vpd requests probe deferral

Perhaps a newer software release is less forgiving.

Please schedule this DUT for manual repair or replacement.
The initial install done during repair is via USB, so the large file transfers don't go over it's network adapter.

It's currently locked, so won't cause issues.
b/67754356
Owner: nxia@chromium.org
Transferring outstanding deputy bugs.
During week of 2/25/18 * 6 cases fails due to retry exception (function="ready_test()") on the following DUTs. 
johndhong@, could you need to check the following DUTS?
chromeos6-row22-rack13-host9
chromeos6-row22-rack13-host7
chromeos6-row22-rack13-host1
Detail log:
https://stainless.corp.google.com/search?view=list&first_date=2018-02-26&last_date=2018-03-02&suite=%5Ebluestreak%24&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=retry+exception&exclude_cts=true&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true

Cc: chromeos-meetings@google.com
@mzhuo how is wizpig failing relevant for Bluestreak devices?
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Looks like maybe this should be closed? If not, please say so.
It seems wizpig-paradin fails 3+ times in a row due to provision failure.
https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=wizpig-paladin&buildBranch=master

Example log: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927363591518386240/+/steps/HWTest__provision_/0/stdout

As far as I read the log, status of all DUTs are "Reparing".
Is this something abnormal?
Can anyone look into the infra status?


---------------------------------------------------------------------
 12-12-2018 [00:56:38] Attempting to display pool info: cq
host: chromeos2-row8-rack6-host6, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'hw_video_acc_enc_h264', 'os:cros', 'hw_jpeg_acc_dec', 'power:battery', 'hw_video_acc_vp8', 'hw_video_acc_h264', 'hw_video_acc_vp9', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'audio_loopback_dongle', 'cts_cpu_x86', 'variant:wizpig', 'touchpad', 'phase:PVT', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697540 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos2-row8-rack8-host13, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697542 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos2-row8-rack8-host10, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'ec:cros', 'os:cros', 'power:battery', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'variant:wizpig', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697538 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos2-row8-rack8-host17, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'audio_loopback_dongle', 'cts_cpu_x86', 'variant:wizpig', 'touchpad', 'phase:PVT', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697545 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos2-row8-rack8-host18, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697539 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos6-row2-rack19-host1, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:r8152', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'pool:cq', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697543 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos6-row2-rack19-host6, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'usb_nic:r8152', 'touchpad', 'phase:PVT', 'bluetooth', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', 'arc', 'internal_display', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_enc', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'ec:cros', 'servo', 'variant:wizpig', 'webcam', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_2', 'sparse_coverage_3', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697544 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos6-row2-rack19-host4, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'internal_display', 'webcam', 'bluetooth', 'usb_nic:r8152', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'servo', 'variant:wizpig', 'hw_video_acc_enc_vp8', 'ec:cros', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_2', 'sparse_coverage_3', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697541 Provision started on: 2018-12-12 00:44:38 status FAIL

host: chromeos6-row2-rack19-host8, status: Repairing, locked: False diagnosis: Unused
labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'internal_display', 'usb_nic:r8152', 'model:wizpig', 'touchpad', 'phase:PVT', 'arc', 'hw_jpeg_acc_enc', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'ec:cros', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_video_acc_vp9', 'hw_video_acc_h264', 'webcam', 'cts_cpu_x86', 'sparse_coverage_2', 'sparse_coverage_5', 'sparse_coverage_3', 'pool:cq', 'sku:wizpig_intel_celeron_n3060_2Gb']
Last 10 jobs within 1:48:00:
3697546 Provision started on: 2018-12-12 00:44:38 status FAIL

Reason: Provisioning failed.

Status: Closed (was: Available)
Re: #38 - The failures you linked to were caused by https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1351133 .  Use https://crbug.com/914717 to track.

I'm going to close this old bug.  I'm pretty sure whatever provisioning problems wizpig-paladin had in 2017 are gone, even if we have new problems cropping up now.

Sign in to add a comment