wizpig-paladin failing often in hwtest (5/7 fails from 1093--1099) |
||||||||||||||||||
Issue descriptionwizpig-paladin failed 5 (1099, 1097, 1096, 1095, 1093) of last six builds. Each time with message: "** HWTest did not complete due to infrastructure issues (code 3) **" Each time the suite timed out during bvt-inline: Suite timed out. Started on 10-03-2017 [09:27:59], timed out on 10-03-2017 [11:00:01] 10-03-2017 [11:00:01] Start collecting test results and dump them to json. Suite job [ FAILED ] Suite job ABORT: There appear to be 8 DUTs in pool:cq: host: chromeos2-row8-rack8-host12, status: Ready, locked: False diagnosis: Working host: chromeos2-row8-rack8-host17, status: Ready, locked: False diagnosis: Working host: chromeos2-row8-rack8-host18, status: Ready, locked: False diagnosis: Working host: chromeos6-row2-rack20-host4, status: Repairing, locked: False diagnosis: Unused host: chromeos6-row2-rack20-host20, status: Repairing, locked: False diagnosis: Failed repair host: chromeos6-row2-rack20-host22, status: Ready, locked: False diagnosis: Working host: chromeos6-row2-rack19-host1, status: Ready, locked: False diagnosis: Working host: chromeos6-row2-rack19-host3, status: Ready, locked: False diagnosis: Working However, looking at the test reports, 3 of them never seem to run a test successfully (or just a single test in -host22's case): 10-03-2017 [11:02:51] Attempting to display pool info: cq ... host: chromeos6-row2-rack20-host4, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'audio_loopback_dongle', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig'] Last 10 jobs within 1:48:00: 735094 Provision started on: 2017-10-03 09:30:16 status FAIL host: chromeos6-row2-rack20-host20, status: Repairing, locked: False diagnosis: Failed repair labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', '4k_video', 'model:wizpig'] Last 10 jobs within 1:48:00: 735164 Verify started on: 2017-10-03 09:39:48 status FAIL host: chromeos6-row2-rack20-host22, status: Ready, locked: False diagnosis: Working labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'arc', 'os:cros', 'power:battery', 'ec:cros', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'audio_loopback_dongle', 'usb_nic:r8152', 'sku:wizpig_intel_celeron_n3060_2Gb', 'variant:wizpig', 'touchpad', 'phase:PVT', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'cros-version:wizpig-paladin/R63-9998.0.0-rc2', '4k_video'] Last 10 jobs within 1:48:00: 147013555 wizpig-paladin/R63-9998.0.0-rc2/suite_attr_wrapper/cheets_StartAndroid.stress.0 started on: 2017-10-03 10:41:25 status Completed 735091 Provision started on: 2017-10-03 09:30:16 status PASS
,
Oct 3 2017
There's a number of devices which seem to be provision/repairing repeatedly. Sheriffs should also be trying to identify if a bad CL is causing all of these issues. Since the problems seem localized to chromeos6-row2-rack20 duts (backed up by dut-status), I'm guessing that something else is going on specific to that network. This might be related to issue 770806 .
,
Oct 3 2017
dut-status -b wizpig -p cq hostname S last checked URL chromeos2-row8-rack8-host12 OK 2017-10-03 13:06:24 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host12/736330-reset/ chromeos2-row8-rack8-host17 OK 2017-10-03 13:07:12 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host17/736333-reset/ chromeos2-row8-rack8-host18 OK 2017-10-03 13:06:33 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host18/736331-reset/ chromeos6-row2-rack20-host4 OK 2017-10-03 05:07:55 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733637-repair/ chromeos6-row2-rack20-host20 NO 2017-10-03 10:34:07 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735562-repair/ chromeos6-row2-rack20-host22 OK 2017-10-03 09:30:16 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/735091-provision/ chromeos6-row2-rack19-host1 OK 2017-10-03 12:24:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host1/736113-provision/ chromeos6-row2-rack19-host3 OK 2017-10-03 12:24:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host3/736111-provision/ Host chromeos6-row2-rack20-host20 is continuously failing Verify/Repair: chromeos6-row2-rack20-host20 2017-10-03 13:10:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/736354-verify/ 2017-10-03 10:34:07 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735562-repair/ 2017-10-03 09:39:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/735164-verify/ 2017-10-03 07:25:40 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/734462-repair/ 2017-10-03 06:09:01 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/733899-verify/ 2017-10-03 03:20:02 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/733122-repair/ 2017-10-03 02:08:16 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/732893-verify/ 2017-10-02 23:20:26 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/732247-repair/ 2017-10-02 21:48:46 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/731795-provision/ 2017-10-02 17:16:26 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/730468-repair/ 2017-10-02 15:46:47 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host20/729797-provision/ Host chromeos6-row2-rack20-host20 also started failing Provision (The '--'?) (but used to work, before 2017-10-03 00:46:30?). chromeos6-row2-rack20-host4 2017-10-03 10:59:57 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/735672-repair/ 2017-10-03 09:30:16 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/735094-provision/ 2017-10-03 05:07:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733637-repair/ 2017-10-03 04:21:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/733469-provision/ 2017-10-03 02:18:18 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/732929-repair/ 2017-10-03 00:46:30 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/732505-provision/ 2017-10-02 23:00:58 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146542520-chromeos-test/ 2017-10-02 21:48:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/731794-provision/ 2017-10-02 20:13:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146523567-chromeos-test/ 2017-10-02 18:48:15 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/730861-provision/ 2017-10-02 17:03:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146496980-chromeos-test/ 2017-10-02 15:46:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/729799-provision/ DUT chromeos6-row2-rack20-host22 seems to behaving better than the others, its just that it has its tests aborted occasionally (I guess when the suite times out). chromeos6-row2-rack20-host22 2017-10-03 13:47:53 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146701813-chromeos-test/ 2017-10-03 13:47:31 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736518-reset/ 2017-10-03 13:45:21 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146701412-chromeos-test/ 2017-10-03 13:44:52 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736502-reset/ 2017-10-03 13:38:44 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146693167-chromeos-test/ 2017-10-03 12:24:00 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/736114-provision/ 2017-10-03 10:41:25 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146648690-chromeos-test/ 2017-10-03 09:30:16 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/735091-provision/ 2017-10-03 07:26:51 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146620539-chromeos-test/ 2017-10-03 06:26:52 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/734057-provision/ 2017-10-03 05:02:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146602164-chromeos-test/ 2017-10-03 03:37:28 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/733190-provision/ 2017-10-03 02:18:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/732930-cleanup/ 2017-10-03 02:17:21 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146583706-chromeos-test/ 2017-10-03 00:46:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/732509-provision/ 2017-10-02 22:59:20 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146542510-chromeos-test/ 2017-10-02 21:48:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/731792-provision/ 2017-10-02 19:54:59 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146523566-chromeos-test/ 2017-10-02 18:48:12 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/730860-provision/ 2017-10-02 17:16:42 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/730469-cleanup/ 2017-10-02 17:15:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/146496975-chromeos-test/ 2017-10-02 15:46:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/729794-provision/
,
Oct 4 2017
Checking 736638-repair failure for chromeos6-row2-rack20-host20, I see a lot of messages about servo failing:
10/03 14:20:07.081 INFO | repair:0327| Verifying this condition: servod service is taking calls
10/03 14:20:07.081 DEBUG| abstract_ssh:0892| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 43598:localhost:9991 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack20-labstation2
10/03 14:20:07.125 DEBUG| abstract_ssh:0900| Started ssh tunnel, local = 43598 remote = 9991, pid = 7512
10/03 14:20:07.126 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
10/03 14:20:07.127 WARNI| retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
10/03 14:20:07.128 WARNI| retry:0183| Retrying in 1.450435 seconds...
10/03 14:20:08.586 INFO |rpc_server_tracker:0208| XMLRPC server started successfully.
10/03 14:21:08.587 ERROR| repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
host.connect_servo()
File "/usr/local/autotest/server/hosts/servo_host.py", line 136, in connect_servo
'Servo initialize timed out.')
AutoservVerifyError: Servo initialize timed out.
----------------------
10/03 14:21:32.928 DEBUG| abstract_ssh:0892| Full tunnel command: /usr/bin/ssh -a -x -n -N -q -L 38339:localhost:9991 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack20-labstation2
10/03 14:21:32.988 DEBUG| abstract_ssh:0900| Started ssh tunnel, local = 38339 remote = 9991, pid = 22566
10/03 14:21:32.989 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
10/03 14:21:32.990 WARNI| retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
10/03 14:21:32.991 WARNI| retry:0183| Retrying in 1.405245 seconds...
10/03 14:22:33.403 ERROR|rpc_server_tracker:0201| Failed to start XMLRPC server.
10/03 14:22:33.404 DEBUG| abstract_ssh:0915| Terminated tunnel, pid 22566
10/03 14:22:33.404 ERROR| repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
host.connect_servo()
File "/usr/local/autotest/server/hosts/servo_host.py", line 130, in connect_servo
servo_obj = servo.Servo(servo_host=self, servo_serial=self.servo_serial)
File "/usr/local/autotest/server/cros/servo/servo.py", line 176, in __init__
self._server = servo_host.get_servod_server_proxy()
File "/usr/local/autotest/server/hosts/servo_host.py", line 183, in get_servod_server_proxy
timeout_seconds=60)
File "/usr/local/autotest/server/hosts/rpc_server_tracker.py", line 197, in xmlrpc_connect
ready_test()
File "/usr/local/autotest/client/common_lib/cros/retry.py", line 246, in func_retry
raise error.TimeoutException(exception_message)
TimeoutException: retry exception (function="ready_test()"), timeout = 60s
10/03 14:22:33.405 INFO | server_job:0214| FAIL ---- verify.servod timestamp=1507065753 localtime=Oct 03 14:22:33 retry exception (function="ready_test()"), timeout = 60s
-----------------------------
10/03 14:22:34.655 INFO | dev_server:0707| Getting devservers for host: chromeos6-row2-rack20-labstation2
10/03 14:22:34.694 DEBUG| dev_server:0728| The host chromeos6-row2-rack20-labstation2 (100.115.172.156) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.128.0:18.
10/03 14:22:34.695 DEBUG| dev_server:0679| Pick one healthy devserver from ['http://100.115.185.226:8082', 'http://100.115.185.227:8082', 'http://100.115.185.228:8082']
-----------------------------
10/03 14:22:44.334 ERROR| utils:0280| [stderr] [1003/142243:INFO:update_engine_client.cc(493)] Querying Update Engine status...
10/03 14:22:44.340 DEBUG| utils:0280| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
10/03 14:22:44.507 INFO | servo_host:0458| servo host has the following duts: ['chromeos6-row2-rack20-host4', 'chromeos6-row2-rack20-host2', 'chromeos6-row2-rack20-host10', 'chromeos6-row2-rack20-host8', 'chromeos6-row2-rack20-host14', 'chromeos6-row2-rack20-host12', 'chromeos6-row2-rack20-host6', 'chromeos6-row2-rack20-host16', 'chromeos6-row2-rack20-host20', 'chromeos6-row2-rack20-host22', 'chromeos6-row2-rack20-host18']
10/03 14:22:44.508 INFO | servo_host:0460| servo host has multiple duts, scheduling synchronized reboot
10/03 14:22:47.928 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/appengine/cq_stats/build_annotations for control files.
10/03 14:22:48.012 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/venv/chromite/.git/objects/d8 for control files.
// It then spends almost 2 hours(!) spewing 27,595(!) more lines like "Unable to search directory ... for control files."...
10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_141187676_1505126915_24130 for control files.
10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_141027216_1505079083_13467 for control files.
10/03 16:31:40.080 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146418336_1506974725_13109 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146476177_1506979022_4331 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146637435_1507047945_15206 for control files.
10/03 16:31:40.081 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/containers/test_146396008_1506978668_9413 for control files.
10/03 16:31:40.711 ERROR| repair:0332| Failed: servo host software is up-to-date
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/servo_repair.py", line 28, in verify
host.update_image(wait_for_update=False)
File "/usr/local/autotest/server/hosts/servo_host.py", line 550, in update_image
status, current_build_number = self._check_for_reboot(updater)
File "/usr/local/autotest/server/hosts/servo_host.py", line 462, in _check_for_reboot
self.schedule_synchronized_reboot(dut_list, afe)
File "/usr/local/autotest/server/hosts/servo_host.py", line 421, in schedule_synchronized_reboot
control_file = getter.get_control_file_contents_by_name(test)
File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 152, in get_control_file_contents_by_name
path = self.get_control_file_path(test_name)
File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 136, in get_control_file_path
raise error.ControlFileNotFound(test_name + ' is not unique.')
ControlFileNotFound: servohost_Reboot is not unique.
10/03 16:31:40.712 INFO | server_job:0214| FAIL ---- verify.update timestamp=1507073500 localtime=Oct 03 16:31:40 servohost_Reboot is not unique.
...
10/03 16:41:27.280 ERROR| repair:0332| Failed: host is available via ssh
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/repair.py", line 37, in verify
raise hosts.AutoservVerifyError(msg % host.hostname)
AutoservVerifyError: No answer to ping from chromeos6-row2-rack20-host20
.....
10/03 16:41:27.280 ERROR| repair:0332| Failed: host is available via ssh
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/repair.py", line 37, in verify
raise hosts.AutoservVerifyError(msg % host.hostname)
AutoservVerifyError: No answer to ping from chromeos6-row2-rack20-host20
...
10/03 16:44:25.895 INFO | ssh_multiplex:0107| Timed out waiting for master-ssh connection to be established.
10/03 16:45:29.323 ERROR| utils:0280| [stderr] ssh: connect to host chromeos6-row2-rack20-host20 port 22: Connection timed out
10/03 16:45:30.327 DEBUG| abstract_ssh:0632| Host chromeos6-row2-rack20-host20 is still down after waiting 227 seconds
10/03 16:45:30.327 ERROR| repair:0449| Repair failed: Power cycle the host with RPM
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
self.repair(host)
File "/usr/local/autotest/server/hosts/repair.py", line 96, in repair
host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 is still offline after powercycling
10/03 16:45:30.328 INFO | server_job:0214| FAIL ---- repair.rpm timestamp=1507074330 localtime=Oct 03 16:45:30 chromeos6-row2-rack20-host20 is still offline after powercycling
10/03 16:45:30.329 INFO | server_job:0214| END FAIL ---- repair.rpm timestamp=1507074330 localtime=Oct 03 16:45:30
...
10/03 16:45:30.329 INFO | repair:0105| Attempting this repair action: Reset the DUT via keyboard sysrq-x
10/03 16:45:30.329 DEBUG| repair:0106| Repairing because these triggers failed:
10/03 16:45:30.329 DEBUG| repair:0108| host is available via ssh
10/03 16:45:30.330 INFO | server_job:0214| START ---- repair.sysrq timestamp=1507074330 localtime=Oct 03 16:45:30
10/03 16:45:30.330 ERROR| repair:0449| Repair failed: Reset the DUT via keyboard sysrq-x
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
self.repair(host)
File "/usr/local/autotest/server/hosts/cros_repair.py", line 394, in repair
'%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.
10/03 16:45:30.331 INFO | server_job:0214| FAIL ---- repair.sysrq timestamp=1507074330 localtime=Oct 03 16:45:30 chromeos6-row2-rack20-host20 has no servo support.
...
10/03 16:45:30.332 INFO | server_job:0214| START ---- repair.servoreset timestamp=1507074330 localtime=Oct 03 16:45:30
10/03 16:45:30.332 ERROR| repair:0449| Repair failed: Reset the DUT via servo
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
self.repair(host)
File "/usr/local/autotest/server/hosts/cros_repair.py", line 419, in repair
'%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.
10/03 16:45:30.333 INFO | server_job:0214| FAIL ---- repair.servoreset timestamp=1507074330 localtime=Oct 03 16:45:30 chromeos6-row2-rack20-host20 has no servo support.
...
10/03 16:45:30.341 INFO | server_job:0214| START ---- repair.usb timestamp=1507074330 localtime=Oct 03 16:45:30
10/03 16:45:30.341 ERROR| repair:0449| Repair failed: Reinstall from USB using servo
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
self.repair(host)
File "/usr/local/autotest/server/hosts/cros_repair.py", line 491, in repair
'%s has no servo support.' % host.hostname)
AutoservRepairError: chromeos6-row2-rack20-host20 has no servo support.
...
10/03 16:45:30.345 ERROR| repair:0043| Repair failed due to Exception.
Traceback (most recent call last):
File "/usr/local/autotest/server/control_segments/repair", line 38, in repair
target.repair()
File "/usr/local/autotest/server/hosts/cros_host.py", line 1216, in repair
self._repair_strategy.repair(self)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 703, in repair
self._verify_root._verify_host(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 326, in _verify_host
self._verify_dependencies(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 199, in _verify_dependencies
self._verify_list(host, self._dependency_list, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 188, in _verify_list
raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: No answer to ping from chromeos6-row2-rack20-host20
,
Oct 4 2017
,
Oct 5 2017
Was chromeos6-row2-rack20-host20 replaced by chromeos6-row2-rack20-host20 in pool:cq ? dut-status -b wizpig -p cq hostname S last checked URL chromeos2-row8-rack8-host12 OK 2017-10-04 14:42:04 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host12/743781-reset/ chromeos2-row8-rack8-host10 OK 2017-10-04 14:48:35 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/743806-repair/ chromeos2-row8-rack8-host17 OK 2017-10-04 14:43:22 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host17/743789-reset/ chromeos2-row8-rack8-host18 OK 2017-10-04 14:51:13 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host18/743819-reset/ chromeos6-row2-rack20-host4 OK 2017-10-04 14:08:18 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host4/743598-provision/ chromeos6-row2-rack20-host22 OK 2017-10-04 14:08:18 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack20-host22/743593-provision/ chromeos6-row2-rack19-host1 OK 2017-10-04 14:42:01 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host1/743779-reset/ chromeos6-row2-rack19-host3 OK 2017-10-04 14:41:43 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack19-host3/743776-reset/
,
Oct 5 2017
We rebalance the pools in a regular basis, so probably. Rebalancing moves bad/locked duts into the suites pool (which is our holding ground), and good duts from suites into the pool.
,
Oct 5 2017
So... what will stop the next rebalance from moving host20 back into pool:cq?
,
Oct 5 2017
Not sure if it is a root cause or a symptom, but I spawned issue 771823 about the 2 hours of "Unable to search directory ... for control files." spewed during verify.update.
,
Oct 5 2017
> So... what will stop the next rebalance from moving host20 back into pool:cq? In balancing, the direction of movement is "bad devices" -> suites "good devices" -> pool being balanced If a device isn't working, it won't go back into pool:cq. Also, nothing changes unless at least one device in pool:cq (or pool:bvt) is broken.
,
Oct 5 2017
Something is really funny with wizpig-paladin; it passed 3/4 runs (1106-1109), but has now been purple for 7 straight runs (1110-1116): Each time with a message like the following in HWTest bvt-inline: https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1116/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio @@@BUILD_STEP@HWTest [bvt-inline]@@@ ************************************************************ @@@STEP_LINK@stdout-->stdio@https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fwizpig-paladin%2F1116%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-inline_%2F0%2Fstdout@@@ ** Start Stage HWTest [bvt-inline] - Thu, 05 Oct 2017 09:06:39 -0700 (PDT) ** ** Stage that runs tests in the Autotest lab. ************************************************************ 09:06:39: INFO: Created cidb engine bot@130.211.191.11 for pid 8295 09:06:39: INFO: Running cidb query on pid 8295, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f91be6e0610> 09:06:40: INFO: Waiting up to forever for payloads and test artifacts ... Preconditions for the stage successfully met. Beginning to execute stage... 09:12:52: INFO: Running cidb query on pid 8295, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f91be4f2990> 09:12:52: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpM6r1aL/tmplspkpx/temp_summary.json --raw-cmd --task-name wizpig-paladin/R63-10004.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:wizpig-paladin/R63-10004.0.0-rc3' '--tags=task_name:wizpig-paladin/R63-10004.0.0-rc3-bvt-inline' '--tags=board:wizpig' -- /usr/local/autotest/site_utils/run_suite.py --build wizpig-paladin/R63-10004.0.0-rc3 --board wizpig --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58200904L, 'cidb_build_id': 1916954, 'datastore_parent_key': ('Build', 1916954, 'BuildStage', 58200904L)}" -c Autotest instance created: cautotest-prod 10-05-2017 [09:12:55] Submitted create_suite_job rpc 10-05-2017 [09:13:07] Created suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147078382 @@@STEP_LINK@Link to suite@http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147078382@@@ --create_and_return was specified, terminating now. Will return from run_suite with status: OK 09:13:09: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpM6r1aL/tmp3K7K1I/temp_summary.json --raw-cmd --task-name wizpig-paladin/R63-10004.0.0-rc3-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:wizpig-paladin/R63-10004.0.0-rc3' '--tags=task_name:wizpig-paladin/R63-10004.0.0-rc3-bvt-inline' '--tags=board:wizpig' -- /usr/local/autotest/site_utils/run_suite.py --build wizpig-paladin/R63-10004.0.0-rc3 --board wizpig --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58200904L, 'cidb_build_id': 1916954, 'datastore_parent_key': ('Build', 1916954, 'BuildStage', 58200904L)}" -m 147078382 09:18:09: INFO: Refreshing due to a 401 (attempt 1/2) 09:18:09: INFO: Refreshing access_token 09:21:08: INFO: Refreshing due to a 401 (attempt 1/2) 09:21:08: INFO: Refreshing access_token 09:23:14: INFO: RetriableHttp: attempt 1 receiving status 504, will retry 10:14:45: INFO: Refreshing due to a 401 (attempt 1/2) 10:14:45: INFO: Refreshing access_token 10:18:13: INFO: Refreshing due to a 401 (attempt 1/2) 10:18:13: INFO: Refreshing access_token 10:21:20: INFO: Refreshing due to a 401 (attempt 1/2) 10:21:20: INFO: Refreshing access_token -------------------------------------------------------------------------------- started: Thu Oct 5 09:17:33 2017 ended: Thu Oct 5 10:39:26 2017 duration: 1 hrs, 21 mins, 53 secs status: EXCEPTION status reason: step was interrupted.
,
Oct 5 2017
Re: #12, I think that's just Nxia@'s logic to self-destruct the CQ once we know it's bad. The "Refreshing due to a 401" is just logspam from ts_mon which we should remove (I opened crbug.com/772157 to track that).
,
Oct 6 2017
The last purple build was 1123, with a log a lot like #12: https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1123 This time I noticed there is a link there to the suite [0], and the logs [1] show that actually the suite ran for quite a while - in fact 63 tests ran successfully (END GOOD), but the suite was eventually aborted during: [0] http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147252369 [1] https://storage.cloud.google.com/chromeos-autotest-results/147252369-chromeos-test/hostless/debug/autoserv.DEBUG?_ga=2.191751148.-734044362.1501703718 10/06 05:01:41.505 INFO | server_job:0214| INFO ---- Start bvt-inline timestamp=1507291301 localtime=Oct 06 05:01:41 ... 10/06 05:02:14.805 DEBUG| suite:1172| Scheduled 58 tests, writing the total to keyval. ... 10/06 06:35:52.209 INFO | server_job:0214| START ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 10/06 06:35:52.210 INFO | server_job:0214| ABORT ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 10/06 06:35:52.211 INFO | server_job:0214| END ABORT ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 So the ABORT looks like it happens because the suite ran for 90 minutes. Checking back a bit we can see a spot where the test stalled for 30 of those minutes: 10/06 05:55:49.851 DEBUG| suite:1365| Adding job keyval for platform_DMVerityBitCorruption.middle=147260586-chromeos-test 10/06 05:55:49.852 INFO | server_job:0214| START 147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last platform_DMVerityBitCorruption.last timestamp=1507294460 localtime=Oct 06 05:54:20 10/06 05:55:49.852 INFO | server_job:0214| GOOD 147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last platform_DMVerityBitCorruption.last timestamp=1507294482 localtime=Oct 06 05:54:42 completed successfully 10/06 05:55:49.853 INFO | server_job:0214| END GOOD 147260586-chromeos-test/chromeos2-row8-rack8-host17/platform_DMVerityBitCorruption.last platform_DMVerityBitCorruption.last timestamp=1507294482 localtime=Oct 06 05:54:42 10/06 05:55:49.853 DEBUG| suite:1365| Adding job keyval for platform_DMVerityBitCorruption.last=147260586-chromeos-test <----- right here is 37 minutes of silence, followed by a suspicious "MySQL server has gone away" message -----> 10/06 06:32:58.339 ERROR| db:0023| 06:32:58 10/06/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 10/06 06:33:39.681 INFO | server_job:0214| START 147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2 graphics_GLMark2 timestamp=1507296636 localtime=Oct 06 06:30:36 10/06 06:33:39.682 INFO | server_job:0214| GOOD 147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2 graphics_GLMark2 timestamp=1507296656 localtime=Oct 06 06:30:56 completed successfully 10/06 06:33:39.683 INFO | server_job:0214| END GOOD 147252419-chromeos-test/chromeos6-row2-rack20-host4/graphics_GLMark2 graphics_GLMark2 timestamp=1507296656 localtime=Oct 06 06:30:56 10/06 06:33:39.683 DEBUG| suite:1365| Adding job keyval for graphics_GLMark2=147252419-chromeos-test 10/06 06:35:52.209 INFO | server_job:0214| START ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 10/06 06:35:52.210 INFO | server_job:0214| ABORT ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 10/06 06:35:52.211 INFO | server_job:0214| END ABORT ---- wizpig-paladin/R63-10007.0.0-rc1/bvt-inline/login_OwnershipRetaken timestamp=1507296952 localtime=Oct 06 06:35:52 10/06 06:36:30.840 DEBUG| dynamic_suite:0623| Finished waiting on suite. Returning from _perform_reimage_and_run.
,
Oct 6 2017
From chat, akeshet stated that if a builder is marked EXPERIMENTAL, and it takes longer than other slaves, then master won't wait for it and it will get aborted. Thus, the only way to know if wizpig is fixed well enough to pass is to mark it important again, 1126 & 1127. Closing this bug.
,
Oct 11 2017
This is still broken with same symptoms. The last 3 CQ runs failed on wizpig-paladin HWTest: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16549 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16550 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16551 Over the weekend and the past few days there have been many more such failures: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/
,
Oct 11 2017
I'll follow up on this later today if phobbs doesn't get to it.
,
Oct 11 2017
This builder should be marked as experimental if it's failing this often. Looks like it got un-marked as such by djkurtz. https://chromiumos-status.appspot.com/
,
Oct 11 2017
> This builder should be marked as experimental if it's failing this > often. Looks like it got un-marked as such by djkurtz. Wait. Before we do that, we have to prove that the recent failures aren't a bad CL. Otherwise, we'll let bad code in, and then _lots_ of stuff will go bad in the lab.
,
Oct 11 2017
,
Oct 11 2017
I did a spot-check of recent repairs in wizpig CQ, and compared them to recent repairs in the BVT. In both cases, there were more repairs than I expected. In both cases, there's a mixture of "the DUT was offline" and "the failed update but stayed online". The frequency of "offline" is higher in the CQ, and there were reports of kernel crashes there. So, there could have been bad CLs in the CQ. Someone should look at the blamelists.
,
Oct 11 2017
,
Oct 13 2017
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1178 -> infra DNS flake https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1171 -> infra network flake: Connection timed out https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1170 -> infra network flake: Connection timed out https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1169 -> infra network flake: Connection timed out https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1148 -> infra network flake: Connection timed out https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1143 -> infra network flake: Connection timed out https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1136 -> infra network flake: Connection timed out I don't see any evidence that a CL is to blame.
,
Oct 13 2017
Connection timed out, CAN be because of bugs in the product. It's normally generated when we are unable to reconnect to a DUT after rebooting it. It can also come from all kinds of other causes. Bad DUT hardware, overloaded devserver network interfaces, DNS errors, etc.
,
Oct 13 2017
phobbs@ is OOO
Spot checking these, looks like most (all?) of these are failures of the same DUT, which seems to be provision fail-loopling.
$ dut-status -f chromeos2-row8-rack8-host10
chromeos2-row8-rack8-host10
2017-10-12 15:17:06 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/800032-repair/
2017-10-12 14:55:30 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799852-provision/
2017-10-12 14:34:49 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799713-repair/
2017-10-12 14:29:33 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799680-verify/
2017-10-12 13:26:37 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799462-repair/
2017-10-12 12:55:33 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799183-provision/
2017-10-12 12:53:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799141-repair/
2017-10-12 12:39:47 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/799041-provision/
2017-10-12 09:54:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/798137-repair/
2017-10-12 09:38:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/798039-provision/
2017-10-12 07:23:09 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/797458-repair/
2017-10-12 06:48:39 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/797272-provision/
2017-10-12 04:36:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/796728-repair/
2017-10-12 04:00:03 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/796426-provision/
2017-10-12 01:28:44 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/795684-repair/
2017-10-12 01:11:58 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/795613-provision/
2017-10-11 22:27:18 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794861-repair/
2017-10-11 22:05:15 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794688-provision/
2017-10-11 20:13:51 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/794080-repair/
2017-10-11 19:40:02 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/793785-provision/
2017-10-11 17:58:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack8-host10/793151-repair/
,
Oct 13 2017
Locking DUT for investigation.
,
Oct 13 2017
,
Oct 13 2017
I see some sshl crashdumps in at least one of those repair jobs. This might explain the DUT being unreachable after boot. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host10/799713-repair/20171210143448/crashinfo.chromeos2-row8-rack8-host10/
,
Oct 13 2017
This looks like the same issue as Issue 766211 which was also afflicting wizpig. I suggest this is a product bug, returning to the sheriff.
,
Oct 13 2017
by sshl I meant supplied_sslh_fork
,
Oct 13 2017
Also telling is that it appears to be stable after a USB reinstall to an older version (R62-9901.35.1) after repair.
,
Oct 13 2017
Why is the repair infrastructure failing considered a product bug?
,
Oct 13 2017
The repair infrastructure is succeeding. It is repairing a DUT that is unable to come onto the network, by reinstalling an older stable image.
,
Oct 13 2017
The best explanation is that there is something in the newer image, interacting with something about that particular DUT (or its network dongle?) that is causing it to not come onto the network after boot.
,
Oct 13 2017
But why is the provision failing for only one DUT? If it were a software bug, it would affect all wizpig DUT's, wouldn't it?
,
Oct 13 2017
Repair is working, provision is failing. Provision can only succeed if the product can be installed, boot, and listen for an SSH connection. Many provision failures are infra, but these don't appear to be because of a mix of #31 and #32. If our diagnosis is correct, this board is going to continue to cause CQ outages, despite the problematic DUT being locked.
,
Oct 13 2017
This DUT has a failing or loose USB ethernet adapter. It's barely able to be interacted with over SSH (my session kept locking up): [ 9627.819193] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041 [ 9627.819213] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6 [ 9627.819227] usb 2-2: Product: Linksys USB3GIGV1 [ 9627.819237] usb 2-2: Manufacturer: Linksys [ 9627.819246] usb 2-2: SerialNumber: 000001000000 [ 9627.821289] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49 [ 9627.824802] platform vpd: Driver vpd requests probe deferral [ 9632.855428] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9632.866625] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9632.866692] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9757.581271] usb 2-2: USB disconnect, device number 109 [ 9757.581916] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device [ 9757.808892] usb 2-2: new SuperSpeed USB device number 110 using xhci_hcd [ 9757.821347] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041 [ 9757.821366] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6 [ 9757.821378] usb 2-2: Product: Linksys USB3GIGV1 [ 9757.821387] usb 2-2: Manufacturer: Linksys [ 9757.821395] usb 2-2: SerialNumber: 000001000000 [ 9757.824764] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49 [ 9757.828574] platform vpd: Driver vpd requests probe deferral [ 9762.859426] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9762.870506] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9762.870572] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9917.615808] usb 2-2: USB disconnect, device number 110 [ 9917.616039] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device [ 9917.854083] usb 2-2: new SuperSpeed USB device number 111 using xhci_hcd [ 9917.865548] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041 [ 9917.865566] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6 [ 9917.865578] usb 2-2: Product: Linksys USB3GIGV1 [ 9917.865587] usb 2-2: Manufacturer: Linksys [ 9917.865596] usb 2-2: SerialNumber: 000001000000 [ 9917.868491] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49 [ 9917.870440] platform vpd: Driver vpd requests probe deferral [ 9922.902049] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9922.910172] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9922.910240] cdc_ether 2-2:2.0 eth1: kevent 12 may have been dropped [ 9937.630804] usb 2-2: USB disconnect, device number 111 [ 9937.631089] cdc_ether 2-2:2.0 eth1: unregister 'cdc_ether' usb-0000:00:14.0-2, CDC Ethernet Device [ 9937.861848] usb 2-2: new SuperSpeed USB device number 112 using xhci_hcd [ 9937.876353] usb 2-2: New USB device found, idVendor=13b1, idProduct=0041 [ 9937.876376] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=6 [ 9937.876391] usb 2-2: Product: Linksys USB3GIGV1 [ 9937.876402] usb 2-2: Manufacturer: Linksys [ 9937.876413] usb 2-2: SerialNumber: 000001000000 [ 9937.878773] cdc_ether 2-2:2.0 eth1: register 'cdc_ether' at usb-0000:00:14.0-2, CDC Ethernet Device, 14:91:82:3b:ad:49 [ 9937.879039] platform vpd: Driver vpd requests probe deferral [ 9937.879285] platform vpd: Driver vpd requests probe deferral Perhaps a newer software release is less forgiving. Please schedule this DUT for manual repair or replacement.
,
Oct 13 2017
The initial install done during repair is via USB, so the large file transfers don't go over it's network adapter. It's currently locked, so won't cause issues.
,
Oct 13 2017
,
Oct 14 2017
Transferring outstanding deputy bugs.
,
Nov 6 2017
wizpig failed provision again https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1448 Repair fixed that DUT via servoreset https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos6-row2-rack19-host3/1050951-repair/20170611133000
,
Mar 5 2018
During week of 2/25/18 * 6 cases fails due to retry exception (function="ready_test()") on the following DUTs. johndhong@, could you need to check the following DUTS? chromeos6-row22-rack13-host9 chromeos6-row22-rack13-host7 chromeos6-row22-rack13-host1 Detail log: https://stainless.corp.google.com/search?view=list&first_date=2018-02-26&last_date=2018-03-02&suite=%5Ebluestreak%24&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=retry+exception&exclude_cts=true&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Mar 5 2018
,
Mar 5 2018
@mzhuo how is wizpig failing relevant for Bluestreak devices?
,
Mar 30 2018
,
Mar 30 2018
,
Jul 20
Looks like maybe this should be closed? If not, please say so.
,
Dec 12
It seems wizpig-paradin fails 3+ times in a row due to provision failure. https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=wizpig-paladin&buildBranch=master Example log: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927363591518386240/+/steps/HWTest__provision_/0/stdout As far as I read the log, status of all DUTs are "Reparing". Is this something abnormal? Can anyone look into the infra status? --------------------------------------------------------------------- 12-12-2018 [00:56:38] Attempting to display pool info: cq host: chromeos2-row8-rack6-host6, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'hw_video_acc_enc_h264', 'os:cros', 'hw_jpeg_acc_dec', 'power:battery', 'hw_video_acc_vp8', 'hw_video_acc_h264', 'hw_video_acc_vp9', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'audio_loopback_dongle', 'cts_cpu_x86', 'variant:wizpig', 'touchpad', 'phase:PVT', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697540 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos2-row8-rack8-host13, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697542 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos2-row8-rack8-host10, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'ec:cros', 'os:cros', 'power:battery', 'servo', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'variant:wizpig', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697538 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos2-row8-rack8-host17, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'audio_loopback_dongle', 'cts_cpu_x86', 'variant:wizpig', 'touchpad', 'phase:PVT', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697545 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos2-row8-rack8-host18, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'pool:cq', 'usb_nic:cdc_ether', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697539 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos6-row2-rack19-host1, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'webcam', 'wizpig', 'internal_display', 'usb_nic:r8152', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'sparse_coverage_5', 'sparse_coverage_3', 'sparse_coverage_2', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'hw_video_acc_enc_vp8', 'ec:cros', 'pool:cq', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'cts_cpu_x86', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697543 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos6-row2-rack19-host6, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'usb_nic:r8152', 'touchpad', 'phase:PVT', 'bluetooth', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', 'arc', 'internal_display', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_enc', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'ec:cros', 'servo', 'variant:wizpig', 'webcam', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_2', 'sparse_coverage_3', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697544 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos6-row2-rack19-host4, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'internal_display', 'webcam', 'bluetooth', 'usb_nic:r8152', 'hw_video_acc_vp8', 'hw_video_acc_vp9', 'hw_jpeg_acc_dec', 'hw_video_acc_h264', 'hw_video_acc_enc_h264', 'model:wizpig', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_jpeg_acc_enc', 'arc', 'touchpad', 'phase:PVT', 'servo', 'variant:wizpig', 'hw_video_acc_enc_vp8', 'ec:cros', 'cts_cpu_x86', 'audio_loopback_dongle', 'pool:cq', 'sparse_coverage_5', 'sparse_coverage_2', 'sparse_coverage_3', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697541 Provision started on: 2018-12-12 00:44:38 status FAIL host: chromeos6-row2-rack19-host8, status: Repairing, locked: False diagnosis: Unused labels: ['board:wizpig', 'bluetooth', 'accel:cros-ec', 'os:cros', 'power:battery', 'cts_abi_x86', 'cts_abi_arm', 'storage:mmc', 'wizpig', 'internal_display', 'usb_nic:r8152', 'model:wizpig', 'touchpad', 'phase:PVT', 'arc', 'hw_jpeg_acc_enc', 'servo', 'variant:wizpig', 'audio_loopback_dongle', 'ec:cros', 'hw_video_acc_enc_h264', 'hw_video_acc_enc_vp8', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', '4k_video_h264', '4k_video_vp8', '4k_video_vp9', 'hw_video_acc_vp9', 'hw_video_acc_h264', 'webcam', 'cts_cpu_x86', 'sparse_coverage_2', 'sparse_coverage_5', 'sparse_coverage_3', 'pool:cq', 'sku:wizpig_intel_celeron_n3060_2Gb'] Last 10 jobs within 1:48:00: 3697546 Provision started on: 2018-12-12 00:44:38 status FAIL Reason: Provisioning failed.
,
Dec 13
Re: #38 - The failures you linked to were caused by https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1351133 . Use https://crbug.com/914717 to track. I'm going to close this old bug. I'm pretty sure whatever provisioning problems wizpig-paladin had in 2017 are gone, even if we have new problems cropping up now. |
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by djkurtz@chromium.org
, Oct 3 2017