skylab_test_push failing (+ reported badly in sml task) |
||||
Issue description2018-09-07 07:08:32,685 INFO| Triggering /opt/infra-tools/usr/bin/skylab_test_push INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal NOTICE:root:ts_mon was set up. WARNING:root:<class '__main__.TestPushException'>(Not enough DUTs to run push tests. Require 4 gandof DUTs in pool: bvt, only 0 are Ready now) WARNING:root:Retrying in 34.452737 seconds... Reverifying all DUTs. Checking DUT inventory... Checking DUT inventory... Autotest instance created: localhost 09-07-2018 [07:09:32] Submitted create_suite_job rpc 09-07-2018 [07:09:34] Created suite job: http://localhost/afe/#tab_id=view_job&object_id=28397 @@@STEP_LINK@Link to suite@http://localhost/afe/#tab_id=view_job&object_id=28397@@@ --create_and_return was specified, terminating now. Will return from run_suite with status: OK Suite job 28397 is completed. Comparing test results... Fatal error: local() encountered an error (return code 1) while executing '/opt/infra-tools/usr/bin/skylab_test_push' Aborting. 2018-09-07 07:29:33,681 ERRO| /opt/infra-tools/usr/bin/skylab_test_push failed. ... ... 2018-09-07 07:38:12,028 INFO| Completed test push 2018-09-07 07:38:12,216 INFO| RunTestPushTask succeed. 2018-09-07 07:38:12,217 INFO| Printing out task report. [localhost] local: find /usr/local/autotest//results/* | grep -v 'README' | xargs rm -rf { "sub_reports": [], "exception": null, "is_successful": true, "description": "RunTestPushTask succeed.", "arguments_used": { "service_account_json": "/creds/service_accounts/cipd-uprev-service.json" }, "task_name": "RunTestPushTask" } But that is a lie!
,
Sep 7
test_push.py: 2018-08-21 13:46:14,495:INFO:test_push:_run_test_push:191:Finished skylab_staging_test suite.
test_push.py: 2018-08-21 13:46:14,495:INFO:test_push:_verify_test_results:198:Comparing test results...
test_push.py: 2018-08-21 13:46:14,500:DEBUG:autotest:find_module:138:Dummying out autotest_lib.tko.common import
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:201:Test results:
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Pass.actionable GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Pass GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:login_LoginSuccess GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Error ERROR
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Fail FAIL
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Warn WARN
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.NAError TEST_NA
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Crash GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_PassServer GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:provision_AutoUpdate.double FAIL
test_push.py: 2018-08-21 13:46:15,327:ERROR:test_push:_verify_test_results:207:Results of 1 test(s) do not match expected values:
provision_AutoUpdate.double Expected: [GOOD], Actual: [FAIL]
test_push.py: 2018-08-21 13:46:15,328:INFO:ts_mon_config:_CleanupMetricsFlushingProcess:207:Waiting for ts_mon flushing process to finish...
test_push.py: 2018-08-21 13:47:15,046:INFO:ts_mon_config:_CleanupMetricsFlushingProcess:213:Finished waiting for ts_mon process.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 255, in <module>
sys.exit(main())
File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 74, in main
_run_test_push(args)
File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 193, in _run_test_push
_verify_test_results(task_id, _EXPECTED_TEST_RESULTS)
File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 208, in _verify_test_results
raise errors.TestPushError('Test results are not consistent with '
skylab_staging.errors.TestPushError: Test results are not consistent with expected results
autoupdate.double is failing in skylab: https://chromium-swarm-dev.appspot.com/task?id=3f76ed8a272d1310&refresh=10
,
Sep 7
Staging blockers are deputy's responsibility.
,
Sep 7
Failure from https://stainless.corp.google.com/browse/chromeos-autotest-results/swarming-3f76ed8a272d1311/ FAIL provision_AutoUpdate.double provision_AutoUpdate.double timestamp=1534884151 localtime=Aug 21 13:42:31 Download and install failed from chromeos-gt-devserver17.cbf.corp.google.com onto chromeos4-row7-rack10-host3: Update engine status is UPDATE_STATUS_UPDATED_NEED_REBOOT (UPDATE_STATUS_IDLE was expected). All bots at go/chromeos-skylab-staging-bots are in repair_failed state. This is inconsistent with how these DUTs are behaving in prod: http://shortn/_SclqsxdaiA
,
Sep 7
focussing on one of the bots: http://shortn/_EIQAW69bRs The DUT does not respond to ping / ssh. Then, all repair steps fail, including repair-via-usb, because: 09/07 12:58:15.756 INFO | servo:0673| Searching for usb device and copying image to it. Please wait a few minutes... 09/07 13:00:53.804 ERROR| servo:0676| Failed to transfer requested image to USB. Please take a look at Servo Logs. 09/07 13:00:53.811 ERROR| repair:0507| Repair failed: Reinstall from USB using servo Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 505, in _repair_host self.repair(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 560, in repair host.servo_install(host.stage_image_for_servo()) File "/usr/local/autotest/server/hosts/cros_host.py", line 665, in servo_install self.servo.install_recovery_image(image_url) File "/usr/local/autotest/server/cros/servo/servo.py", line 705, in install_recovery_image self.image_to_servo_usb(image_path, make_image_noninteractive) File "/usr/local/autotest/server/cros/servo/servo.py", line 678, in image_to_servo_usb raise error.AutotestError('Download image to usb failed.') AutotestError: Download image to usb failed. Very helpfully, it doesn't tell us where the servo logs are. I found them at ssh root@chromeos4-row7-rack10-host3-servo.cros localhost ~ # vim /var/log/servod_9999.log ... 2018-08-15 15:05:03,054 - Servod - INFO - Initialized bb_kbd_m2_c1_r3 to off 2018-08-15 15:05:03,064 - Servod - INFO - Initialized ec_ec3po_interp_connect to on 2018-08-15 15:05:03,079 - servod - INFO - Listening on 0.0.0.0 port 9999 2018-08-21 12:56:06,180 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-21 13:38:07,268 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-21 14:20:05,181 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-21 15:13:21,444 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-21 16:00:08,101 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) So, USB transfer has been failing since 8/21. But we have no visibility in various servo repair failure modes. See issue 735121 for my past complaints about this. I suppose something should try to recover servo in this case, but nothing does...
,
Sep 7
I restarted servod on chromeos4-row7-rack10-host3-servo.cros Let's see if it does better on the next admin repair task. I created a retry of the last failed repair task on that bot: https://chromium-swarm-dev.appspot.com/task?id=3fceb4c3fcd2a610&refresh=10&request_detail=true (Note to self: Go check crosskylabadmin-staging metrics later to ensure that manually retried admin tasks do not confuse the admin task creation app)
,
Sep 7
Other dead bots have the same story: http://shortn/_yZBGOMfzxK 2018-08-15 15:09:38,885 - Servod - INFO - Initialized ec_ec3po_interp_connect to on 2018-08-15 15:09:38,894 - servod - INFO - Listening on 0.0.0.0 port 9999 2018-08-30 08:04:35,752 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-30 09:01:03,144 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-30 09:42:16,640 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) http://shortn/_YAAbtyWTdE 2018-08-15 16:03:07,038 - Servod - INFO - Initialized ec_ec3po_interp_connect to on 2018-08-15 16:03:07,048 - servod - INFO - Listening on 0.0.0.0 port 9999 2018-09-04 22:23:44,790 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-09-04 23:04:43,158 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-09-04 23:45:36,942 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) http://shortn/_BfqbsWHJVO 2018-08-31 09:04:44,047 - Servod - INFO - Initialized bb_kbd_m2_c1_r2 to off 2018-08-31 09:04:44,084 - Servod - INFO - Initialized bb_kbd_m2_c1_r3 to off 2018-08-31 09:04:44,092 - Servod - INFO - Initialized ec_ec3po_interp_connect to on 2018-08-31 09:04:44,102 - servod - INFO - Listening on 0.0.0.0 port 9999 2018-08-31 09:30:51,636 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-31 10:16:01,268 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) 2018-08-31 11:10:14,027 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error ) I've restarted servod on all of those servohosts. I've also filed b/114461478 to have englab-sys-cros take a look at these DUTs.
,
Sep 11
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/8a93f212a65baf86f2dfaa77113d3a423f2b751f commit 8a93f212a65baf86f2dfaa77113d3a423f2b751f Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Sep 11 20:37:18 2018
,
Sep 13
|
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Sep 7