New issue
Advanced search Search tips

Issue 881883 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 13
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

skylab_test_push failing (+ reported badly in sml task)

Project Member Reported by pprabhu@chromium.org, Sep 7

Issue description

2018-09-07 07:08:32,685 INFO| Triggering /opt/infra-tools/usr/bin/skylab_test_push
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal
NOTICE:root:ts_mon was set up.
WARNING:root:<class '__main__.TestPushException'>(Not enough DUTs to run push tests.
Require 4 gandof DUTs in pool: bvt, only 0 are Ready now)
WARNING:root:Retrying in 34.452737 seconds...
Reverifying all DUTs.
Checking DUT inventory...
Checking DUT inventory...
Autotest instance created: localhost
09-07-2018 [07:09:32] Submitted create_suite_job rpc
09-07-2018 [07:09:34] Created suite job: http://localhost/afe/#tab_id=view_job&object_id=28397
@@@STEP_LINK@Link to suite@http://localhost/afe/#tab_id=view_job&object_id=28397@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
Suite job 28397 is completed.
Comparing test results...

Fatal error: local() encountered an error (return code 1) while executing '/opt/infra-tools/usr/bin/skylab_test_push'

Aborting.
2018-09-07 07:29:33,681 ERRO| /opt/infra-tools/usr/bin/skylab_test_push failed.


...
...
2018-09-07 07:38:12,028 INFO| Completed test push
2018-09-07 07:38:12,216 INFO| RunTestPushTask succeed.
2018-09-07 07:38:12,217 INFO| Printing out task report.
[localhost] local: find /usr/local/autotest//results/* | grep -v 'README' | xargs rm -rf
{
  "sub_reports": [],
  "exception": null,
  "is_successful": true,
  "description": "RunTestPushTask succeed.",
  "arguments_used": {
    "service_account_json": "/creds/service_accounts/cipd-uprev-service.json"
  },
  "task_name": "RunTestPushTask"
}


But that is a lie!
 
Summary: skylab_test_push failing (+ reported badly in sml task) (was: test_push claims success even thought skylab_test_push fails)
The task is actually failing overall, as it should. Just reporting badly: https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/675248


test_push.py: 2018-08-21 13:46:14,495:INFO:test_push:_run_test_push:191:Finished skylab_staging_test suite.
test_push.py: 2018-08-21 13:46:14,495:INFO:test_push:_verify_test_results:198:Comparing test results...
test_push.py: 2018-08-21 13:46:14,500:DEBUG:autotest:find_module:138:Dummying out autotest_lib.tko.common import
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:201:Test results:
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Pass.actionable         GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Pass                    GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:login_LoginSuccess            GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Error              ERROR
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Fail               FAIL
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Warn               WARN
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.NAError            TEST_NA
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_Fail.Crash              GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:dummy_PassServer              GOOD
test_push.py: 2018-08-21 13:46:15,324:DEBUG:test_push:_verify_test_results:203:provision_AutoUpdate.double   FAIL
test_push.py: 2018-08-21 13:46:15,327:ERROR:test_push:_verify_test_results:207:Results of 1 test(s) do not match expected values:
provision_AutoUpdate.double Expected: [GOOD], Actual: [FAIL]


test_push.py: 2018-08-21 13:46:15,328:INFO:ts_mon_config:_CleanupMetricsFlushingProcess:207:Waiting for ts_mon flushing process to finish...
test_push.py: 2018-08-21 13:47:15,046:INFO:ts_mon_config:_CleanupMetricsFlushingProcess:213:Finished waiting for ts_mon process.
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 255, in <module>
    sys.exit(main())
  File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 74, in main
    _run_test_push(args)
  File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 193, in _run_test_push
    _verify_test_results(task_id, _EXPECTED_TEST_RESULTS)
  File "/usr/local/autotest/venv/skylab_staging/test_push.py", line 208, in _verify_test_results
    raise errors.TestPushError('Test results are not consistent with '
skylab_staging.errors.TestPushError: Test results are not consistent with expected results


autoupdate.double is failing in skylab: https://chromium-swarm-dev.appspot.com/task?id=3f76ed8a272d1310&refresh=10
Labels: Hotlist-Deputy
Staging blockers are deputy's responsibility.
Failure from https://stainless.corp.google.com/browse/chromeos-autotest-results/swarming-3f76ed8a272d1311/

	FAIL	provision_AutoUpdate.double	provision_AutoUpdate.double	timestamp=1534884151	localtime=Aug 21 13:42:31	Download and install failed from chromeos-gt-devserver17.cbf.corp.google.com onto chromeos4-row7-rack10-host3: Update engine status is UPDATE_STATUS_UPDATED_NEED_REBOOT (UPDATE_STATUS_IDLE was expected).


All bots at go/chromeos-skylab-staging-bots are in repair_failed state.
This is inconsistent with how these DUTs are behaving in prod: http://shortn/_SclqsxdaiA
focussing on one of the bots: http://shortn/_EIQAW69bRs

The DUT does not respond to ping / ssh.
Then, all repair steps fail, including repair-via-usb, because: 

09/07 12:58:15.756 INFO |             servo:0673| Searching for usb device and copying image to it. Please wait a few minutes...
09/07 13:00:53.804 ERROR|             servo:0676| Failed to transfer requested image to USB. Please take a look at Servo Logs.
09/07 13:00:53.811 ERROR|            repair:0507| Repair failed: Reinstall from USB using servo
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 505, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 560, in repair
    host.servo_install(host.stage_image_for_servo())
  File "/usr/local/autotest/server/hosts/cros_host.py", line 665, in servo_install
    self.servo.install_recovery_image(image_url)
  File "/usr/local/autotest/server/cros/servo/servo.py", line 705, in install_recovery_image
    self.image_to_servo_usb(image_path, make_image_noninteractive)
  File "/usr/local/autotest/server/cros/servo/servo.py", line 678, in image_to_servo_usb
    raise error.AutotestError('Download image to usb failed.')
AutotestError: Download image to usb failed.


Very helpfully, it doesn't tell us where the servo logs are.

I found them at 
ssh root@chromeos4-row7-rack10-host3-servo.cros

localhost ~ # vim /var/log/servod_9999.log

...

2018-08-15 15:05:03,054 - Servod - INFO - Initialized bb_kbd_m2_c1_r3 to off
2018-08-15 15:05:03,064 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
2018-08-15 15:05:03,079 - servod - INFO - Listening on 0.0.0.0 port 9999
2018-08-21 12:56:06,180 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-21 13:38:07,268 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-21 14:20:05,181 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-21 15:13:21,444 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-21 16:00:08,101 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )


So, USB transfer has been failing since 8/21. But we have no visibility in various servo repair failure modes. See  issue 735121  for my past complaints about this.

I suppose something should try to recover servo in this case, but nothing does...
I restarted servod on chromeos4-row7-rack10-host3-servo.cros
Let's see if it does better on the next admin repair task.

I created a retry of the last failed repair task on that bot: https://chromium-swarm-dev.appspot.com/task?id=3fceb4c3fcd2a610&refresh=10&request_detail=true

(Note to self: Go check crosskylabadmin-staging metrics later to ensure that manually retried admin tasks do not confuse the admin task creation app)
Labels: -Pri-2 Pri-1
Other dead bots have the same story:

http://shortn/_yZBGOMfzxK

2018-08-15 15:09:38,885 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
2018-08-15 15:09:38,894 - servod - INFO - Listening on 0.0.0.0 port 9999
2018-08-30 08:04:35,752 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-30 09:01:03,144 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-30 09:42:16,640 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )

http://shortn/_YAAbtyWTdE

2018-08-15 16:03:07,038 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
2018-08-15 16:03:07,048 - servod - INFO - Listening on 0.0.0.0 port 9999
2018-09-04 22:23:44,790 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-09-04 23:04:43,158 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-09-04 23:45:36,942 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )

http://shortn/_BfqbsWHJVO

2018-08-31 09:04:44,047 - Servod - INFO - Initialized bb_kbd_m2_c1_r2 to off
2018-08-31 09:04:44,084 - Servod - INFO - Initialized bb_kbd_m2_c1_r3 to off
2018-08-31 09:04:44,092 - Servod - INFO - Initialized ec_ec3po_interp_connect to on
2018-08-31 09:04:44,102 - servod - INFO - Listening on 0.0.0.0 port 9999
2018-08-31 09:30:51,636 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-31 10:16:01,268 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )
2018-08-31 11:10:14,027 - Servod - ERROR - Failed to transfer image to USB device: [Errno 110] Connection timed out ( socket error )


I've restarted servod on all of those servohosts.

I've also filed b/114461478 to have englab-sys-cros take a look at these DUTs.
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 11

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/8a93f212a65baf86f2dfaa77113d3a423f2b751f

commit 8a93f212a65baf86f2dfaa77113d3a423f2b751f
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Sep 11 20:37:18 2018

Status: Fixed (was: Assigned)

Sign in to add a comment