Bogus repair task content on cros-full-0027
Reported by
jrbarnette@chromium.org,
Feb 22 2018
|
||||
Issue description
I just ran the following command:
$ site_utils/deploy_server.py -x cros-full-0027.mtv.corp.google.com --force_update --skip-update
The command failed.
This command is essentially the same command that will be used to
perform automated push to prod, so we can expect this failure to
affect our next push.
The relevant error messages follow:
======== cros-full-0027.mtv.corp.google.com ========
Running '/usr/local/autotest/site_utils/deploy_server_local.py --force_update --skip-update' on cros-full-0027.mtv.corp.google.com
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/keyval: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.DEBUG: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.INFO: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.WARNING: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.ERROR: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/sysinfo: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/status.log: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753.lock: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/chromeos6-row1-rack7-host21: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/crashinfo.chromeos6-row1-rack7-host21: Permission denied
Will skip service check for pushing servers in prod.
Checking tree status:
Traceback (most recent call last):
File "/usr/local/autotest/site_utils/deploy_server_local.py", line 538, in <module>
sys.exit(main(sys.argv[1:]))
File "/usr/local/autotest/site_utils/deploy_server_local.py", line 504, in main
verify_repo_clean()
File "/usr/local/autotest/site_utils/deploy_server_local.py", line 96, in verify_repo_clean
subprocess.check_output(['git', 'clean', '-fd'])
File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'clean', '-fd']' returned non-zero exit status 1
The servers that failed were:
cros-full-0027.mtv.corp.google.com
,
Feb 22 2018
$ atest host list chromeos6-row1-rack7-host21 Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos6-row1-rack7-host21 Repair Failed cros-full-0018.mtv.corp.google.com False None banjo internal_display, bluetooth, touchpad, ec:cros, hw_video_acc_enc_h264, os:cros, sku:banjo_intel_celeron_n2830_2Gb, power:battery, variant:banjo, storage:mmc, cts_abi_arm, phase:PVT, hw_jpeg_acc_dec, cts_abi_x86, hw_video_acc_h264, board:banjo, webcam, pool:suites, model:banjo, 4k_video_h264, audio_loopback_dongle The host associated with the bogus content is a DUT owned by shard cros-full-0018; the failure is on cros-full-0027, which is a drone. I'm wondering whether this content might in some way be related to bug 801467 .
,
Feb 22 2018
The repair task in question was also uploaded by gs_offloader: $ gsutil ls gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/keyval gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/result_summary.html gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/status.log gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/debug/ gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/ gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/
,
Feb 22 2018
... And, on the scheduler, the repair task was logged:
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep 62495156 scheduler.log.2018-02-*
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.513 INFO | drone_manager:0812| monitoring pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.513 INFO | models:2110| Starting: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40)
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.517 INFO | rdb_hosts:0222| Host chromeos6-row1-rack7-host21 in Repair Failed updating {'status': 'Repairing'} through rdb on behalf of: Task: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40)
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.521 INFO | drone_manager:0786| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair', '-m', u'chromeos6-row1-rack7-host21', '--verbose', '--lab', 'True', '-R', '--host-protection', 'NO_PROTECTION']
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.700 INFO | models:2121| Finished: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40) (active)
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.705 INFO | drone_manager:0824| forgetting pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.706 INFO | rdb_hosts:0222| Host chromeos6-row1-rack7-host21 in Repairing updating {'status': 'Repair Failed'} through rdb on behalf of: Task: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40)
,
Feb 22 2018
The DUT in question is definitely affected by the same system misbehavior as bug 801467 . The master scheduler shows regular activity on the DUT: chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | head -1 scheduler.log.2018-02-14-11.57.00:02/14 11:58:30.891 INFO | models:2110| Starting: Special Task 62399974 (host chromeos6-row1-rack7-host21, task Reset, time 2018-02-14 11:58:12) chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | tail -1 scheduler.log.2018-02-22-13.49.10:02/22 14:03:14.022 INFO | models:2110| Starting: Special Task 62504202 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-22 14:03:09) chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | wc -l 54 However, the DUT _should_ be owned by its shard, and indeed, the shard shows its own activities: $ dut-status -d 2 -f chromeos6-row1-rack7-host21 chromeos6-row1-rack7-host21 2018-02-22 13:55:39 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160557-repair/ 2018-02-22 13:50:40 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160497-verify/ 2018-02-22 13:04:37 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160188-repair/ 2018-02-22 12:59:44 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160127-verify/ 2018-02-22 12:39:07 NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/159980-repair/
,
Feb 22 2018
,
Feb 22 2018
N.B. Although the problem DUT is clearly affected by bug 801467 , this bug is not the same problem. Specifically, we need to explain why results from a repair task wound up in /usr/local/autotest, instead of in /usr/local/autotest/results/hosts.
,
Feb 22 2018
I've done what I think is needed to unblock the push to prod: $ sudo mv chromeos6-row1-rack7-host21 results/hosts I'm not going to do much more of anything here until bug 801467 is sorted out, and if this problem doesn't recur, it may not be worth doing anything at all. So, let's not pretend that I own it.
,
Jun 8 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by jrbarnette@chromium.org
, Feb 22 2018Logging into the server, you can see this: $ ls -ld /usr/local/autotest/chromeos6-row1-rack7-host21 drwxr-xr-x 3 root root 4096 Feb 21 19:18 /usr/local/autotest/chromeos6-row1-rack7-host21 That directory shouldn't exist; the push to prod checks for content like this, and tries to delete it. Since this directory is owned by root, the delete fails. The symptom can be eliminated by manually deleting the directory. However: _This content never should have been created_. We need to explain how it got there. For the moment, here's what's in the problem directory: chromeos6-row1-rack7-host21/ chromeos6-row1-rack7-host21/62495156-repair chromeos6-row1-rack7-host21/62495156-repair/keyval chromeos6-row1-rack7-host21/62495156-repair/debug chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.DEBUG chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.INFO chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.WARNING chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.ERROR chromeos6-row1-rack7-host21/62495156-repair/sysinfo chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute chromeos6-row1-rack7-host21/62495156-repair/status.log chromeos6-row1-rack7-host21/62495156-repair/host_info_store chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753.lock chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753 chromeos6-row1-rack7-host21/62495156-repair/host_keyvals chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/chromeos6-row1-rack7-host21 chromeos6-row1-rack7-host21/62495156-repair/crashinfo.chromeos6-row1-rack7-host21