New issue
Advanced search Search tips

Issue 814924 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Bogus repair task content on cros-full-0027

Reported by jrbarnette@chromium.org, Feb 22 2018

Issue description

I just ran the following command:
    $ site_utils/deploy_server.py -x cros-full-0027.mtv.corp.google.com --force_update --skip-update

The command failed.

This command is essentially the same command that will be used to
perform automated push to prod, so we can expect this failure to
affect our next push.

The relevant error messages follow:
======== cros-full-0027.mtv.corp.google.com ========
Running '/usr/local/autotest/site_utils/deploy_server_local.py --force_update --skip-update' on cros-full-0027.mtv.corp.google.com
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/keyval: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.DEBUG: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.INFO: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.WARNING: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.ERROR: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/sysinfo: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/status.log: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753.lock: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/chromeos6-row1-rack7-host21: Permission denied
warning: failed to remove chromeos6-row1-rack7-host21/62495156-repair/crashinfo.chromeos6-row1-rack7-host21: Permission denied
Will skip service check for pushing servers in prod.
Checking tree status:
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/deploy_server_local.py", line 538, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/usr/local/autotest/site_utils/deploy_server_local.py", line 504, in main
    verify_repo_clean()
  File "/usr/local/autotest/site_utils/deploy_server_local.py", line 96, in verify_repo_clean
    subprocess.check_output(['git', 'clean', '-fd'])
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'clean', '-fd']' returned non-zero exit status 1

The servers that failed were:
cros-full-0027.mtv.corp.google.com

 
Logging into the server, you can see this:
    $ ls -ld /usr/local/autotest/chromeos6-row1-rack7-host21
    drwxr-xr-x 3 root root 4096 Feb 21 19:18 /usr/local/autotest/chromeos6-row1-rack7-host21

That directory shouldn't exist; the push to prod checks for
content like this, and tries to delete it.  Since this directory
is owned by root, the delete fails.

The symptom can be eliminated by manually deleting the directory.
However:  _This content never should have been created_.  We need
to explain how it got there.

For the moment, here's what's in the problem directory:
chromeos6-row1-rack7-host21/
chromeos6-row1-rack7-host21/62495156-repair
chromeos6-row1-rack7-host21/62495156-repair/keyval
chromeos6-row1-rack7-host21/62495156-repair/debug
chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.DEBUG
chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.INFO
chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.WARNING
chromeos6-row1-rack7-host21/62495156-repair/debug/autoserv.ERROR
chromeos6-row1-rack7-host21/62495156-repair/sysinfo
chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
chromeos6-row1-rack7-host21/62495156-repair/status.log
chromeos6-row1-rack7-host21/62495156-repair/host_info_store
chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753.lock
chromeos6-row1-rack7-host21/62495156-repair/host_info_store/store_97c8510f-0824-467a-98db-2e8a382dd753
chromeos6-row1-rack7-host21/62495156-repair/host_keyvals
chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/chromeos6-row1-rack7-host21
chromeos6-row1-rack7-host21/62495156-repair/crashinfo.chromeos6-row1-rack7-host21

$ atest host list chromeos6-row1-rack7-host21
Host                         Status         Shard                               Locked  Lock Reason  Locked by  Platform  Labels
chromeos6-row1-rack7-host21  Repair Failed  cros-full-0018.mtv.corp.google.com  False                None       banjo     internal_display, bluetooth, touchpad, ec:cros, hw_video_acc_enc_h264, os:cros, sku:banjo_intel_celeron_n2830_2Gb, power:battery, variant:banjo, storage:mmc, cts_abi_arm, phase:PVT, hw_jpeg_acc_dec, cts_abi_x86, hw_video_acc_h264, board:banjo, webcam, pool:suites, model:banjo, 4k_video_h264, audio_loopback_dongle

The host associated with the bogus content is a DUT owned by shard
cros-full-0018; the failure is on cros-full-0027, which is a drone.
I'm wondering whether this content might in some way be related to
 bug 801467 .

The repair task in question was also uploaded by gs_offloader:

$ gsutil ls gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/keyval
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/result_summary.html
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/status.log
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/debug/
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_info_store/
gs://chromeos-autotest-results/hosts/chromeos6-row1-rack7-host21/62495156-repair/host_keyvals/

... And, on the scheduler, the repair task was logged:

chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep 62495156 scheduler.log.2018-02-*
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.513 INFO |     drone_manager:0812| monitoring pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.513 INFO |            models:2110| Starting: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40)
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.517 INFO |         rdb_hosts:0222| Host chromeos6-row1-rack7-host21 in Repair Failed updating {'status': 'Repairing'} through rdb on behalf of: Task: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40) 
scheduler.log.2018-02-21-14.38.36:02/21 19:13:12.521 INFO |     drone_manager:0786| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair', '-m', u'chromeos6-row1-rack7-host21', '--verbose', '--lab', 'True', '-R', '--host-protection', 'NO_PROTECTION']
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.700 INFO |            models:2121| Finished: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40) (active)
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.705 INFO |     drone_manager:0824| forgetting pidfile /usr/local/autotest/results/hosts/chromeos6-row1-rack7-host21/62495156-repair/.autoserv_execute
scheduler.log.2018-02-21-14.38.36:02/21 19:23:04.706 INFO |         rdb_hosts:0222| Host chromeos6-row1-rack7-host21 in Repairing updating {'status': 'Repair Failed'} through rdb on behalf of: Task: Special Task 62495156 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-21 19:12:40) 

The DUT in question is definitely affected by the same system
misbehavior as  bug 801467 .  The master scheduler shows regular
activity on the DUT:
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | head -1
scheduler.log.2018-02-14-11.57.00:02/14 11:58:30.891 INFO |            models:2110| Starting: Special Task 62399974 (host chromeos6-row1-rack7-host21, task Reset, time 2018-02-14 11:58:12)
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | tail -1
scheduler.log.2018-02-22-13.49.10:02/22 14:03:14.022 INFO |            models:2110| Starting: Special Task 62504202 (host chromeos6-row1-rack7-host21, task Repair, time 2018-02-22 14:03:09)
chromeos-test@cros-full-0036:/usr/local/autotest/logs$ grep chromeos6-row1-rack7-host21 scheduler.log.2018-02-* | grep 'Starting: ' | wc -l
54

However, the DUT _should_ be owned by its shard, and indeed, the shard
shows its own activities:

$ dut-status -d 2 -f chromeos6-row1-rack7-host21
chromeos6-row1-rack7-host21
    2018-02-22 13:55:39  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160557-repair/
    2018-02-22 13:50:40  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160497-verify/
    2018-02-22 13:04:37  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160188-repair/
    2018-02-22 12:59:44  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/160127-verify/
    2018-02-22 12:39:07  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row1-rack7-host21/159980-repair/

Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
N.B. Although the problem DUT is clearly affected by  bug 801467 , this
bug is not the same problem.  Specifically, we need to explain why
results from a repair task wound up in /usr/local/autotest, instead of
in /usr/local/autotest/results/hosts.

Labels: -Pri-1 Pri-3
Owner: ----
Status: Available (was: Started)
Summary: Bogus repair task content on cros-full-0027 (was: Bogus content on cros-full-0027 is breaking push)
I've done what I think is needed to unblock the push to prod:

$ sudo mv chromeos6-row1-rack7-host21 results/hosts

I'm not going to do much more of anything here until  bug 801467  is
sorted out, and if this problem doesn't recur, it may not be worth
doing anything at all.  So, let's not pretend that I own it.

Comment 9 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment