New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 600752 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Scheduler crash due to exception from copy_file_or_directory

Project Member Reported by dshi@chromium.org, Apr 5 2016

Issue description

We are seeing increasing number of scheduler crash from exception raised from copy_file_or_directory, for example:

04/05 05:43:55.655 ERROR|     email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/scheduler/site_monitor_db.py", line 106, in tick
    super(SiteDispatcher, self).tick()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 374, in tick
    _drone_manager.execute_actions()
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/scheduler/site_drone_manager.py", line 89, in execute_actions
    super(SiteDroneManager, self).execute_actions()
  File "/usr/local/autotest/scheduler/drone_manager.py", line 563, in execute_actions
    self._drones.values())
  File "/usr/local/autotest/scheduler/thread_lib.py", line 202, in execute
    return self.get_results() if wait else None
  File "/usr/local/autotest/scheduler/thread_lib.py", line 148, in get_results
    self.wait_on_drones()
  File "/usr/local/autotest/scheduler/thread_lib.py", line 136, in wait_on_drones
    raise drone_task_queue.DroneTaskQueueException(exception_msg)
DroneTaskQueueException: Drone chromeos-server2.mtv.corp.google.com raised Exception command execution error
* Command:
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_ujfMzbssh-master/socket
    -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpCRVcuc -o
    BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l
    chromeos-test -p 22 chromeos-server2.mtv.corp.google.com " python
    /usr/local/autotest/scheduler/drone_utility.py --call_time 1459860230.4"
Exit status: 1
Duration: 1.23830795288

stderr:
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/drone_utility.py", line 618, in <module>
    main()
  File "/usr/local/autotest/scheduler/drone_utility.py", line 612, in main
    return_value = drone_utility.execute_calls(calls)
  File "/usr/local/autotest/scheduler/drone_utility.py", line 540, in execute_calls
    results.append(method_call.execute_on(self))
  File "/usr/local/autotest/scheduler/drone_utility.py", line 54, in execute_on
    return method(*self._args, **self._kwargs)
  File "/usr/local/autotest/scheduler/drone_utility.py", line 409, in copy_file_or_directory
    os.path.join(destination_path, filename))
  File "/usr/local/autotest/scheduler/drone_utility.py", line 411, in copy_file_or_directory
    shutil.copytree(source_path, destination_path, symlinks=True)
  File "/usr/lib/python2.7/shutil.py", line 208, in copytree
    raise Error, errors

shutil.Error: [(u'/usr/local/autotest/results/hosts/chromeos4-row5-rack7-host11/53466533-provision/provision_FirmwareUpdate/sysinfo/reboot_current/mnt/stateful_partition/unencrypted/preserve/log/messages', .... ]

The missing file does not exist in GS either:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row5-rack7-host11/53466533-provision/provision_FirmwareUpdate/

We may want to handle such error gracefully without crashing scheduler, since it's not fatal errors that leads to db corruption.



 

Comment 1 by dshi@chromium.org, Apr 5 2016

Issue 600783 has been merged into this issue.
Labels: -Pri-3 Pri-1
Owner: dshi@chromium.org
This is still a consistent source of alert spam. Can we just turn off this alert if we're going to ignore it anyway?
Project Member

Comment 3 by bugdroid1@chromium.org, Apr 11 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/603628dac0ce6cece13e56288cf342d938a50596

commit 603628dac0ce6cece13e56288cf342d938a50596
Author: Dan Shi <dshi@google.com>
Date: Mon Apr 11 18:47:07 2016

[autotest] Ignore shutil.Error when copy files from special task results folder

Ignore copy directory error due to missing files. The cause of this behavior is
that, gs_offloader zips up folders with too many files. There is a race
condition that repair job tries to copy provision job results to the test job
result folder, meanwhile gs_offloader is uploading the provision job result and
zipping up folders which contains too many files.

The race condition can't be resolved easily. gs_offloader uploads results in a
different thread. Once a special task or job is finished, uploading is started.
There is no reliable way for gs_offloader to wait for all repair/provision/test
job to finish then starts uploading. So the simplest thing we can do now is to
ignore that error. Checked all the code path, that method is only used for such
purpose.

BUG= chromium:600752 
TEST=None

Change-Id: I14cc1e25230838e2c3613c9298c8e9828c5020bf
Reviewed-on: https://chromium-review.googlesource.com/338112
Commit-Ready: Dan Shi <dshi@google.com>
Tested-by: Dan Shi <dshi@google.com>
Reviewed-by: Fang Deng <fdeng@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/603628dac0ce6cece13e56288cf342d938a50596/scheduler/drone_utility.py

Comment 4 by dshi@chromium.org, Apr 17 2016

Status: Fixed (was: Available)

Comment 5 by benhenry@google.com, Apr 27 2016

Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Status: Verified (was: Fixed)
Closing. please reopen if its not fixed.

Sign in to add a comment