Scheduler crash due to exception from copy_file_or_directory |
|||||
Issue description
We are seeing increasing number of scheduler crash from exception raised from copy_file_or_directory, for example:
04/05 05:43:55.655 ERROR| email_manager:0054| monitor_db exception
EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
return function(*args, **kwargs)
File "/usr/local/autotest/scheduler/site_monitor_db.py", line 106, in tick
super(SiteDispatcher, self).tick()
File "/usr/local/autotest/scheduler/monitor_db.py", line 374, in tick
_drone_manager.execute_actions()
File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
return function(*args, **kwargs)
File "/usr/local/autotest/scheduler/site_drone_manager.py", line 89, in execute_actions
super(SiteDroneManager, self).execute_actions()
File "/usr/local/autotest/scheduler/drone_manager.py", line 563, in execute_actions
self._drones.values())
File "/usr/local/autotest/scheduler/thread_lib.py", line 202, in execute
return self.get_results() if wait else None
File "/usr/local/autotest/scheduler/thread_lib.py", line 148, in get_results
self.wait_on_drones()
File "/usr/local/autotest/scheduler/thread_lib.py", line 136, in wait_on_drones
raise drone_task_queue.DroneTaskQueueException(exception_msg)
DroneTaskQueueException: Drone chromeos-server2.mtv.corp.google.com raised Exception command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_ujfMzbssh-master/socket
-o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpCRVcuc -o
BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l
chromeos-test -p 22 chromeos-server2.mtv.corp.google.com " python
/usr/local/autotest/scheduler/drone_utility.py --call_time 1459860230.4"
Exit status: 1
Duration: 1.23830795288
stderr:
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/drone_utility.py", line 618, in <module>
main()
File "/usr/local/autotest/scheduler/drone_utility.py", line 612, in main
return_value = drone_utility.execute_calls(calls)
File "/usr/local/autotest/scheduler/drone_utility.py", line 540, in execute_calls
results.append(method_call.execute_on(self))
File "/usr/local/autotest/scheduler/drone_utility.py", line 54, in execute_on
return method(*self._args, **self._kwargs)
File "/usr/local/autotest/scheduler/drone_utility.py", line 409, in copy_file_or_directory
os.path.join(destination_path, filename))
File "/usr/local/autotest/scheduler/drone_utility.py", line 411, in copy_file_or_directory
shutil.copytree(source_path, destination_path, symlinks=True)
File "/usr/lib/python2.7/shutil.py", line 208, in copytree
raise Error, errors
shutil.Error: [(u'/usr/local/autotest/results/hosts/chromeos4-row5-rack7-host11/53466533-provision/provision_FirmwareUpdate/sysinfo/reboot_current/mnt/stateful_partition/unencrypted/preserve/log/messages', .... ]
The missing file does not exist in GS either:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row5-rack7-host11/53466533-provision/provision_FirmwareUpdate/
We may want to handle such error gracefully without crashing scheduler, since it's not fatal errors that leads to db corruption.
,
Apr 11 2016
This is still a consistent source of alert spam. Can we just turn off this alert if we're going to ignore it anyway?
,
Apr 11 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/603628dac0ce6cece13e56288cf342d938a50596 commit 603628dac0ce6cece13e56288cf342d938a50596 Author: Dan Shi <dshi@google.com> Date: Mon Apr 11 18:47:07 2016 [autotest] Ignore shutil.Error when copy files from special task results folder Ignore copy directory error due to missing files. The cause of this behavior is that, gs_offloader zips up folders with too many files. There is a race condition that repair job tries to copy provision job results to the test job result folder, meanwhile gs_offloader is uploading the provision job result and zipping up folders which contains too many files. The race condition can't be resolved easily. gs_offloader uploads results in a different thread. Once a special task or job is finished, uploading is started. There is no reliable way for gs_offloader to wait for all repair/provision/test job to finish then starts uploading. So the simplest thing we can do now is to ignore that error. Checked all the code path, that method is only used for such purpose. BUG= chromium:600752 TEST=None Change-Id: I14cc1e25230838e2c3613c9298c8e9828c5020bf Reviewed-on: https://chromium-review.googlesource.com/338112 Commit-Ready: Dan Shi <dshi@google.com> Tested-by: Dan Shi <dshi@google.com> Reviewed-by: Fang Deng <fdeng@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/603628dac0ce6cece13e56288cf342d938a50596/scheduler/drone_utility.py
,
Apr 17 2016
,
Apr 27 2016
,
Aug 12 2016
Closing. please reopen if its not fixed. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dshi@chromium.org
, Apr 5 2016