native heap symbolizer test very flaky on catapult win try server |
||||||
Issue description
,
Dec 22 2017
I am bit sicked, so may be slow getting to this. Feel free to take this from me.
,
Dec 22 2017
Details are in https://github.com/catapult-project/catapult/issues/4134 The stack trace from gsutil is File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\__main__.py", line 572, in _RunNamedCommandAndHandleExceptions user_project=user_project) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\command_runner.py", line 319, in RunNamedCommand return_code = command_inst.RunCommand() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\commands\ls.py", line 566, in RunCommand exp_dirs, exp_objs, exp_bytes = ls_helper.ExpandUrlAndPrint(storage_url) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\ls_helper.py", line 192, in ExpandUrlAndPrint top_level_iterator.PeekException() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\plurality_checkable_iterator.py", line 65, in _PopulateHead e = self.base_iterator.next() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\wildcard_iterator.py", line 476, in IterAll expand_top_level_buckets=expand_top_level_buckets): File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\wildcard_iterator.py", line 215, in __iter__ provider=self.wildcard_url.scheme, fields=listing_fields): File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\gcs_json_api.py", line 609, in ListObjects global_params=global_params) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\third_party\storage_apitools\storage_v1_client.py", line 1171, in List config, request, global_params=global_params) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\base_api.py", line 720, in _RunMethod http, http_request, **opts) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 351, in MakeRequest max_retry_wait, total_wait_sec)) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\util.py", line 1723, in WarnAfterManyRetriesHandler http_wrapper.HandleExceptionsAndRebuildHttpConnections(retry_args) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 341, in MakeRequest check_response_func=check_response_func) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 391, in _MakeRequestNoRetry redirections=redirections, connection_type=connection_type) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\client.py", line 616, in new_request self._refresh(request_orig) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\client.py", line 875, in _refresh self.store.acquire_lock() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\multistore_file.py", line 240, in acquire_lock self._multistore._lock() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\multistore_file.py", line 300, in _lock self._file.open_and_lock() File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\locked_file.py", line 230, in open_and_lock self._opener.open_and_lock(timeout, delay) File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\_win32_opener.py", line 75, in open_and_lock pywintypes.OVERLAPPED()) error: (33, 'LockFileEx', 'The process cannot access the file because another process has locked a portion of the file.') We use the latest release of gsutil, version 4.28, but that uses an old version of oauth2client, 2.2.0 (4.1.2 is current). The change to update gsutil to 4.1.2 is out: https://github.com/GoogleCloudPlatform/gsutil/commit/114fc88c47657ab9b1669be7a82dfa847f2a856f but it hasn't been released yet. The reason what's above matters is that gsutil 4.28 uses oauth2client.contrib.multistore_file, which is flaky on Windows (see trace above) and has been replaced with oauth2client.contrib.multiprocess_file_storage, which uses methods that are consistent between platforms. I think what is happening is that one of the two tests are calling gsutil methods at the same time, and the flaky lock acquisition in multistore_file is causing failures. (The reason for the lock is to rate-limit credentials requests.) Two possible solutions: 1. Retry: Two ways to add retries: a. The _lock() call in oauth2client's is not doing retries because timeout is not being set. We could edit it to set it to some positive value. The problem with this method is that we would be editing third_party code, and I'm not sure what the process is for that. Ned? b. Instead we could add a retry to cloud_storage's _RunCommand when we fail with error code 33: "LockFileEx". If we do this we can add a note to remove the retry once gsutil/oauth2client is updated. I wrote this code already, but I still need to test it, and now I can't get the flaky test to repro anymore. It seems the flake maybe doesn't repro with -v -v turned on? 2. Update the library: I don't think this is a good solution until there's a new release out. I'm just having problems because the bug isn't reproing for me anymore.
,
Dec 22 2017
,
Dec 22 2017
,
Dec 22 2017
I figured out how to reliably repro it:
add
import time
time.sleep(1)
to multistore_file.py's _lock() function right after the self._file.open_and_lock() call.
Then I can fix it again by adding timeout=2 to the self._file.open_and_lock() call.
,
Dec 23 2017
,
Jan 8 2018
Here's the change that originally fixed this: https://chromium-review.googlesource.com/c/catapult/+/843489 |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by erikc...@chromium.org
, Dec 19 2017""" [2/4] symbolize_trace_end_to_end_test_slow.SymbolizeTraceEndToEndTest.testMacv1Breakpad failed unexpectedly: Downloading C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\data\mac_trace_v1_presymbolization.json.gz from gcs. Traceback (most recent call last): File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 127, in testMacv1Breakpad ['--use-breakpad-symbols']) File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 64, in _RunSymbolizationOnTrace _DownloadFromCloudStorage(trace_presymbolization_path) File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 35, in _DownloadFromCloudStorage cloud_storage.GetIfChanged(path, cloud_storage.PARTNER_BUCKET) File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 451, in GetIfChanged _GetLocked(bucket, expected_hash, file_path) File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 349, in _GetLocked _RunCommand(['cp', url, partial_download_path.name]) File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 154, in _RunCommand raise GetErrorObjectForCloudStorageStderr(stderr) CloudStorageError: Failure: (33, 'LockFileEx', 'The process cannot access the file because another process has locked a portion of the file.'). """ The failing line is: """ _DownloadFromCloudStorage(trace_presymbolization_path) ... def _DownloadFromCloudStorage(path): print 'Downloading %s from gcs.' % (path) cloud_storage.GetIfChanged(path, cloud_storage.PARTNER_BUCKET) """ https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/extras/symbolizer/symbolize_trace_end_to_end_test_slow.py?q=symbolize_trace_end_to_end_test_slow.py&sq=package:chromium&l=64 Maybe this is a bug in cloud_storage.GetIfChanged where the lock for some reason is not deleted, and persists across different runs - so whenever the device with the orphaned lock runs the test, it fails? Back over to Ned for further investigation.