New issue
Advanced search Search tips

Issue 796137 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 795899



Sign in to add a comment

native heap symbolizer test very flaky on catapult win try server

Project Member Reported by nedngu...@google.com, Dec 19 2017

Issue description

Owner: nedngu...@google.com
"""
[2/4] symbolize_trace_end_to_end_test_slow.SymbolizeTraceEndToEndTest.testMacv1Breakpad failed unexpectedly:
  Downloading C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\data\mac_trace_v1_presymbolization.json.gz from gcs.
  Traceback (most recent call last):
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 127, in testMacv1Breakpad
      ['--use-breakpad-symbols'])
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 64, in _RunSymbolizationOnTrace
      _DownloadFromCloudStorage(trace_presymbolization_path)
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\tracing\tracing\extras\symbolizer\symbolize_trace_end_to_end_test_slow.py", line 35, in _DownloadFromCloudStorage
      cloud_storage.GetIfChanged(path, cloud_storage.PARTNER_BUCKET)
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 451, in GetIfChanged
      _GetLocked(bucket, expected_hash, file_path)
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 349, in _GetLocked
      _RunCommand(['cp', url, partial_download_path.name])
    File "C:\b\build\slave\Catapult_Windows_Tryserver\build\catapult\common\py_utils\py_utils\cloud_storage.py", line 154, in _RunCommand
      raise GetErrorObjectForCloudStorageStderr(stderr)
  CloudStorageError: Failure: (33, 'LockFileEx', 'The process cannot access the file because another process has locked a portion of the file.').
"""

The failing line is:
"""
_DownloadFromCloudStorage(trace_presymbolization_path)
...

def _DownloadFromCloudStorage(path):
  print 'Downloading %s from gcs.' % (path)
  cloud_storage.GetIfChanged(path, cloud_storage.PARTNER_BUCKET)
"""
https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/extras/symbolizer/symbolize_trace_end_to_end_test_slow.py?q=symbolize_trace_end_to_end_test_slow.py&sq=package:chromium&l=64

Maybe this is a bug in cloud_storage.GetIfChanged where the lock for some reason is not deleted, and persists across different runs - so whenever the device with the orphaned lock runs the test, it fails? Back over to Ned for further investigation.
Cc: crouleau@chromium.org
I am bit sicked, so may be slow getting to this. Feel free to take this from me.
Cc: -crouleau@chromium.org nednguyen@chromium.org
Owner: crouleau@chromium.org
Details are in https://github.com/catapult-project/catapult/issues/4134

The stack trace from gsutil is 
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\__main__.py", line 572, in _RunNamedCommandAndHandleExceptions
      user_project=user_project)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\command_runner.py", line 319, in RunNamedCommand
      return_code = command_inst.RunCommand()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\commands\ls.py", line 566, in RunCommand
      exp_dirs, exp_objs, exp_bytes = ls_helper.ExpandUrlAndPrint(storage_url)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\ls_helper.py", line 192, in ExpandUrlAndPrint
      top_level_iterator.PeekException()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\plurality_checkable_iterator.py", line 65, in _PopulateHead
      e = self.base_iterator.next()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\wildcard_iterator.py", line 476, in IterAll
      expand_top_level_buckets=expand_top_level_buckets):
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\wildcard_iterator.py", line 215, in __iter__
      provider=self.wildcard_url.scheme, fields=listing_fields):
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\gcs_json_api.py", line 609, in ListObjects
      global_params=global_params)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\third_party\storage_apitools\storage_v1_client.py", line 1171, in List
      config, request, global_params=global_params)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\base_api.py", line 720, in _RunMethod
      http, http_request, **opts)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 351, in MakeRequest
      max_retry_wait, total_wait_sec))
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\gslib\util.py", line 1723, in WarnAfterManyRetriesHandler
      http_wrapper.HandleExceptionsAndRebuildHttpConnections(retry_args)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 341, in MakeRequest
      check_response_func=check_response_func)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\apitools\apitools\base\py\http_wrapper.py", line 391, in _MakeRequestNoRetry
      redirections=redirections, connection_type=connection_type)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\client.py", line 616, in new_request
      self._refresh(request_orig)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\client.py", line 875, in _refresh
      self.store.acquire_lock()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\multistore_file.py", line 240, in acquire_lock
      self._multistore._lock()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\multistore_file.py", line 300, in _lock
      self._file.open_and_lock()
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\locked_file.py", line 230, in open_and_lock
      self._opener.open_and_lock(timeout, delay)
    File "C:\Users\crouleau\src\src\third_party\catapult\third_party\gsutil\third_party\oauth2client\oauth2client\contrib\_win32_opener.py", line 75, in open_and_lock
      pywintypes.OVERLAPPED())
  error: (33, 'LockFileEx', 'The process cannot access the file because another process has locked a portion of the file.')



We use the latest release of gsutil, version 4.28, but that uses an old version of oauth2client, 2.2.0 (4.1.2 is current). The change to update gsutil to 4.1.2 is out: https://github.com/GoogleCloudPlatform/gsutil/commit/114fc88c47657ab9b1669be7a82dfa847f2a856f
but it hasn't been released yet.

The reason what's above matters is that gsutil 4.28 uses oauth2client.contrib.multistore_file, which is flaky on Windows (see trace above) and has been replaced with oauth2client.contrib.multiprocess_file_storage, which uses methods that are consistent between platforms.

I think what is happening is that one of the two tests are calling gsutil methods at the same time, and the flaky lock acquisition in multistore_file is causing failures. (The reason for the lock is to rate-limit credentials requests.)


Two possible solutions:

1. Retry:
Two ways to add retries:
a. The _lock() call in oauth2client's is not doing retries because timeout is not being set. We could edit it to set it to some positive value. The problem with this method is that we would be editing third_party code, and I'm not sure what the process is for that. Ned?
b. Instead we could add a retry to cloud_storage's _RunCommand when we fail with error code 33: "LockFileEx". If we do this we can add a note to remove the retry once gsutil/oauth2client is updated. I wrote this code already, but I still need to test it, and now I can't get the flaky test to repro anymore. It seems the flake maybe doesn't repro with -v -v turned on?

2. Update the library:
I don't think this is a good solution until there's a new release out.


I'm just having problems because the bug isn't reproing for me anymore.
Cc: -nednguyen@chromium.org nedngu...@google.com

Comment 5 by kbr@chromium.org, Dec 22 2017

Blockedon: 795899
I figured out how to reliably repro it:

add 
            import time
            time.sleep(1)
to multistore_file.py's _lock() function right after the self._file.open_and_lock() call.

Then I can fix it again by adding timeout=2 to the self._file.open_and_lock() call. 
Status: Fixed (was: Assigned)
Here's the change that originally fixed this: https://chromium-review.googlesource.com/c/catapult/+/843489

Sign in to add a comment