New issue
Advanced search Search tips

Issue 772453 link

Starred by 4 users

Issue metadata

Status: Archived
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 798794



Sign in to add a comment

Bots failing bot_update due to googlesource replication lag

Project Member Reported by pauljensen@chromium.org, Oct 6 2017

Issue description

https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29?numbuilds=200

Looks like this bot has failed bot_update with an exception four times in the last few days:
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/68393
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/68385
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/68361
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/68211

The bot_update exception always looks like this:

Traceback (most recent call last):
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/run.py", line 376, in _old_run
    recipe_result = recipe_script.run(api, properties)
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 98, in run
    self.run_steps, properties, self.PROPERTIES, api=api)
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 627, in invoke_with_properties
    **additional_args)
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/loader.py", line 588, in _invoke_with_properties
    return callable_obj(*props, **additional_args)
  File "/b/build/scripts/slave/recipes/cronet.py", line 189, in RunSteps
    chromium_apply_config=builder_config.get('chromium_apply_config'))
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/recipe_api.py", line 745, in _inner
    return func(*a, **kw)
  File "/b/build/scripts/slave/recipe_modules/cronet/api.py", line 54, in init_and_sync
    droid.init_and_sync(use_bot_update=True)
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/recipe_api.py", line 745, in _inner
    return func(*a, **kw)
  File "/b/build/scripts/slave/recipe_modules/chromium_android/api.py", line 133, in init_and_sync
    spec, refs=refs, with_branch_heads=with_branch_heads)
  File "/b/build/scripts/slave/.recipe_deps/recipe_engine/recipe_engine/recipe_api.py", line 745, in _inner
    return func(*a, **kw)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/api.py", line 327, in ensure_checkout
    co_root = result['root']
KeyError: 'root'

CC'ing agable@ who made the last change to api.py and jbudorick@ because he always seems to know why Cronet's bots fail :)
 
That's the not real stacktrace. That stacktrace is saying "when reading the output of the bot_update step, the JSON object it wrote to the file didn't have a 'root' key". Which makes total sense, because bot_update itself failed with a stacktrace, and therefore didn't write to that file. Here's the real stacktrace:

===Running git checkout --force 2515e420da7a6e03d2f250a1cee71b13025d750a===
In directory: /b/build/slave/Android_Cronet_x86_Builder__dbg_/build/src
fatal: reference is not a tree: 2515e420da7a6e03d2f250a1cee71b13025d750a
===Failed in 0.0 mins===
Something failed: git checkout --force 2515e420da7a6e03d2f250a1cee71b13025d750a failed with code 128 in /b/build/slave/Android_Cronet_x86_Builder__dbg_/build/src..
Ran 99.1176700592 seconds past deadline. Aborting.
Traceback (most recent call last):
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 1169, in <module>
    sys.exit(main())
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 1153, in main
    checkout(options, git_slns, specs, revisions, step_text, shallow)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 1059, in checkout
    gclient_output = ensure_checkout(**checkout_parameters)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 759, in ensure_checkout
    cleanup_dir)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 560, in git_checkout
    force_revision(sln_dir, revision)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 456, in force_revision
    git('checkout', '--force', revision, cwd=folder_name)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 211, in git
    return call(*cmd, **kwargs)
  File "/b/build/scripts/slave/.recipe_deps/depot_tools/recipes/recipe_modules/bot_update/resources/bot_update.py", line 192, in call
    code, outval)
__main__.SubprocessFailed: git checkout --force 2515e420da7a6e03d2f250a1cee71b13025d750a failed with code 128 in /b/build/slave/Android_Cronet_x86_Builder__dbg_/build/src.
step returned non-zero exit code: 1

So the real question is why that hash is corrupt ('not a tree').

Looks like a valid hash to me: https://chromium.googlesource.com/chromium/src/+/2515e420da7a6e03d2f250a1cee71b13025d750a

Maybe a corrupt checkout on disk?
I should mention that this appears to be affecting many Cronet bots.  For example here's another:
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20ARM64%20Builder%20%28dbg%29?numbuilds=200

The different bots are running on different slaves.
Oh interesting, yep, it's happening in a bunch of places, but with a different hash every time.
Uh oh, I found it. It's googlesource latency. Example:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.android%2FAndroid_Cronet_x86_Builder__dbg_%2F68361%2F%2B%2Frecipes%2Fsteps%2Fbot_update%2F0%2Fstdout

Tries to check out 3df0d0574d1f53e831bd5d9f7ae9c825c0c8cb38

Runs 'git fetch' in the cache, and gets:
From https://chromium.googlesource.com/chromium/src
   0c0ed17d896b..71b25bc67d18  master       -> master

But 71b is the *parent* of 3df (https://chromium.googlesource.com/chromium/src/+/3df0d0574d1f53e831bd5d9f7ae9c825c0c8cb38). So even though 3df has landed in the repo and triggered a job to start, when that job fetches the repo from GoB it doesn't download that commit.

That commit was landed at :49
The job was started at :52
The fetch started at least a minute and a half after that (given step timings prior to bot_update).
So that's a GoB replication latency of ~5 minutes, which is unnaceptably slow.

Filing an internal bug against them.
Labels: Infra-Troopers
Status: ExternalDependency (was: Untriaged)
Summary: Bots failing bot_update due to googlesource replication lag (was: Cronet bots fail bot_update occasionally)
b/67501786

Adding Trooper label so trooper is aware of this issue. If others report similar symptoms, please dupe into this bug.

Comment 6 by hinoka@chromium.org, Oct 19 2017

Cc: hinoka@chromium.org machenb...@chromium.org
 Issue 776498  has been merged into this issue.
Owner: aga...@chromium.org
New trooper checking in. Looks like the consistency hiccup is still occurring very occasionally:
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/69451
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/69430
https://build.chromium.org/p/chromium.android/builders/Android%20Cronet%20x86%20Builder%20%28dbg%29/builds/69417

Not sure what else we can do here. Over to agable to decide if we should just close this as wontfix.
On the day I noticed it, every time I looked at
https://build.chromium.org/p/chromium.gpu.fyi/console
there were 4 or 5 purple bots.
If this could happen on CQ bots wouldn't you have a day of people unable to land changes?
Unless there's reason to believe this is extremely rare and won't get worse it doesn't seem like something to be ignored.

Comment 9 by aga...@chromium.org, Oct 20 2017

Please post updates on the internal bug; there's nothing we can do from here.
Status: Archived (was: ExternalDependency)
Keeping this bug open isn't helping; all progress here is tracked in GoB team's code-yellow bug: b/35998605

Comment 11 by kbr@chromium.org, Jan 3 2018

Blocking: 798794

Sign in to add a comment