New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 842018 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Builders fail in InitialCheckout with git lock error (concurrent git access?)

Project Member Reported by moragues@chromium.org, May 11 2018

Issue description

Please fill out the relevant section and delete the others.

=== Create or edit a Git repo ===
googlesource host (e.g. chromium, chrome-internal): chrome-internal
name of the repo:
is it a mirror, and of what:
permissions:


=== Link/Unlink Gerrit accounts ===
Link or unlink?
What are your @google and @chromium addresses?

moragues@google.com
moragues@chromium.org

=== Obliterate a change ===
Link to the CL to be erased:


The following trybot failed with no real clear failure reasons given in the log.
Looks like it might be a failure to fetch from gerrit due to locking issues.

cros tryjob --remote -g *613835 -g 1024862 meowth-release-tryjob

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8946875845703520848
 
Cc: la...@chromium.org
Components: -Infra>Git>Admin Infra>Client>ChromeOS>CI
Owner: jclinton@chromium.org
This does not appear to be a Git>Admin bug, and I'm not sure what in particular the bug is asking for. Please include a snippet or more details on bugs like this to make it easier to know what is being asked.

In this case the relevant stacktrace is apparently

Traceback (most recent call last):
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch", line 169, in <module>
    DoMain()
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/swarming/w/ir/kitchen-workdir/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 481, in main
    return _main(argv)
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 454, in _main
    InitialCheckout(repo)
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 69, in wrapped_functor
    result = functor(*args, **kwargs)
  File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 310, in InitialCheckout
    repo.Sync(detach=True)
  File "/b/swarming/w/ir/kitchen-workdir/chromite/cbuildbot/repository.py", line 684, in Sync
    raise SrcCheckOutException(err_msg)
chromite.cbuildbot.repository.SrcCheckOutException: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n
cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository

assigning to jclinton TL of CI, and cc'ing lannm who I believe is oncall.


Labels: -Restrict-View-Google OS-Chrome Pri-1 Type-Bug
Owner: ----
Status: Available (was: Untriaged)
Summary: tryjob failed with git lock error (concurrent git access to autotest repo?) (was: tryjob failed with git lock error)
Root cause:
error: cannot lock ref 'refs/remotes/cros/master': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
From https://chromium.googlesource.com/a/chromiumos/third_party/autotest
 ! b1241d1f4..6e0a32ca2  master     -> cros/master  (unable to update local ref)
error: Cannot fetch chromiumos/third_party/autotest
error: Exited sync due to fetch errors


Seems like a real problem that we should investigate.

Comment 3 by derat@chromium.org, May 17 2018

Cc: derat@chromium.org
I think this might've just made the amd64-generic-tot-asan-informational master run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946258699989069648 fail in InitialCheckout too:

error: cannot lock ref 'refs/remotes/cros-internal/release-R67-10575.B': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
From https://chrome-internal.googlesource.com/chromeos/overlays/chromeos-partner-overlay
 ! a7cb3e3366..a60ced0060  release-R67-10575.B -> cros-internal/release-R67-10575.B  (unable to update local ref)
error: Cannot fetch chromeos/overlays/chromeos-partner-overlay
error: Exited sync due to fetch errors
real	1m16.922s
11:08:20: WARNING: Command failed with retriable error.
return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n
cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository
11:08:20: ERROR: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n

Comment 4 by derat@chromium.org, May 17 2018

Cc: pprabhu@chromium.org
Owner: ayatane@chromium.org
Status: Assigned (was: Available)
Summary: tryjob failed with git lock error (concurrent git access?) (was: tryjob failed with git lock error (concurrent git access to autotest repo?))
peach_pit-tot-chrome-pfq-informational as well at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946247633846858096:

error: cannot lock ref 'refs/remotes/cros-internal/release-R67-10575.B': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
From https://chrome-internal.googlesource.com/chromeos/overlays/chromeos-partner-overlay
 ! a7cb3e3366..a60ced0060  release-R67-10575.B -> cros-internal/release-R67-10575.B  (unable to update local ref)
error: Cannot fetch chromeos/overlays/chromeos-partner-overlay
error: Exited sync due to fetch errors
real	1m26.181s
14:03:47: WARNING: Command failed with retriable error.
return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n
cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository
14:03:47: ERROR: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n
----

Deputies, please investigate this.
Cc: jclinton@chromium.org
Owner: gmeinke@chromium.org
Transfering to the CI Bobby. I will also try to look at this tomorrow.
Of the 3 failures that I know about, each of these failed in a git lock in a git repo:
* meowth-release-tryjob failed to lock autotest
* caroline-tot-chrome-pfq-informational failed to lock autotest
* peach_pit-tot-chrome-pfq-informational failed to lock chromeos-partner-overlay

This looks like we're running two git instances at the same time trying to access the same repo. This seems unusual because, AFAICT, we only run repo once. And repo's entire reason for existing is doing exactly this correctly.

Looking into the InitialCheckout stage further.

Comment 7 by derat@chromium.org, May 18 2018

Here's another one, on daisy-tot-chromium-pfq-informational and back on the autotest repo this time: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946157154103229248

error: cannot lock ref 'refs/remotes/cros/master': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
From https://chromium.googlesource.com/a/chromiumos/third_party/autotest
 ! b1241d1f4..de693fcc7  master     -> cros/master  (unable to update local ref)
error: Cannot fetch chromiumos/third_party/autotest
error: Exited sync due to fetch errors
real	1m25.056s

It definitely seems a bit suspicious how the autotest and chromeos-partner-overlay repositories keep popping up here.

Comment 8 by derat@chromium.org, May 18 2018

Summary: Bulders fail in InitialCheckout with git lock error (concurrent git access?) (was: tryjob failed with git lock error (concurrent git access?))
And another in chromeos-partner-overlay on amd64-generic-tot-chromium-pfq-informational: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946155893763503280

Comment 9 by derat@chromium.org, May 18 2018

Summary: Builders fail in InitialCheckout with git lock error (concurrent git access?) (was: Bulders fail in InitialCheckout with git lock error (concurrent git access?))
Update: we do have retry logic but the first failure is on the first attempt to run repo sync so the retry logic isn't to blame. Continuing to research. Maybe a stale lock from a previous run?

I logged in to a sampling of Swarming hosts and checked the git caches for stale locks. I couldn't find any.

I logged in to a host which had just been scheduled and managed to capture its process tree while it was running InitialCheckout:

/usr/bin/python -u /opt/chromebuild/chromebuild-startup.py
 \_ /usr/bin/python /b/swarming/swarming_bot.1.zip start_bot
     \_ /usr/bin/python /b/swarming/swarming_bot.1.zip task_runner
         \_ /usr/bin/python /b/swarming/swarming_bot.1.zip run_isolated
             \_ /b/swarming/w/ir/kitchen cook -mode swarming
                 \_ python kitchen-checkout/scripts/slave/recipes.py
                     \_ /b/swarming/w/ir/cipd_bin_packages/bin/python -u /b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipes.py
                         \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
                             \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
                             \_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
                             \_ /b/swarming/w/ir/kitchen-workdir/python_bin/python /b/swarming/w/ir/cache/cbuild/repository/.repo/repo/main.py
                                 \_ git checkout -q 87c50983542c93535092e1c7865849b4467989c6 --

That's a complete list of all the processes that are running. I don't see any concurrent git processes.

Looking for bleed-over from the previous run by examining the proximity of a failing run to a previous build: maybe there is a pattern.

Cc: chromeos-continuous-integration-team@google.com
chromeos-partner-overlay on amd64-generic-tot-chromium-pfq-informational http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946155893763503280 ran on swarm-cros-54 per https://chrome-swarming.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1526754060000&f=buildbucket_build_id-tag%3A8946155893763503280&l=50&n=true&s=created_ts%3Adesc&st=1525371660000 .

swarm-cros-54's task history: https://chrome-swarming.appspot.com/bot?id=swarm-cros-54&sort_stats=total%3Adesc are all failures since 5/16/2018, 11:05:28 PM (MDT).

Logging in we see:
ls -l /b/swarming/c/OR/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock
-rw-r--r-- 1 chrome-bot chrome-bot 41 May 16 22:07 /b/swarming/c/OR/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock

So, this bot has a stuck lock.

It's easy to find the cause. It started with this job being manually killed: https://chrome-swarming.appspot.com/task?id=3d854e9ed5f3f610&refresh=10

Pulling up all of the remaining reports, they all have either swarm-cros-54 or swarm-cros-133 in common. swarm-cros-133's task history is here: https://chrome-swarming.appspot.com/bot?id=swarm-cros-133&sort_stats=total%3Adesc

swarm-cros-133 has been failing every job scheduled on it since 4/24/2018, 11:08:57 PM (MDT) when someone manually killed a job in the InitialCheckout stage. This machine has been responsible for 288 individual failures over the past 3.5 weeks. It also has a stale lock:
/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock.

I have manually deleted the stale locks off of both swarm-cros-54 or swarm-cros-133.

We should add something to the cbuildbot_launch script to cleanup those lock files, and file a git bug. It's probably tied to a recent client change since our behavior hasn't changed much but this problem is new.
I'm working on a change to do the cleanup in cbuildbot_launch now.
Owner: jclinton@chromium.org
Status: Started (was: Assigned)
Cc: nxia@chromium.org
 Issue 779940  has been merged into this issue.
Project Member

Comment 18 by bugdroid1@chromium.org, May 31 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/515a27ae50cb92d61e92083d09fd0d6bb693c4f6

commit 515a27ae50cb92d61e92083d09fd0d6bb693c4f6
Author: Jason D. Clinton <jclinton@chromium.org>
Date: Thu May 31 23:22:40 2018

cbuildbot_launch: Clean up stale git locks

Some of our builds are getting killed mid-git operation and leaving
behind git master.lock files that break subsequent builds on the same
machine. This happens because the git repos are cached between builds.

BUG= chromium:842018 
TEST=cbuildbot/run_tests --network

Change-Id: I9662f214c34a4e26bb58f41e891f8aca66c34f2d
Reviewed-on: https://chromium-review.googlesource.com/1075727
Commit-Ready: Jason Clinton <jclinton@chromium.org>
Tested-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/cbuildbot/repository.py
[modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/lib/git_unittest.py
[modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/lib/git.py
[modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/scripts/cbuildbot_launch.py
[modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/cbuildbot/repository_unittest.py

Status: Fixed (was: Started)
Project Member

Comment 20 by bugdroid1@chromium.org, Jun 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/423b431b31d73b754f199c09fdb0825a6827d4cd

commit 423b431b31d73b754f199c09fdb0825a6827d4cd
Author: chromite-chromium-autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Fri Jun 01 04:58:16 2018

Roll src/third_party/chromite ffe9e60..a09f2ac (5 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/ffe9e60..a09f2ac


git log ffe9e60..a09f2ac --date=short --no-merges --format='%ad %ae %s'
2018-05-29 dgarrett@google.com Legoland: Update Legoland build URLs.
2018-05-29 rrangel@chromium.org cros_gdb: address post-submit comments from http://crrev.com/c/1052367
2018-05-24 xixuan@chromium.org cbuildbot: Add provision & bvt-inline suite for nyan_blaze-paladin.
2018-05-23 xixuan@chromium.org cbuildbot: Add fake SkylabHWTestStage to nyan_blaze-paladin.
2018-05-28 jclinton@chromium.org cbuildbot_launch: Clean up stale git locks

Created with:
  gclient setdep -r src/third_party/chromite@a09f2ac

The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.



BUG= chromium:834333 ,chromium:none,chromium:845314,chromium:845314,chromium:842018

TBR=chrome-os-gardeners@chromium.org

Change-Id: I6e6012b8c673d0f4d7debda9ec001c42710fce44
Reviewed-on: https://chromium-review.googlesource.com/1081993
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#563539}
[modify] https://crrev.com/423b431b31d73b754f199c09fdb0825a6827d4cd/DEPS

Issue 853949 has been merged into this issue.
Status: Started (was: Fixed)
I didn't actually fix this because my CL didn't handle bare repos correctly. It just skipped over them. All repos are bare when they are referenced from the repo manifest before they are checked out.
Project Member

Comment 23 by bugdroid1@chromium.org, Jun 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf

commit 1deead13f98b72c3c5eb97f18c9f2a09471d4fcf
Author: Jason D. Clinton <jclinton@chromium.org>
Date: Sat Jun 23 00:22:06 2018

lib/git.py: Support bare repos in DeleteStaleLocks

I'm an idiot thus managed to simultaneously write the logic to detect
that a repo is bare and then not actually use it to scan for locks to
delete. :( Added a test for this.

BUG= chromium:842018 
TEST=cbuildbot/run_tests --network succeeds

Change-Id: I48748947065840047e63961418d84888a5a0b31f
Reviewed-on: https://chromium-review.googlesource.com/1111647
Commit-Ready: Jason Clinton <jclinton@chromium.org>
Tested-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cbuildbot/repository.py
[modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cros_bisect/simple_chrome_builder.py
[modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/lib/git.py
[modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/lib/git_unittest.py
[modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cbuildbot/repository_unittest.py

Project Member

Comment 24 by bugdroid1@chromium.org, Jun 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b27b63d85e6eb74d3b5b309ee7b000341991b322

commit b27b63d85e6eb74d3b5b309ee7b000341991b322
Author: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Sat Jun 23 20:59:19 2018

Roll src/third_party/chromite 6ec076ceabc3..1deead13f98b (1 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/6ec076ceabc3..1deead13f98b


git log 6ec076ceabc3..1deead13f98b --date=short --no-merges --format='%ad %ae %s'
2018-06-23 jclinton@chromium.org lib/git.py: Support bare repos in DeleteStaleLocks


Created with:
  gclient setdep -r src/third_party/chromite@1deead13f98b

The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.



BUG= chromium:842018 
TBR=chrome-os-gardeners@chromium.org

Change-Id: Ic551e41c07f15ad9a8973763dc131a400a2f8642
Reviewed-on: https://chromium-review.googlesource.com/1112232
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#569886}
[modify] https://crrev.com/b27b63d85e6eb74d3b5b309ee7b000341991b322/DEPS

Status: Fixed (was: Started)
Seems to be resolved for good.

Sign in to add a comment