Builders fail in InitialCheckout with git lock error (concurrent git access?) |
||||||||||||
Issue descriptionPlease fill out the relevant section and delete the others. === Create or edit a Git repo === googlesource host (e.g. chromium, chrome-internal): chrome-internal name of the repo: is it a mirror, and of what: permissions: === Link/Unlink Gerrit accounts === Link or unlink? What are your @google and @chromium addresses? moragues@google.com moragues@chromium.org === Obliterate a change === Link to the CL to be erased: The following trybot failed with no real clear failure reasons given in the log. Looks like it might be a failure to fetch from gerrit due to locking issues. cros tryjob --remote -g *613835 -g 1024862 meowth-release-tryjob https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8946875845703520848
,
May 14 2018
Root cause: error: cannot lock ref 'refs/remotes/cros/master': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. From https://chromium.googlesource.com/a/chromiumos/third_party/autotest ! b1241d1f4..6e0a32ca2 master -> cros/master (unable to update local ref) error: Cannot fetch chromiumos/third_party/autotest error: Exited sync due to fetch errors Seems like a real problem that we should investigate.
,
May 17 2018
I think this might've just made the amd64-generic-tot-asan-informational master run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946258699989069648 fail in InitialCheckout too: error: cannot lock ref 'refs/remotes/cros-internal/release-R67-10575.B': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. From https://chrome-internal.googlesource.com/chromeos/overlays/chromeos-partner-overlay ! a7cb3e3366..a60ced0060 release-R67-10575.B -> cros-internal/release-R67-10575.B (unable to update local ref) error: Cannot fetch chromeos/overlays/chromeos-partner-overlay error: Exited sync due to fetch errors real 1m16.922s [1;33m11:08:20: WARNING: Command failed with retriable error. return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository[0m [1;31m11:08:20: ERROR: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n
,
May 17 2018
peach_pit-tot-chrome-pfq-informational as well at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946247633846858096: error: cannot lock ref 'refs/remotes/cros-internal/release-R67-10575.B': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. From https://chrome-internal.googlesource.com/chromeos/overlays/chromeos-partner-overlay ! a7cb3e3366..a60ced0060 release-R67-10575.B -> cros-internal/release-R67-10575.B (unable to update local ref) error: Cannot fetch chromeos/overlays/chromeos-partner-overlay error: Exited sync due to fetch errors real 1m26.181s [1;33m14:03:47: WARNING: Command failed with retriable error. return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository[0m [1;31m14:03:47: ERROR: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n ---- Deputies, please investigate this.
,
May 18 2018
Transfering to the CI Bobby. I will also try to look at this tomorrow.
,
May 18 2018
Of the 3 failures that I know about, each of these failed in a git lock in a git repo: * meowth-release-tryjob failed to lock autotest * caroline-tot-chrome-pfq-informational failed to lock autotest * peach_pit-tot-chrome-pfq-informational failed to lock chromeos-partner-overlay This looks like we're running two git instances at the same time trying to access the same repo. This seems unusual because, AFAICT, we only run repo once. And repo's entire reason for existing is doing exactly this correctly. Looking into the InitialCheckout stage further.
,
May 18 2018
Here's another one, on daisy-tot-chromium-pfq-informational and back on the autotest repo this time: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946157154103229248 error: cannot lock ref 'refs/remotes/cros/master': Unable to create '/b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. From https://chromium.googlesource.com/a/chromiumos/third_party/autotest ! b1241d1f4..de693fcc7 master -> cros/master (unable to update local ref) error: Cannot fetch chromiumos/third_party/autotest error: Exited sync due to fetch errors real 1m25.056s It definitely seems a bit suspicious how the autotest and chromeos-partner-overlay repositories keep popping up here.
,
May 18 2018
And another in chromeos-partner-overlay on amd64-generic-tot-chromium-pfq-informational: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946155893763503280
,
May 18 2018
,
May 18 2018
Update: we do have retry logic but the first failure is on the first attempt to run repo sync so the retry logic isn't to blame. Continuing to research. Maybe a stale lock from a previous run?
,
May 19 2018
I logged in to a sampling of Swarming hosts and checked the git caches for stale locks. I couldn't find any.
I logged in to a host which had just been scheduled and managed to capture its process tree while it was running InitialCheckout:
/usr/bin/python -u /opt/chromebuild/chromebuild-startup.py
\_ /usr/bin/python /b/swarming/swarming_bot.1.zip start_bot
\_ /usr/bin/python /b/swarming/swarming_bot.1.zip task_runner
\_ /usr/bin/python /b/swarming/swarming_bot.1.zip run_isolated
\_ /b/swarming/w/ir/kitchen cook -mode swarming
\_ python kitchen-checkout/scripts/slave/recipes.py
\_ /b/swarming/w/ir/cipd_bin_packages/bin/python -u /b/swarming/w/ir/kitchen-checkout/scripts/slave/.recipe_deps/recipe_engine/recipes.py
\_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
\_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
\_ python2 /b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch
\_ /b/swarming/w/ir/kitchen-workdir/python_bin/python /b/swarming/w/ir/cache/cbuild/repository/.repo/repo/main.py
\_ git checkout -q 87c50983542c93535092e1c7865849b4467989c6 --
That's a complete list of all the processes that are running. I don't see any concurrent git processes.
Looking for bleed-over from the previous run by examining the proximity of a failing run to a previous build: maybe there is a pattern.
,
May 19 2018
chromeos-partner-overlay on amd64-generic-tot-chromium-pfq-informational http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946155893763503280 ran on swarm-cros-54 per https://chrome-swarming.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1526754060000&f=buildbucket_build_id-tag%3A8946155893763503280&l=50&n=true&s=created_ts%3Adesc&st=1525371660000 . swarm-cros-54's task history: https://chrome-swarming.appspot.com/bot?id=swarm-cros-54&sort_stats=total%3Adesc are all failures since 5/16/2018, 11:05:28 PM (MDT). Logging in we see: ls -l /b/swarming/c/OR/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock -rw-r--r-- 1 chrome-bot chrome-bot 41 May 16 22:07 /b/swarming/c/OR/repository/.repo/projects/src/private-overlays/chromeos-partner-overlay.git/refs/remotes/cros-internal/release-R67-10575.B.lock So, this bot has a stuck lock. It's easy to find the cause. It started with this job being manually killed: https://chrome-swarming.appspot.com/task?id=3d854e9ed5f3f610&refresh=10
,
May 19 2018
Pulling up all of the remaining reports, they all have either swarm-cros-54 or swarm-cros-133 in common. swarm-cros-133's task history is here: https://chrome-swarming.appspot.com/bot?id=swarm-cros-133&sort_stats=total%3Adesc swarm-cros-133 has been failing every job scheduled on it since 4/24/2018, 11:08:57 PM (MDT) when someone manually killed a job in the InitialCheckout stage. This machine has been responsible for 288 individual failures over the past 3.5 weeks. It also has a stale lock: /b/swarming/w/ir/cache/cbuild/repository/.repo/projects/src/third_party/autotest/files.git/refs/remotes/cros/master.lock. I have manually deleted the stale locks off of both swarm-cros-54 or swarm-cros-133.
,
May 20 2018
We should add something to the cbuildbot_launch script to cleanup those lock files, and file a git bug. It's probably tied to a recent client change since our behavior hasn't changed much but this problem is new.
,
May 21 2018
I'm working on a change to do the cleanup in cbuildbot_launch now.
,
May 24 2018
,
May 29 2018
,
May 31 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/515a27ae50cb92d61e92083d09fd0d6bb693c4f6 commit 515a27ae50cb92d61e92083d09fd0d6bb693c4f6 Author: Jason D. Clinton <jclinton@chromium.org> Date: Thu May 31 23:22:40 2018 cbuildbot_launch: Clean up stale git locks Some of our builds are getting killed mid-git operation and leaving behind git master.lock files that break subsequent builds on the same machine. This happens because the git repos are cached between builds. BUG= chromium:842018 TEST=cbuildbot/run_tests --network Change-Id: I9662f214c34a4e26bb58f41e891f8aca66c34f2d Reviewed-on: https://chromium-review.googlesource.com/1075727 Commit-Ready: Jason Clinton <jclinton@chromium.org> Tested-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Mike Nichols <mikenichols@chromium.org> [modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/cbuildbot/repository.py [modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/lib/git_unittest.py [modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/lib/git.py [modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/scripts/cbuildbot_launch.py [modify] https://crrev.com/515a27ae50cb92d61e92083d09fd0d6bb693c4f6/cbuildbot/repository_unittest.py
,
Jun 1 2018
,
Jun 1 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/423b431b31d73b754f199c09fdb0825a6827d4cd commit 423b431b31d73b754f199c09fdb0825a6827d4cd Author: chromite-chromium-autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Fri Jun 01 04:58:16 2018 Roll src/third_party/chromite ffe9e60..a09f2ac (5 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/ffe9e60..a09f2ac git log ffe9e60..a09f2ac --date=short --no-merges --format='%ad %ae %s' 2018-05-29 dgarrett@google.com Legoland: Update Legoland build URLs. 2018-05-29 rrangel@chromium.org cros_gdb: address post-submit comments from http://crrev.com/c/1052367 2018-05-24 xixuan@chromium.org cbuildbot: Add provision & bvt-inline suite for nyan_blaze-paladin. 2018-05-23 xixuan@chromium.org cbuildbot: Add fake SkylabHWTestStage to nyan_blaze-paladin. 2018-05-28 jclinton@chromium.org cbuildbot_launch: Clean up stale git locks Created with: gclient setdep -r src/third_party/chromite@a09f2ac The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. BUG= chromium:834333 ,chromium:none,chromium:845314,chromium:845314,chromium:842018 TBR=chrome-os-gardeners@chromium.org Change-Id: I6e6012b8c673d0f4d7debda9ec001c42710fce44 Reviewed-on: https://chromium-review.googlesource.com/1081993 Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#563539} [modify] https://crrev.com/423b431b31d73b754f199c09fdb0825a6827d4cd/DEPS
,
Jun 22 2018
Issue 853949 has been merged into this issue.
,
Jun 22 2018
I didn't actually fix this because my CL didn't handle bare repos correctly. It just skipped over them. All repos are bare when they are referenced from the repo manifest before they are checked out.
,
Jun 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf commit 1deead13f98b72c3c5eb97f18c9f2a09471d4fcf Author: Jason D. Clinton <jclinton@chromium.org> Date: Sat Jun 23 00:22:06 2018 lib/git.py: Support bare repos in DeleteStaleLocks I'm an idiot thus managed to simultaneously write the logic to detect that a repo is bare and then not actually use it to scan for locks to delete. :( Added a test for this. BUG= chromium:842018 TEST=cbuildbot/run_tests --network succeeds Change-Id: I48748947065840047e63961418d84888a5a0b31f Reviewed-on: https://chromium-review.googlesource.com/1111647 Commit-Ready: Jason Clinton <jclinton@chromium.org> Tested-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cbuildbot/repository.py [modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cros_bisect/simple_chrome_builder.py [modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/lib/git.py [modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/lib/git_unittest.py [modify] https://crrev.com/1deead13f98b72c3c5eb97f18c9f2a09471d4fcf/cbuildbot/repository_unittest.py
,
Jun 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b27b63d85e6eb74d3b5b309ee7b000341991b322 commit b27b63d85e6eb74d3b5b309ee7b000341991b322 Author: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Sat Jun 23 20:59:19 2018 Roll src/third_party/chromite 6ec076ceabc3..1deead13f98b (1 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/6ec076ceabc3..1deead13f98b git log 6ec076ceabc3..1deead13f98b --date=short --no-merges --format='%ad %ae %s' 2018-06-23 jclinton@chromium.org lib/git.py: Support bare repos in DeleteStaleLocks Created with: gclient setdep -r src/third_party/chromite@1deead13f98b The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. BUG= chromium:842018 TBR=chrome-os-gardeners@chromium.org Change-Id: Ic551e41c07f15ad9a8973763dc131a400a2f8642 Reviewed-on: https://chromium-review.googlesource.com/1112232 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#569886} [modify] https://crrev.com/b27b63d85e6eb74d3b5b309ee7b000341991b322/DEPS
,
Aug 7
Seems to be resolved for good. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by akes...@chromium.org
, May 11 2018Components: -Infra>Git>Admin Infra>Client>ChromeOS>CI
Owner: jclinton@chromium.org
This does not appear to be a Git>Admin bug, and I'm not sure what in particular the bug is asking for. Please include a snippet or more details on bugs like this to make it easier to know what is being asked. In this case the relevant stacktrace is apparently Traceback (most recent call last): File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch", line 169, in <module> DoMain() File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch", line 165, in DoMain commandline.ScriptWrapperMain(FindTarget) File "/b/swarming/w/ir/kitchen-workdir/chromite/lib/commandline.py", line 911, in ScriptWrapperMain ret = target(argv[1:]) File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 481, in main return _main(argv) File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 454, in _main InitialCheckout(repo) File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 69, in wrapped_functor result = functor(*args, **kwargs) File "/b/swarming/w/ir/kitchen-workdir/chromite/scripts/cbuildbot_launch.py", line 310, in InitialCheckout repo.Sync(detach=True) File "/b/swarming/w/ir/kitchen-workdir/chromite/cbuildbot/repository.py", line 684, in Sync raise SrcCheckOutException(err_msg) chromite.cbuildbot.repository.SrcCheckOutException: return code: 1; command: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n cmd=['repo', '--time', 'sync', '--force-sync', '--cache-dir=/b/swarming/w/ir/cache/git', '-n'], cwd=/b/swarming/w/ir/cache/cbuild/repository assigning to jclinton TL of CI, and cc'ing lannm who I believe is oncall.