Branches disappear and don't fetch/clone
Project Member Reported by email@example.com, Jan 15 2010
Affected Version: 18.104.22.168 Sometimes a branch disappears, and it cannot be fetched or cloned anymore. repo sync shows this as: $ repo sync Fetching projects: 100% (224/224), done. error: master in platform/bionic not found I suspect what's happening is a background `git gc` job runs and moves the branch into the $GIT_DIR/packed-refs file, but JGit doesn't seem to be reloading the packed-refs data after the git gc pass. Since the branch is no longer loose JGit is not reporting it to a client.
Jan 15 2010,
Our experience is now the other way around: When people check in very big checkins, and they are not repacked (the hourly script was disabled, for those of you who read about us doing that) this leads. The load on the server increases, at some point this error shows up. error: revision sw-integration in platform/<some-git> not found Once we run git gc load goes does down and server now serves git as normal again. Restart of gerrit service may or may not be necessary, we have no good statistic of this yet. Last time we got into this it was not necessary.
Jan 15 2010,
Addition: By "large commits" I mean to say that it is an external delivery, which means that the commit contains a lot of files, a lot of those files are updated, so the upload of the commit means a lot of new blob data and a new ref on the server.
Jan 19 2010,
Given what is happening in issue 394 , we might actually be looking at a different variant of issue 394 . If the object that a branch points to cannot be read from disk, the branch just silently disappears, and no error is logged to the server log file. So issue 394 can cause the branch to vanish like we are seeing here.
Jan 31 2010,
Feb 22 2010,
Issue 394 has been merged into this issue.
Mar 1 2010,
My organization started seeing this today too, with similar symptoms as explained in issue 394 : fatal: protocol error: bad pack header Has anyone been able to temporarily work-around this problem?
Mar 1 2010,
Another update. I just noticed this post: http://groups.google.com/group/repo- discuss/browse_thread/thread/d137c9e55e55542 I dropped down to a shell and run "git gc" on the problematic git repo as the gerrit2 user and it fixed the problem.
Mar 3 2010,
Slipped to 2.1.3. I want to get 2.1.2 out.
Mar 3 2010,
Mar 11 2010,
I have finally been able to recreate this problem! 1) Push a commit onto a git (the error occurs more likely if the commit is big (mine was 800 megs from /dev/urandom). 2) Let the replication to the replication-server finish 3) Clone the project from the replication server (make sure you are the FIRST person to clone after the replication is done). 4) Ctrl-C the clone 5) You are now the proud owner of a broken git. (we heal it with 'git gc') cloning the git again will give you something like this: Initialized empty Git repository in /mnt/src/helloworld/helloworld/.git/ remote: Counting objects: 2765, done remote: Compressing objects: 100% (2765/2765) fatal: internal server error6/2765), 165.82 MiB | 11346 KiB/s fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed
Mar 12 2010,
Btw, forgot to add that the push in step 1) is to refs/heads/master.
Apr 9 2010,
Is this still reproducible?
Apr 26 2010,
Re comment #10, when the replication is running is that going over the system SSH, writing the objects directly into the repository behind Gerrit's back? I think its a red-herring that ctrl-c'ing that first clone causes things to break for all subsequent users. And I doubt 800 MiB is actually needed to trigger this. What's probably happening is, your 800 MiB push contained enough *objects* that it was over the 100 object limit and was retained as a pack file, rather than being exploded to loose objects. And the Gerrit server failed to figure out that a new pack file was available on disk.
Apr 26 2010,
Hi Shawn / Comment #11 Yes, we're replicating over OpenSSH. The 800MiB example was mentioned as the safest way to reproduce the bug. But this is certainly not the only possible scenario, we see it quite often when we push more than one object as well. The test we did, IIRC, was to check in one large 800MiB binary. I'm not sure what that means to git internally.. I thought it meant only one huuuuuuuuuge blob-object rather than many? (and then tree and commit objects, obviuosly.. still not hundreds?) I'd love to tell you more on the differences between when the sync completes versus when you ctrl-c it, but I was not around Ulrik and Ernst when they set about to reproduce it, and hence my answer is less useful than it could've been. They might add their own comments tormorrow morning, EU hours. Hope it helps!
May 3 2010,
Had any luck with this Shawn? Can you reproduce it if you follow the #10 steps?
May 3 2010,
Nope. I spent about a day on it last week. I wasn't able to reproduce by following comment #10. So I spent some time looking through this section of code in JGit. There is a possibly bad condition relating to a push into Gerrit Code Review confusing a concurrent read. I've posted patches for it to JGit, and I see they got merged over the weekend. I doubt they fix the case described here though, because the push must occur over the Gerrit port to trigger the condition. My week this week is all messed up scheduling wise due to personal stuff that I have going on right now. But I plan to devote most of what I can this week at work to looking at this problem more, maybe I'll have some flash of insight if I stare at the code long enough.
May 4 2010,
Some notes from an IM session with an admin suffering from this bug on their Gerrit server, against a Linux kernel repo: Them> got again the false missing object exception, I do notice > one thing tho, almost all the time it's complaining about > the object that is the vanilla 2.6.33 commit (we initialize > all our branches to start from that) Me > ugh Them> hi again, wtf, I just found out that we have disabled the > repack script sometime in March and are only running the > resync-all script every night so it would mean those problems > are not because of the external repacking Me > yikes > so the vanilla 2.6.33 commit went poof solely due to gerrit > adding new pack files during pushes. Them> not sure why it did, but yeah, it seems Gerrit doesn't know > about it even tho it exists and works after a restart (neither > of the touch or "git gc" solve the issue, only restarting Gerrit > does so far) Makes me start to suspect that the PackFile object which contains the commit got marked as corrupt in memory, or it was simply omitted from the PackList object somehow during a copy of the array.
May 13 2010,
Slightly new theory: JGit has an open bug  where pack files are accessed after their file descriptor was closed. These usually result in an IOException being thrown back at the caller. In many places within ObjectDirectory, JGit consumes an IOException when accessing the pack file and removes the pack file from its list of known packs. Since the exception is not logged, we don't know if this condition is triggering or not. When the pack gets removed from the list of known packs, it is never put back into the list because the objects/pack mtime doesn't change. So if this read-after-close bug occurs at the right place, we won't log it, but we'll close the pack and forget it ever exists. Later on when we can't access the object we log the missing object error, or simply hide the branch from the client entirely.  https://bugs.eclipse.org/bugs/show_bug.cgi?id=308945
May 27 2010,
Fixed in Gerrit by change I50a1cd941fe9f0a7dd2a6a15d6bd56a36fc773a0
Jun 1 2010,
Jun 8 2010,
We're hitting this daily now, even on 22.214.171.124. We're running the work around script that touches the pack objects files. I'll try and disable that to see if it helps.
Jun 8 2010,
My problem could well be issue 585 too. I'll provide the details in 858.
Mar 28 2012,
Sign in to add a comment