New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 11 users

Issue metadata

Status: Released
Owner:
Last visit > 30 days ago
Closed: Mar 2012

Blocked on:
issue 394



Sign in to add a comment

Branches disappear and don't fetch/clone

Project Member Reported by sop@google.com, Jan 15 2010

Issue description

Affected Version: 2.1.1.1

Sometimes a branch disappears, and it cannot be fetched or
cloned anymore.  repo sync shows this as:

  $ repo sync
  Fetching projects: 100% (224/224), done.
  error: master in platform/bionic not found

I suspect what's happening is a background `git gc` job runs
and moves the branch into the $GIT_DIR/packed-refs file, but
JGit doesn't seem to be reloading the packed-refs data after
the git gc pass.  Since the branch is no longer loose JGit is
not reporting it to a client.
 
Our experience is now the other way around: When people check in very big checkins, 
and they are not repacked (the hourly script was disabled, for those of you who read 
about us doing that) this leads. The load on the server increases, at some point 
this error shows up.

error: revision sw-integration in platform/<some-git> not found

Once we run git gc load goes does down and server now serves git as normal again. 
Restart of gerrit service may or may not be necessary, we have no good statistic of 
this yet. Last time we got into this it was not necessary.
Addition: By "large commits" I mean to say that it is an external delivery, which 
means that the commit contains a lot of files, a lot of those files are updated, so 
the upload of the commit means a lot of new blob data and a new ref on the server.

Comment 3 by sop@google.com, Jan 19 2010

Blockedon: 394
Given what is happening in  issue 394 , we might actually be
looking at a different variant of  issue 394 .

If the object that a branch points to cannot be read from
disk, the branch just silently disappears, and no error is
logged to the server log file.  So  issue 394  can cause the
branch to vanish like we are seeing here.

Comment 4 by sop@google.com, Jan 31 2010

Labels: -Milestone-Next Milestone-2.1.2

Comment 5 by sop@google.com, Feb 22 2010

 Issue 394  has been merged into this issue.

Comment 6 by jjhel...@gmail.com, Mar 1 2010

My organization started seeing this today too, with similar symptoms as explained in 
 issue 394 :

fatal: protocol error: bad pack header

Has anyone been able to temporarily work-around this problem?

Comment 7 by jjhel...@gmail.com, Mar 1 2010

Another update.  I just noticed this post: http://groups.google.com/group/repo-
discuss/browse_thread/thread/d137c9e55e55542

I dropped down to a shell and run "git gc" on the problematic git repo as the gerrit2 
user and it fixed the problem.

Comment 8 by sop@google.com, Mar 3 2010

Labels: Milestone-2.1.3
Slipped to 2.1.3.  I want to get 2.1.2 out.

Comment 9 by sop@google.com, Mar 3 2010

Labels: -Milestone-2.1.2
Project Member

Comment 10 by ulrik.sj...@gmail.com, Mar 11 2010

I have finally been able to recreate this problem!

1) Push a commit onto a git (the error occurs more likely if the commit is big (mine
was 800 megs from /dev/urandom).
2) Let the replication to the replication-server finish 
3) Clone the project from the replication server (make sure you are the FIRST person
to clone after the replication is done).
4) Ctrl-C the clone
5) You are now the proud owner of a broken git. (we heal it with 'git gc') cloning
the git again will give you something like this:

Initialized empty Git repository in /mnt/src/helloworld/helloworld/.git/
remote: Counting objects: 2765, done
remote: Compressing objects: 100% (2765/2765)
fatal: internal server error6/2765), 165.82 MiB | 11346 KiB/s   
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed


Project Member

Comment 11 by ulrik.sj...@gmail.com, Mar 12 2010

Btw, forgot to add that the push in step 1) is to refs/heads/master.

Comment 12 by ern...@gmail.com, Apr 9 2010

Is this still reproducible?

Comment 13 by sop@google.com, Apr 26 2010

Re comment #10, when the replication is running is that
going over the system SSH, writing the objects directly
into the repository behind Gerrit's back?

I think its a red-herring that ctrl-c'ing that first
clone causes things to break for all subsequent users.

And I doubt 800 MiB is actually needed to trigger this.
What's probably happening is, your 800 MiB push contained
enough *objects* that it was over the 100 object limit and
was retained as a pack file, rather than being exploded to
loose objects.  And the Gerrit server failed to figure out
that a new pack file was available on disk.
Hi Shawn / Comment #11
Yes, we're replicating over OpenSSH.

The 800MiB example was mentioned as the safest way to reproduce the bug. But this is 
certainly not the only possible scenario, we see it quite often when we push more 
than one object as well.
The test we did, IIRC, was to check in one large 800MiB binary. I'm not sure what 
that means to git internally.. I thought it meant only one huuuuuuuuuge blob-object 
rather than many? (and then tree and commit objects, obviuosly.. still not hundreds?)

I'd love to tell you more on the differences between when the sync completes versus 
when you ctrl-c it, but I was not around Ulrik and Ernst when they set about to 
reproduce it, and hence my answer is less useful than it could've been. They might 
add their own comments tormorrow morning, EU hours.
Hope it helps!

Comment 15 by ern...@gmail.com, May 3 2010

Had any luck with this Shawn? Can you reproduce it if you follow the #10 steps?

Comment 16 by sop@google.com, May 3 2010

Nope.  I spent about a day on it last week.  I wasn't able to reproduce by
following comment #10.  So I spent some time looking through this section
of code in JGit.  There is a possibly bad condition relating to a push into
Gerrit Code Review confusing a concurrent read.  I've posted patches for it
to JGit, and I see they got merged over the weekend.  I doubt they fix the
case described here though, because the push must occur over the Gerrit port
to trigger the condition.

My week this week is all messed up scheduling wise due to personal stuff that
I have going on right now.  But I plan to devote most of what I can this week
at work to looking at this problem more, maybe I'll have some flash of insight
if I stare at the code long enough.

Comment 17 by sop@google.com, May 4 2010

Labels: Component-JGit
Some notes from an IM session with an admin suffering from
this bug on their Gerrit server, against a Linux kernel repo:

Them> got again the false missing object exception, I do notice
    > one thing tho, almost all the time it's complaining about
    > the object that is the vanilla 2.6.33 commit (we initialize
    > all our branches to start from that)

Me  > ugh

Them> hi again, wtf, I just found out that we have disabled the
    > repack script sometime in March and are only running the
    > resync-all script every night so it would mean those problems
    > are not because of the external repacking

Me  > yikes
    > so the vanilla 2.6.33 commit went poof solely due to gerrit
    > adding new pack files during pushes.

Them> not sure why it did, but yeah, it seems Gerrit doesn't know
    > about it even tho it exists and works after a restart (neither
    > of the touch or "git gc" solve the issue, only restarting Gerrit
    > does so far)

Makes me start to suspect that the PackFile object which contains the
commit got marked as corrupt in memory, or it was simply omitted from
the PackList object somehow during a copy of the array.

Comment 18 by sop@google.com, May 13 2010

Slightly new theory:

JGit has an open bug [1] where pack files are accessed after their
file descriptor was closed.  These usually result in an IOException
being thrown back at the caller.

In many places within ObjectDirectory, JGit consumes an IOException
when accessing the pack file and removes the pack file from its list
of known packs.  Since the exception is not logged, we don't know if
this condition is triggering or not.

When the pack gets removed from the list of known packs, it is never
put back into the list because the objects/pack mtime doesn't change.

So if this read-after-close bug occurs at the right place, we won't
log it, but we'll close the pack and forget it ever exists.  Later on
when we can't access the object we log the missing object error, or
simply hide the branch from the client entirely.

[1] https://bugs.eclipse.org/bugs/show_bug.cgi?id=308945

Comment 19 by sop@google.com, May 27 2010

Labels: -Milestone-2.1.3 FixedIn-2.1.3
Status: Fixed
Fixed in Gerrit by change I50a1cd941fe9f0a7dd2a6a15d6bd56a36fc773a0

Comment 20 by sop@google.com, Jun 1 2010

Labels: -FixedIn-2.1.3 FixedIn-2.1.2.5
We're hitting this daily now, even on 2.1.2.5.  

We're running the work around script that touches the pack objects files.  I'll try and disable that to see if it helps.

My problem could well be  issue 585  too.  I'll provide the details in 858.

Comment 23 by sop@google.com, Mar 28 2012

Status: Released

Sign in to add a comment