should git download missing objects?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* should git download missing objects?
@ 2006-11-12 15:44 Anand Kumria
  2006-11-12 19:41 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Anand Kumria @ 2006-11-12 15:44 UTC (permalink / raw)
  To: git

Hi,

I did an initial clone of Linus' linux-2.6.git tree, via the git protocol,
and then managed to accidently delete one of the .pack and
corresponding .idx files.

I thought that 'cg-fetch' would do the job of bring down the missing pack
again, and all would be well. Alas this isn't the case.

<http://pastebin.ca/246678>

Pasky, on IRC, indicated that this might be because git-fetch-pack isn't
downloading missing objects when the git:// protocol is being used.

Should it? Is there a magic invocation of git fetch I can use to fix this
up. I can always re-clone completely (since this is just a tracking repo)
but it would be nice to fix this with the tools themselves.

Any hints?

Thanks,
Anand

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-12 15:44 should git download missing objects? Anand Kumria
@ 2006-11-12 19:41 ` Junio C Hamano
  2006-11-13 19:45   ` Alex Riesen
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2006-11-12 19:41 UTC (permalink / raw)
  To: Anand Kumria; +Cc: git

"Anand Kumria" <wildfire@progsoc.org> writes:

> I did an initial clone of Linus' linux-2.6.git tree, via the git protocol,
> and then managed to accidently delete one of the .pack and
> corresponding .idx files.
>
> I thought that 'cg-fetch' would do the job of bring down the missing pack
> again, and all would be well. Alas this isn't the case.
>
> <http://pastebin.ca/246678>
>
> Pasky, on IRC, indicated that this might be because git-fetch-pack isn't
> downloading missing objects when the git:// protocol is being used.

There are the invariants between refs and objects:

 - objects that its refs (files under .git/refs/ hierarchy that
   record 40-byte hexadecimal object names) point at are never
   missing, or the repository is corrupt.

 - objects that are reachable via pointers in another object
   that is not missing (a tag points at another object, a commit
   points at its tree and its parent commits, and a tree points
   at its subtrees and blobs) are never missing, or the repository
   is corrupt.

Git tools first fetch missing objects and then update your refs
only when fetch succeeds completely, in order to maintain the
above invariants (a partial fetch does not update your refs).
And these invariants are why:

 - fsck-objects start reachability check from the refs;

 - commit walkers can stop at your existing refs;

 - git native protocols only need to tell the other end what
   refs you have, in order for the other end to exclude what you
   already have from the set of objects it sends you.

What's missing needs to be determined in a reasonably efficient
manner, and the above invariants allow us not have to do the
equivalent of fsck-objects every time.  Being able to trust refs
is fairly fundamental in the fetch operation of git.

I am not opposed to the idea of a new tool to fix a corrupted
repository that has broken the above invariants, perhaps caused
by accidental removal of objects and packs by end users.  What
it needs to do would be:

 - run fsck-objects to notice what are missing, by noting
   "broken link from foo to bar" output messages.  Object 'bar'
   is what you _ought_ to have according to your refs but you
   don't (because you removed the objects that should be there),
   and everything that is reachable from it from the other side
   needs to be retrieved.  Because you do not have 'bar', your
   end cannot determine what other objects you happen to have in
   your object store are reachable from it and would result in
   redundant download.

 - run fetch-pack equivalent to get everything reachable
   starting at the above missing objects, pretending you do not
   have any object, because your refs are not trustworthy.

 - run fsck-objects again to make sure that your refs can now be
   trusted again.

To implement the second step above, you need to implement a
modified fetch-pack that does not trust any of your refs.  It
also needs to ignore what are offered from the other end but
asks the objects you know are missing ('bar' in the above
example).  This program needs to talk to a modified upload-pack
running at the other end (let's call it upload-pack-recover),
because usual upload-pack does not serve starting from a random
object that happen to be in its repository, but only starting
from objects that are pointed by its own set of refs to ensure
integrity.

The upload-pack-recover program would need to start traversal
from object 'bar' in the above example, and when it does so, it
should not just run 'rev-list --objects' starting at 'bar'.  It
first needs to prove that its object store has everything that
is reachable from 'bar' (the recipient would still end up with
an incomplete repository if it didn't).

What this means is that it needs to prove some of its refs can
reach 'bar' (again, on the upstream end, only refs are trusted,
not mere existence of object is not enough) before sending
objects back.  Usual upload-pack do not have to do it because it
refuses to serve starting from anything but what its refs point
at (and by the invariants, the objects pointed at by refs are
guaranteed to be complete [an object is "complete" if no object
that can be reachable is not missing]).

This is needed because the repository might have discarded
branch that used to reach 'bar', and while the object 'bar' was
in a pack but some of its ancestors or component trees and/or
blobs were loose and subsequent git-prune have removed the
latter without removing 'bar'.  Mere existence of the object
'bar' does not mean 'bar' is complete.

So coming up with such a pair of programs is not a rocket
science, but it is fairly delicate.  I would rather have them as
specialized commands, not a part of everyday commands, even if
you were to implement it.

Since this is not everyday anyway, a far easier way would be to
clone-pack from the upstream into a new repository, take the
pack you downloaded from that new repository and mv it into your
corrupt repository.  You can run fsck-objects to see if you got
back everything you lost earlier.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-12 19:41 ` Junio C Hamano
@ 2006-11-13 19:45   ` Alex Riesen
  2006-11-13 19:54     ` Shawn Pearce
  2006-11-13 20:05     ` Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Alex Riesen @ 2006-11-13 19:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Anand Kumria, git

Junio C Hamano, Sun, Nov 12, 2006 20:41:23 +0100:
> Since this is not everyday anyway, a far easier way would be to
> clone-pack from the upstream into a new repository, take the
> pack you downloaded from that new repository and mv it into your
> corrupt repository.  You can run fsck-objects to see if you got
> back everything you lost earlier.

I get into such a situation annoyingly often, by using
"git clone -l -s from to" and doing some "cleanup" in the
origin repository. For example, it happens that I remove a tag,
or a branch, and do a repack or prune afterwards. The related
repositories, which had "accidentally" referenced the pruned
objects become "corrupt", as you put it.

At the moment, if I run into the situation, I copy packs/objects from
all repos I have (objects/info/alternates are useful here too), run a
fsck-objects/repack and hope nothing is lost. It works, as I almost
always have "accidental" backups somewhere, but is kind of annoying to
setup. A tool to do this job more effectively will be very handy (at
least, it wont have to copy gigabytes of data over switched windows
network. Not often, I hope. Not _so_ many gigabytes, possibly).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 19:45   ` Alex Riesen
@ 2006-11-13 19:54     ` Shawn Pearce
  2006-11-13 20:03       ` Petr Baudis
  2006-11-13 20:05     ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2006-11-13 19:54 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, Anand Kumria, git

Alex Riesen <fork0@t-online.de> wrote:
> Junio C Hamano, Sun, Nov 12, 2006 20:41:23 +0100:
> > Since this is not everyday anyway, a far easier way would be to
> > clone-pack from the upstream into a new repository, take the
> > pack you downloaded from that new repository and mv it into your
> > corrupt repository.  You can run fsck-objects to see if you got
> > back everything you lost earlier.
> 
> I get into such a situation annoyingly often, by using
> "git clone -l -s from to" and doing some "cleanup" in the
> origin repository. For example, it happens that I remove a tag,
> or a branch, and do a repack or prune afterwards. The related
> repositories, which had "accidentally" referenced the pruned
> objects become "corrupt", as you put it.
> 
> At the moment, if I run into the situation, I copy packs/objects from
> all repos I have (objects/info/alternates are useful here too), run a
> fsck-objects/repack and hope nothing is lost. It works, as I almost
> always have "accidental" backups somewhere, but is kind of annoying to
> setup. A tool to do this job more effectively will be very handy (at
> least, it wont have to copy gigabytes of data over switched windows
> network. Not often, I hope. Not _so_ many gigabytes, possibly).


One of my coworkers recently lost a single loose tree object.
We suspect his Windows virus scanner deleted the file.  :-(

Copying the one bad object from another repository immediately fixed
the breakage caused, but it was very annoying to not be able to run a
"git fetch --missing-objects" or some such.  Fortunately it was just
the one object and it was also still loose in another repository.
scp was handy.  :-)

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 19:54     ` Shawn Pearce
@ 2006-11-13 20:03       ` Petr Baudis
  2006-11-13 20:10         ` Shawn Pearce
  2006-11-13 20:22         ` Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Petr Baudis @ 2006-11-13 20:03 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Alex Riesen, Junio C Hamano, Anand Kumria, git

On Mon, Nov 13, 2006 at 08:54:14PM CET, Shawn Pearce wrote:
> Alex Riesen <fork0@t-online.de> wrote:
> > Junio C Hamano, Sun, Nov 12, 2006 20:41:23 +0100:
> > > Since this is not everyday anyway, a far easier way would be to
> > > clone-pack from the upstream into a new repository, take the
> > > pack you downloaded from that new repository and mv it into your
> > > corrupt repository.  You can run fsck-objects to see if you got
> > > back everything you lost earlier.
> > 
> > I get into such a situation annoyingly often, by using
> > "git clone -l -s from to" and doing some "cleanup" in the
> > origin repository. For example, it happens that I remove a tag,
> > or a branch, and do a repack or prune afterwards. The related
> > repositories, which had "accidentally" referenced the pruned
> > objects become "corrupt", as you put it.
> >
> > At the moment, if I run into the situation, I copy packs/objects from
> > all repos I have (objects/info/alternates are useful here too), run a
> > fsck-objects/repack and hope nothing is lost. It works, as I almost
> > always have "accidental" backups somewhere, but is kind of annoying to
> > setup. A tool to do this job more effectively will be very handy (at
> > least, it wont have to copy gigabytes of data over switched windows
> > network. Not often, I hope. Not _so_ many gigabytes, possibly).

cg-fetch -f locally or over HTTP should be able to fix that up, if used
cleverly.

> One of my coworkers recently lost a single loose tree object.
> We suspect his Windows virus scanner deleted the file.  :-(
> 
> Copying the one bad object from another repository immediately fixed
> the breakage caused, but it was very annoying to not be able to run a
> "git fetch --missing-objects" or some such.  Fortunately it was just
> the one object and it was also still loose in another repository.
> scp was handy.  :-)

If it's over ssh, this is still where the heavily dusty (and heavily
"plumby") git-ssh-fetch command is useful, since it can get passed an
undocumented --recover argument and then it will fetch _all_ the objects
you are missing, not assuming anything.

Perhaps I should reintroduce support for git-ssh-fetch to cg-fetch to be
used in case of -f over SSH. But it would be silly if I did that and
next Git would remove the command from its suite. Junio, what's its life
expectancy? I guess this usage scenario is something to take into
account when thinking about removing it, I know that I wanted to get rid
of it in the past but now my opinion is changing.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 19:45   ` Alex Riesen
  2006-11-13 19:54     ` Shawn Pearce
@ 2006-11-13 20:05     ` Junio C Hamano
  2006-11-13 22:52       ` Alex Riesen
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2006-11-13 20:05 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git

fork0@t-online.de (Alex Riesen) writes:

> Junio C Hamano, Sun, Nov 12, 2006 20:41:23 +0100:
>> Since this is not everyday anyway, a far easier way would be to
>> clone-pack from the upstream into a new repository, take the
>> pack you downloaded from that new repository and mv it into your
>> corrupt repository.  You can run fsck-objects to see if you got
>> back everything you lost earlier.
>
> I get into such a situation annoyingly often, by using
> "git clone -l -s from to" and doing some "cleanup" in the
> origin repository. For example, it happens that I remove a tag,
> or a branch, and do a repack or prune afterwards. The related
> repositories, which had "accidentally" referenced the pruned
> objects become "corrupt", as you put it.
>
> At the moment, if I run into the situation, I copy packs/objects from
> all repos I have (objects/info/alternates are useful here too), run a
> fsck-objects/repack and hope nothing is lost. It works, as I almost
> always have "accidental" backups somewhere, but is kind of annoying to
> setup. A tool to do this job more effectively will be very handy (at
> least, it wont have to copy gigabytes of data over switched windows
> network. Not often, I hope. Not _so_ many gigabytes, possibly).

I suspect it is a different issue.  Maybe you would need reverse
links from the origin directory to .git/refs/ directroy of
repositories that borrow from it to prevent pruning.  No amount
of butchering fetch-pack to look behind incomplete refs that lie
and claim they are complete would solve your problem if you do
not have any "accidental backups".

In general, 'git clone -l -s' origin directories may not be
writable by the person who is making the clone, so we should not
do this inside 'git clone'.  Also you could add alternates after
you set up your repository, so maybe something like this would
help?

	#!/bin/sh
	#
	# Usage: git-add-alternates other_repo
        #
        : ${GIT_DIR=.git}
	my_refs=`cd $GIT_DIR/refs && pwd`
	other=$1
        test -d "$other/.git" && test -d "$other/objects" || {
        	echo >&2 "I do not see a repository at $other"
                exit 1
	}
	mkdir -p "$other/.git/refs/borrowers" || {
        	echo >&2 "You cannot write in $other"
        	echo >&2 "Arrange with the owner of it to make"
        	echo >&2 "sure the objects you need are not pruned."
                exit 2
        }
	cnt=0
        while test -d "$other/.git/refs/borrowers/$cnt"
        do
		cnt=$(($cnt + 1))
	done
        ln -s "$my_refs" "$other/.git/refs/borrowers/$cnt"

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 20:03       ` Petr Baudis
@ 2006-11-13 20:10         ` Shawn Pearce
  2006-11-13 20:22         ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Shawn Pearce @ 2006-11-13 20:10 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Alex Riesen, Junio C Hamano, Anand Kumria, git

Petr Baudis <pasky@suse.cz> wrote:
> On Mon, Nov 13, 2006 at 08:54:14PM CET, Shawn Pearce wrote:
> > Copying the one bad object from another repository immediately fixed
> > the breakage caused, but it was very annoying to not be able to run a
> > "git fetch --missing-objects" or some such.  Fortunately it was just
> > the one object and it was also still loose in another repository.
> > scp was handy.  :-)
> 
> If it's over ssh, this is still where the heavily dusty (and heavily
> "plumby") git-ssh-fetch command is useful, since it can get passed an
> undocumented --recover argument and then it will fetch _all_ the objects
> you are missing, not assuming anything.

Interesting.  Since its undocumented I didn't know it existed
until now.  :)

I'm thinking though that a --recover should just be part of
git-fetch, and that it should work on all transports, not just SSH.

Of course you could get into a whole world of hurt where you keep
doing fsck-objects --full (listing out the missing), fetching them,
only to find more missing, etc.  After a coule of cycles of that it
may just be better to claim to the other end that you have nothing
but want everything (e.g. an initial clone) and get a new pack from
which you can pull objects.

But I think that was sort of Junio's point on this topic.  I'm just
trying to throw in my +1 in favor of a feature that would have
recovered that sole missing object without making the end user
reclone their entire repository and move pack files around by hand.
And I'm being more verbose about it than just +1.  :)

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 20:03       ` Petr Baudis
  2006-11-13 20:10         ` Shawn Pearce
@ 2006-11-13 20:22         ` Junio C Hamano
  2006-11-14 20:08           ` Petr Baudis
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2006-11-13 20:22 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

Petr Baudis <pasky@suse.cz> writes:

> ... Junio, what's its life
> expectancy? I guess this usage scenario is something to take into
> account when thinking about removing it, I know that I wanted to get rid
> of it in the past but now my opinion is changing.

It uses the same commit walker semantics and mechanism so I do
not think it is too much burden to carry it, but I'd rather have
something that works over git native protocol if we really care
about this.  People without ssh access needs to be able to
recover over git:// protocol.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 20:05     ` Junio C Hamano
@ 2006-11-13 22:52       ` Alex Riesen
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Riesen @ 2006-11-13 22:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano, Mon, Nov 13, 2006 21:05:48 +0100:
> > Junio C Hamano, Sun, Nov 12, 2006 20:41:23 +0100:
> >> Since this is not everyday anyway, a far easier way would be to
> >> clone-pack from the upstream into a new repository, take the
> >> pack you downloaded from that new repository and mv it into your
> >> corrupt repository.  You can run fsck-objects to see if you got
> >> back everything you lost earlier.
> >
> > I get into such a situation annoyingly often, by using
> > "git clone -l -s from to" and doing some "cleanup" in the
> > origin repository. For example, it happens that I remove a tag,
> > or a branch, and do a repack or prune afterwards. The related
> > repositories, which had "accidentally" referenced the pruned
> > objects become "corrupt", as you put it.
> >
> > At the moment, if I run into the situation, I copy packs/objects from
> > all repos I have (objects/info/alternates are useful here too), run a
> > fsck-objects/repack and hope nothing is lost. It works, as I almost
> > always have "accidental" backups somewhere, but is kind of annoying to
> > setup. A tool to do this job more effectively will be very handy (at
> > least, it wont have to copy gigabytes of data over switched windows
> > network. Not often, I hope. Not _so_ many gigabytes, possibly).
> 
> I suspect it is a different issue.  Maybe you would need reverse
> links from the origin directory to .git/refs/ directroy of
> repositories that borrow from it to prevent pruning.  No amount
> of butchering fetch-pack to look behind incomplete refs that lie
> and claim they are complete would solve your problem if you do
> not have any "accidental backups".

It's is not about preventing this from happening. It is about
recovering from user error (which I plainly did). The discussion about
"git fetch --recover" sound very much like what would helped in that
situation. I'll just try not doing it next time, but if I do, it'd be
nice to have a tool to help me recover from it. Not prevent, not
seeing it possible, just help.

Anyway, it's kind of too late for that repositories. And not very
convenient to work with: the branches in the slave repos come and go
often, they pull from each other and push into central (aka origin)
repo. Maintain the borrowed refs in sync would be nightmare (as is: "I
promise to forget doing it").

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: should git download missing objects?
  2006-11-13 20:22         ` Junio C Hamano
@ 2006-11-14 20:08           ` Petr Baudis
  0 siblings, 0 replies; 10+ messages in thread
From: Petr Baudis @ 2006-11-14 20:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Nov 13, 2006 at 09:22:13PM CET, Junio C Hamano wrote:
> Petr Baudis <pasky@suse.cz> writes:
> 
> > ... Junio, what's its life
> > expectancy? I guess this usage scenario is something to take into
> > account when thinking about removing it, I know that I wanted to get rid
> > of it in the past but now my opinion is changing.
> 
> It uses the same commit walker semantics and mechanism so I do
> not think it is too much burden to carry it, but I'd rather have
> something that works over git native protocol if we really care
> about this.  People without ssh access needs to be able to
> recover over git:// protocol.

Even though I obviously agree with the above, it would be useful to have
the flag even though git:// (which is apparently harder to get right
than the others) is not supported. After all, most repositories I've
seen that are available over git:// are available over HTTP as well.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-11-14 20:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-12 15:44 should git download missing objects? Anand Kumria
2006-11-12 19:41 ` Junio C Hamano
2006-11-13 19:45   ` Alex Riesen
2006-11-13 19:54     ` Shawn Pearce
2006-11-13 20:03       ` Petr Baudis
2006-11-13 20:10         ` Shawn Pearce
2006-11-13 20:22         ` Junio C Hamano
2006-11-14 20:08           ` Petr Baudis
2006-11-13 20:05     ` Junio C Hamano
2006-11-13 22:52       ` Alex Riesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).