git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git prune pig slow
@ 2006-07-29  9:02 Russell King
  2006-07-29 11:40 ` Johannes Schindelin
  2006-07-29 18:14 ` Linus Torvalds
  0 siblings, 2 replies; 6+ messages in thread
From: Russell King @ 2006-07-29  9:02 UTC (permalink / raw)
  To: git

Hi,

git 1.4.0, P4 2.6GHz, 1GB.

I'm trying to use "git prune" to remove some unreachable objects from
my git tree.  However, it appears to be _extremely_ expensive:

rmk      13376 91.3 15.7 165980 161556 pts/0   R+   09:50   5:14 git-fsck-object

stracing it shows that it's doing lots and lots of brk() calls.

I killed it after 10 minutes and decided to do the job manually -
git-fsck-objects --unreachable and deleting the objects one by one is
_much_ quicker than git-fsck-objects --full --cache --unreachable.

-- 
Russell King

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git prune pig slow
  2006-07-29  9:02 git prune pig slow Russell King
@ 2006-07-29 11:40 ` Johannes Schindelin
  2006-07-29 18:14 ` Linus Torvalds
  1 sibling, 0 replies; 6+ messages in thread
From: Johannes Schindelin @ 2006-07-29 11:40 UTC (permalink / raw)
  To: Russell King; +Cc: git

Hi,

On Sat, 29 Jul 2006, Russell King wrote:

> Hi,
> 
> git 1.4.0, P4 2.6GHz, 1GB.
> 
> I'm trying to use "git prune" to remove some unreachable objects from
> my git tree.  However, it appears to be _extremely_ expensive:
> 
> rmk      13376 91.3 15.7 165980 161556 pts/0   R+   09:50   5:14 git-fsck-object
> 
> stracing it shows that it's doing lots and lots of brk() calls.

Does git-count-objects show a high amount of unpacked objects? You should 
try "git-repack -a -d" _before_ git-prune, then.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git prune pig slow
  2006-07-29  9:02 git prune pig slow Russell King
  2006-07-29 11:40 ` Johannes Schindelin
@ 2006-07-29 18:14 ` Linus Torvalds
  2006-07-29 20:03   ` Linus Torvalds
  1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2006-07-29 18:14 UTC (permalink / raw)
  To: Russell King; +Cc: git



On Sat, 29 Jul 2006, Russell King wrote:
> 
> I killed it after 10 minutes and decided to do the job manually -
> git-fsck-objects --unreachable and deleting the objects one by one is
> _much_ quicker than git-fsck-objects --full --cache --unreachable.

It's also very dangerous.

If you have partial packing (which you can get if you fetch data using 
rsync or http, for example), not havign the "--full" means that 
git-fsck-objects will report on objects being "unreachable" if they are 
only reachable from another object that is packed.

Now, in practice, if you only use the git native protocol, this should 
never happen, and you're fine. But there's a _very_ real reason why "git 
prune" passes the "--full" flag to git-fsck-cache. "git prune" is simply 
too dangerous without it.

That said, the current "git prune" in 1.4.2-rc is much faster, because it 
does the reachability analysis on its own, and doesn't do all the other 
things that git-fsck-cache does.

Btw, another alternative to "git prune" is actually to do

	git repack -a -d

and then just delete all unpacked objects.

			Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git prune pig slow
  2006-07-29 18:14 ` Linus Torvalds
@ 2006-07-29 20:03   ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2006-07-29 20:03 UTC (permalink / raw)
  To: Russell King; +Cc: git



On Sat, 29 Jul 2006, Linus Torvalds wrote:
> 
> It's also very dangerous.
> 
> If you have partial packing (which you can get if you fetch data using 
> rsync or http, for example), not havign the "--full" means that 
> git-fsck-objects will report on objects being "unreachable" if they are 
> only reachable from another object that is packed.
> 
> Now, in practice, if you only use the git native protocol, this should 
> never happen, and you're fine.

Side note: in _practice_, it probably doesn't happen even with rsync and 
http, so in that sense, it's true that "--full" is almost always likely to 
just be a waste of time, and I can't come up with a schenario where you 
really need "--full" for pruning unless you did something strange. All the 
normal workflows means that if you have an object that is in a pack, 
everything it points to will _also_ be in a pack, and as such, "git prune" 
would never remove anything that wasn't safe to remove, even without the 
"--full".

But just to get an example of how a _strange_ schenario could happen, 
let's say that

 - you're tracking a upstreams repository using rsync or http (ie you will 
   get the objects in the same format that upstream tracks them, either as 
   individual objects, or as "packs")

 - that upstreams repository does _incremental_ repacks every once in a 
   while. 

 - the last time you fetched was _just_ before upstream did an incremental 
   pack, we call this "State A".

	As a result, you now have his old state A all as individual 
	objects in your object database.

 - you fetch again, now after upstream has done _two_ incremntal packs 
   (one to pack all the loose objects that you already had, and one to 
   pack the new state). Upstream is now at "State B"

	As a result, you get all of his _new_ objects as one nice pack: 
	you do not get his other pack, because you already have all 
	_those_ objects (which are "state A") as individual objects.

 - so now, since you're only tracking the other ends state, and have no 
   objects of your own (in particular, the last fetch/pull did _not_ 
   generate a merge object of your own to connect the new pack with the 
   old objects), what has happened is that all your heads point into the 
   new incremental pack you just fetched, and that pack itself will have 
   pointers to the individual objects that you fetched last time, because 
   it was an incremental pack to "state A".

 - what happens now is that if you run "git-fsck-objects" without the 
   "--full", it will claim that _all_ of your unpacked objects are 
   unreachable, because they really are reachable only though that new 
   pack.

So in this (very very unusual) circumstance, "git prune" without the 
"--full" would literally prune away objects that you very much need.

I hope this explains why that "unnecessary" (and admittedly much more 
expensive) --full is there. It really is unnecessary in practice: partly 
because Junio has made "git repack -a -d" so efficient that doing 
incremental packs isn't even worth it for most people, and partly because 
you probably use the native git protocol and repack yourself, and thus 
never use another persons pack directly (which also avoids this problem).

But yeah, the olf "git prune" was really very expensive. It's much better 
in the current git branch, although it's still not _cheap_ (because it 
does do the whole reachability analysis, though all pack-files, because it 
wants to get the above special case right).

If we really wanted to, we could add a "core.fullpacks" flag that you 
could set, and that would cause the non-native protocols to not work (or 
alternatively force a re-pack after they have fetched a pack), and that 
would disallow incremental repacking locally, and then we could optimize 
the hell out of "git prune" and say that it never needs to look at any 
reachability for an object that is already packed.

That would make "git prune" basically instantaneous, the way "git 
fsck-objects" is by default. But to be safe, it really needs to have some 
per-repository flag that is honored by the other git commands.

			Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git prune pig slow
@ 2006-07-29 22:41 linux
  2006-07-29 23:48 ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: linux @ 2006-07-29 22:41 UTC (permalink / raw)
  To: git; +Cc: torvalds

> Btw, another alternative to "git prune" is actually to do
>
>	git repack -a -d
>
> and then just delete all unpacked objects.

No, that's dangerous too.  The index file is considered part of the root
set for git-fsck-objects, but not for git-repack.

Example script:

$ git-init-db
$ cat > hello.c
#include <stdio.h>

int
main(void)
{
        puts("Hello, world!");
        return 0;
}
$ git-update-index --add hello.c
$ git-repack -a -d
Generating pack...
Done counting 0 objects.
Nothing new to pack.
$ rm .git/objects/67/159ba959e0a0cd6157bf04d5dad66af59383c2
rm: remove write-protected regular file `.git/objects/67/159ba959e0a0cd6157bf04d5dad66af59383c2'? y
$ git commit
error: invalid object 67159ba959e0a0cd6157bf04d5dad66af59383c2
fatal: git-write-tree: error building trees

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git prune pig slow
  2006-07-29 22:41 linux
@ 2006-07-29 23:48 ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2006-07-29 23:48 UTC (permalink / raw)
  To: linux; +Cc: git



On Sat, 29 Jul 2006, linux@horizon.com wrote:
> 
> No, that's dangerous too.  The index file is considered part of the root
> set for git-fsck-objects, but not for git-repack.

Indeed.

Although at least you won't lose any history - at worst you'll have to 
basically do a "git reset HEAD" to make things right again.

I was careful when I wrote the new git-prune to take the index into 
account, but I'd forgotten about it wrt the "git repack -a -d" suggestion.

		Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-07-29 23:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-29  9:02 git prune pig slow Russell King
2006-07-29 11:40 ` Johannes Schindelin
2006-07-29 18:14 ` Linus Torvalds
2006-07-29 20:03   ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2006-07-29 22:41 linux
2006-07-29 23:48 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).