Why repository grows after "git gc"? / Purpose of *.keep files?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Why repository grows after "git gc"? / Purpose of *.keep files?
@ 2008-05-12 12:29 Teemu Likonen
  2008-05-12 15:52 ` Teemu Likonen
  0 siblings, 1 reply; 35+ messages in thread
From: Teemu Likonen @ 2008-05-12 12:29 UTC (permalink / raw)
  To: git

I have noticed that after cloning a repository (via git protocol) the
repo is packed pretty tightly and takes relatively small amount of disk
space. After using it a while and running "git gc" the repo sometimes
grows 25% or something like that.

For testing purposes I deleted objects/pack/*.keep file(s) and ran "git
gc" again. The repo resulted in small again, just like after the initial
clone. I don't have disk space problems but a repo growing about 25%
after manual "git gc" seems weird. What's the purpose of these *.keep
files? They just contain text like "fetch-pack <number> on <my
hostname>".

PS. I have merged Brandon Casey's new git-gc/repack patches. In case it
    has some effect. See the "pu" branch or "git log 9e7d5019".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 12:29 Why repository grows after "git gc"? / Purpose of *.keep files? Teemu Likonen
@ 2008-05-12 15:52 ` Teemu Likonen
  2008-05-12 17:13   ` Johannes Schindelin
  2008-05-12 17:17   ` David Tweed
  0 siblings, 2 replies; 35+ messages in thread
From: Teemu Likonen @ 2008-05-12 15:52 UTC (permalink / raw)
  To: git

Teemu Likonen wrote (2008-05-12 15:29 +0300):

> For testing purposes I deleted objects/pack/*.keep file(s) and ran
> "git gc" again. The repo resulted in small again, just like after the
> initial clone.

After playing with test repo a while it seems that "git gc" never
touches pack files which have accompanying .keep file around. (And it's
common to have a .keep file after "git clone".) This makes gc perform
faster. A side effect seems to be that objects which later become
unreferenced in those pack-files-with-.keep are never pruned. *.keep
files also seem to prevent from really aggressively optimizing the
repository's size.

Probably a crazy idea: What if "gc --aggressive" first removed *.keep
files and after packing and garbage-collecting and whatever it does it
would add a .keep file for the newly created pack?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 15:52 ` Teemu Likonen
@ 2008-05-12 17:13   ` Johannes Schindelin
  2008-05-12 18:43     ` Teemu Likonen
  2008-05-12 17:17   ` David Tweed
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Schindelin @ 2008-05-12 17:13 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: git

Hi,

On Mon, 12 May 2008, Teemu Likonen wrote:

> Probably a crazy idea: What if "gc --aggressive" first removed *.keep 
> files and after packing and garbage-collecting and whatever it does it 
> would add a .keep file for the newly created pack?

Most .keep files are not meant to be removed by git-gc.  Usually, .keep 
files are only created interactively (if you _want_ to keep a pack, e.g. 
when it has been optimally packed and is big), or by git-index-pack while 
it is writing a pack (IIRC).

So I think it would be wrong for "gc --aggressive" to remove the .keep 
files.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 17:13   ` Johannes Schindelin
@ 2008-05-12 18:43     ` Teemu Likonen
  2008-05-12 18:56       ` Nicolas Pitre
  0 siblings, 1 reply; 35+ messages in thread
From: Teemu Likonen @ 2008-05-12 18:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin wrote (2008-05-12 18:13 +0100):

> On Mon, 12 May 2008, Teemu Likonen wrote:
> 
> > Probably a crazy idea: What if "gc --aggressive" first removed
> > *.keep files and after packing and garbage-collecting and whatever
> > it does it would add a .keep file for the newly created pack?
> 
> Most .keep files are not meant to be removed by git-gc.  Usually,
> .keep files are only created interactively (if you _want_ to keep
> a pack, e.g. when it has been optimally packed and is big), or by
> git-index-pack while it is writing a pack (IIRC).
> 
> So I think it would be wrong for "gc --aggressive" to remove the .keep
> files.

I guess you're right. Maybe "gc --aggressive" could delete only certain
machine-generated .keep files which have an identifier inside?

Well, I don't really have any problems with the current behaviour; it
just feels a bit strange that, for example, Linus's kernel repository
grew about 90MB after just one update pull and gc. Also, dangling
objects are kept forever in .keep packs (which are created with "git
clone", for example).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 18:43     ` Teemu Likonen
@ 2008-05-12 18:56       ` Nicolas Pitre
  2008-05-12 19:09         ` Teemu Likonen
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-12 18:56 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Johannes Schindelin, git

On Mon, 12 May 2008, Teemu Likonen wrote:

> Well, I don't really have any problems with the current behaviour; it
> just feels a bit strange that, for example, Linus's kernel repository
> grew about 90MB after just one update pull and gc.

That looks really odd.  Sure the repo might grow a bit, but 90MB seems 
really excessive.  How many time did pass between the initial clone and 
that subsequent pull?

> Also, dangling
> objects are kept forever in .keep packs (which are created with "git
> clone", for example).

A pack obtained via 'git clone' will never contain any dangling objects.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 18:56       ` Nicolas Pitre
@ 2008-05-12 19:09         ` Teemu Likonen
  2008-05-12 19:36           ` Nicolas Pitre
  0 siblings, 1 reply; 35+ messages in thread
From: Teemu Likonen @ 2008-05-12 19:09 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, git

Nicolas Pitre wrote (2008-05-12 14:56 -0400):

> On Mon, 12 May 2008, Teemu Likonen wrote:
> 
> > Well, I don't really have any problems with the current behaviour;
> > it just feels a bit strange that, for example, Linus's kernel
> > repository grew about 90MB after just one update pull and gc.
> 
> That looks really odd.  Sure the repo might grow a bit, but 90MB seems
> really excessive.  How many time did pass between the initial clone
> and that subsequent pull?

As I used the kernel repo just for testing this behaviour in question
I did both things today. Timestamps tell that there were six hours
between the initial .keep pack and the new pack created by manual "git
gc".

> > Also, dangling objects are kept forever in .keep packs (which are
> > created with "git clone", for example).
> 
> A pack obtained via 'git clone' will never contain any dangling
> objects.

I think it can contain at some later point. For example, if a user first
fetches all the branches but later decides to track only one branch.
After deleting unneeded tracking branches and expiring the reflog
there'll be dangling objects in the original .keep pack created with
"git clone".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 19:09         ` Teemu Likonen
@ 2008-05-12 19:36           ` Nicolas Pitre
  2008-05-12 20:10             ` Govind Salinas
  2008-05-12 20:24             ` Teemu Likonen
  0 siblings, 2 replies; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-12 19:36 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Johannes Schindelin, git

On Mon, 12 May 2008, Teemu Likonen wrote:

> Nicolas Pitre wrote (2008-05-12 14:56 -0400):
> 
> > On Mon, 12 May 2008, Teemu Likonen wrote:
> > 
> > > Well, I don't really have any problems with the current behaviour;
> > > it just feels a bit strange that, for example, Linus's kernel
> > > repository grew about 90MB after just one update pull and gc.
> > 
> > That looks really odd.  Sure the repo might grow a bit, but 90MB seems
> > really excessive.  How many time did pass between the initial clone
> > and that subsequent pull?
> 
> As I used the kernel repo just for testing this behaviour in question
> I did both things today. Timestamps tell that there were six hours
> between the initial .keep pack and the new pack created by manual "git
> gc".

This is way too big a difference.  Something is going on.

What git version is this? And can you send me the content of your 
.git/logs directory?

> > > Also, dangling objects are kept forever in .keep packs (which are
> > > created with "git clone", for example).
> > 
> > A pack obtained via 'git clone' will never contain any dangling
> > objects.
> 
> I think it can contain at some later point. For example, if a user first
> fetches all the branches but later decides to track only one branch.
> After deleting unneeded tracking branches and expiring the reflog
> there'll be dangling objects in the original .keep pack created with
> "git clone".

Sure.  But to decide to track only one branch and exclude the others 
require some higher level of git knowledge already.  At that point if 
you really care about top packing performances you certainly can deal 
with the .keep file as well.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 19:36           ` Nicolas Pitre
@ 2008-05-12 20:10             ` Govind Salinas
  2008-05-12 21:06               ` Nicolas Pitre
  2008-05-12 20:24             ` Teemu Likonen
  1 sibling, 1 reply; 35+ messages in thread
From: Govind Salinas @ 2008-05-12 20:10 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Teemu Likonen, Johannes Schindelin, git

On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 12 May 2008, Teemu Likonen wrote:
>
>  > Nicolas Pitre wrote (2008-05-12 14:56 -0400):
>  >
>  > > On Mon, 12 May 2008, Teemu Likonen wrote:
>  > >
>  > > > Well, I don't really have any problems with the current behaviour;
>  > > > it just feels a bit strange that, for example, Linus's kernel
>  > > > repository grew about 90MB after just one update pull and gc.
>  > >
>  > > That looks really odd.  Sure the repo might grow a bit, but 90MB seems
>  > > really excessive.  How many time did pass between the initial clone
>  > > and that subsequent pull?
>  >
>  > As I used the kernel repo just for testing this behaviour in question
>  > I did both things today. Timestamps tell that there were six hours
>  > between the initial .keep pack and the new pack created by manual "git
>  > gc".
>
>  This is way too big a difference.  Something is going on.
>
>  What git version is this? And can you send me the content of your
>  .git/logs directory?
>
>
>  > > > Also, dangling objects are kept forever in .keep packs (which are
>  > > > created with "git clone", for example).
>  > >
>  > > A pack obtained via 'git clone' will never contain any dangling
>  > > objects.
>  >
>  > I think it can contain at some later point. For example, if a user first
>  > fetches all the branches but later decides to track only one branch.
>  > After deleting unneeded tracking branches and expiring the reflog
>  > there'll be dangling objects in the original .keep pack created with
>  > "git clone".
>
>  Sure.  But to decide to track only one branch and exclude the others
>  require some higher level of git knowledge already.  At that point if
>  you really care about top packing performances you certainly can deal
>  with the .keep file as well.
>
>

I have had some similar problems with .keep files.  I cloned a repo I
created that had a branch that I wasn't interested in.  I deleted the
branch and then I could never get rid of the (large) number of objects
in that pack until I deleted the .keep and repacked.  I think there
should be some way of forcing git to fix this sort of thing.

It gets even worse, I had pushed up the branch I wanted to get rid of
to my hosted server and there was no way to get git to release that
disk space.  I had to have the hosting admin send me a tarball
of the repo, extract it, delete the .keep file and repack it then send
it back to him.  I was fortunate enough to have a service that would
let me do that.

Thanks,
Govind.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 20:10             ` Govind Salinas
@ 2008-05-12 21:06               ` Nicolas Pitre
  2008-05-12 21:07                 ` Govind Salinas
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-12 21:06 UTC (permalink / raw)
  To: Govind Salinas; +Cc: Teemu Likonen, Johannes Schindelin, git

On Mon, 12 May 2008, Govind Salinas wrote:

> On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote:
> >  Sure.  But to decide to track only one branch and exclude the others
> >  require some higher level of git knowledge already.  At that point if
> >  you really care about top packing performances you certainly can deal
> >  with the .keep file as well.
> 
> I have had some similar problems with .keep files.  I cloned a repo I
> created that had a branch that I wasn't interested in.  I deleted the
> branch and then I could never get rid of the (large) number of objects
> in that pack until I deleted the .keep and repacked.

But as soon as you just "git pull" you'll get the deleted branch back.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 21:06               ` Nicolas Pitre
@ 2008-05-12 21:07                 ` Govind Salinas
  0 siblings, 0 replies; 35+ messages in thread
From: Govind Salinas @ 2008-05-12 21:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Teemu Likonen, Johannes Schindelin, git

On Mon, May 12, 2008 at 4:06 PM, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 12 May 2008, Govind Salinas wrote:
>
>  > On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote:
>
> > >  Sure.  But to decide to track only one branch and exclude the others
>  > >  require some higher level of git knowledge already.  At that point if
>  > >  you really care about top packing performances you certainly can deal
>  > >  with the .keep file as well.
>  >
>  > I have had some similar problems with .keep files.  I cloned a repo I
>  > created that had a branch that I wasn't interested in.  I deleted the
>  > branch and then I could never get rid of the (large) number of objects
>  > in that pack until I deleted the .keep and repacked.
>
>  But as soon as you just "git pull" you'll get the deleted branch back.
>
>
If you read the rest of my mail, you will see where I removed it from the
hosted server as well.  But with difficulty.

Thanks,
Govind.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 19:36           ` Nicolas Pitre
  2008-05-12 20:10             ` Govind Salinas
@ 2008-05-12 20:24             ` Teemu Likonen
  2008-05-12 21:03               ` Mike Hommey
  2008-05-12 21:07               ` Nicolas Pitre
  1 sibling, 2 replies; 35+ messages in thread
From: Teemu Likonen @ 2008-05-12 20:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, git

Nicolas Pitre wrote (2008-05-12 15:36 -0400):

> On Mon, 12 May 2008, Teemu Likonen wrote:
> 
> > > On Mon, 12 May 2008, Teemu Likonen wrote:
> > > 
> > > > Well, I don't really have any problems with the current
> > > > behaviour; it just feels a bit strange that, for example,
> > > > Linus's kernel repository grew about 90MB after just one update
> > > > pull and gc.

> > As I used the kernel repo just for testing this behaviour in
> > question I did both things today. Timestamps tell that there were
> > six hours between the initial .keep pack and the new pack created by
> > manual "git gc".
> 
> This is way too big a difference.  Something is going on.
> 
> What git version is this? And can you send me the content of your
> .git/logs directory?

I'm using Git from the "master" branch; compiled it today. I have the
following gc/repack-related patches applied from the "pu" branch:

  builtin-gc.c: deprecate --prune, it now really has no effect
  git-gc: always use -A when manually repacking
  repack: modify behavior of -A option to leave unreferenced objects unpacked

But I have experienced the same earlier with some other post-1.5.5
version so I believe you can reproduce this yourself. After cloning
Linus's linux-2.6 repo its .git directory weights 209MB. After single
"git pull" and "git gc" it was 298MB in my test.

I'll send you the .git/logs directory but I'm afraid it doesn't tell
much. There are just three files:

  .git/logs/HEAD
  .git/logs/refs/heads/master
  .git/logs/refs/remotes/origin/master

They containt one line for the initial clone and one line for
the fast-forward pull.

> > I think it can contain at some later point. For example, if a user
> > first fetches all the branches but later decides to track only one
> > branch. After deleting unneeded tracking branches and expiring the
> > reflog there'll be dangling objects in the original .keep pack
> > created with "git clone".
> 
> Sure.  But to decide to track only one branch and exclude the others
> require some higher level of git knowledge already.  At that point if
> you really care about top packing performances you certainly can deal
> with the .keep file as well.

Perhaps so. Although I don't consider this very high level Git
knowledge:

  $ git remote rm origin
  $ git remote add -t wanted_branch origin git://...

The first command removes all the tracking branches. The latter starts
to track only one branch.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 20:24             ` Teemu Likonen
@ 2008-05-12 21:03               ` Mike Hommey
  2008-05-12 21:08                 ` Mike Hommey
  2008-05-12 21:07               ` Nicolas Pitre
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Hommey @ 2008-05-12 21:03 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Nicolas Pitre, Johannes Schindelin, git

On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote:
> But I have experienced the same earlier with some other post-1.5.5
> version so I believe you can reproduce this yourself. After cloning
> Linus's linux-2.6 repo its .git directory weights 209MB. After single
> "git pull" and "git gc" it was 298MB in my test.

I noticed that a while ago: when repacking multiple packs when one has a
.keep file, the resulting additional pack contains too many blobs and
trees, contrary to when only packing loose objects:

$ git init
$ echo a > a; git add a; git commit -m a
$ git gc
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ git verify-pack -v .git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.pack
4bba7c0583de30efff4097299f89b199ab4a6dff commit 160 116 12
78981922613b2afb6025042ff6bd878ac1994e85 blob   2 11 167
aaff74984cccd156a469afa7d9ab10e4777beb24 tree   29 39 128
.git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.pack: ok

$ touch .git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.keep
$ echo b > b; git add b; git commit -m b
$ git gc
Counting objects: 3, done.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ git verify-pack -v
.git/objects/pack/pack-aa817046e43f278d67c6b85962676246f57bb855.pack
3683f870be446c7cc05ffaef9fa06415276e1828 tree   58 65 158
61780798228d17af2d34fce4cfbdf35556832472 blob   2 11 223
647aed0360e964adc5cedb12e0719fb8bfc05867 commit 208 146 12
.git/objects/pack/pack-aa817046e43f278d67c6b85962676246f57bb855.pack: ok

$ git gc
Counting objects: 4, done.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (4/4), done.
Total 4 (delta 0), reused 4 (delta 0)
$ git verify-pack -v
.git/objects/pack/pack-5f692a665e062dedad7b4baf692517adec37899d.pack
3683f870be446c7cc05ffaef9fa06415276e1828 tree   58 65 158
61780798228d17af2d34fce4cfbdf35556832472 blob   2 11 234
647aed0360e964adc5cedb12e0719fb8bfc05867 commit 208 146 12
78981922613b2afb6025042ff6bd878ac1994e85 blob   2 11 223
.git/objects/pack/pack-5f692a665e062dedad7b4baf692517adec37899d.pack: ok

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 21:03               ` Mike Hommey
@ 2008-05-12 21:08                 ` Mike Hommey
  2008-05-13  0:12                   ` Shawn O. Pearce
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Hommey @ 2008-05-12 21:08 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Nicolas Pitre, Johannes Schindelin, git

On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote:
> On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote:
> > But I have experienced the same earlier with some other post-1.5.5
> > version so I believe you can reproduce this yourself. After cloning
> > Linus's linux-2.6 repo its .git directory weights 209MB. After single
> > "git pull" and "git gc" it was 298MB in my test.
> 
> I noticed that a while ago: when repacking multiple packs when one has a
> .keep file, the resulting additional pack contains too many blobs and
> trees, contrary to when only packing loose objects:
(...)

That is, it seems to also contain all the blobs and subtrees for all the
commits the pack contains, even when they already are in the pack having
a .keep file.

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 21:08                 ` Mike Hommey
@ 2008-05-13  0:12                   ` Shawn O. Pearce
  2008-05-13  5:33                     ` Mike Hommey
  2008-05-14  1:03                     ` Nicolas Pitre
  0 siblings, 2 replies; 35+ messages in thread
From: Shawn O. Pearce @ 2008-05-13  0:12 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Teemu Likonen, Nicolas Pitre, Johannes Schindelin, git

Mike Hommey <mh@glandium.org> wrote:
> On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote:
> > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote:
> > > But I have experienced the same earlier with some other post-1.5.5
> > > version so I believe you can reproduce this yourself. After cloning
> > > Linus's linux-2.6 repo its .git directory weights 209MB. After single
> > > "git pull" and "git gc" it was 298MB in my test.
> > 
> > I noticed that a while ago: when repacking multiple packs when one has a
> > .keep file, the resulting additional pack contains too many blobs and
> > trees, contrary to when only packing loose objects:
> (...)
> 
> That is, it seems to also contain all the blobs and subtrees for all the
> commits the pack contains, even when they already are in the pack having
> a .keep file.

I've noticed this too.  Like since day 1 when we added .keep.
But uh, nobody else complained and I forgot about it.

My theory (totally unproven) is that the new pack has objects we
copied from the .keep pack, because those objects were the best
delta-bases for the loose objects we have deltafied and want to
store in the new pack.  Except they aren't yet packed in the new
pack, so we pack them too.  Tada, duplicates.  :-\

Suddenly your repository nearly doubles in size if we have most
files/trees change, as those delta bases are copied whole into the
new pack.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  0:12                   ` Shawn O. Pearce
@ 2008-05-13  5:33                     ` Mike Hommey
  2008-05-14  1:03                     ` Nicolas Pitre
  1 sibling, 0 replies; 35+ messages in thread
From: Mike Hommey @ 2008-05-13  5:33 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Teemu Likonen, Nicolas Pitre, Johannes Schindelin, git

On Mon, May 12, 2008 at 08:12:52PM -0400, Shawn O. Pearce wrote:
> Mike Hommey <mh@glandium.org> wrote:
> > On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote:
> > > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote:
> > > > But I have experienced the same earlier with some other post-1.5.5
> > > > version so I believe you can reproduce this yourself. After cloning
> > > > Linus's linux-2.6 repo its .git directory weights 209MB. After single
> > > > "git pull" and "git gc" it was 298MB in my test.
> > > 
> > > I noticed that a while ago: when repacking multiple packs when one has a
> > > .keep file, the resulting additional pack contains too many blobs and
> > > trees, contrary to when only packing loose objects:
> > (...)
> > 
> > That is, it seems to also contain all the blobs and subtrees for all the
> > commits the pack contains, even when they already are in the pack having
> > a .keep file.
> 
> I've noticed this too.  Like since day 1 when we added .keep.
> But uh, nobody else complained and I forgot about it.
> 
> My theory (totally unproven) is that the new pack has objects we
> copied from the .keep pack, because those objects were the best
> delta-bases for the loose objects we have deltafied and want to
> store in the new pack.  Except they aren't yet packed in the new
> pack, so we pack them too.  Tada, duplicates.  :-\

Well, that does not seem delta related, since my testcase doesn't show
deltas in the second pack.

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  0:12                   ` Shawn O. Pearce
  2008-05-13  5:33                     ` Mike Hommey
@ 2008-05-14  1:03                     ` Nicolas Pitre
  2008-05-14  6:43                       ` Junio C Hamano
  1 sibling, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-14  1:03 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Mike Hommey, Teemu Likonen, Johannes Schindelin, git

On Mon, 12 May 2008, Shawn O. Pearce wrote:

> Mike Hommey <mh@glandium.org> wrote:
> > On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote:
> > > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote:
> > > > But I have experienced the same earlier with some other post-1.5.5
> > > > version so I believe you can reproduce this yourself. After cloning
> > > > Linus's linux-2.6 repo its .git directory weights 209MB. After single
> > > > "git pull" and "git gc" it was 298MB in my test.
> > > 
> > > I noticed that a while ago: when repacking multiple packs when one has a
> > > .keep file, the resulting additional pack contains too many blobs and
> > > trees, contrary to when only packing loose objects:
> > (...)
> > 
> > That is, it seems to also contain all the blobs and subtrees for all the
> > commits the pack contains, even when they already are in the pack having
> > a .keep file.
> 
> I've noticed this too.  Like since day 1 when we added .keep.
> But uh, nobody else complained and I forgot about it.

Well, now that I've reproduced Teemu Likonen's test case, I can confirm 
this is actually a problem.  Here I get:

|remote: Counting objects: 523, done.
|remote: Compressing objects: 100% (57/57), done.
|remote: Total 362 (delta 305), reused 362 (delta 305)
|Receiving objects: 100% (362/362), 65.37 KiB, done.
|Resolving deltas: 100% (305/305), completed with 105 local objects.
|From ../test1
|   492c2e4..9404ef0  master     -> master

The received pack is 449135 bytes large.  This is much larger than the 
actually received data which is 65.37 KiB, but we're completing a thin 
pack with 105 undeltified objects accounting for the size increase which 
is expected.  So far so good.

Now, in theory, running 'git gc' should only repack those 362 + 105 
objects, since the remaining ones are all found in the .keep flagged 
pack.  But that's not what's happening at all:

|Counting objects: 26559, done.
|Compressing objects: 100% (24708/24708), done.
|Writing objects: 100% (26559/26559), done.
|Total 26559 (delta 3054), reused 14011 (delta 1613)

So... there is something definitively wrong here.  The expectation was 
to get a pack in the same size range as the one received during the 
pack, or somewhat smaller due to a better delta compression of the added 
objects.  But instead we get a pack containing  26559 objects!!!  And in 
that lot, only 3054 (11%) are deltas.  That makes for a pack that 
started from 449135 bytes and grew to 72395940 bytes.

> My theory (totally unproven) is that the new pack has objects we
> copied from the .keep pack, because those objects were the best
> delta-bases for the loose objects we have deltafied and want to
> store in the new pack.  Except they aren't yet packed in the new
> pack, so we pack them too.  Tada, duplicates.  :-\

Well, not exactly.

Let's see what happens here even before any packing is attempted

|$ git rev-list --objects 492c2e4..9404ef0
|362
|
|$ git rev-list --objects --all \
|   --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack |
|	wc -l
|26559

So this --unpacked= argument (which undocumented semantics I still have 
issues with) is certainly not doing what is expected.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14  1:03                     ` Nicolas Pitre
@ 2008-05-14  6:43                       ` Junio C Hamano
  2008-05-14  9:10                         ` Juergen Ruehle
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2008-05-14  6:43 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin,
	git

Nicolas Pitre <nico@cam.org> writes:

> Let's see what happens here even before any packing is attempted
>
> |$ git rev-list --objects 492c2e4..9404ef0
> |362
> |
> |$ git rev-list --objects --all \
> |   --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack |
> |	wc -l
> |26559
>
> So this --unpacked= argument (which undocumented semantics I still have 
> issues with) is certainly not doing what is expected.

The output from rev-list is not surprising.  --unpacked=$this.pack implies
the usual --unpacked behaviour (i.e. only show unpacked objects by not
traversing into commits that are packed) and at the same time pretends
that objects in $this.pack are loose.

It was meant to be used for a partial incremental repacking.  If you have
a pack to be kept (perhaps a highly packed deep pack that holds the
earlier parts of the history), marked with .keep, and a handful young
packs, you would give these young ones with --unpacked, so that the
resulting single pack contains all that are loose or in these young
packs.  After that, you can remove all the young packs and loose objects.

At least that is the idea.

I am not sure where that rev-list experiment you showed fits in the bigger
picture, but if that is used for repacking the young packs, perhaps the
issue is that after the repacking the code forgets to remove the young
ones whose objects are now moved into the new pack?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14  6:43                       ` Junio C Hamano
@ 2008-05-14  9:10                         ` Juergen Ruehle
  2008-05-14 14:24                           ` Nicolas Pitre
                                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Juergen Ruehle @ 2008-05-14  9:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen,
	Johannes Schindelin, git

Junio C Hamano writes:
 > The output from rev-list is not surprising.  --unpacked=$this.pack implies
 > the usual --unpacked behaviour (i.e. only show unpacked objects by not
 > traversing into commits that are packed)

The problem is unconditional traversing into commits that are
unpacked. This behavior is immediately obvious if the packed blob in
the .keep pack is large. I've been using the following since the large
object discussion with Dana, but it might be completely broken (though
the test case is probably correct).

--

Previously --unpacked would filter on the commit level, ignoring whether the
objects comprising the commit actually were packed or unpacked.

This makes it impossible to store e.g. excessively large blobs in
different packs from the commits referencing them, since the next repack of
such a commit will suck all referenced blobs into the same pack.

This change moves the unpacked check to the output stage and no longer checks
the flag during commit traversal and adds a trivial test demonstrating the
problem.
---
 Note that t6009 is already taken, so it might be better to merge the test
 into one of the other rev-list tests.

 list-objects.c               |    6 ++++--
 revision.c                   |    2 --
 t/t6009-rev-list-unpacked.sh |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 4 deletions(-)
 create mode 100644 t/t6009-rev-list-unpacked.sh

diff --git a/list-objects.c b/list-objects.c
index c8b8375..b378c0f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -146,7 +146,8 @@ void traverse_commit_list(struct rev_info *revs,
 
 	while ((commit = get_revision(revs)) != NULL) {
 		process_tree(revs, commit->tree, &objects, NULL, "");
-		show_commit(commit);
+		if (!revs->unpacked || !has_sha1_pack(commit->object.sha1, revs->ignore_packed))
+			show_commit(commit);
 	}
 	for (i = 0; i < revs->pending.nr; i++) {
 		struct object_array_entry *pending = revs->pending.objects + i;
@@ -173,7 +174,8 @@ void traverse_commit_list(struct rev_info *revs,
 		    sha1_to_hex(obj->sha1), name);
 	}
 	for (i = 0; i < objects.nr; i++)
-		show_object(&objects.objects[i]);
+		if (!revs->unpacked || !has_sha1_pack(objects.objects[i].item->sha1, revs->ignore_packed))
+			show_object(&objects.objects[i]);
 	free(objects.objects);
 	if (revs->pending.nr) {
 		free(revs->pending.objects);
diff --git a/revision.c b/revision.c
index 4231ea2..0e90d3b 100644
--- a/revision.c
+++ b/revision.c
@@ -1508,8 +1508,6 @@ enum commit_action simplify_commit(struct rev_info *revs, struct commit *commit)
 {
 	if (commit->object.flags & SHOWN)
 		return commit_ignore;
-	if (revs->unpacked && has_sha1_pack(commit->object.sha1, revs->ignore_packed))
-		return commit_ignore;
 	if (revs->show_all)
 		return commit_show;
 	if (commit->object.flags & UNINTERESTING)
diff --git a/t/t6009-rev-list-unpacked.sh b/t/t6009-rev-list-unpacked.sh
new file mode 100644
index 0000000..6b65e83
--- /dev/null
+++ b/t/t6009-rev-list-unpacked.sh
@@ -0,0 +1,32 @@
+#!/bin/sh
+
+test_description='test git rev-list --unpacked --objects'
+
+. ./test-lib.sh
+
+# Create an unpacked commit that references a packed object.
+
+test_expect_success setup '
+	echo Hallo > foo &&
+	git add foo &&
+	test_tick &&
+	git commit -m "A" &&
+        git gc &&
+	echo Cello > bar &&
+	git add bar &&
+	test_tick &&
+	git commit -m "B"
+'
+
+test_expect_success \
+    'object list should contain foo' '
+    git rev-list --all --objects | grep -q "foo"
+'
+
+test_expect_success \
+    'unpacked object list should not contain foo' '
+    test_must_fail "git rev-list --all --unpacked --objects | grep -q \"foo\""
+'
+
+
+test_done
-- 
1.5.5.1.382.g7d84c

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14  9:10                         ` Juergen Ruehle
@ 2008-05-14 14:24                           ` Nicolas Pitre
  2008-05-14 17:03                           ` Junio C Hamano
  2008-05-14 20:06                           ` Linus Torvalds
  2 siblings, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-14 14:24 UTC (permalink / raw)
  To: Juergen Ruehle
  Cc: Junio C Hamano, Shawn O. Pearce, Mike Hommey, Teemu Likonen,
	Johannes Schindelin, git

On Wed, 14 May 2008, Juergen Ruehle wrote:

> Junio C Hamano writes:
>  > The output from rev-list is not surprising.  --unpacked=$this.pack implies
>  > the usual --unpacked behaviour (i.e. only show unpacked objects by not
>  > traversing into commits that are packed)
> 
> The problem is unconditional traversing into commits that are
> unpacked. This behavior is immediately obvious if the packed blob in
> the .keep pack is large. 

That's what I was suspecting too.  And because the Linux repo contains 
many files, then a single commit will fetch a large bunch of objects 
indeed.

> I've been using the following since the large
> object discussion with Dana, but it might be completely broken (though
> the test case is probably correct).

This is not some part of git code I'm familiar with, so I can't tell if 
the patch is broken or not.  What I can do is repeat my simple test 
which produces the following results with your patch:

|$ git rev-list --objects 492c2e4..9404ef0
|362
|
|$ git rev-list --objects --all \
|   --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack |
|       wc -l
|362

That's exactly what is expected.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14  9:10                         ` Juergen Ruehle
  2008-05-14 14:24                           ` Nicolas Pitre
@ 2008-05-14 17:03                           ` Junio C Hamano
  2008-05-14 20:06                           ` Linus Torvalds
  2 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2008-05-14 17:03 UTC (permalink / raw)
  To: Juergen Ruehle, Linus Torvalds
  Cc: Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen,
	Johannes Schindelin, git

Juergen Ruehle <j.ruehle@bmiag.de> writes:

> Previously --unpacked would filter on the commit level, ignoring whether the
> objects comprising the commit actually were packed or unpacked.
>
> This makes it impossible to store e.g. excessively large blobs in
> different packs from the commits referencing them, since the next repack of
> such a commit will suck all referenced blobs into the same pack.

Doesn't this patch essentially make the --unpacked option to rev-list and
the --incremental option to pack-objects the same thing?

The semantics of the --unpacked has been defined that way from the very
beginning, and I've always wondered how the option and --incremental
should interact with each other.  I think the approach your patch takes
makes sense.

> This change moves the unpacked check to the output stage and no longer checks
> the flag during commit traversal and adds a trivial test demonstrating the
> problem.

Sign-off?

> diff --git a/t/t6009-rev-list-unpacked.sh b/t/t6009-rev-list-unpacked.sh
> new file mode 100644
> index 0000000..6b65e83
> --- /dev/null
> +++ b/t/t6009-rev-list-unpacked.sh
> @@ -0,0 +1,32 @@
> ...
> +test_expect_success \
> +    'unpacked object list should not contain foo' '
> +    test_must_fail "git rev-list --all --unpacked --objects | grep -q \"foo\""
> +'

Ahhh.  Ugly but don't you mean "! (rev-list | grep)"?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14  9:10                         ` Juergen Ruehle
  2008-05-14 14:24                           ` Nicolas Pitre
  2008-05-14 17:03                           ` Junio C Hamano
@ 2008-05-14 20:06                           ` Linus Torvalds
  2008-05-14 20:19                             ` Linus Torvalds
  2 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2008-05-14 20:06 UTC (permalink / raw)
  To: Juergen Ruehle
  Cc: Junio C Hamano, Nicolas Pitre, Shawn O. Pearce, Mike Hommey,
	Teemu Likonen, Johannes Schindelin, git



On Wed, 14 May 2008, Juergen Ruehle wrote:
> 
> Previously --unpacked would filter on the commit level, ignoring whether the
> objects comprising the commit actually were packed or unpacked.

I think this patch is correct, but I wonder why you removed the pruning 
from revision.c? Why do we want to process trees for commits that aren't 
going to be shown? This is going to slow down things a lot, and we've long 
had the rule that commits have to be complete in the packs that are kept 
(ie you should never have a pack-file that points to an unpacked object).

So I'd suggest a slightly less intrusive patch (untested!!) instead, which 
leaves the commit object logic alone.

(Your test-case should obviously be merged regardless)

		Linus

---
 list-objects.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index c8b8375..8cb05ca 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -172,8 +172,12 @@ void traverse_commit_list(struct rev_info *revs,
 		die("unknown pending object %s (%s)",
 		    sha1_to_hex(obj->sha1), name);
 	}
-	for (i = 0; i < objects.nr; i++)
-		show_object(&objects.objects[i]);
+	for (i = 0; i < objects.nr; i++) {
+		struct object_array_entry *entry = &objects.objects[i];
+		if (revs->unpacked && has_sha1_pack(entry->item->sha1, revs->ignore_packed))
+			continue;
+		show_object(entry);
+	}
 	free(objects.objects);
 	if (revs->pending.nr) {
 		free(revs->pending.objects);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14 20:06                           ` Linus Torvalds
@ 2008-05-14 20:19                             ` Linus Torvalds
  2008-05-14 20:29                               ` Nicolas Pitre
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2008-05-14 20:19 UTC (permalink / raw)
  To: Juergen Ruehle
  Cc: Junio C Hamano, Nicolas Pitre, Shawn O. Pearce, Mike Hommey,
	Teemu Likonen, Johannes Schindelin, git

On Wed, 14 May 2008, Linus Torvalds wrote:
> 
> I think this patch is correct, but I wonder why you removed the pruning 
> from revision.c?

In fact, it might be a good idea to not just keep it in revision.c, but 
move it up a bit, so that a commit that is packed and should be ignored 
won't even have its parents put on the list (which means that we not only 
ignore the trees in that commit, but also all parents).

Of course, the more aggressively we prune, the more we end up having to 
depend on the fact that a commit that is in a pack that is marked "keep" 
must *always* have everything that leads to it in that pack or others also 
marked "keep". We effectively have that already (because we've always 
pruned away the commits early), but it's a thing to keep in mind whenever 
we prune even more aggressively.

		Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14 20:19                             ` Linus Torvalds
@ 2008-05-14 20:29                               ` Nicolas Pitre
  2008-05-14 20:36                                 ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-14 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juergen Ruehle, Junio C Hamano, Shawn O. Pearce, Mike Hommey,
	Teemu Likonen, Johannes Schindelin, git

On Wed, 14 May 2008, Linus Torvalds wrote:

> Of course, the more aggressively we prune, the more we end up having to 
> depend on the fact that a commit that is in a pack that is marked "keep" 
> must *always* have everything that leads to it in that pack or others also 
> marked "keep". We effectively have that already (because we've always 
> pruned away the commits early), but it's a thing to keep in mind whenever 
> we prune even more aggressively.

I wonder if this is a good thing.  Such a rule would effectively put 
restrictions on how objects like big blobs could be distributed amongst 
many .keep packs.  I just wish we're not painting ourselves in a corner.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14 20:29                               ` Nicolas Pitre
@ 2008-05-14 20:36                                 ` Linus Torvalds
  2008-05-14 23:24                                   ` A Large Angry SCM
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2008-05-14 20:36 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Juergen Ruehle, Junio C Hamano, Shawn O. Pearce, Mike Hommey,
	Teemu Likonen, Johannes Schindelin, git

On Wed, 14 May 2008, Nicolas Pitre wrote:

> On Wed, 14 May 2008, Linus Torvalds wrote:
> 
> > Of course, the more aggressively we prune, the more we end up having to 
> > depend on the fact that a commit that is in a pack that is marked "keep" 
> > must *always* have everything that leads to it in that pack or others also 
> > marked "keep". We effectively have that already (because we've always 
> > pruned away the commits early), but it's a thing to keep in mind whenever 
> > we prune even more aggressively.
> 
> I wonder if this is a good thing.  Such a rule would effectively put 
> restrictions on how objects like big blobs could be distributed amongst 
> many .keep packs.  I just wish we're not painting ourselves in a corner.

You can distribute big objects arbitrarily among many .keep packs, but 
what you can *NOT* do (and which has _always_ been a bug to do) is to have 
a *.keep pack that refers to objects that are not in a .keep pack!

So keep<->keep you can do anything you want, and distribute objects any 
way.

But a keep pack must only refer to objects in itself or in other keep 
packs.

Because otherwise, if we ever hit an object in a keep pack, we'll stop 
even looking further when we use --unpacked. And that has always been true 
(admittedly only for "commit" objects, but those are the ones that most 
commonly refer to other objects, so ..)

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-14 20:36                                 ` Linus Torvalds
@ 2008-05-14 23:24                                   ` A Large Angry SCM
  0 siblings, 0 replies; 35+ messages in thread
From: A Large Angry SCM @ 2008-05-14 23:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Juergen Ruehle, Junio C Hamano, Shawn O. Pearce,
	Mike Hommey, Teemu Likonen, Johannes Schindelin, git

Linus Torvalds wrote:
> 
> On Wed, 14 May 2008, Nicolas Pitre wrote:
> 
>> On Wed, 14 May 2008, Linus Torvalds wrote:
>>
>>> Of course, the more aggressively we prune, the more we end up having to 
>>> depend on the fact that a commit that is in a pack that is marked "keep" 
>>> must *always* have everything that leads to it in that pack or others also 
>>> marked "keep". We effectively have that already (because we've always 
>>> pruned away the commits early), but it's a thing to keep in mind whenever 
>>> we prune even more aggressively.
>> I wonder if this is a good thing.  Such a rule would effectively put 
>> restrictions on how objects like big blobs could be distributed amongst 
>> many .keep packs.  I just wish we're not painting ourselves in a corner.
> 
> You can distribute big objects arbitrarily among many .keep packs, but 
> what you can *NOT* do (and which has _always_ been a bug to do) is to have 
> a *.keep pack that refers to objects that are not in a .keep pack!
> 
> So keep<->keep you can do anything you want, and distribute objects any 
> way.
> 
> But a keep pack must only refer to objects in itself or in other keep 
> packs.
> 
> Because otherwise, if we ever hit an object in a keep pack, we'll stop 
> even looking further when we use --unpacked. And that has always been true 
> (admittedly only for "commit" objects, but those are the ones that most 
> commonly refer to other objects, so ..)

Sounds like git-fsck needs to start checking for this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 20:24             ` Teemu Likonen
  2008-05-12 21:03               ` Mike Hommey
@ 2008-05-12 21:07               ` Nicolas Pitre
  1 sibling, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2008-05-12 21:07 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Johannes Schindelin, git

On Mon, 12 May 2008, Teemu Likonen wrote:

> I'll send you the .git/logs directory but I'm afraid it doesn't tell
> much. There are just three files:
> 
>   .git/logs/HEAD
>   .git/logs/refs/heads/master
>   .git/logs/refs/remotes/origin/master
> 
> They containt one line for the initial clone and one line for
> the fast-forward pull.

That's what I want.  This way I should be able to reproduce your exact 
case.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 15:52 ` Teemu Likonen
  2008-05-12 17:13   ` Johannes Schindelin
@ 2008-05-12 17:17   ` David Tweed
  2008-05-12 23:49     ` Shawn O. Pearce
  1 sibling, 1 reply; 35+ messages in thread
From: David Tweed @ 2008-05-12 17:17 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: git

On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote:
> Teemu Likonen wrote (2008-05-12 15:29 +0300):
> Probably a crazy idea: What if "gc --aggressive" first removed *.keep
> files and after packing and garbage-collecting and whatever it does it
> would add a .keep file for the newly created pack?

My understanding is that the repacking with -a redoes the computation
to repack ALL the objects in every pack and loose objects, whereas
what would be preferred is to try to delta new objects (loose and
packed) against the existing .keep pack (extending it with the new
objects) but not trying to re-deltify objects in the .keep pack. This
is because .keep files are primarily for those who are cloning onto a
machine that isn't powerful (maybe even a laptop/palmtop) but who are
cloning from a powerful server, so that you wouldn't necessarily want
to apply your strategy unconditionally.

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 17:17   ` David Tweed
@ 2008-05-12 23:49     ` Shawn O. Pearce
  2008-05-12 23:53       ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Shawn O. Pearce @ 2008-05-12 23:49 UTC (permalink / raw)
  To: David Tweed; +Cc: Teemu Likonen, git

David Tweed <david.tweed@gmail.com> wrote:
> On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote:
> > Teemu Likonen wrote (2008-05-12 15:29 +0300):
> > Probably a crazy idea: What if "gc --aggressive" first removed *.keep
> > files and after packing and garbage-collecting and whatever it does it
> > would add a .keep file for the newly created pack?
> 
> My understanding is that the repacking with -a redoes the computation
> to repack ALL the objects in every pack and loose objects,

No.  -a means repack all objects in all packs which do not have a
.keep on them.  Without -a we only repack loose objects.

> whereas
> what would be preferred is to try to delta new objects (loose and
> packed) against the existing .keep pack (extending it with the new
> objects) but not trying to re-deltify objects in the .keep pack.

We cannot do that.  Deltas in pack A may not reference base objects
in pack B.  This is a simplification rule that prevents us from
needing to worry about damaging a pack when we repack and delete
another pack.

> This
> is because .keep files are primarily for those who are cloning onto a
> machine that isn't powerful (maybe even a laptop/palmtop) but who are
> cloning from a powerful server, so that you wouldn't necessarily want
> to apply your strategy unconditionally.

Yes, sort of.  We use .keep for two reasons:

  - As a "lock file" to prevent a pack that was just created by a
    git-fetch or git-recieve-pack from being deleted by a concurrent
    git-repack before the objects it contains are linked into the
    refs space and thus considered reachable;

  - As a way to avoid _huge_ packs (say >1G) that would take a lot
    of disk IO just to copy with 100% delta reuse from an old pack
    to a new pack each time the user runs git-gc.

I think git-clone marking a 150M linux-2.6 pack with .keep is wrong;
most users working with the linux-2.6 sources have sufficient
hardware to deal with the disk IO required to copy that with 100%
delta reuse.  But I have a repository at day-job with a 600M pack,
that's starting to head into the realm where git-gc while running
on battery on a laptop would prefer to have that .keep.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 23:49     ` Shawn O. Pearce
@ 2008-05-12 23:53       ` Junio C Hamano
  2008-05-13  0:09         ` Shawn O. Pearce
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2008-05-12 23:53 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: David Tweed, Teemu Likonen, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> David Tweed <david.tweed@gmail.com> wrote:
>> On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote:
>> > Teemu Likonen wrote (2008-05-12 15:29 +0300):
>> > Probably a crazy idea: What if "gc --aggressive" first removed *.keep
>> > files and after packing and garbage-collecting and whatever it does it
>> > would add a .keep file for the newly created pack?
>> 
>> My understanding is that the repacking with -a redoes the computation
>> to repack ALL the objects in every pack and loose objects,
>
> No.  -a means repack all objects in all packs which do not have a
> .keep on them.  Without -a we only repack loose objects.
>
>> whereas
>> what would be preferred is to try to delta new objects (loose and
>> packed) against the existing .keep pack (extending it with the new
>> objects) but not trying to re-deltify objects in the .keep pack.
>
> We cannot do that.  Deltas in pack A may not reference base objects
> in pack B.  This is a simplification rule that prevents us from
> needing to worry about damaging a pack when we repack and delete
> another pack.
>
>> This
>> is because .keep files are primarily for those who are cloning onto a
>> machine that isn't powerful (maybe even a laptop/palmtop) but who are
>> cloning from a powerful server, so that you wouldn't necessarily want
>> to apply your strategy unconditionally.
>
> Yes, sort of.  We use .keep for two reasons:
>
>   - As a "lock file" to prevent a pack that was just created by a
>     git-fetch or git-recieve-pack from being deleted by a concurrent
>     git-repack before the objects it contains are linked into the
>     refs space and thus considered reachable;
>
>   - As a way to avoid _huge_ packs (say >1G) that would take a lot
>     of disk IO just to copy with 100% delta reuse from an old pack
>     to a new pack each time the user runs git-gc.
>
> I think git-clone marking a 150M linux-2.6 pack with .keep is wrong;
> most users working with the linux-2.6 sources have sufficient
> hardware to deal with the disk IO required to copy that with 100%
> delta reuse.  But I have a repository at day-job with a 600M pack,
> that's starting to head into the realm where git-gc while running
> on battery on a laptop would prefer to have that .keep.

Perhaps clone can decide to keep the .keep file depending on the size of
the pack then?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-12 23:53       ` Junio C Hamano
@ 2008-05-13  0:09         ` Shawn O. Pearce
  2008-05-13  5:08           ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Shawn O. Pearce @ 2008-05-13  0:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Tweed, Teemu Likonen, git

Junio C Hamano <gitster@pobox.com> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> >
> > I think git-clone marking a 150M linux-2.6 pack with .keep is wrong;
> > most users working with the linux-2.6 sources have sufficient
> > hardware to deal with the disk IO required to copy that with 100%
> > delta reuse.  But I have a repository at day-job with a 600M pack,
> > that's starting to head into the realm where git-gc while running
> > on battery on a laptop would prefer to have that .keep.
> 
> Perhaps clone can decide to keep the .keep file depending on the size of
> the pack then?

Yea, I think that's the better thing to do here.  I'm not sure where
the cut-off is, maybe its <512M delete the .keep once the refs are
inplace and the objects are ensured to be reachable.

Of course this does not fix the issue Nico was looking at.
We shouldn't be seeing a 98M explosion with objects duplicated
from the .keep pack into the new pack.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  0:09         ` Shawn O. Pearce
@ 2008-05-13  5:08           ` Paolo Bonzini
  2008-05-13  5:22             ` Shawn O. Pearce
  2008-05-13  9:22             ` Teemu Likonen
  0 siblings, 2 replies; 35+ messages in thread
From: Paolo Bonzini @ 2008-05-13  5:08 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, David Tweed, Teemu Likonen, git

Shawn O. Pearce wrote:
> Junio C Hamano <gitster@pobox.com> wrote:
>> "Shawn O. Pearce" <spearce@spearce.org> writes:
>>> I think git-clone marking a 150M linux-2.6 pack with .keep is wrong;
>>> most users working with the linux-2.6 sources have sufficient
>>> hardware to deal with the disk IO required to copy that with 100%
>>> delta reuse.  But I have a repository at day-job with a 600M pack,
>>> that's starting to head into the realm where git-gc while running
>>> on battery on a laptop would prefer to have that .keep.
>> Perhaps clone can decide to keep the .keep file depending on the size of
>> the pack then?
> 
> Yea, I think that's the better thing to do here.  I'm not sure where
> the cut-off is, maybe its <512M delete the .keep once the refs are
> inplace and the objects are ensured to be reachable.

I think separate cutoffs should be in place for file size and number of 
objects.  Very tight packs probably require hours to repack as efficiently.

By the way, another scenario where I used pack files is when I can only 
distribute via http because of firewalls.  I make a clone of the 
original repository and mark the pack as keep; then I push to the 
distribution site, gc, and mark the pack as keep; then I have every day 
a cron job that does git-gc.  This way I know that the user will only 
have to download the third pack.  I think I'll modify the cron job to 
mark as keep the packs that exceed 2 megabytes or something like that.

Thinking about both use cases, the best would be to have options (common 
to git-clone, git-remote add, git-gc at least; and available via config 
keys too) like

   --keep-packs[=THRES1,THRES2,...]

where:

- one threshold would be enough to mark a pack as keep
- thresholds could be in the form "\d+[kmg]?b" for file size, 
"\d+[kmg]?" for number of objects.
- if no threshold is given, the default could be --keep-packs=100k,512MB 
or whatever is in the config.
- to mark all packs, use --keep-packs=0


Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  5:08           ` Paolo Bonzini
@ 2008-05-13  5:22             ` Shawn O. Pearce
  2008-05-13  9:22             ` Teemu Likonen
  1 sibling, 0 replies; 35+ messages in thread
From: Shawn O. Pearce @ 2008-05-13  5:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Junio C Hamano, David Tweed, Teemu Likonen, git

Paolo Bonzini <bonzini@gnu.org> wrote:
> Shawn O. Pearce wrote:
> >Junio C Hamano <gitster@pobox.com> wrote:
> >>Perhaps clone can decide to keep the .keep file depending on the size of
> >>the pack then?
> >
> >Yea, I think that's the better thing to do here.  I'm not sure where
> >the cut-off is, maybe its <512M delete the .keep once the refs are
> >inplace and the objects are ensured to be reachable.
> 
> I think separate cutoffs should be in place for file size and number of 
> objects.  Very tight packs probably require hours to repack as efficiently.

So long as you don't use `gc --aggressive` or `repack -f` the
tightness of a pack doesn't matter; delta reuse means we copy the
tight delta from the source pack to the new destination pack.

However, you are correct that the more objects in the source pack
the longer it will take to compute what is reachable, which does
extend the time needed for even a simple git-gc.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  5:08           ` Paolo Bonzini
  2008-05-13  5:22             ` Shawn O. Pearce
@ 2008-05-13  9:22             ` Teemu Likonen
  2008-05-13 21:46               ` Stephen R. van den Berg
  1 sibling, 1 reply; 35+ messages in thread
From: Teemu Likonen @ 2008-05-13  9:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Shawn O. Pearce, Junio C Hamano, David Tweed, git

Paolo Bonzini wrote (2008-05-13 07:08 +0200):

> I think separate cutoffs should be in place for file size and number
> of  objects.  Very tight packs probably require hours to repack as
> efficiently.
[...]
> Thinking about both use cases, the best would be to have options
> (common  to git-clone, git-remote add, git-gc at least; and available
> via config  keys too) like
>
>   --keep-packs[=THRES1,THRES2,...]

Some thoughts from user interface's point of view. Two assumptions:

  - gc is daily or weekly operation
  - gc --aggressive is more like weekly or monthly operation.

In big repositories gc can feel pretty slow if there are not any .keep
packs and user runs the command daily. So I think there's a point in
having a .keep pack in repositories the size of linux-2.6 for example.
But at the same time I think it would be nice to have an easy UI-way to
repack with better disk space optimization.

This started as a crazy idea but maybe it's not so crazy so I'll
rephrase my previous suggestion. At final stage the command gc
--aggressive would add new .keep file which contains an identifier like

  This .keep file was added by "gc --aggressive" and
  will be automatically deleted at next run.

(Or something like that, you get the idea.)

At first gc --aggressive looks for .keep files with such identifier and
deletes them if found. Then it proceeds normally and finally adds new
.keep file with the same identifier.

This way the "daily" gc would operate very fast (as it leaves .keep
packs alone), and with gc --aggressive user could easily decide when to
create new landmark .keep packs (and also prune possible dangling
objects inside previous .keep packs). Normal user don't need to know the
details. Just run gc occasionally and maybe gc --aggressive when better
optimization is needed.

How does this sound?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13  9:22             ` Teemu Likonen
@ 2008-05-13 21:46               ` Stephen R. van den Berg
  2008-05-14  5:42                 ` Teemu Likonen
  0 siblings, 1 reply; 35+ messages in thread
From: Stephen R. van den Berg @ 2008-05-13 21:46 UTC (permalink / raw)
  To: Teemu Likonen
  Cc: Paolo Bonzini, Shawn O. Pearce, Junio C Hamano, David Tweed, git

Teemu Likonen wrote:
>This way the "daily" gc would operate very fast (as it leaves .keep
>packs alone), and with gc --aggressive user could easily decide when to
>create new landmark .keep packs (and also prune possible dangling
>objects inside previous .keep packs). Normal user don't need to know the
>details. Just run gc occasionally and maybe gc --aggressive when better
>optimization is needed.

>How does this sound?

It sounds sound :-).
I like the simplicity.
-- 
Sincerely,                                                          srb@cuci.nl
           Stephen R. van den Berg.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Why repository grows after "git gc"? / Purpose of *.keep files?
  2008-05-13 21:46               ` Stephen R. van den Berg
@ 2008-05-14  5:42                 ` Teemu Likonen
  0 siblings, 0 replies; 35+ messages in thread
From: Teemu Likonen @ 2008-05-14  5:42 UTC (permalink / raw)
  To: Stephen R. van den Berg
  Cc: Paolo Bonzini, Shawn O. Pearce, Junio C Hamano, David Tweed, git

Stephen R. van den Berg wrote (2008-05-14 00:46 +0300):

> Teemu Likonen wrote:
> >This way the "daily" gc would operate very fast (as it leaves .keep
> >packs alone), and with gc --aggressive user could easily decide when to
> >create new landmark .keep packs (and also prune possible dangling
> >objects inside previous .keep packs). Normal user don't need to know the
> >details. Just run gc occasionally and maybe gc --aggressive when better
> >optimization is needed.
> 
> >How does this sound?
> 
> It sounds sound :-).
> I like the simplicity.

It turned out that gc --aggressive is not what I thought it was, i.e.
"pack aggressively and efficiently". So my suggestion implies the
semantics that --aggressive would do effective compressing.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-05-15 13:38 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-12 12:29 Why repository grows after "git gc"? / Purpose of *.keep files? Teemu Likonen
2008-05-12 15:52 ` Teemu Likonen
2008-05-12 17:13   ` Johannes Schindelin
2008-05-12 18:43     ` Teemu Likonen
2008-05-12 18:56       ` Nicolas Pitre
2008-05-12 19:09         ` Teemu Likonen
2008-05-12 19:36           ` Nicolas Pitre
2008-05-12 20:10             ` Govind Salinas
2008-05-12 21:06               ` Nicolas Pitre
2008-05-12 21:07                 ` Govind Salinas
2008-05-12 20:24             ` Teemu Likonen
2008-05-12 21:03               ` Mike Hommey
2008-05-12 21:08                 ` Mike Hommey
2008-05-13  0:12                   ` Shawn O. Pearce
2008-05-13  5:33                     ` Mike Hommey
2008-05-14  1:03                     ` Nicolas Pitre
2008-05-14  6:43                       ` Junio C Hamano
2008-05-14  9:10                         ` Juergen Ruehle
2008-05-14 14:24                           ` Nicolas Pitre
2008-05-14 17:03                           ` Junio C Hamano
2008-05-14 20:06                           ` Linus Torvalds
2008-05-14 20:19                             ` Linus Torvalds
2008-05-14 20:29                               ` Nicolas Pitre
2008-05-14 20:36                                 ` Linus Torvalds
2008-05-14 23:24                                   ` A Large Angry SCM
2008-05-12 21:07               ` Nicolas Pitre
2008-05-12 17:17   ` David Tweed
2008-05-12 23:49     ` Shawn O. Pearce
2008-05-12 23:53       ` Junio C Hamano
2008-05-13  0:09         ` Shawn O. Pearce
2008-05-13  5:08           ` Paolo Bonzini
2008-05-13  5:22             ` Shawn O. Pearce
2008-05-13  9:22             ` Teemu Likonen
2008-05-13 21:46               ` Stephen R. van den Berg
2008-05-14  5:42                 ` Teemu Likonen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).