* Why repository grows after "git gc"? / Purpose of *.keep files? @ 2008-05-12 12:29 Teemu Likonen 2008-05-12 15:52 ` Teemu Likonen 0 siblings, 1 reply; 35+ messages in thread From: Teemu Likonen @ 2008-05-12 12:29 UTC (permalink / raw) To: git I have noticed that after cloning a repository (via git protocol) the repo is packed pretty tightly and takes relatively small amount of disk space. After using it a while and running "git gc" the repo sometimes grows 25% or something like that. For testing purposes I deleted objects/pack/*.keep file(s) and ran "git gc" again. The repo resulted in small again, just like after the initial clone. I don't have disk space problems but a repo growing about 25% after manual "git gc" seems weird. What's the purpose of these *.keep files? They just contain text like "fetch-pack <number> on <my hostname>". PS. I have merged Brandon Casey's new git-gc/repack patches. In case it has some effect. See the "pu" branch or "git log 9e7d5019". ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 12:29 Why repository grows after "git gc"? / Purpose of *.keep files? Teemu Likonen @ 2008-05-12 15:52 ` Teemu Likonen 2008-05-12 17:13 ` Johannes Schindelin 2008-05-12 17:17 ` David Tweed 0 siblings, 2 replies; 35+ messages in thread From: Teemu Likonen @ 2008-05-12 15:52 UTC (permalink / raw) To: git Teemu Likonen wrote (2008-05-12 15:29 +0300): > For testing purposes I deleted objects/pack/*.keep file(s) and ran > "git gc" again. The repo resulted in small again, just like after the > initial clone. After playing with test repo a while it seems that "git gc" never touches pack files which have accompanying .keep file around. (And it's common to have a .keep file after "git clone".) This makes gc perform faster. A side effect seems to be that objects which later become unreferenced in those pack-files-with-.keep are never pruned. *.keep files also seem to prevent from really aggressively optimizing the repository's size. Probably a crazy idea: What if "gc --aggressive" first removed *.keep files and after packing and garbage-collecting and whatever it does it would add a .keep file for the newly created pack? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 15:52 ` Teemu Likonen @ 2008-05-12 17:13 ` Johannes Schindelin 2008-05-12 18:43 ` Teemu Likonen 2008-05-12 17:17 ` David Tweed 1 sibling, 1 reply; 35+ messages in thread From: Johannes Schindelin @ 2008-05-12 17:13 UTC (permalink / raw) To: Teemu Likonen; +Cc: git Hi, On Mon, 12 May 2008, Teemu Likonen wrote: > Probably a crazy idea: What if "gc --aggressive" first removed *.keep > files and after packing and garbage-collecting and whatever it does it > would add a .keep file for the newly created pack? Most .keep files are not meant to be removed by git-gc. Usually, .keep files are only created interactively (if you _want_ to keep a pack, e.g. when it has been optimally packed and is big), or by git-index-pack while it is writing a pack (IIRC). So I think it would be wrong for "gc --aggressive" to remove the .keep files. Ciao, Dscho ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 17:13 ` Johannes Schindelin @ 2008-05-12 18:43 ` Teemu Likonen 2008-05-12 18:56 ` Nicolas Pitre 0 siblings, 1 reply; 35+ messages in thread From: Teemu Likonen @ 2008-05-12 18:43 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin wrote (2008-05-12 18:13 +0100): > On Mon, 12 May 2008, Teemu Likonen wrote: > > > Probably a crazy idea: What if "gc --aggressive" first removed > > *.keep files and after packing and garbage-collecting and whatever > > it does it would add a .keep file for the newly created pack? > > Most .keep files are not meant to be removed by git-gc. Usually, > .keep files are only created interactively (if you _want_ to keep > a pack, e.g. when it has been optimally packed and is big), or by > git-index-pack while it is writing a pack (IIRC). > > So I think it would be wrong for "gc --aggressive" to remove the .keep > files. I guess you're right. Maybe "gc --aggressive" could delete only certain machine-generated .keep files which have an identifier inside? Well, I don't really have any problems with the current behaviour; it just feels a bit strange that, for example, Linus's kernel repository grew about 90MB after just one update pull and gc. Also, dangling objects are kept forever in .keep packs (which are created with "git clone", for example). ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 18:43 ` Teemu Likonen @ 2008-05-12 18:56 ` Nicolas Pitre 2008-05-12 19:09 ` Teemu Likonen 0 siblings, 1 reply; 35+ messages in thread From: Nicolas Pitre @ 2008-05-12 18:56 UTC (permalink / raw) To: Teemu Likonen; +Cc: Johannes Schindelin, git On Mon, 12 May 2008, Teemu Likonen wrote: > Well, I don't really have any problems with the current behaviour; it > just feels a bit strange that, for example, Linus's kernel repository > grew about 90MB after just one update pull and gc. That looks really odd. Sure the repo might grow a bit, but 90MB seems really excessive. How many time did pass between the initial clone and that subsequent pull? > Also, dangling > objects are kept forever in .keep packs (which are created with "git > clone", for example). A pack obtained via 'git clone' will never contain any dangling objects. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 18:56 ` Nicolas Pitre @ 2008-05-12 19:09 ` Teemu Likonen 2008-05-12 19:36 ` Nicolas Pitre 0 siblings, 1 reply; 35+ messages in thread From: Teemu Likonen @ 2008-05-12 19:09 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, git Nicolas Pitre wrote (2008-05-12 14:56 -0400): > On Mon, 12 May 2008, Teemu Likonen wrote: > > > Well, I don't really have any problems with the current behaviour; > > it just feels a bit strange that, for example, Linus's kernel > > repository grew about 90MB after just one update pull and gc. > > That looks really odd. Sure the repo might grow a bit, but 90MB seems > really excessive. How many time did pass between the initial clone > and that subsequent pull? As I used the kernel repo just for testing this behaviour in question I did both things today. Timestamps tell that there were six hours between the initial .keep pack and the new pack created by manual "git gc". > > Also, dangling objects are kept forever in .keep packs (which are > > created with "git clone", for example). > > A pack obtained via 'git clone' will never contain any dangling > objects. I think it can contain at some later point. For example, if a user first fetches all the branches but later decides to track only one branch. After deleting unneeded tracking branches and expiring the reflog there'll be dangling objects in the original .keep pack created with "git clone". ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 19:09 ` Teemu Likonen @ 2008-05-12 19:36 ` Nicolas Pitre 2008-05-12 20:10 ` Govind Salinas 2008-05-12 20:24 ` Teemu Likonen 0 siblings, 2 replies; 35+ messages in thread From: Nicolas Pitre @ 2008-05-12 19:36 UTC (permalink / raw) To: Teemu Likonen; +Cc: Johannes Schindelin, git On Mon, 12 May 2008, Teemu Likonen wrote: > Nicolas Pitre wrote (2008-05-12 14:56 -0400): > > > On Mon, 12 May 2008, Teemu Likonen wrote: > > > > > Well, I don't really have any problems with the current behaviour; > > > it just feels a bit strange that, for example, Linus's kernel > > > repository grew about 90MB after just one update pull and gc. > > > > That looks really odd. Sure the repo might grow a bit, but 90MB seems > > really excessive. How many time did pass between the initial clone > > and that subsequent pull? > > As I used the kernel repo just for testing this behaviour in question > I did both things today. Timestamps tell that there were six hours > between the initial .keep pack and the new pack created by manual "git > gc". This is way too big a difference. Something is going on. What git version is this? And can you send me the content of your .git/logs directory? > > > Also, dangling objects are kept forever in .keep packs (which are > > > created with "git clone", for example). > > > > A pack obtained via 'git clone' will never contain any dangling > > objects. > > I think it can contain at some later point. For example, if a user first > fetches all the branches but later decides to track only one branch. > After deleting unneeded tracking branches and expiring the reflog > there'll be dangling objects in the original .keep pack created with > "git clone". Sure. But to decide to track only one branch and exclude the others require some higher level of git knowledge already. At that point if you really care about top packing performances you certainly can deal with the .keep file as well. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 19:36 ` Nicolas Pitre @ 2008-05-12 20:10 ` Govind Salinas 2008-05-12 21:06 ` Nicolas Pitre 2008-05-12 20:24 ` Teemu Likonen 1 sibling, 1 reply; 35+ messages in thread From: Govind Salinas @ 2008-05-12 20:10 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Teemu Likonen, Johannes Schindelin, git On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote: > On Mon, 12 May 2008, Teemu Likonen wrote: > > > Nicolas Pitre wrote (2008-05-12 14:56 -0400): > > > > > On Mon, 12 May 2008, Teemu Likonen wrote: > > > > > > > Well, I don't really have any problems with the current behaviour; > > > > it just feels a bit strange that, for example, Linus's kernel > > > > repository grew about 90MB after just one update pull and gc. > > > > > > That looks really odd. Sure the repo might grow a bit, but 90MB seems > > > really excessive. How many time did pass between the initial clone > > > and that subsequent pull? > > > > As I used the kernel repo just for testing this behaviour in question > > I did both things today. Timestamps tell that there were six hours > > between the initial .keep pack and the new pack created by manual "git > > gc". > > This is way too big a difference. Something is going on. > > What git version is this? And can you send me the content of your > .git/logs directory? > > > > > > Also, dangling objects are kept forever in .keep packs (which are > > > > created with "git clone", for example). > > > > > > A pack obtained via 'git clone' will never contain any dangling > > > objects. > > > > I think it can contain at some later point. For example, if a user first > > fetches all the branches but later decides to track only one branch. > > After deleting unneeded tracking branches and expiring the reflog > > there'll be dangling objects in the original .keep pack created with > > "git clone". > > Sure. But to decide to track only one branch and exclude the others > require some higher level of git knowledge already. At that point if > you really care about top packing performances you certainly can deal > with the .keep file as well. > > I have had some similar problems with .keep files. I cloned a repo I created that had a branch that I wasn't interested in. I deleted the branch and then I could never get rid of the (large) number of objects in that pack until I deleted the .keep and repacked. I think there should be some way of forcing git to fix this sort of thing. It gets even worse, I had pushed up the branch I wanted to get rid of to my hosted server and there was no way to get git to release that disk space. I had to have the hosting admin send me a tarball of the repo, extract it, delete the .keep file and repack it then send it back to him. I was fortunate enough to have a service that would let me do that. Thanks, Govind. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 20:10 ` Govind Salinas @ 2008-05-12 21:06 ` Nicolas Pitre 2008-05-12 21:07 ` Govind Salinas 0 siblings, 1 reply; 35+ messages in thread From: Nicolas Pitre @ 2008-05-12 21:06 UTC (permalink / raw) To: Govind Salinas; +Cc: Teemu Likonen, Johannes Schindelin, git On Mon, 12 May 2008, Govind Salinas wrote: > On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote: > > Sure. But to decide to track only one branch and exclude the others > > require some higher level of git knowledge already. At that point if > > you really care about top packing performances you certainly can deal > > with the .keep file as well. > > I have had some similar problems with .keep files. I cloned a repo I > created that had a branch that I wasn't interested in. I deleted the > branch and then I could never get rid of the (large) number of objects > in that pack until I deleted the .keep and repacked. But as soon as you just "git pull" you'll get the deleted branch back. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 21:06 ` Nicolas Pitre @ 2008-05-12 21:07 ` Govind Salinas 0 siblings, 0 replies; 35+ messages in thread From: Govind Salinas @ 2008-05-12 21:07 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Teemu Likonen, Johannes Schindelin, git On Mon, May 12, 2008 at 4:06 PM, Nicolas Pitre <nico@cam.org> wrote: > On Mon, 12 May 2008, Govind Salinas wrote: > > > On Mon, May 12, 2008 at 2:36 PM, Nicolas Pitre <nico@cam.org> wrote: > > > > Sure. But to decide to track only one branch and exclude the others > > > require some higher level of git knowledge already. At that point if > > > you really care about top packing performances you certainly can deal > > > with the .keep file as well. > > > > I have had some similar problems with .keep files. I cloned a repo I > > created that had a branch that I wasn't interested in. I deleted the > > branch and then I could never get rid of the (large) number of objects > > in that pack until I deleted the .keep and repacked. > > But as soon as you just "git pull" you'll get the deleted branch back. > > If you read the rest of my mail, you will see where I removed it from the hosted server as well. But with difficulty. Thanks, Govind. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 19:36 ` Nicolas Pitre 2008-05-12 20:10 ` Govind Salinas @ 2008-05-12 20:24 ` Teemu Likonen 2008-05-12 21:03 ` Mike Hommey 2008-05-12 21:07 ` Nicolas Pitre 1 sibling, 2 replies; 35+ messages in thread From: Teemu Likonen @ 2008-05-12 20:24 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, git Nicolas Pitre wrote (2008-05-12 15:36 -0400): > On Mon, 12 May 2008, Teemu Likonen wrote: > > > > On Mon, 12 May 2008, Teemu Likonen wrote: > > > > > > > Well, I don't really have any problems with the current > > > > behaviour; it just feels a bit strange that, for example, > > > > Linus's kernel repository grew about 90MB after just one update > > > > pull and gc. > > As I used the kernel repo just for testing this behaviour in > > question I did both things today. Timestamps tell that there were > > six hours between the initial .keep pack and the new pack created by > > manual "git gc". > > This is way too big a difference. Something is going on. > > What git version is this? And can you send me the content of your > .git/logs directory? I'm using Git from the "master" branch; compiled it today. I have the following gc/repack-related patches applied from the "pu" branch: builtin-gc.c: deprecate --prune, it now really has no effect git-gc: always use -A when manually repacking repack: modify behavior of -A option to leave unreferenced objects unpacked But I have experienced the same earlier with some other post-1.5.5 version so I believe you can reproduce this yourself. After cloning Linus's linux-2.6 repo its .git directory weights 209MB. After single "git pull" and "git gc" it was 298MB in my test. I'll send you the .git/logs directory but I'm afraid it doesn't tell much. There are just three files: .git/logs/HEAD .git/logs/refs/heads/master .git/logs/refs/remotes/origin/master They containt one line for the initial clone and one line for the fast-forward pull. > > I think it can contain at some later point. For example, if a user > > first fetches all the branches but later decides to track only one > > branch. After deleting unneeded tracking branches and expiring the > > reflog there'll be dangling objects in the original .keep pack > > created with "git clone". > > Sure. But to decide to track only one branch and exclude the others > require some higher level of git knowledge already. At that point if > you really care about top packing performances you certainly can deal > with the .keep file as well. Perhaps so. Although I don't consider this very high level Git knowledge: $ git remote rm origin $ git remote add -t wanted_branch origin git://... The first command removes all the tracking branches. The latter starts to track only one branch. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 20:24 ` Teemu Likonen @ 2008-05-12 21:03 ` Mike Hommey 2008-05-12 21:08 ` Mike Hommey 2008-05-12 21:07 ` Nicolas Pitre 1 sibling, 1 reply; 35+ messages in thread From: Mike Hommey @ 2008-05-12 21:03 UTC (permalink / raw) To: Teemu Likonen; +Cc: Nicolas Pitre, Johannes Schindelin, git On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote: > But I have experienced the same earlier with some other post-1.5.5 > version so I believe you can reproduce this yourself. After cloning > Linus's linux-2.6 repo its .git directory weights 209MB. After single > "git pull" and "git gc" it was 298MB in my test. I noticed that a while ago: when repacking multiple packs when one has a .keep file, the resulting additional pack contains too many blobs and trees, contrary to when only packing loose objects: $ git init $ echo a > a; git add a; git commit -m a $ git gc Counting objects: 3, done. Writing objects: 100% (3/3), done. Total 3 (delta 0), reused 0 (delta 0) $ git verify-pack -v .git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.pack 4bba7c0583de30efff4097299f89b199ab4a6dff commit 160 116 12 78981922613b2afb6025042ff6bd878ac1994e85 blob 2 11 167 aaff74984cccd156a469afa7d9ab10e4777beb24 tree 29 39 128 .git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.pack: ok $ touch .git/objects/pack/pack-b87e61e2dc18ff37624d7f996f1270f923411530.keep $ echo b > b; git add b; git commit -m b $ git gc Counting objects: 3, done. Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), done. Total 3 (delta 0), reused 0 (delta 0) $ git verify-pack -v .git/objects/pack/pack-aa817046e43f278d67c6b85962676246f57bb855.pack 3683f870be446c7cc05ffaef9fa06415276e1828 tree 58 65 158 61780798228d17af2d34fce4cfbdf35556832472 blob 2 11 223 647aed0360e964adc5cedb12e0719fb8bfc05867 commit 208 146 12 .git/objects/pack/pack-aa817046e43f278d67c6b85962676246f57bb855.pack: ok $ git gc Counting objects: 4, done. Compressing objects: 100% (2/2), done. Writing objects: 100% (4/4), done. Total 4 (delta 0), reused 4 (delta 0) $ git verify-pack -v .git/objects/pack/pack-5f692a665e062dedad7b4baf692517adec37899d.pack 3683f870be446c7cc05ffaef9fa06415276e1828 tree 58 65 158 61780798228d17af2d34fce4cfbdf35556832472 blob 2 11 234 647aed0360e964adc5cedb12e0719fb8bfc05867 commit 208 146 12 78981922613b2afb6025042ff6bd878ac1994e85 blob 2 11 223 .git/objects/pack/pack-5f692a665e062dedad7b4baf692517adec37899d.pack: ok Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 21:03 ` Mike Hommey @ 2008-05-12 21:08 ` Mike Hommey 2008-05-13 0:12 ` Shawn O. Pearce 0 siblings, 1 reply; 35+ messages in thread From: Mike Hommey @ 2008-05-12 21:08 UTC (permalink / raw) To: Teemu Likonen; +Cc: Nicolas Pitre, Johannes Schindelin, git On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote: > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote: > > But I have experienced the same earlier with some other post-1.5.5 > > version so I believe you can reproduce this yourself. After cloning > > Linus's linux-2.6 repo its .git directory weights 209MB. After single > > "git pull" and "git gc" it was 298MB in my test. > > I noticed that a while ago: when repacking multiple packs when one has a > .keep file, the resulting additional pack contains too many blobs and > trees, contrary to when only packing loose objects: (...) That is, it seems to also contain all the blobs and subtrees for all the commits the pack contains, even when they already are in the pack having a .keep file. Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 21:08 ` Mike Hommey @ 2008-05-13 0:12 ` Shawn O. Pearce 2008-05-13 5:33 ` Mike Hommey 2008-05-14 1:03 ` Nicolas Pitre 0 siblings, 2 replies; 35+ messages in thread From: Shawn O. Pearce @ 2008-05-13 0:12 UTC (permalink / raw) To: Mike Hommey; +Cc: Teemu Likonen, Nicolas Pitre, Johannes Schindelin, git Mike Hommey <mh@glandium.org> wrote: > On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote: > > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote: > > > But I have experienced the same earlier with some other post-1.5.5 > > > version so I believe you can reproduce this yourself. After cloning > > > Linus's linux-2.6 repo its .git directory weights 209MB. After single > > > "git pull" and "git gc" it was 298MB in my test. > > > > I noticed that a while ago: when repacking multiple packs when one has a > > .keep file, the resulting additional pack contains too many blobs and > > trees, contrary to when only packing loose objects: > (...) > > That is, it seems to also contain all the blobs and subtrees for all the > commits the pack contains, even when they already are in the pack having > a .keep file. I've noticed this too. Like since day 1 when we added .keep. But uh, nobody else complained and I forgot about it. My theory (totally unproven) is that the new pack has objects we copied from the .keep pack, because those objects were the best delta-bases for the loose objects we have deltafied and want to store in the new pack. Except they aren't yet packed in the new pack, so we pack them too. Tada, duplicates. :-\ Suddenly your repository nearly doubles in size if we have most files/trees change, as those delta bases are copied whole into the new pack. -- Shawn. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 0:12 ` Shawn O. Pearce @ 2008-05-13 5:33 ` Mike Hommey 2008-05-14 1:03 ` Nicolas Pitre 1 sibling, 0 replies; 35+ messages in thread From: Mike Hommey @ 2008-05-13 5:33 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Teemu Likonen, Nicolas Pitre, Johannes Schindelin, git On Mon, May 12, 2008 at 08:12:52PM -0400, Shawn O. Pearce wrote: > Mike Hommey <mh@glandium.org> wrote: > > On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote: > > > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote: > > > > But I have experienced the same earlier with some other post-1.5.5 > > > > version so I believe you can reproduce this yourself. After cloning > > > > Linus's linux-2.6 repo its .git directory weights 209MB. After single > > > > "git pull" and "git gc" it was 298MB in my test. > > > > > > I noticed that a while ago: when repacking multiple packs when one has a > > > .keep file, the resulting additional pack contains too many blobs and > > > trees, contrary to when only packing loose objects: > > (...) > > > > That is, it seems to also contain all the blobs and subtrees for all the > > commits the pack contains, even when they already are in the pack having > > a .keep file. > > I've noticed this too. Like since day 1 when we added .keep. > But uh, nobody else complained and I forgot about it. > > My theory (totally unproven) is that the new pack has objects we > copied from the .keep pack, because those objects were the best > delta-bases for the loose objects we have deltafied and want to > store in the new pack. Except they aren't yet packed in the new > pack, so we pack them too. Tada, duplicates. :-\ Well, that does not seem delta related, since my testcase doesn't show deltas in the second pack. Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 0:12 ` Shawn O. Pearce 2008-05-13 5:33 ` Mike Hommey @ 2008-05-14 1:03 ` Nicolas Pitre 2008-05-14 6:43 ` Junio C Hamano 1 sibling, 1 reply; 35+ messages in thread From: Nicolas Pitre @ 2008-05-14 1:03 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Mon, 12 May 2008, Shawn O. Pearce wrote: > Mike Hommey <mh@glandium.org> wrote: > > On Mon, May 12, 2008 at 11:03:04PM +0200, Mike Hommey wrote: > > > On Mon, May 12, 2008 at 11:24:14PM +0300, Teemu Likonen wrote: > > > > But I have experienced the same earlier with some other post-1.5.5 > > > > version so I believe you can reproduce this yourself. After cloning > > > > Linus's linux-2.6 repo its .git directory weights 209MB. After single > > > > "git pull" and "git gc" it was 298MB in my test. > > > > > > I noticed that a while ago: when repacking multiple packs when one has a > > > .keep file, the resulting additional pack contains too many blobs and > > > trees, contrary to when only packing loose objects: > > (...) > > > > That is, it seems to also contain all the blobs and subtrees for all the > > commits the pack contains, even when they already are in the pack having > > a .keep file. > > I've noticed this too. Like since day 1 when we added .keep. > But uh, nobody else complained and I forgot about it. Well, now that I've reproduced Teemu Likonen's test case, I can confirm this is actually a problem. Here I get: |remote: Counting objects: 523, done. |remote: Compressing objects: 100% (57/57), done. |remote: Total 362 (delta 305), reused 362 (delta 305) |Receiving objects: 100% (362/362), 65.37 KiB, done. |Resolving deltas: 100% (305/305), completed with 105 local objects. |From ../test1 | 492c2e4..9404ef0 master -> master The received pack is 449135 bytes large. This is much larger than the actually received data which is 65.37 KiB, but we're completing a thin pack with 105 undeltified objects accounting for the size increase which is expected. So far so good. Now, in theory, running 'git gc' should only repack those 362 + 105 objects, since the remaining ones are all found in the .keep flagged pack. But that's not what's happening at all: |Counting objects: 26559, done. |Compressing objects: 100% (24708/24708), done. |Writing objects: 100% (26559/26559), done. |Total 26559 (delta 3054), reused 14011 (delta 1613) So... there is something definitively wrong here. The expectation was to get a pack in the same size range as the one received during the pack, or somewhat smaller due to a better delta compression of the added objects. But instead we get a pack containing 26559 objects!!! And in that lot, only 3054 (11%) are deltas. That makes for a pack that started from 449135 bytes and grew to 72395940 bytes. > My theory (totally unproven) is that the new pack has objects we > copied from the .keep pack, because those objects were the best > delta-bases for the loose objects we have deltafied and want to > store in the new pack. Except they aren't yet packed in the new > pack, so we pack them too. Tada, duplicates. :-\ Well, not exactly. Let's see what happens here even before any packing is attempted |$ git rev-list --objects 492c2e4..9404ef0 |362 | |$ git rev-list --objects --all \ | --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack | | wc -l |26559 So this --unpacked= argument (which undocumented semantics I still have issues with) is certainly not doing what is expected. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 1:03 ` Nicolas Pitre @ 2008-05-14 6:43 ` Junio C Hamano 2008-05-14 9:10 ` Juergen Ruehle 0 siblings, 1 reply; 35+ messages in thread From: Junio C Hamano @ 2008-05-14 6:43 UTC (permalink / raw) To: Nicolas Pitre Cc: Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git Nicolas Pitre <nico@cam.org> writes: > Let's see what happens here even before any packing is attempted > > |$ git rev-list --objects 492c2e4..9404ef0 > |362 > | > |$ git rev-list --objects --all \ > | --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack | > | wc -l > |26559 > > So this --unpacked= argument (which undocumented semantics I still have > issues with) is certainly not doing what is expected. The output from rev-list is not surprising. --unpacked=$this.pack implies the usual --unpacked behaviour (i.e. only show unpacked objects by not traversing into commits that are packed) and at the same time pretends that objects in $this.pack are loose. It was meant to be used for a partial incremental repacking. If you have a pack to be kept (perhaps a highly packed deep pack that holds the earlier parts of the history), marked with .keep, and a handful young packs, you would give these young ones with --unpacked, so that the resulting single pack contains all that are loose or in these young packs. After that, you can remove all the young packs and loose objects. At least that is the idea. I am not sure where that rev-list experiment you showed fits in the bigger picture, but if that is used for repacking the young packs, perhaps the issue is that after the repacking the code forgets to remove the young ones whose objects are now moved into the new pack? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 6:43 ` Junio C Hamano @ 2008-05-14 9:10 ` Juergen Ruehle 2008-05-14 14:24 ` Nicolas Pitre ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Juergen Ruehle @ 2008-05-14 9:10 UTC (permalink / raw) To: Junio C Hamano Cc: Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git Junio C Hamano writes: > The output from rev-list is not surprising. --unpacked=$this.pack implies > the usual --unpacked behaviour (i.e. only show unpacked objects by not > traversing into commits that are packed) The problem is unconditional traversing into commits that are unpacked. This behavior is immediately obvious if the packed blob in the .keep pack is large. I've been using the following since the large object discussion with Dana, but it might be completely broken (though the test case is probably correct). -- Previously --unpacked would filter on the commit level, ignoring whether the objects comprising the commit actually were packed or unpacked. This makes it impossible to store e.g. excessively large blobs in different packs from the commits referencing them, since the next repack of such a commit will suck all referenced blobs into the same pack. This change moves the unpacked check to the output stage and no longer checks the flag during commit traversal and adds a trivial test demonstrating the problem. --- Note that t6009 is already taken, so it might be better to merge the test into one of the other rev-list tests. list-objects.c | 6 ++++-- revision.c | 2 -- t/t6009-rev-list-unpacked.sh | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 36 insertions(+), 4 deletions(-) create mode 100644 t/t6009-rev-list-unpacked.sh diff --git a/list-objects.c b/list-objects.c index c8b8375..b378c0f 100644 --- a/list-objects.c +++ b/list-objects.c @@ -146,7 +146,8 @@ void traverse_commit_list(struct rev_info *revs, while ((commit = get_revision(revs)) != NULL) { process_tree(revs, commit->tree, &objects, NULL, ""); - show_commit(commit); + if (!revs->unpacked || !has_sha1_pack(commit->object.sha1, revs->ignore_packed)) + show_commit(commit); } for (i = 0; i < revs->pending.nr; i++) { struct object_array_entry *pending = revs->pending.objects + i; @@ -173,7 +174,8 @@ void traverse_commit_list(struct rev_info *revs, sha1_to_hex(obj->sha1), name); } for (i = 0; i < objects.nr; i++) - show_object(&objects.objects[i]); + if (!revs->unpacked || !has_sha1_pack(objects.objects[i].item->sha1, revs->ignore_packed)) + show_object(&objects.objects[i]); free(objects.objects); if (revs->pending.nr) { free(revs->pending.objects); diff --git a/revision.c b/revision.c index 4231ea2..0e90d3b 100644 --- a/revision.c +++ b/revision.c @@ -1508,8 +1508,6 @@ enum commit_action simplify_commit(struct rev_info *revs, struct commit *commit) { if (commit->object.flags & SHOWN) return commit_ignore; - if (revs->unpacked && has_sha1_pack(commit->object.sha1, revs->ignore_packed)) - return commit_ignore; if (revs->show_all) return commit_show; if (commit->object.flags & UNINTERESTING) diff --git a/t/t6009-rev-list-unpacked.sh b/t/t6009-rev-list-unpacked.sh new file mode 100644 index 0000000..6b65e83 --- /dev/null +++ b/t/t6009-rev-list-unpacked.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='test git rev-list --unpacked --objects' + +. ./test-lib.sh + +# Create an unpacked commit that references a packed object. + +test_expect_success setup ' + echo Hallo > foo && + git add foo && + test_tick && + git commit -m "A" && + git gc && + echo Cello > bar && + git add bar && + test_tick && + git commit -m "B" +' + +test_expect_success \ + 'object list should contain foo' ' + git rev-list --all --objects | grep -q "foo" +' + +test_expect_success \ + 'unpacked object list should not contain foo' ' + test_must_fail "git rev-list --all --unpacked --objects | grep -q \"foo\"" +' + + +test_done -- 1.5.5.1.382.g7d84c ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 9:10 ` Juergen Ruehle @ 2008-05-14 14:24 ` Nicolas Pitre 2008-05-14 17:03 ` Junio C Hamano 2008-05-14 20:06 ` Linus Torvalds 2 siblings, 0 replies; 35+ messages in thread From: Nicolas Pitre @ 2008-05-14 14:24 UTC (permalink / raw) To: Juergen Ruehle Cc: Junio C Hamano, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Wed, 14 May 2008, Juergen Ruehle wrote: > Junio C Hamano writes: > > The output from rev-list is not surprising. --unpacked=$this.pack implies > > the usual --unpacked behaviour (i.e. only show unpacked objects by not > > traversing into commits that are packed) > > The problem is unconditional traversing into commits that are > unpacked. This behavior is immediately obvious if the packed blob in > the .keep pack is large. That's what I was suspecting too. And because the Linux repo contains many files, then a single commit will fetch a large bunch of objects indeed. > I've been using the following since the large > object discussion with Dana, but it might be completely broken (though > the test case is probably correct). This is not some part of git code I'm familiar with, so I can't tell if the patch is broken or not. What I can do is repeat my simple test which produces the following results with your patch: |$ git rev-list --objects 492c2e4..9404ef0 |362 | |$ git rev-list --objects --all \ | --unpacked=pack-6a3438b2702be06697023d80b77e67a73a0b0b5c.pack | | wc -l |362 That's exactly what is expected. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 9:10 ` Juergen Ruehle 2008-05-14 14:24 ` Nicolas Pitre @ 2008-05-14 17:03 ` Junio C Hamano 2008-05-14 20:06 ` Linus Torvalds 2 siblings, 0 replies; 35+ messages in thread From: Junio C Hamano @ 2008-05-14 17:03 UTC (permalink / raw) To: Juergen Ruehle, Linus Torvalds Cc: Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git Juergen Ruehle <j.ruehle@bmiag.de> writes: > Previously --unpacked would filter on the commit level, ignoring whether the > objects comprising the commit actually were packed or unpacked. > > This makes it impossible to store e.g. excessively large blobs in > different packs from the commits referencing them, since the next repack of > such a commit will suck all referenced blobs into the same pack. Doesn't this patch essentially make the --unpacked option to rev-list and the --incremental option to pack-objects the same thing? The semantics of the --unpacked has been defined that way from the very beginning, and I've always wondered how the option and --incremental should interact with each other. I think the approach your patch takes makes sense. > This change moves the unpacked check to the output stage and no longer checks > the flag during commit traversal and adds a trivial test demonstrating the > problem. Sign-off? > diff --git a/t/t6009-rev-list-unpacked.sh b/t/t6009-rev-list-unpacked.sh > new file mode 100644 > index 0000000..6b65e83 > --- /dev/null > +++ b/t/t6009-rev-list-unpacked.sh > @@ -0,0 +1,32 @@ > ... > +test_expect_success \ > + 'unpacked object list should not contain foo' ' > + test_must_fail "git rev-list --all --unpacked --objects | grep -q \"foo\"" > +' Ahhh. Ugly but don't you mean "! (rev-list | grep)"? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 9:10 ` Juergen Ruehle 2008-05-14 14:24 ` Nicolas Pitre 2008-05-14 17:03 ` Junio C Hamano @ 2008-05-14 20:06 ` Linus Torvalds 2008-05-14 20:19 ` Linus Torvalds 2 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2008-05-14 20:06 UTC (permalink / raw) To: Juergen Ruehle Cc: Junio C Hamano, Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Wed, 14 May 2008, Juergen Ruehle wrote: > > Previously --unpacked would filter on the commit level, ignoring whether the > objects comprising the commit actually were packed or unpacked. I think this patch is correct, but I wonder why you removed the pruning from revision.c? Why do we want to process trees for commits that aren't going to be shown? This is going to slow down things a lot, and we've long had the rule that commits have to be complete in the packs that are kept (ie you should never have a pack-file that points to an unpacked object). So I'd suggest a slightly less intrusive patch (untested!!) instead, which leaves the commit object logic alone. (Your test-case should obviously be merged regardless) Linus --- list-objects.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/list-objects.c b/list-objects.c index c8b8375..8cb05ca 100644 --- a/list-objects.c +++ b/list-objects.c @@ -172,8 +172,12 @@ void traverse_commit_list(struct rev_info *revs, die("unknown pending object %s (%s)", sha1_to_hex(obj->sha1), name); } - for (i = 0; i < objects.nr; i++) - show_object(&objects.objects[i]); + for (i = 0; i < objects.nr; i++) { + struct object_array_entry *entry = &objects.objects[i]; + if (revs->unpacked && has_sha1_pack(entry->item->sha1, revs->ignore_packed)) + continue; + show_object(entry); + } free(objects.objects); if (revs->pending.nr) { free(revs->pending.objects); ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 20:06 ` Linus Torvalds @ 2008-05-14 20:19 ` Linus Torvalds 2008-05-14 20:29 ` Nicolas Pitre 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2008-05-14 20:19 UTC (permalink / raw) To: Juergen Ruehle Cc: Junio C Hamano, Nicolas Pitre, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Wed, 14 May 2008, Linus Torvalds wrote: > > I think this patch is correct, but I wonder why you removed the pruning > from revision.c? In fact, it might be a good idea to not just keep it in revision.c, but move it up a bit, so that a commit that is packed and should be ignored won't even have its parents put on the list (which means that we not only ignore the trees in that commit, but also all parents). Of course, the more aggressively we prune, the more we end up having to depend on the fact that a commit that is in a pack that is marked "keep" must *always* have everything that leads to it in that pack or others also marked "keep". We effectively have that already (because we've always pruned away the commits early), but it's a thing to keep in mind whenever we prune even more aggressively. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 20:19 ` Linus Torvalds @ 2008-05-14 20:29 ` Nicolas Pitre 2008-05-14 20:36 ` Linus Torvalds 0 siblings, 1 reply; 35+ messages in thread From: Nicolas Pitre @ 2008-05-14 20:29 UTC (permalink / raw) To: Linus Torvalds Cc: Juergen Ruehle, Junio C Hamano, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Wed, 14 May 2008, Linus Torvalds wrote: > Of course, the more aggressively we prune, the more we end up having to > depend on the fact that a commit that is in a pack that is marked "keep" > must *always* have everything that leads to it in that pack or others also > marked "keep". We effectively have that already (because we've always > pruned away the commits early), but it's a thing to keep in mind whenever > we prune even more aggressively. I wonder if this is a good thing. Such a rule would effectively put restrictions on how objects like big blobs could be distributed amongst many .keep packs. I just wish we're not painting ourselves in a corner. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 20:29 ` Nicolas Pitre @ 2008-05-14 20:36 ` Linus Torvalds 2008-05-14 23:24 ` A Large Angry SCM 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2008-05-14 20:36 UTC (permalink / raw) To: Nicolas Pitre Cc: Juergen Ruehle, Junio C Hamano, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git On Wed, 14 May 2008, Nicolas Pitre wrote: > On Wed, 14 May 2008, Linus Torvalds wrote: > > > Of course, the more aggressively we prune, the more we end up having to > > depend on the fact that a commit that is in a pack that is marked "keep" > > must *always* have everything that leads to it in that pack or others also > > marked "keep". We effectively have that already (because we've always > > pruned away the commits early), but it's a thing to keep in mind whenever > > we prune even more aggressively. > > I wonder if this is a good thing. Such a rule would effectively put > restrictions on how objects like big blobs could be distributed amongst > many .keep packs. I just wish we're not painting ourselves in a corner. You can distribute big objects arbitrarily among many .keep packs, but what you can *NOT* do (and which has _always_ been a bug to do) is to have a *.keep pack that refers to objects that are not in a .keep pack! So keep<->keep you can do anything you want, and distribute objects any way. But a keep pack must only refer to objects in itself or in other keep packs. Because otherwise, if we ever hit an object in a keep pack, we'll stop even looking further when we use --unpacked. And that has always been true (admittedly only for "commit" objects, but those are the ones that most commonly refer to other objects, so ..) Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-14 20:36 ` Linus Torvalds @ 2008-05-14 23:24 ` A Large Angry SCM 0 siblings, 0 replies; 35+ messages in thread From: A Large Angry SCM @ 2008-05-14 23:24 UTC (permalink / raw) To: Linus Torvalds Cc: Nicolas Pitre, Juergen Ruehle, Junio C Hamano, Shawn O. Pearce, Mike Hommey, Teemu Likonen, Johannes Schindelin, git Linus Torvalds wrote: > > On Wed, 14 May 2008, Nicolas Pitre wrote: > >> On Wed, 14 May 2008, Linus Torvalds wrote: >> >>> Of course, the more aggressively we prune, the more we end up having to >>> depend on the fact that a commit that is in a pack that is marked "keep" >>> must *always* have everything that leads to it in that pack or others also >>> marked "keep". We effectively have that already (because we've always >>> pruned away the commits early), but it's a thing to keep in mind whenever >>> we prune even more aggressively. >> I wonder if this is a good thing. Such a rule would effectively put >> restrictions on how objects like big blobs could be distributed amongst >> many .keep packs. I just wish we're not painting ourselves in a corner. > > You can distribute big objects arbitrarily among many .keep packs, but > what you can *NOT* do (and which has _always_ been a bug to do) is to have > a *.keep pack that refers to objects that are not in a .keep pack! > > So keep<->keep you can do anything you want, and distribute objects any > way. > > But a keep pack must only refer to objects in itself or in other keep > packs. > > Because otherwise, if we ever hit an object in a keep pack, we'll stop > even looking further when we use --unpacked. And that has always been true > (admittedly only for "commit" objects, but those are the ones that most > commonly refer to other objects, so ..) Sounds like git-fsck needs to start checking for this. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 20:24 ` Teemu Likonen 2008-05-12 21:03 ` Mike Hommey @ 2008-05-12 21:07 ` Nicolas Pitre 1 sibling, 0 replies; 35+ messages in thread From: Nicolas Pitre @ 2008-05-12 21:07 UTC (permalink / raw) To: Teemu Likonen; +Cc: Johannes Schindelin, git On Mon, 12 May 2008, Teemu Likonen wrote: > I'll send you the .git/logs directory but I'm afraid it doesn't tell > much. There are just three files: > > .git/logs/HEAD > .git/logs/refs/heads/master > .git/logs/refs/remotes/origin/master > > They containt one line for the initial clone and one line for > the fast-forward pull. That's what I want. This way I should be able to reproduce your exact case. Nicolas ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 15:52 ` Teemu Likonen 2008-05-12 17:13 ` Johannes Schindelin @ 2008-05-12 17:17 ` David Tweed 2008-05-12 23:49 ` Shawn O. Pearce 1 sibling, 1 reply; 35+ messages in thread From: David Tweed @ 2008-05-12 17:17 UTC (permalink / raw) To: Teemu Likonen; +Cc: git On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote: > Teemu Likonen wrote (2008-05-12 15:29 +0300): > Probably a crazy idea: What if "gc --aggressive" first removed *.keep > files and after packing and garbage-collecting and whatever it does it > would add a .keep file for the newly created pack? My understanding is that the repacking with -a redoes the computation to repack ALL the objects in every pack and loose objects, whereas what would be preferred is to try to delta new objects (loose and packed) against the existing .keep pack (extending it with the new objects) but not trying to re-deltify objects in the .keep pack. This is because .keep files are primarily for those who are cloning onto a machine that isn't powerful (maybe even a laptop/palmtop) but who are cloning from a powerful server, so that you wouldn't necessarily want to apply your strategy unconditionally. -- cheers, dave tweed__________________________ david.tweed@gmail.com Rm 124, School of Systems Engineering, University of Reading. "while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 17:17 ` David Tweed @ 2008-05-12 23:49 ` Shawn O. Pearce 2008-05-12 23:53 ` Junio C Hamano 0 siblings, 1 reply; 35+ messages in thread From: Shawn O. Pearce @ 2008-05-12 23:49 UTC (permalink / raw) To: David Tweed; +Cc: Teemu Likonen, git David Tweed <david.tweed@gmail.com> wrote: > On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote: > > Teemu Likonen wrote (2008-05-12 15:29 +0300): > > Probably a crazy idea: What if "gc --aggressive" first removed *.keep > > files and after packing and garbage-collecting and whatever it does it > > would add a .keep file for the newly created pack? > > My understanding is that the repacking with -a redoes the computation > to repack ALL the objects in every pack and loose objects, No. -a means repack all objects in all packs which do not have a .keep on them. Without -a we only repack loose objects. > whereas > what would be preferred is to try to delta new objects (loose and > packed) against the existing .keep pack (extending it with the new > objects) but not trying to re-deltify objects in the .keep pack. We cannot do that. Deltas in pack A may not reference base objects in pack B. This is a simplification rule that prevents us from needing to worry about damaging a pack when we repack and delete another pack. > This > is because .keep files are primarily for those who are cloning onto a > machine that isn't powerful (maybe even a laptop/palmtop) but who are > cloning from a powerful server, so that you wouldn't necessarily want > to apply your strategy unconditionally. Yes, sort of. We use .keep for two reasons: - As a "lock file" to prevent a pack that was just created by a git-fetch or git-recieve-pack from being deleted by a concurrent git-repack before the objects it contains are linked into the refs space and thus considered reachable; - As a way to avoid _huge_ packs (say >1G) that would take a lot of disk IO just to copy with 100% delta reuse from an old pack to a new pack each time the user runs git-gc. I think git-clone marking a 150M linux-2.6 pack with .keep is wrong; most users working with the linux-2.6 sources have sufficient hardware to deal with the disk IO required to copy that with 100% delta reuse. But I have a repository at day-job with a 600M pack, that's starting to head into the realm where git-gc while running on battery on a laptop would prefer to have that .keep. -- Shawn. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 23:49 ` Shawn O. Pearce @ 2008-05-12 23:53 ` Junio C Hamano 2008-05-13 0:09 ` Shawn O. Pearce 0 siblings, 1 reply; 35+ messages in thread From: Junio C Hamano @ 2008-05-12 23:53 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: David Tweed, Teemu Likonen, git "Shawn O. Pearce" <spearce@spearce.org> writes: > David Tweed <david.tweed@gmail.com> wrote: >> On Mon, May 12, 2008 at 4:52 PM, Teemu Likonen <tlikonen@iki.fi> wrote: >> > Teemu Likonen wrote (2008-05-12 15:29 +0300): >> > Probably a crazy idea: What if "gc --aggressive" first removed *.keep >> > files and after packing and garbage-collecting and whatever it does it >> > would add a .keep file for the newly created pack? >> >> My understanding is that the repacking with -a redoes the computation >> to repack ALL the objects in every pack and loose objects, > > No. -a means repack all objects in all packs which do not have a > .keep on them. Without -a we only repack loose objects. > >> whereas >> what would be preferred is to try to delta new objects (loose and >> packed) against the existing .keep pack (extending it with the new >> objects) but not trying to re-deltify objects in the .keep pack. > > We cannot do that. Deltas in pack A may not reference base objects > in pack B. This is a simplification rule that prevents us from > needing to worry about damaging a pack when we repack and delete > another pack. > >> This >> is because .keep files are primarily for those who are cloning onto a >> machine that isn't powerful (maybe even a laptop/palmtop) but who are >> cloning from a powerful server, so that you wouldn't necessarily want >> to apply your strategy unconditionally. > > Yes, sort of. We use .keep for two reasons: > > - As a "lock file" to prevent a pack that was just created by a > git-fetch or git-recieve-pack from being deleted by a concurrent > git-repack before the objects it contains are linked into the > refs space and thus considered reachable; > > - As a way to avoid _huge_ packs (say >1G) that would take a lot > of disk IO just to copy with 100% delta reuse from an old pack > to a new pack each time the user runs git-gc. > > I think git-clone marking a 150M linux-2.6 pack with .keep is wrong; > most users working with the linux-2.6 sources have sufficient > hardware to deal with the disk IO required to copy that with 100% > delta reuse. But I have a repository at day-job with a 600M pack, > that's starting to head into the realm where git-gc while running > on battery on a laptop would prefer to have that .keep. Perhaps clone can decide to keep the .keep file depending on the size of the pack then? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-12 23:53 ` Junio C Hamano @ 2008-05-13 0:09 ` Shawn O. Pearce 2008-05-13 5:08 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Shawn O. Pearce @ 2008-05-13 0:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: David Tweed, Teemu Likonen, git Junio C Hamano <gitster@pobox.com> wrote: > "Shawn O. Pearce" <spearce@spearce.org> writes: > > > > I think git-clone marking a 150M linux-2.6 pack with .keep is wrong; > > most users working with the linux-2.6 sources have sufficient > > hardware to deal with the disk IO required to copy that with 100% > > delta reuse. But I have a repository at day-job with a 600M pack, > > that's starting to head into the realm where git-gc while running > > on battery on a laptop would prefer to have that .keep. > > Perhaps clone can decide to keep the .keep file depending on the size of > the pack then? Yea, I think that's the better thing to do here. I'm not sure where the cut-off is, maybe its <512M delete the .keep once the refs are inplace and the objects are ensured to be reachable. Of course this does not fix the issue Nico was looking at. We shouldn't be seeing a 98M explosion with objects duplicated from the .keep pack into the new pack. -- Shawn. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 0:09 ` Shawn O. Pearce @ 2008-05-13 5:08 ` Paolo Bonzini 2008-05-13 5:22 ` Shawn O. Pearce 2008-05-13 9:22 ` Teemu Likonen 0 siblings, 2 replies; 35+ messages in thread From: Paolo Bonzini @ 2008-05-13 5:08 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Junio C Hamano, David Tweed, Teemu Likonen, git Shawn O. Pearce wrote: > Junio C Hamano <gitster@pobox.com> wrote: >> "Shawn O. Pearce" <spearce@spearce.org> writes: >>> I think git-clone marking a 150M linux-2.6 pack with .keep is wrong; >>> most users working with the linux-2.6 sources have sufficient >>> hardware to deal with the disk IO required to copy that with 100% >>> delta reuse. But I have a repository at day-job with a 600M pack, >>> that's starting to head into the realm where git-gc while running >>> on battery on a laptop would prefer to have that .keep. >> Perhaps clone can decide to keep the .keep file depending on the size of >> the pack then? > > Yea, I think that's the better thing to do here. I'm not sure where > the cut-off is, maybe its <512M delete the .keep once the refs are > inplace and the objects are ensured to be reachable. I think separate cutoffs should be in place for file size and number of objects. Very tight packs probably require hours to repack as efficiently. By the way, another scenario where I used pack files is when I can only distribute via http because of firewalls. I make a clone of the original repository and mark the pack as keep; then I push to the distribution site, gc, and mark the pack as keep; then I have every day a cron job that does git-gc. This way I know that the user will only have to download the third pack. I think I'll modify the cron job to mark as keep the packs that exceed 2 megabytes or something like that. Thinking about both use cases, the best would be to have options (common to git-clone, git-remote add, git-gc at least; and available via config keys too) like --keep-packs[=THRES1,THRES2,...] where: - one threshold would be enough to mark a pack as keep - thresholds could be in the form "\d+[kmg]?b" for file size, "\d+[kmg]?" for number of objects. - if no threshold is given, the default could be --keep-packs=100k,512MB or whatever is in the config. - to mark all packs, use --keep-packs=0 Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 5:08 ` Paolo Bonzini @ 2008-05-13 5:22 ` Shawn O. Pearce 2008-05-13 9:22 ` Teemu Likonen 1 sibling, 0 replies; 35+ messages in thread From: Shawn O. Pearce @ 2008-05-13 5:22 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Junio C Hamano, David Tweed, Teemu Likonen, git Paolo Bonzini <bonzini@gnu.org> wrote: > Shawn O. Pearce wrote: > >Junio C Hamano <gitster@pobox.com> wrote: > >>Perhaps clone can decide to keep the .keep file depending on the size of > >>the pack then? > > > >Yea, I think that's the better thing to do here. I'm not sure where > >the cut-off is, maybe its <512M delete the .keep once the refs are > >inplace and the objects are ensured to be reachable. > > I think separate cutoffs should be in place for file size and number of > objects. Very tight packs probably require hours to repack as efficiently. So long as you don't use `gc --aggressive` or `repack -f` the tightness of a pack doesn't matter; delta reuse means we copy the tight delta from the source pack to the new destination pack. However, you are correct that the more objects in the source pack the longer it will take to compute what is reachable, which does extend the time needed for even a simple git-gc. -- Shawn. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 5:08 ` Paolo Bonzini 2008-05-13 5:22 ` Shawn O. Pearce @ 2008-05-13 9:22 ` Teemu Likonen 2008-05-13 21:46 ` Stephen R. van den Berg 1 sibling, 1 reply; 35+ messages in thread From: Teemu Likonen @ 2008-05-13 9:22 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Shawn O. Pearce, Junio C Hamano, David Tweed, git Paolo Bonzini wrote (2008-05-13 07:08 +0200): > I think separate cutoffs should be in place for file size and number > of objects. Very tight packs probably require hours to repack as > efficiently. [...] > Thinking about both use cases, the best would be to have options > (common to git-clone, git-remote add, git-gc at least; and available > via config keys too) like > > --keep-packs[=THRES1,THRES2,...] Some thoughts from user interface's point of view. Two assumptions: - gc is daily or weekly operation - gc --aggressive is more like weekly or monthly operation. In big repositories gc can feel pretty slow if there are not any .keep packs and user runs the command daily. So I think there's a point in having a .keep pack in repositories the size of linux-2.6 for example. But at the same time I think it would be nice to have an easy UI-way to repack with better disk space optimization. This started as a crazy idea but maybe it's not so crazy so I'll rephrase my previous suggestion. At final stage the command gc --aggressive would add new .keep file which contains an identifier like This .keep file was added by "gc --aggressive" and will be automatically deleted at next run. (Or something like that, you get the idea.) At first gc --aggressive looks for .keep files with such identifier and deletes them if found. Then it proceeds normally and finally adds new .keep file with the same identifier. This way the "daily" gc would operate very fast (as it leaves .keep packs alone), and with gc --aggressive user could easily decide when to create new landmark .keep packs (and also prune possible dangling objects inside previous .keep packs). Normal user don't need to know the details. Just run gc occasionally and maybe gc --aggressive when better optimization is needed. How does this sound? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 9:22 ` Teemu Likonen @ 2008-05-13 21:46 ` Stephen R. van den Berg 2008-05-14 5:42 ` Teemu Likonen 0 siblings, 1 reply; 35+ messages in thread From: Stephen R. van den Berg @ 2008-05-13 21:46 UTC (permalink / raw) To: Teemu Likonen Cc: Paolo Bonzini, Shawn O. Pearce, Junio C Hamano, David Tweed, git Teemu Likonen wrote: >This way the "daily" gc would operate very fast (as it leaves .keep >packs alone), and with gc --aggressive user could easily decide when to >create new landmark .keep packs (and also prune possible dangling >objects inside previous .keep packs). Normal user don't need to know the >details. Just run gc occasionally and maybe gc --aggressive when better >optimization is needed. >How does this sound? It sounds sound :-). I like the simplicity. -- Sincerely, srb@cuci.nl Stephen R. van den Berg. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Why repository grows after "git gc"? / Purpose of *.keep files? 2008-05-13 21:46 ` Stephen R. van den Berg @ 2008-05-14 5:42 ` Teemu Likonen 0 siblings, 0 replies; 35+ messages in thread From: Teemu Likonen @ 2008-05-14 5:42 UTC (permalink / raw) To: Stephen R. van den Berg Cc: Paolo Bonzini, Shawn O. Pearce, Junio C Hamano, David Tweed, git Stephen R. van den Berg wrote (2008-05-14 00:46 +0300): > Teemu Likonen wrote: > >This way the "daily" gc would operate very fast (as it leaves .keep > >packs alone), and with gc --aggressive user could easily decide when to > >create new landmark .keep packs (and also prune possible dangling > >objects inside previous .keep packs). Normal user don't need to know the > >details. Just run gc occasionally and maybe gc --aggressive when better > >optimization is needed. > > >How does this sound? > > It sounds sound :-). > I like the simplicity. It turned out that gc --aggressive is not what I thought it was, i.e. "pack aggressively and efficiently". So my suggestion implies the semantics that --aggressive would do effective compressing. ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2008-05-15 13:38 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-12 12:29 Why repository grows after "git gc"? / Purpose of *.keep files? Teemu Likonen 2008-05-12 15:52 ` Teemu Likonen 2008-05-12 17:13 ` Johannes Schindelin 2008-05-12 18:43 ` Teemu Likonen 2008-05-12 18:56 ` Nicolas Pitre 2008-05-12 19:09 ` Teemu Likonen 2008-05-12 19:36 ` Nicolas Pitre 2008-05-12 20:10 ` Govind Salinas 2008-05-12 21:06 ` Nicolas Pitre 2008-05-12 21:07 ` Govind Salinas 2008-05-12 20:24 ` Teemu Likonen 2008-05-12 21:03 ` Mike Hommey 2008-05-12 21:08 ` Mike Hommey 2008-05-13 0:12 ` Shawn O. Pearce 2008-05-13 5:33 ` Mike Hommey 2008-05-14 1:03 ` Nicolas Pitre 2008-05-14 6:43 ` Junio C Hamano 2008-05-14 9:10 ` Juergen Ruehle 2008-05-14 14:24 ` Nicolas Pitre 2008-05-14 17:03 ` Junio C Hamano 2008-05-14 20:06 ` Linus Torvalds 2008-05-14 20:19 ` Linus Torvalds 2008-05-14 20:29 ` Nicolas Pitre 2008-05-14 20:36 ` Linus Torvalds 2008-05-14 23:24 ` A Large Angry SCM 2008-05-12 21:07 ` Nicolas Pitre 2008-05-12 17:17 ` David Tweed 2008-05-12 23:49 ` Shawn O. Pearce 2008-05-12 23:53 ` Junio C Hamano 2008-05-13 0:09 ` Shawn O. Pearce 2008-05-13 5:08 ` Paolo Bonzini 2008-05-13 5:22 ` Shawn O. Pearce 2008-05-13 9:22 ` Teemu Likonen 2008-05-13 21:46 ` Stephen R. van den Berg 2008-05-14 5:42 ` Teemu Likonen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).