* git gc & deleted branches @ 2008-05-08 17:45 Guido Ostkamp 2008-05-08 18:39 ` Jeff King 0 siblings, 1 reply; 50+ messages in thread From: Guido Ostkamp @ 2008-05-08 17:45 UTC (permalink / raw) To: git Hello, I'm trying to reclaim space from an abandoned branch (never involved in any merge) using 'git gc', but it doesn't appear to work: mkdir testrepo cd testrepo git init dd if=/dev/urandom bs=1024k count=10 of=file git add file git commit -a -m 'initial checkin' git checkout -b test dd if=/dev/urandom bs=1024k count=10 of=file git commit -a -m 'branch checkin' git checkout master du -s . # returns 30960 git branch -D test git gc du -s . # returns 30916 Here I had expected ~20000 since the branch uses ~10000. My config is [gc] reflogExpire = 0 reflogExpireUnreachable = 0 rerereresolved = 0 rerereunresolved = 0 packrefs = 1 I also tried 'git-pack-refs --all' or 'git-pack-refs --prune' but to no avail. What am I doing wrong? Thanks for any hints. Regards Guido ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 17:45 git gc & deleted branches Guido Ostkamp @ 2008-05-08 18:39 ` Jeff King 2008-05-08 18:55 ` Guido Ostkamp 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-08 18:39 UTC (permalink / raw) To: Guido Ostkamp; +Cc: git On Thu, May 08, 2008 at 07:45:31PM +0200, Guido Ostkamp wrote: > [gc] > reflogExpire = 0 > reflogExpireUnreachable = 0 > rerereresolved = 0 > rerereunresolved = 0 > packrefs = 1 git-gc uses a "safe" pruning mode, where it only prunes unreferenced objects that are older than a certain period (this makes it safe to run git-gc, even if other processes are creating objects at the same time). So try [gc] pruneExpire = now Alternatively, you can just run 'git prune' manually instead of 'git gc'. > I also tried 'git-pack-refs --all' or 'git-pack-refs --prune' but to no > avail. Those won't help at all; they are purely about moving refs from individual files into the 'packed-refs' file. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 18:39 ` Jeff King @ 2008-05-08 18:55 ` Guido Ostkamp 2008-05-08 20:07 ` Brandon Casey 2008-05-08 20:51 ` Jeff King 0 siblings, 2 replies; 50+ messages in thread From: Guido Ostkamp @ 2008-05-08 18:55 UTC (permalink / raw) To: Jeff King; +Cc: git On Thu, 8 May 2008, Jeff King wrote: > On Thu, May 08, 2008 at 07:45:31PM +0200, Guido Ostkamp wrote: > >> [gc] >> reflogExpire = 0 >> reflogExpireUnreachable = 0 >> rerereresolved = 0 >> rerereunresolved = 0 >> packrefs = 1 > > git-gc uses a "safe" pruning mode, where it only prunes unreferenced > objects that are older than a certain period (this makes it safe to run > git-gc, even if other processes are creating objects at the same time). > > So try > > [gc] > pruneExpire = now > > Alternatively, you can just run 'git prune' manually instead of 'git > gc'. Jeff, I tried it, but it has no effect (see below). There is only the master branch left, and only one commit therein, still it uses the space former occupied by the branch. I'm using git version 1.5.5.1.147.g867f. Any further ideas? $ git config -l core.repositoryformatversion=0 core.filemode=true core.bare=false core.logallrefupdates=true gc.reflogexpire=0 gc.reflogexpireunreachable=0 gc.rerereresolved=0 gc.rerereunresolved=0 gc.packrefs=1 gc.pruneexpire=now $ git gc Counting objects: 6, done. Compressing objects: 100% (4/4), done. Writing objects: 100% (6/6), done. Total 6 (delta 0), reused 6 (delta 0) $ git prune $ git branch * master $ du -s . 30820 . $ git log commit 9717437cdcb2a4457f28f41db5f6fad9ca55b54e Author: Testuser <testuser@bianca.dialin.t-online.de> Date: Thu May 8 19:40:06 2008 +0200 initial checkin $ ls -l total 10240 -rw-r--r-- 1 testuser users 10485760 May 8 19:40 file Regards Guido ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 18:55 ` Guido Ostkamp @ 2008-05-08 20:07 ` Brandon Casey 2008-05-08 20:52 ` Guido Ostkamp 2008-05-08 20:56 ` Jeff King 2008-05-08 20:51 ` Jeff King 1 sibling, 2 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-08 20:07 UTC (permalink / raw) To: Guido Ostkamp; +Cc: Jeff King, git Guido Ostkamp wrote: > On Thu, 8 May 2008, Jeff King wrote: >> On Thu, May 08, 2008 at 07:45:31PM +0200, Guido Ostkamp wrote: >> >>> [gc] >>> reflogExpire = 0 >>> reflogExpireUnreachable = 0 >>> rerereresolved = 0 >>> rerereunresolved = 0 >>> packrefs = 1 >> >> git-gc uses a "safe" pruning mode, where it only prunes unreferenced >> objects that are older than a certain period (this makes it safe to run >> git-gc, even if other processes are creating objects at the same time). >> >> So try >> >> [gc] >> pruneExpire = now >> >> Alternatively, you can just run 'git prune' manually instead of 'git >> gc'. > > Jeff, I tried it, but it has no effect (see below). There is only the > master branch left, and only one commit therein, still it uses the space > former occupied by the branch. I'm using git version 1.5.5.1.147.g867f. > > Any further ideas? Possibly that object got packed? git-prune only removes loose objects. Try 'git gc --prune' which will call git-repack with the -a option. btw, this is _really_ a non-issue. It seems to keep coming up on the list. Just know that each one of the config options that you set to zero, including the one Jeff suggested setting to "now", is a safety mechanism that is there to ensure that you never ever lose data and that mistakes are recoverable. And be assured that the objects referenced by a deleted branch will be removed from the repository eventually as long as 'git gc --prune' is run periodically. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 20:07 ` Brandon Casey @ 2008-05-08 20:52 ` Guido Ostkamp 2008-05-08 21:01 ` Jeff King 2008-05-08 20:56 ` Jeff King 1 sibling, 1 reply; 50+ messages in thread From: Guido Ostkamp @ 2008-05-08 20:52 UTC (permalink / raw) To: Brandon Casey; +Cc: Jeff King, git On Thu, 8 May 2008, Brandon Casey wrote: > Possibly that object got packed? git-prune only removes loose objects. > Try 'git gc --prune' which will call git-repack with the -a option. Thanks, Brandon - that did the trick :-) > Just know that each one of the config options that you set to zero, > including the one Jeff suggested setting to "now", is a safety mechanism > that is there to ensure that you never ever lose data and that mistakes > are recoverable. I am aware of this. However, at work I am unfortunately bound to a very restrictive filesystem quota on central development servers, so every single byte counts in (our official versioning control system is ClearCase where less space is required due to working tree and history being supplied through virtual filesystems). > And be assured that the objects referenced by a deleted branch will be > removed from the repository eventually as long as 'git gc --prune' is > run periodically. Ok. I did not know about the 'prune' option yet as it neither mentioned in the "Git Tutorial" nor "Everyday Git", there only 'git gc' is used with no options. Regards Guido ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 20:52 ` Guido Ostkamp @ 2008-05-08 21:01 ` Jeff King 2008-05-08 21:15 ` Nicolas Pitre 2008-05-08 21:33 ` Guido Ostkamp 0 siblings, 2 replies; 50+ messages in thread From: Jeff King @ 2008-05-08 21:01 UTC (permalink / raw) To: Guido Ostkamp; +Cc: Brandon Casey, git On Thu, May 08, 2008 at 10:52:19PM +0200, Guido Ostkamp wrote: >> And be assured that the objects referenced by a deleted branch will be >> removed from the repository eventually as long as 'git gc --prune' is >> run periodically. > > Ok. I did not know about the 'prune' option yet as it neither mentioned in > the "Git Tutorial" nor "Everyday Git", there only 'git gc' is used with no > options. It is deprecated; see 25ee9731. According to that commit message, prune is now a no-op. However, it looks like it is still used for trigger a "repack -a" rather than "repack -A". I don't know if it is worth making that behavior available through some more sane command line option (I would think people who really know that they want "repack -a" would just call it). -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:01 ` Jeff King @ 2008-05-08 21:15 ` Nicolas Pitre 2008-05-08 21:17 ` Jeff King 2008-05-08 21:33 ` Guido Ostkamp 1 sibling, 1 reply; 50+ messages in thread From: Nicolas Pitre @ 2008-05-08 21:15 UTC (permalink / raw) To: Jeff King; +Cc: Guido Ostkamp, Brandon Casey, git On Thu, 8 May 2008, Jeff King wrote: > On Thu, May 08, 2008 at 10:52:19PM +0200, Guido Ostkamp wrote: > > >> And be assured that the objects referenced by a deleted branch will be > >> removed from the repository eventually as long as 'git gc --prune' is > >> run periodically. > > > > Ok. I did not know about the 'prune' option yet as it neither mentioned in > > the "Git Tutorial" nor "Everyday Git", there only 'git gc' is used with no > > options. > > It is deprecated; see 25ee9731. > > According to that commit message, prune is now a no-op. However, it > looks like it is still used for trigger a "repack -a" rather than > "repack -A". I don't know if it is worth making that behavior available > through some more sane command line option (I would think people who > really know that they want "repack -a" would just call it). Well, actually this is a problem. I think it is a good thing to deprecate gc --prune. but if that means that repack -a is never used then unreferenced and expired objects will never be pruned if they're packed if one is always using 'git gc' as we are advocating. Nicolas ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:15 ` Nicolas Pitre @ 2008-05-08 21:17 ` Jeff King 2008-05-08 21:23 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-08 21:17 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Guido Ostkamp, Brandon Casey, git On Thu, May 08, 2008 at 05:15:34PM -0400, Nicolas Pitre wrote: > > According to that commit message, prune is now a no-op. However, it > > looks like it is still used for trigger a "repack -a" rather than > > "repack -A". I don't know if it is worth making that behavior available > > Well, actually this is a problem. > > I think it is a good thing to deprecate gc --prune. but if that means > that repack -a is never used then unreferenced and expired objects will > never be pruned if they're packed if one is always using 'git gc' as we > are advocating. I thought that -A would eventually put them all into a single pack, killing off the old packs. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:17 ` Jeff King @ 2008-05-08 21:23 ` Brandon Casey 2008-05-08 21:31 ` Jeff King 0 siblings, 1 reply; 50+ messages in thread From: Brandon Casey @ 2008-05-08 21:23 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Guido Ostkamp, git Jeff King wrote: > On Thu, May 08, 2008 at 05:15:34PM -0400, Nicolas Pitre wrote: > >>> According to that commit message, prune is now a no-op. However, it >>> looks like it is still used for trigger a "repack -a" rather than >>> "repack -A". I don't know if it is worth making that behavior available >> Well, actually this is a problem. >> >> I think it is a good thing to deprecate gc --prune. but if that means >> that repack -a is never used then unreferenced and expired objects will >> never be pruned if they're packed if one is always using 'git gc' as we >> are advocating. > > I thought that -A would eventually put them all into a single pack, > killing off the old packs. '-a' puts everything in a single pack and kills off old packs. Anything that was unreachable is not repacked in the new pack. '-A' does the same thing but it also repacks the unreachable objects that were previously packed. So if something gets packed that subsequently becomes unreachable it will never be removed unless 'repack -a' is used. Possibly --keep-unreachable should instead unpack the unreachable items which would allow them to eventually be pruned based on pruneExpire. Then we could indeed get rid of the --prune option to git-gc. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:23 ` Brandon Casey @ 2008-05-08 21:31 ` Jeff King 2008-05-08 21:40 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-08 21:31 UTC (permalink / raw) To: Brandon Casey; +Cc: Nicolas Pitre, Guido Ostkamp, git On Thu, May 08, 2008 at 04:23:53PM -0500, Brandon Casey wrote: > > I thought that -A would eventually put them all into a single pack, > > killing off the old packs. > > '-a' puts everything in a single pack and kills off old packs. Anything that > was unreachable is not repacked in the new pack. > > '-A' does the same thing but it also repacks the unreachable objects that were > previously packed. Ah, indeed. I hadn't looked closely at the -A behavior before. So yes, we are never killing off prunable packed objects. Probably we could use the same solution as "git prune --expire"; perhaps a "--keep-unreachable=2.weeks.ago"? -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:31 ` Jeff King @ 2008-05-08 21:40 ` Brandon Casey 2008-05-08 21:44 ` Jeff King 0 siblings, 1 reply; 50+ messages in thread From: Brandon Casey @ 2008-05-08 21:40 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Guido Ostkamp, git Jeff King wrote: > On Thu, May 08, 2008 at 04:23:53PM -0500, Brandon Casey wrote: > >>> I thought that -A would eventually put them all into a single pack, >>> killing off the old packs. >> '-a' puts everything in a single pack and kills off old packs. Anything that >> was unreachable is not repacked in the new pack. >> >> '-A' does the same thing but it also repacks the unreachable objects that were >> previously packed. > > Ah, indeed. I hadn't looked closely at the -A behavior before. So yes, > we are never killing off prunable packed objects. Probably we could use > the same solution as "git prune --expire"; perhaps a > "--keep-unreachable=2.weeks.ago"? The 'prune --expire' behavior is based on object mtime (i.e. file modification time). That is lost once something is packed right? I was thinking that either repack or pack-objects could be modified to unpack those unreachable objects and leave them loose, and also give them the timestamp of the pack file they came from. Then the --expire behavior of git-prune could work normally and remove them. This seems like it would work nicely since prune follows repack in git-gc. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:40 ` Brandon Casey @ 2008-05-08 21:44 ` Jeff King 2008-05-08 21:53 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-08 21:44 UTC (permalink / raw) To: Brandon Casey; +Cc: Nicolas Pitre, Guido Ostkamp, git On Thu, May 08, 2008 at 04:40:20PM -0500, Brandon Casey wrote: > The 'prune --expire' behavior is based on object mtime (i.e. file > modification time). That is lost once something is packed right? Yes. You would have to use the pack mtime. But of course you would have to actually _leave_ them in a pack, or they would just keep getting added to the new pack. > I was thinking that either repack or pack-objects could be modified to > unpack those unreachable objects and leave them loose, and also give > them the timestamp of the pack file they came from. Then the --expire > behavior of git-prune could work normally and remove them. This seems > like it would work nicely since prune follows repack in git-gc. That is sensible, I think. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:44 ` Jeff King @ 2008-05-08 21:53 ` Brandon Casey 2008-05-08 22:48 ` Jeff King 0 siblings, 1 reply; 50+ messages in thread From: Brandon Casey @ 2008-05-08 21:53 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Guido Ostkamp, git Jeff King wrote: > On Thu, May 08, 2008 at 04:40:20PM -0500, Brandon Casey wrote: > >> The 'prune --expire' behavior is based on object mtime (i.e. file >> modification time). That is lost once something is packed right? > > Yes. You would have to use the pack mtime. But of course you would have > to actually _leave_ them in a pack, or they would just keep getting > added to the new pack. I had the impression that unreachable objects would not be packed. Maybe it was more of an assumption. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:53 ` Brandon Casey @ 2008-05-08 22:48 ` Jeff King 2008-05-09 1:41 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-08 22:48 UTC (permalink / raw) To: Brandon Casey; +Cc: Nicolas Pitre, Guido Ostkamp, git On Thu, May 08, 2008 at 04:53:20PM -0500, Brandon Casey wrote: > > Yes. You would have to use the pack mtime. But of course you would have > > to actually _leave_ them in a pack, or they would just keep getting > > added to the new pack. > > I had the impression that unreachable objects would not be packed. Maybe it > was more of an assumption. Look in builtin-pack-objects.c:1981-1982. We basically just say "if it's in a pack now, then it should go into the new pack." -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 22:48 ` Jeff King @ 2008-05-09 1:41 ` Brandon Casey 2008-05-09 3:21 ` Junio C Hamano 2008-05-09 4:19 ` git gc & deleted branches Jeff King 0 siblings, 2 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-09 1:41 UTC (permalink / raw) To: git Jeff King <peff <at> peff.net> writes: > > On Thu, May 08, 2008 at 04:53:20PM -0500, Brandon Casey wrote: > > > > Yes. You would have to use the pack mtime. But of course you would have > > > to actually _leave_ them in a pack, or they would just keep getting > > > added to the new pack. > > > > I had the impression that unreachable objects would not be packed. Maybe it > > was more of an assumption. > > Look in builtin-pack-objects.c:1981-1982. We basically just say "if it's > in a pack now, then it should go into the new pack." > > -Peff > Here's what I was thinking (posted using gmane): diff --git a/git-repack.sh b/git-repack.sh index e18eb3f..064c331 100755 --- a/git-repack.sh +++ b/git-repack.sh @@ -30,7 +30,7 @@ do -n) no_update_info=t ;; -a) all_into_one=t ;; -A) all_into_one=t - keep_unreachable=--keep-unreachable ;; + keep_unreachable=t ;; -d) remove_redundant=t ;; -q) quiet=-q ;; -f) no_reuse=--no-reuse-object ;; @@ -78,9 +78,6 @@ case ",$all_into_one," in if test -z "$args" then args='--unpacked --incremental' - elif test -n "$keep_unreachable" - then - args="$args $keep_unreachable" fi ;; esac @@ -116,7 +113,15 @@ for name in $names ; do echo >&2 "old-pack-$name.{pack,idx} in $PACKDIR." exit 1 } - rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx" + rm -f "$PACKDIR/old-pack-$name.idx" + test -z "$keep_unreachable" || + ! test -f "$PACKDIR/old-pack-$name.pack" || + git unpack-objects < "$PACKDIR/old-pack-$name.pack" || { + echo >&2 "Failed unpacking unreachable objects from old pack" + echo >&2 "saved as old-pack-$name.pack in $PACKDIR." + exit 1 + } + rm -f "$PACKDIR/old-pack-$name.pack" done if test "$remove_redundant" = t @@ -130,7 +135,18 @@ then do case " $fullbases " in *" $e "*) ;; - *) rm -f "$e.pack" "$e.idx" "$e.keep" ;; + *) + rm -f "$e.idx" "$e.keep" + if test -n "$keep_unreachable" && + test -f "$e.pack" + then + git unpack-objects < "$e.pack" || { + echo >&2 "Fail AVOID GMANE WRAP" + exit 1 + } + fi + rm -f "$e.pack" + ;; esac done ) Is the first invocation of unpack-objects necessary? pack-objects has created a pack which hashes to the same name of a pack we already have, and we replace the original with the new one. Is that what is happening? They will be identical right? Of course this won't set the timestamp on the created objects based on the timestamp of the pack file, but this was easy. Setting the timestamp would be proper, but what's another two weeks. Besides, for those users not manually running git-gc, this code path won't even be executed until there are enough pack files for git-gc to add -A to the repack options. Then, for git-gc it should be enough to just always use -A with repack when manually running it. Then --prune can be deprecated. -brandon ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 1:41 ` Brandon Casey @ 2008-05-09 3:21 ` Junio C Hamano [not found] ` <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com> ` (4 more replies) 2008-05-09 4:19 ` git gc & deleted branches Jeff King 1 sibling, 5 replies; 50+ messages in thread From: Junio C Hamano @ 2008-05-09 3:21 UTC (permalink / raw) To: Brandon Casey; +Cc: git Brandon Casey <drafnel@gmail.com> writes: > @@ -116,7 +113,15 @@ for name in $names ; do > echo >&2 "old-pack-$name.{pack,idx} in $PACKDIR." > exit 1 > } > - rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx" > + rm -f "$PACKDIR/old-pack-$name.idx" > + test -z "$keep_unreachable" || > + ! test -f "$PACKDIR/old-pack-$name.pack" || > + git unpack-objects < "$PACKDIR/old-pack-$name.pack" || { > + echo >&2 "Failed unpacking unreachable objects from old pack" > + echo >&2 "saved as old-pack-$name.pack in $PACKDIR." > + exit 1 > + } Neat trick. Unreachable objects that are only in this pack will get current timestamp and gets a new lease of life for two weeks and then will disappear. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com>]
[parent not found: <7v1w4bb291.fsf@gitster.siamese.dyndns.org>]
* Re: git gc & deleted branches [not found] ` <7v1w4bb291.fsf@gitster.siamese.dyndns.org> @ 2008-05-10 3:32 ` Brandon Casey 2008-05-10 4:15 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Brandon Casey @ 2008-05-10 3:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: git I didn't read your email until now. It may sound strange, but I can't read my personal email from work. Also, looks like I inadvertantly took this message off-list. On Fri, May 9, 2008 at 4:23 PM, Junio C Hamano <gitster@pobox.com> wrote: > "Brandon Casey" <drafnel@gmail.com> writes: > >> Is this invocation of pack-objects that you commented on necessary >> though? This is the section that is replacing an existing pack file >> with a newly created pack file of the same name... > > Yuck, I failed to see it in the context. You are right. This won't do > anything I suspect. Now I'm confused. I hope you realized I said "pack-objects" when I meant "unpack-objects". It was only the first invocation of unpack-objects that I thought may be unnecessary. This is because pack-objects (yes, pack) created a new pack with the same name as an existing pack which means (if I'm thinking about things correctly) that none of the objects inside the pack are unreachable. So trying to unpack-objects is a waste of time. There is a second invocation of unpack-objects which is run on packs which will be deleted. This one should unpack any unreferenced objects from each pack. -brandon The remainder of your email is below since the list has never seen it. > Perhaps we would want a new option --eject that causes pack-object write > out unreachable packed objects in the loose format? This would require a > minor surgery to write_sha1_file() so that it won't pay attention to the > return value of has_sha1_file(sha1), perhaps like this. > > This is a lunch-time hack and needs more serious work, such as fixing > horrible "write_sha1_file_1" to something more sensible. > > Also it should add --eject as an independent and incompatible option > instead of dropping --keep-unpacked support. > > --- > > builtin-pack-objects.c | 56 ++++++++++++----------------------------------- > sha1_file.c | 9 ++++++- > 2 files changed, 22 insertions(+), 43 deletions(-) > > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c > index 777f272..8240a4b 100644 > --- a/builtin-pack-objects.c > +++ b/builtin-pack-objects.c > @@ -1836,38 +1836,25 @@ struct in_pack { > struct in_pack_object *array; > }; > > -static void mark_in_pack_object(struct object *object, struct packed_git *p, struct in_pack *in_pack) > +static void eject_object(struct object *obj) > { > - in_pack->array[in_pack->nr].offset = find_pack_entry_one(object->sha1, p); > - in_pack->array[in_pack->nr].object = object; > - in_pack->nr++; > -} > - > -/* > - * Compare the objects in the offset order, in order to emulate the > - * "git-rev-list --objects" output that produced the pack originally. > - */ > -static int ofscmp(const void *a_, const void *b_) > -{ > - struct in_pack_object *a = (struct in_pack_object *)a_; > - struct in_pack_object *b = (struct in_pack_object *)b_; > + enum object_type type; > + unsigned long size; > + void *data; > + unsigned char sha1[20]; > > - if (a->offset < b->offset) > - return -1; > - else if (a->offset > b->offset) > - return 1; > - else > - return hashcmp(a->object->sha1, b->object->sha1); > + data = read_sha1_file(obj->sha1, &type, &size); > + assert(data && type == obj->type); > + if (write_sha1_file_1(data, size, typename(type), sha1, 1)) > + die("cannot eject packed object %s", sha1_to_hex(obj->sha1)); > + assert(!memcmp(sha1, obj->sha1)); > } > > -static void add_objects_in_unpacked_packs(struct rev_info *revs) > +static void eject_objects_in_unpacked_packs(struct rev_info *revs) > { > struct packed_git *p; > - struct in_pack in_pack; > uint32_t i; > > - memset(&in_pack, 0, sizeof(in_pack)); > - > for (p = packed_git; p; p = p->next) { > const unsigned char *sha1; > struct object *o; > @@ -1881,28 +1868,15 @@ static void add_objects_in_unpacked_packs(struct rev_info *revs) > if (open_pack_index(p)) > die("cannot open pack index"); > > - ALLOC_GROW(in_pack.array, > - in_pack.nr + p->num_objects, > - in_pack.alloc); > - > for (i = 0; i < p->num_objects; i++) { > sha1 = nth_packed_object_sha1(p, i); > o = lookup_unknown_object(sha1); > - if (!(o->flags & OBJECT_ADDED)) > - mark_in_pack_object(o, p, &in_pack); > + if (o->flags & OBJECT_ADDED) > + continue; > o->flags |= OBJECT_ADDED; > + eject_object(o); > } > } > - > - if (in_pack.nr) { > - qsort(in_pack.array, in_pack.nr, sizeof(in_pack.array[0]), > - ofscmp); > - for (i = 0; i < in_pack.nr; i++) { > - struct object *o = in_pack.array[i].object; > - add_object_entry(o->sha1, o->type, "", 0); > - } > - } > - free(in_pack.array); > } > > static void get_object_list(int ac, const char **av) > @@ -1938,7 +1912,7 @@ static void get_object_list(int ac, const char **av) > traverse_commit_list(&revs, show_commit, show_object); > > if (keep_unreachable) > - add_objects_in_unpacked_packs(&revs); > + eject_objects_in_unpacked_packs(&revs); > } > > static int adjust_perm(const char *path, mode_t mode) > diff --git a/sha1_file.c b/sha1_file.c > index 3516777..ce3a661 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -2102,7 +2102,7 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type, > return 0; > } > > -int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1) > +int write_sha1_file_1(void *buf, unsigned long len, const char *type, unsigned char *returnsha1, int refresh) > { > int size, ret; > unsigned char *compressed; > @@ -2120,7 +2120,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha > filename = sha1_file_name(sha1); > if (returnsha1) > hashcpy(returnsha1, sha1); > - if (has_sha1_file(sha1)) > + if (!refresh && has_sha1_file(sha1)) > return 0; > fd = open(filename, O_RDONLY); > if (fd >= 0) { > @@ -2185,6 +2185,11 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha > return move_temp_to_file(tmpfile, filename); > } > > +int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1) > +{ > + return write_sha1_file_1(buf, len, type, returnsha1, 0); > +} > + > /* > * We need to unpack and recompress the object for writing > * it out to a different file. > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 3:32 ` Brandon Casey @ 2008-05-10 4:15 ` Brandon Casey 0 siblings, 0 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-10 4:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, May 9, 2008 at 10:32 PM, Brandon Casey <drafnel@gmail.com> wrote: > On Fri, May 9, 2008 at 4:23 PM, Junio C Hamano <gitster@pobox.com> wrote: > The remainder of your email is below since the list has never seen it. > > >> Perhaps we would want a new option --eject that causes pack-object write >> out unreachable packed objects in the loose format? Two comments on this. 1) If you decide to go in this direction, I think the option should be named --unpack-unreachable rather than --eject. 2) There is some part of me that is against a program named pack-objects doing unpacking. But it is probably more efficient this way. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 0/3] leave unreferenced objects unpacked 2008-05-09 3:21 ` Junio C Hamano [not found] ` <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com> @ 2008-05-10 4:01 ` drafnel 2008-05-10 4:01 ` [PATCH 1/3] repack: modify behavior of -A option to " drafnel ` (2 subsequent siblings) 4 siblings, 0 replies; 50+ messages in thread From: drafnel @ 2008-05-10 4:01 UTC (permalink / raw) To: gitster; +Cc: git Here is a formal patch. I removed the first invocation of unpack-objects since I think it is unneccessary. Followed up with mods to git-gc. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-09 3:21 ` Junio C Hamano [not found] ` <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com> 2008-05-10 4:01 ` [PATCH 0/3] leave unreferenced objects unpacked drafnel @ 2008-05-10 4:01 ` drafnel 2008-05-10 6:03 ` Jeff King 2008-05-10 4:01 ` [PATCH 2/3] git-gc: always use -A when manually repacking drafnel 2008-05-10 4:01 ` [PATCH 3/3] builtin-gc.c: deprecate --prune, it now really has no effect drafnel 4 siblings, 1 reply; 50+ messages in thread From: drafnel @ 2008-05-10 4:01 UTC (permalink / raw) To: gitster; +Cc: git, Brandon Casey From: Brandon Casey <drafnel@gmail.com> The previous behavior of the -A option was to retain any previously packed objects which had become unreferenced, and place them into the newly created pack file. Since git-gc, when run automatically with the --auto option, calls repack with the -A option, this had the effect of retaining unreferenced packed objects indefinitely. To avoid this scenario, the user was required to run git-gc with the little known --prune option or to manually run repack with the -a option. This patch changes the behavior of the -A option so that unreferenced objects that exist in any pack file being replaced, will be unpacked into the repository. The unreferenced loose objects can then be garbage collected by git-gc (i.e. git-prune) based on the gc.pruneExpire setting. Also add new tests for checking whether unreferenced objects which were previously packed are properly left in the repository unpacked after repacking. Signed-off-by: Brandon Casey <drafnel@gmail.com> --- git-repack.sh | 18 +++++++++--- t/t7701-repack-unpack-unreachable.sh | 47 ++++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+), 5 deletions(-) create mode 100755 t/t7701-repack-unpack-unreachable.sh diff --git a/git-repack.sh b/git-repack.sh index e18eb3f..a0e06ed 100755 --- a/git-repack.sh +++ b/git-repack.sh @@ -30,7 +30,7 @@ do -n) no_update_info=t ;; -a) all_into_one=t ;; -A) all_into_one=t - keep_unreachable=--keep-unreachable ;; + keep_unreachable=t ;; -d) remove_redundant=t ;; -q) quiet=-q ;; -f) no_reuse=--no-reuse-object ;; @@ -78,9 +78,6 @@ case ",$all_into_one," in if test -z "$args" then args='--unpacked --incremental' - elif test -n "$keep_unreachable" - then - args="$args $keep_unreachable" fi ;; esac @@ -130,7 +127,18 @@ then do case " $fullbases " in *" $e "*) ;; - *) rm -f "$e.pack" "$e.idx" "$e.keep" ;; + *) + rm -f "$e.idx" "$e.keep" + if test -n "$keep_unreachable" && + test -f "$e.pack" + then + git unpack-objects < "$e.pack" || { + echo >&2 "Failed unpacking unreachable objects from redundant pack file $e.pack" + exit 1 + } + fi + rm -f "$e.pack" + ;; esac done ) diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh new file mode 100755 index 0000000..6a5211f --- /dev/null +++ b/t/t7701-repack-unpack-unreachable.sh @@ -0,0 +1,47 @@ +#!/bin/sh + +test_description='git-repack works correctly' + +. ./test-lib.sh + +test_expect_success '-A option leaves unreachable objects unpacked' ' + echo content > file1 && + git add . && + git commit -m initial_commit && + # create a transient branch with unique content + git checkout -b transient_branch && + echo more content >> file1 && + # record the objects created in the database for file, commit, tree + fsha1=$(git hash-object file1) && + git commit -a -m more_content && + csha1=$(git rev-parse HEAD^{commit}) && + tsha1=$(git rev-parse HEAD^{tree}) && + git checkout master && + echo even more content >> file1 && + git commit -a -m even_more_content && + # delete the transient branch + git branch -D transient_branch && + # pack the repo + git repack -A -d -l && + # verify objects are packed in repository + test 3 = $(git verify-pack -v -- .git/objects/pack/*.idx | + grep -e "^$fsha1 " -e "^$csha1 " -e "^$tsha1 " | + sort | uniq | wc -l) && + git show $fsha1 && + git show $csha1 && + git show $tsha1 && + # now expire the reflog + sleep 1 && + git reflog expire --expire-unreachable=now --all && + # and repack + git repack -A -d -l && + # verify objects are retained unpacked + test 0 = $(git verify-pack -v -- .git/objects/pack/*.idx | + grep -e "^$fsha1 " -e "^$csha1 " -e "^$tsha1 " | + sort | uniq | wc -l) && + git show $fsha1 && + git show $csha1 && + git show $tsha1 +' + +test_done -- 1.5.5.67.g9a49 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-10 4:01 ` [PATCH 1/3] repack: modify behavior of -A option to " drafnel @ 2008-05-10 6:03 ` Jeff King 2008-05-11 1:10 ` Nicolas Pitre 2008-05-11 4:16 ` Brandon Casey 0 siblings, 2 replies; 50+ messages in thread From: Jeff King @ 2008-05-10 6:03 UTC (permalink / raw) To: drafnel; +Cc: gitster, git On Fri, May 09, 2008 at 11:01:55PM -0500, drafnel@gmail.com wrote: > - keep_unreachable=--keep-unreachable ;; > + keep_unreachable=t ;; Can we call this something else (like unpack_unreachable) since it now has nothing to do with the --keep-unreachable flag? Also, should --keep-unreachable be deprecated / removed? > + *) > + rm -f "$e.idx" "$e.keep" > + if test -n "$keep_unreachable" && > + test -f "$e.pack" > + then > + git unpack-objects < "$e.pack" || { > + echo >&2 "Failed unpacking unreachable objects from redundant pack file $e.pack" > + exit 1 > + } > + fi I still like Geert's suggestion of unpacking them to a _different_ place. That helps to avoid spurious "gc --auto" invocations caused by too many prunable objects. Though it certainly doesn't solve it, and maybe that just needs to be fixed separately. Possibly the "gc --auto" test should be: - count objects; if too few, exit - count unreachable loose objects; if too few, exit - run gc That means having a lot of unreachable objects will still incur some extra processing, but not as much as a full repack. And it won't bug the user with a "you need to repack" message. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-10 6:03 ` Jeff King @ 2008-05-11 1:10 ` Nicolas Pitre 2008-05-11 1:23 ` Junio C Hamano 2008-05-11 4:16 ` Brandon Casey 1 sibling, 1 reply; 50+ messages in thread From: Nicolas Pitre @ 2008-05-11 1:10 UTC (permalink / raw) To: Jeff King; +Cc: drafnel, gitster, git On Sat, 10 May 2008, Jeff King wrote: > Also, should --keep-unreachable be deprecated / removed? Depends. If it has no maintenance cost then we might as well keep it around. > I still like Geert's suggestion of unpacking them to a _different_ > place. That helps to avoid spurious "gc --auto" invocations caused by > too many prunable objects. Though it certainly doesn't solve it, and > maybe that just needs to be fixed separately. Having a separate location for objects seems clunky to me. And the fundamental problem isn't solved indeed -- you may end up with many non expired unreachable loose objects already without packing them. > Possibly the "gc --auto" test should be: > > - count objects; if too few, exit > - count unreachable loose objects; if too few, exit Determining the number of unreachable objects is quite costly, packed or not. So that isn't a good thing to do on every 'git gc --auto' invokation. > - run gc > > That means having a lot of unreachable objects will still incur some > extra processing, but not as much as a full repack. And it won't bug the > user with a "you need to repack" message. The auto gc performs incremental packing most of the time. And that is way faster than figuring out which objects are unreachable. For example, running 'git prune' in my Linux repo takes 16 seconds, even when there is nothing to prune. Running 'git repack' (with no option so to perform an incremental repack) took less than 2 seconds to pack 541 reachable objects that happened to be loose. I'm now starting to wonder if there is a reason for keeping unreachable objects that used to be packed. Putting --keep-unreachable aside for now, the only way an unreachable object could have entered a pack is if it used to be reachable before through the commit history or reflog. So if they're not reachable anymore, that's most probably because their reflog expired. So what's the point for keeping them even longer? What's the reasoning that led to the creation of --keep-unreachable in the first place? Nicolas ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-11 1:10 ` Nicolas Pitre @ 2008-05-11 1:23 ` Junio C Hamano 0 siblings, 0 replies; 50+ messages in thread From: Junio C Hamano @ 2008-05-11 1:23 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jeff King, drafnel, git Nicolas Pitre <nico@cam.org> writes: > I'm now starting to wonder if there is a reason for keeping unreachable > objects that used to be packed. Putting --keep-unreachable aside for > now, the only way an unreachable object could have entered a pack is if > it used to be reachable before through the commit history or reflog. > So if they're not reachable anymore, that's most probably because their > reflog expired. So what's the point for keeping them even longer? > What's the reasoning that led to the creation of --keep-unreachable in > the first place? I think the logic went like this. (1) You may have rewound your head since you last repacked; blobs and trees in the rewound commit are already packed now. (2) Now you may be fetching (or somebody else may be pushing) a commit that contains such blobs and/or trees, and the fetch or push is small enough that it unpacks, but the packed and unreachable ones are not unpacked. (3) But before that fetch or push finishes to update the ref, you can race with a "repack -a -d". ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-10 6:03 ` Jeff King 2008-05-11 1:10 ` Nicolas Pitre @ 2008-05-11 4:16 ` Brandon Casey 2008-05-11 4:51 ` Brandon Casey 1 sibling, 1 reply; 50+ messages in thread From: Brandon Casey @ 2008-05-11 4:16 UTC (permalink / raw) To: Jeff King; +Cc: gitster, git On Sat, May 10, 2008 at 1:03 AM, Jeff King <peff@peff.net> wrote: > On Fri, May 09, 2008 at 11:01:55PM -0500, drafnel@gmail.com wrote: > >> - keep_unreachable=--keep-unreachable ;; >> + keep_unreachable=t ;; > > Can we call this something else (like unpack_unreachable) since it now > has nothing to do with the --keep-unreachable flag? Actually I initially changed it to unpack_unreachable, and then changed it back. The reason I did this is because I think keep_unreachable still describes what is being accomplished, that unreachables are being kept. When -A is supplied along with -d, unreachables are kept by being unpacked. When -d is not supplied, unreachables are kept in their original pack file. If Geert's proposal or something else is implemented, keep_unreachable may still be appropriate. hmm? > Also, should --keep-unreachable be deprecated / removed? > >> + *) >> + rm -f "$e.idx" "$e.keep" >> + if test -n "$keep_unreachable" && >> + test -f "$e.pack" >> + then >> + git unpack-objects < "$e.pack" || { >> + echo >&2 "Failed unpacking unreachable objects from redundant pack file $e.pack" >> + exit 1 >> + } >> + fi > > I still like Geert's suggestion of unpacking them to a _different_ > place. That helps to avoid spurious "gc --auto" invocations caused by > too many prunable objects. Though it certainly doesn't solve it, and > maybe that just needs to be fixed separately. That was my thinking. > > Possibly the "gc --auto" test should be: > > - count objects; if too few, exit > - count unreachable loose objects; if too few, exit > - run gc > > That means having a lot of unreachable objects will still incur some > extra processing, but not as much as a full repack. And it won't bug the > user with a "you need to repack" message. I've got a thought. How about limiting how often auto repack repacks by looking at the timestamp of the most recent pack? Wouldn't the packs already be prepared in most cases i.e. prepare_packed_git() -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked 2008-05-11 4:16 ` Brandon Casey @ 2008-05-11 4:51 ` Brandon Casey 0 siblings, 0 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-11 4:51 UTC (permalink / raw) To: Jeff King; +Cc: gitster, git On Sat, May 10, 2008 at 11:16 PM, Brandon Casey <drafnel@gmail.com> wrote: > I've got a thought. How about limiting how often auto repack repacks > by looking at the timestamp of the most recent pack? Wouldn't the > packs already be prepared in most cases i.e. prepare_packed_git() completely untested and hopefully not mangled by google... actually, this will do nothing for the case where there exists many loose unreachable objects and no loose reachable objects since we won't create a new pack with an updated timestamp to compare against. So git-gc will continue to spin its wheels without getting anywhere. Could we update the pack timestamp after running git-gc or use a timestamp from someplace else? -brandon diff --git a/builtin-gc.c b/builtin-gc.c index 48f7d95..16b1455 100644 --- a/builtin-gc.c +++ b/builtin-gc.c @@ -27,6 +27,7 @@ static int aggressive_window = -1; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; static char *prune_expire = "2.weeks.ago"; +static time_t gc_auto_pack_frequency = 21600; /* 6 hours */ #define MAX_ADD 10 static const char *argv_pack_refs[] = {"pack-refs", "--all", "--prune", NULL}; @@ -56,6 +57,10 @@ static int gc_config(const char *var, const char *value) gc_auto_pack_limit = git_config_int(var, value); return 0; } + if (!strcmp(var, "gc.autopackfrequency")) { + gc_auto_pack_frequency = git_config_ulong(var, value); + return 0; + } if (!strcmp(var, "gc.pruneexpire")) { if (!value) return config_error_nonbool(var); @@ -205,6 +210,14 @@ static int need_to_gc(void) else if (!too_many_loose_objects()) return 0; + if (gc_auto_pack_frequency) { + prepare_packed_git(); + if (packed_git && + packed_git->mtime > + approxidate("now") - gc_auto_pack_frequency) + return 0; + } + if (run_hook()) return 0; return 1; ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 2/3] git-gc: always use -A when manually repacking 2008-05-09 3:21 ` Junio C Hamano ` (2 preceding siblings ...) 2008-05-10 4:01 ` [PATCH 1/3] repack: modify behavior of -A option to " drafnel @ 2008-05-10 4:01 ` drafnel 2008-05-10 4:01 ` [PATCH 3/3] builtin-gc.c: deprecate --prune, it now really has no effect drafnel 4 siblings, 0 replies; 50+ messages in thread From: drafnel @ 2008-05-10 4:01 UTC (permalink / raw) To: gitster; +Cc: git, Brandon Casey From: Brandon Casey <drafnel@gmail.com> Now that repack -A will leave unreferenced objects unpacked, there is no reason to use the -a option to repack (which will discard unreferenced objects). The unpacked unreferenced objects will not be repacked by a subsequent repack, and will eventually be pruned by git-gc based on the gc.pruneExpire config option. --- builtin-gc.c | 13 ++----------- 1 files changed, 2 insertions(+), 11 deletions(-) diff --git a/builtin-gc.c b/builtin-gc.c index f99ebc7..6db2f51 100644 --- a/builtin-gc.c +++ b/builtin-gc.c @@ -256,17 +256,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) "performance. You may also\n" "run \"git gc\" manually. See " "\"git help gc\" for more information.\n"); - } else { - /* - * Use safer (for shared repos) "-A" option to - * repack when not pruning. Auto-gc makes its - * own decision. - */ - if (prune) - append_option(argv_repack, "-a", MAX_ADD); - else - append_option(argv_repack, "-A", MAX_ADD); - } + } else + append_option(argv_repack, "-A", MAX_ADD); if (pack_refs && run_command_v_opt(argv_pack_refs, RUN_GIT_CMD)) return error(FAILED_RUN, argv_pack_refs[0]); -- 1.5.5.67.g9a49 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 3/3] builtin-gc.c: deprecate --prune, it now really has no effect 2008-05-09 3:21 ` Junio C Hamano ` (3 preceding siblings ...) 2008-05-10 4:01 ` [PATCH 2/3] git-gc: always use -A when manually repacking drafnel @ 2008-05-10 4:01 ` drafnel 4 siblings, 0 replies; 50+ messages in thread From: drafnel @ 2008-05-10 4:01 UTC (permalink / raw) To: gitster; +Cc: git, Brandon Casey From: Brandon Casey <drafnel@gmail.com> --- builtin-gc.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/builtin-gc.c b/builtin-gc.c index 6db2f51..48f7d95 100644 --- a/builtin-gc.c +++ b/builtin-gc.c @@ -219,7 +219,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) char buf[80]; struct option builtin_gc_options[] = { - OPT_BOOLEAN(0, "prune", &prune, "prune unreferenced objects"), + OPT_BOOLEAN(0, "prune", &prune, "prune unreferenced objects (deprecated)"), OPT_BOOLEAN(0, "aggressive", &aggressive, "be more thorough (increased runtime)"), OPT_BOOLEAN(0, "auto", &auto_gc, "enable auto-gc mode"), OPT_BOOLEAN('q', "quiet", &quiet, "suppress progress reports"), @@ -249,7 +249,6 @@ int cmd_gc(int argc, const char **argv, const char *prefix) /* * Auto-gc should be least intrusive as possible. */ - prune = 0; if (!need_to_gc()) return 0; fprintf(stderr, "Auto packing your repository for optimum " -- 1.5.5.67.g9a49 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 1:41 ` Brandon Casey 2008-05-09 3:21 ` Junio C Hamano @ 2008-05-09 4:19 ` Jeff King 2008-05-09 15:00 ` Geert Bosch 1 sibling, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-09 4:19 UTC (permalink / raw) To: Brandon Casey; +Cc: git On Fri, May 09, 2008 at 01:41:30AM +0000, Brandon Casey wrote: > Here's what I was thinking (posted using gmane): > > diff --git a/git-repack.sh b/git-repack.sh > index e18eb3f..064c331 100755 > --- a/git-repack.sh > +++ b/git-repack.sh I like it. It makes an easy rule to say "packed objects _never_ get pruned, they only get demoted to loose objects." And then of course we have sane rules for pruning loose objects. > - rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx" > + rm -f "$PACKDIR/old-pack-$name.idx" > + test -z "$keep_unreachable" || > + ! test -f "$PACKDIR/old-pack-$name.pack" || > + git unpack-objects < "$PACKDIR/old-pack-$name.pack" || { > + echo >&2 "Failed unpacking unreachable objects from old pack" > + echo >&2 "saved as old-pack-$name.pack in $PACKDIR." > + exit 1 > + } > + rm -f "$PACKDIR/old-pack-$name.pack" > [...] > > Is the first invocation of unpack-objects necessary? pack-objects has created > a pack which hashes to the same name of a pack we already have, and we replace > the original with the new one. Is that what is happening? They will be > identical right? Yeah, that's what it looks like to me (that the first unpack is unnecessary, because we will just be putting the new pack into place that has all the same objects). AIUI, two packs with identical hashes must contain the exact same objects. > Of course this won't set the timestamp on the created objects based on the > timestamp of the pack file, but this was easy. Setting the timestamp would be > proper, but what's another two weeks. Besides, for those users not manually > running git-gc, this code path won't even be executed until there are enough > pack files for git-gc to add -A to the repack options. I think the extra two weeks is fine. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 4:19 ` git gc & deleted branches Jeff King @ 2008-05-09 15:00 ` Geert Bosch 2008-05-09 15:14 ` Brandon Casey 0 siblings, 1 reply; 50+ messages in thread From: Geert Bosch @ 2008-05-09 15:00 UTC (permalink / raw) To: Jeff King; +Cc: Brandon Casey, git On May 9, 2008, at 00:19, Jeff King wrote: > I like it. It makes an easy rule to say "packed objects _never_ get > pruned, they only get demoted to loose objects." And then of course > we have sane rules for pruning loose objects. Isn't there an issue with the "git gc" triggering because there may be too many loose unreferenced objects? Still, I do like the approach. Maybe unreferenced objects and old refs should go to a .git/lost+found directory and be expired from there. This has a couple of benefits: - Easy to manually inspect or blow away any crud - One git-gc run can make one pack in lost+found, avoiding huge numbers of loose objects (and massive disk use) when trying to do a large cleanup (to possibly reclaim disk space) - Objects will not be accessible by ordinary git commands for a while, before they are really removed, avoiding surprises Only some tools would look in the lost+found to restore stuff. -Geert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 15:00 ` Geert Bosch @ 2008-05-09 15:14 ` Brandon Casey 2008-05-09 15:53 ` Jeff King 2008-05-09 16:12 ` Nicolas Pitre 0 siblings, 2 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-09 15:14 UTC (permalink / raw) To: Geert Bosch; +Cc: Jeff King, git Geert Bosch wrote: > > On May 9, 2008, at 00:19, Jeff King wrote: > >> I like it. It makes an easy rule to say "packed objects _never_ get >> pruned, they only get demoted to loose objects." And then of course >> we have sane rules for pruning loose objects. > > Isn't there an issue with the "git gc" triggering because there > may be too many loose unreferenced objects? > Still, I do like the approach. This would be an argument for going the extra mile and having the loose objects adopt the timestamp of their pack file. In the normal case they would probably be pruned immediately during the same git-gc run. > Maybe unreferenced objects and old refs should go to a .git/lost+found > directory and be expired from there. This has a couple of benefits: > - Objects will not be accessible by ordinary git commands for a while, > before they are really removed, avoiding surprises Unreferenced objects are sometimes used by other repositories which have this repository listed as an alternate. So it may not be a good idea to make the unreferenced objects inaccessible. -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 15:14 ` Brandon Casey @ 2008-05-09 15:53 ` Jeff King 2008-05-09 15:56 ` Brandon Casey 2008-05-09 16:12 ` Nicolas Pitre 1 sibling, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-09 15:53 UTC (permalink / raw) To: Brandon Casey; +Cc: Geert Bosch, git On Fri, May 09, 2008 at 10:14:12AM -0500, Brandon Casey wrote: > > - Objects will not be accessible by ordinary git commands for a while, > > before they are really removed, avoiding surprises > > Unreferenced objects are sometimes used by other repositories which have > this repository listed as an alternate. So it may not be a good idea to > make the unreferenced objects inaccessible. But that is precisely what we're going to do, but in two weeks. Isn't it better to have the dependent repo fail while the change is recoverable? -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 15:53 ` Jeff King @ 2008-05-09 15:56 ` Brandon Casey 0 siblings, 0 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-09 15:56 UTC (permalink / raw) To: Jeff King; +Cc: Geert Bosch, git Jeff King wrote: > On Fri, May 09, 2008 at 10:14:12AM -0500, Brandon Casey wrote: > >>> - Objects will not be accessible by ordinary git commands for a while, >>> before they are really removed, avoiding surprises >> Unreferenced objects are sometimes used by other repositories which have >> this repository listed as an alternate. So it may not be a good idea to >> make the unreferenced objects inaccessible. > > But that is precisely what we're going to do, but in two weeks. Isn't it > better to have the dependent repo fail while the change is recoverable? good point. -b ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 15:14 ` Brandon Casey 2008-05-09 15:53 ` Jeff King @ 2008-05-09 16:12 ` Nicolas Pitre 2008-05-09 16:54 ` Brandon Casey 2008-05-09 22:33 ` Junio C Hamano 1 sibling, 2 replies; 50+ messages in thread From: Nicolas Pitre @ 2008-05-09 16:12 UTC (permalink / raw) To: Brandon Casey; +Cc: Geert Bosch, Jeff King, git On Fri, 9 May 2008, Brandon Casey wrote: > Geert Bosch wrote: > > > > On May 9, 2008, at 00:19, Jeff King wrote: > > > >> I like it. It makes an easy rule to say "packed objects _never_ get > >> pruned, they only get demoted to loose objects." And then of course > >> we have sane rules for pruning loose objects. > > > > Isn't there an issue with the "git gc" triggering because there > > may be too many loose unreferenced objects? > > Still, I do like the approach. > > This would be an argument for going the extra mile and having the loose > objects adopt the timestamp of their pack file. In the normal case they > would probably be pruned immediately during the same git-gc run. Well, not necessarily. If you created a large branch yesterday and you are deleting it today, then if you repacked in between means that those loose objects won't be more than one day old. Yet there could be enough of them to trigger auto gc. But that auto gc won't pack those objects since they are unreferenced. Hence auto gc will trigger all the time without making any progress. > > Maybe unreferenced objects and old refs should go to a .git/lost+found > > directory and be expired from there. This has a couple of benefits: > > > - Objects will not be accessible by ordinary git commands for a while, > > before they are really removed, avoiding surprises > > Unreferenced objects are sometimes used by other repositories which have > this repository listed as an alternate. So it may not be a good idea to > make the unreferenced objects inaccessible. Nah. If this is really the case then you shouldn't be running gc at all in the first place. Nicolas ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 16:12 ` Nicolas Pitre @ 2008-05-09 16:54 ` Brandon Casey 2008-05-09 22:33 ` Junio C Hamano 1 sibling, 0 replies; 50+ messages in thread From: Brandon Casey @ 2008-05-09 16:54 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Geert Bosch, Jeff King, git Nicolas Pitre wrote: > On Fri, 9 May 2008, Brandon Casey wrote: > >> Geert Bosch wrote: >>> On May 9, 2008, at 00:19, Jeff King wrote: >>> >>>> I like it. It makes an easy rule to say "packed objects _never_ get >>>> pruned, they only get demoted to loose objects." And then of course >>>> we have sane rules for pruning loose objects. >>> Isn't there an issue with the "git gc" triggering because there >>> may be too many loose unreferenced objects? >>> Still, I do like the approach. >> This would be an argument for going the extra mile and having the loose >> objects adopt the timestamp of their pack file. In the normal case they >> would probably be pruned immediately during the same git-gc run. > > Well, not necessarily. If you created a large branch yesterday and you > are deleting it today, then if you repacked in between means that those > loose objects won't be more than one day old. Yet there could be enough > of them to trigger auto gc. But that auto gc won't pack those objects > since they are unreferenced. Hence auto gc will trigger all the time > without making any progress. That's true, but the intermediate repack is not the cause here. You'd be in the same situation if a large branch was created yesterday and then deleted today even if packing had never occurred. I do see your point, but you should have said a large branch created a month ago, deleted today, but repacked yesterday. :) -brandon ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 16:12 ` Nicolas Pitre 2008-05-09 16:54 ` Brandon Casey @ 2008-05-09 22:33 ` Junio C Hamano 2008-05-09 23:09 ` [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch Chris Frey 2008-05-10 0:07 ` git gc & deleted branches Jeremy Maitin-Shepard 1 sibling, 2 replies; 50+ messages in thread From: Junio C Hamano @ 2008-05-09 22:33 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Brandon Casey, Geert Bosch, Jeff King, git Nicolas Pitre <nico@cam.org> writes: > On Fri, 9 May 2008, Brandon Casey wrote: > >> Unreferenced objects are sometimes used by other repositories which have >> this repository listed as an alternate. So it may not be a good idea to >> make the unreferenced objects inaccessible. > > Nah. If this is really the case then you shouldn't be running gc at all > in the first place. True. I think the true motivation behind --keep-unreachable is not about the shared object store (aka "alternates") but about races between gc and push (or fetch). Before push (or fetch) finishes and updates refs, the new objects they create would be dangling _and_ the objects these dangling objects refer to may be packed but unreferenced. Repacking unreferenced packed objects was a way to avoid losing them. ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch. 2008-05-09 22:33 ` Junio C Hamano @ 2008-05-09 23:09 ` Chris Frey 2008-05-10 0:07 ` git gc & deleted branches Jeremy Maitin-Shepard 1 sibling, 0 replies; 50+ messages in thread From: Chris Frey @ 2008-05-09 23:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git On Fri, May 09, 2008 at 03:33:41PM -0700, Junio C Hamano wrote: > I think the true motivation behind --keep-unreachable is not about the > shared object store (aka "alternates") but about races between gc and > push (or fetch). Before push (or fetch) finishes and updates refs, the > new objects they create would be dangling _and_ the objects these dangling > objects refer to may be packed but unreferenced. Repacking unreferenced > packed objects was a way to avoid losing them. This is what the log history seems to indicate: git log -p --grep=keep-unreach So pack-objects --keep-unreachable was implemented in order to add repack -A, which now doesn't need --keep-unreachable, and can become obsolete. Which is just as well, since --keep-unreachable never made it to the man pages. :-) If I understand things correctly, there is no user-friendly way to add loose, unreachable objects to a pack. This whole architecture was just to prevent a repack from silently deleting things. If this is right, the patch below updates the docs. - Chris >From 443b1201d54f0b7197d18779ce934823e9897b36 Mon Sep 17 00:00:00 2001 From: Chris Frey <cdfrey@foursquare.net> Date: Fri, 9 May 2008 19:08:26 -0400 Subject: [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch. This patch clarifies the git-prune man page, documenting that it only prunes unpacked objects. git-repack is documented according to the new git-repack -A behaviour, which does not depend on git-pack-objects --keep-unreachable anymore. Signed-off-by: Chris Frey <cdfrey@foursquare.net> --- Documentation/git-prune.txt | 5 ++++- Documentation/git-repack.txt | 14 +++++++++++++- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/Documentation/git-prune.txt b/Documentation/git-prune.txt index f92bb8c..3178bc4 100644 --- a/Documentation/git-prune.txt +++ b/Documentation/git-prune.txt @@ -18,12 +18,15 @@ git-prune. See the section "NOTES", below. This runs `git-fsck --unreachable` using all the refs available in `$GIT_DIR/refs`, optionally with additional set of -objects specified on the command line, and prunes all +objects specified on the command line, and prunes all unpacked objects unreachable from any of these head objects from the object database. In addition, it prunes the unpacked objects that are also found in packs by running `git prune-packed`. +Note that unreachable, packed objects will remain. If this is +not desired, see linkgit:git-repack[1]. + OPTIONS ------- diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index 3d95749..906d3c7 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -8,7 +8,7 @@ git-repack - Pack unpacked objects in a repository SYNOPSIS -------- -'git-repack' [-a] [-d] [-f] [-l] [-n] [-q] [--window=N] [--depth=N] +'git-repack' [-a] [-A] [-d] [-f] [-l] [-n] [-q] [--window=N] [--depth=N] DESCRIPTION ----------- @@ -37,6 +37,18 @@ OPTIONS leaves behind, but `git fsck --full` shows as dangling. +-A:: + Same as `-a`, but any unreachable objects in a previous + pack become loose, unpacked objects, instead of being + left in the old pack. Unreachable objects are never + intentionally added to a pack, even when repacking. + When used with '-d', this option + prevents unreachable objects from being immediately + deleted by way of being left in the old pack and then + removed. Instead, the loose unreachable objects + will be pruned according to normal expiry rules + with the next linkgit:git-gc[1]. + -d:: After packing, if the newly created packs make some existing packs redundant, remove the redundant packs. -- 1.5.4.4 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-09 22:33 ` Junio C Hamano 2008-05-09 23:09 ` [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch Chris Frey @ 2008-05-10 0:07 ` Jeremy Maitin-Shepard 2008-05-10 0:20 ` Shawn O. Pearce 1 sibling, 1 reply; 50+ messages in thread From: Jeremy Maitin-Shepard @ 2008-05-10 0:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git Junio C Hamano <gitster@pobox.com> writes: > Nicolas Pitre <nico@cam.org> writes: >> On Fri, 9 May 2008, Brandon Casey wrote: >> >>> Unreferenced objects are sometimes used by other repositories which have >>> this repository listed as an alternate. So it may not be a good idea to >>> make the unreferenced objects inaccessible. >> >> Nah. If this is really the case then you shouldn't be running gc at all >> in the first place. > True. > I think the true motivation behind --keep-unreachable is not about the > shared object store (aka "alternates") but about races between gc and > push (or fetch). Before push (or fetch) finishes and updates refs, the > new objects they create would be dangling _and_ the objects these dangling > objects refer to may be packed but unreferenced. Repacking unreferenced > packed objects was a way to avoid losing them. I feel like the current approach of (not very well) keeping track of which objects are still needed is very messy, not very well defined or based on specific solid principles, and prone to errors and losing objects. Things like git clone -shared can only really be used in extremely specialized setups, or if pruning of unreferenced objects is completely disabled in the source repository, or if specialized scripts are used to do the garbage collection that take into account the references of the "child" repository. It is my impression that even repo.or.cz, while it has some safe guards, does not even completely safely handle garbage collection. Probably it would be very useful to examples of such scripts in contrib. I think that ultimately, some general purpose and reliable solution needs to be found to handle the cases of (1) a repository having its objects referenced by another via info/alternates; (2) a repository with multiple working directories (presumably this should warn/error out unless given a force option/detach head and warn if you try to switch HEAD for some working directory to the same branch as some other working directory). It seems, btw, that a third type of clone, one which merely symlinks the objects directory, would also be useful, once there is a solution to the robustness issue. This would be a case (3) that needs to be handled as well. It seems that clear that ultimately, to handle these three cases, every repository needs to know about every other repository, probably via a symlink to other repository's .git directory. Git gc would then also examine any refs in this directory, making sure to avoid circular references that might result from following the symlinks. It should also probably error out if it finds a symlink that doesn't point to a valid git repository, because such a symlink either refers to a now-deleted repository for which the symlink needs to be cleaned up, or it refers to a repository that was moved and therefore the symlink needs to be updated. Simply ignoring invalid symlinks could result in pruning objects that need to be kept for repositories that have moved. It is extremely cumbersome to have to worry about whether there are other concurrent accesses to the repository when running e.g. git gc. For servers, you may never be able to guarantee that nothing else is accessing the repository concurrently. Here is a possible solution: Each git process creates a log file of the references that it has created. The log file should be named in some way with e.g. the process id and start time of the process, and simply consist of a list of 20-byte sha1 hashes to be considered additional in-use references for the purpose of garbage collection. The log file would be cleaned up when the process exits, and would also be deleted by any instance of git gc that notices a stale log file that doesn't correspond to a running process. To handle shell scripts that need to deal with git-hash-object directly, git hash-object could be passed maybe a file descriptor or filename of a log file to use instead of creating one. Maybe the log file format could be more complicated, and also support paths to e.g. alternate index files to also consider for references. Things would need to be one so that race conditions do not occur, but I think something like this would work. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 0:07 ` git gc & deleted branches Jeremy Maitin-Shepard @ 2008-05-10 0:20 ` Shawn O. Pearce 2008-05-10 0:43 ` Jeremy Maitin-Shepard 2008-05-10 1:21 ` Junio C Hamano 0 siblings, 2 replies; 50+ messages in thread From: Shawn O. Pearce @ 2008-05-10 0:20 UTC (permalink / raw) To: Jeremy Maitin-Shepard Cc: Junio C Hamano, Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git Jeremy Maitin-Shepard <jbms@cmu.edu> wrote: > It is extremely cumbersome to have to worry about whether there are > other concurrent accesses to the repository when running e.g. git gc. > For servers, you may never be able to guarantee that nothing else is > accessing the repository concurrently. Here is a possible solution: > > Each git process creates a log file of the references that it has > created. The log file should be named in some way with e.g. the process > id and start time of the process, and simply consist of a list of > 20-byte sha1 hashes to be considered additional in-use references for > the purpose of garbage collection. I believe we partially considered that in the past and discarded it as far too complex implementation-wise for the benefit it gives us. The current approach of leaving unreachable loose objects around for 2 weeks is good enough. Any Git process that has been running for 2 weeks while still not linking everything it needs into the reachable refs of that repository is already braindamaged and shouldn't be running anymore. If we are dealing with a pack file, those are protected by .keep "lock files" between the time they are created on disk and the time that the git-fetch or git-receive-pack process has finished updating the refs to anchor the pack's contents as reachable. Every once in a while a stale .keep file gets left behind when a process gets killed by the OS, and its damn annoying to clean up. I'd hate to clean up logs from every little git-add or git-commit that aborted in the middle uncleanly. -- Shawn. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 0:20 ` Shawn O. Pearce @ 2008-05-10 0:43 ` Jeremy Maitin-Shepard 2008-05-10 1:21 ` Junio C Hamano 1 sibling, 0 replies; 50+ messages in thread From: Jeremy Maitin-Shepard @ 2008-05-10 0:43 UTC (permalink / raw) To: Shawn O. Pearce Cc: Junio C Hamano, Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git "Shawn O. Pearce" <spearce@spearce.org> writes: > Jeremy Maitin-Shepard <jbms@cmu.edu> wrote: >> It is extremely cumbersome to have to worry about whether there are >> other concurrent accesses to the repository when running e.g. git gc. >> For servers, you may never be able to guarantee that nothing else is >> accessing the repository concurrently. Here is a possible solution: >> >> Each git process creates a log file of the references that it has >> created. The log file should be named in some way with e.g. the process >> id and start time of the process, and simply consist of a list of >> 20-byte sha1 hashes to be considered additional in-use references for >> the purpose of garbage collection. > I believe we partially considered that in the past and discarded it > as far too complex implementation-wise for the benefit it gives us. It doesn't seem all that complex, and I'd say that fundamentally it is the _correct_ way to do things. Being sloppy is always easier in the short run, but then either means the system is permanently broken or results in a lot of "fixing up" work later. I think almost all of the work of handling these log files could be done without impacting a lot of code that calls the relevant APIs that would actually use the log files. I think the biggest impact would be on non-C code, but even for that code, appropriate wrapper could be used to avoid having to make many changes. > The current approach of leaving unreachable loose objects around > for 2 weeks is good enough. Any Git process that has been running > for 2 weeks while still not linking everything it needs into the > reachable refs of that repository is already braindamaged and > shouldn't be running anymore. This sort of reasoning just leads to an inherently unreliable system. Sure, two weeks might seem good enough for nearly all cases, but why _shouldn't_ I be able to leave my editor open for two weeks before typing in my commit message and finishing the commit, or wait for two weeks in the middle of a rebase (it seems that in the new implementation, temporary refs are created basically to do the same thing as the log file I described.) I could easily be typing up my commit message, then switch to something else, and happen not to come back to it for two weeks. Because such a "timeout" based solution isn't really the "correct solution" but will work most of the time, potential problems won't be noticed while testing. Another significant issue is that this timeout means that unreferenced junk has to stay around in the repository for two weeks for no (good) reason. > If we are dealing with a pack file, those are protected by .keep > "lock files" between the time they are created on disk and the > time that the git-fetch or git-receive-pack process has finished > updating the refs to anchor the pack's contents as reachable. > Every once in a while a stale .keep file gets left behind when a > process gets killed by the OS, and its damn annoying to clean up. > I'd hate to clean up logs from every little git-add or git-commit > that aborted in the middle uncleanly. First of all, merely exiting due to an error should not cause log files to be left around. The only thing that should cause log files to be left around is kill -9 or a system crash. Second, by storing the process id and a timestamp of when the log file was created, it is possible to reliably determine if a log file is stale. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 0:20 ` Shawn O. Pearce 2008-05-10 0:43 ` Jeremy Maitin-Shepard @ 2008-05-10 1:21 ` Junio C Hamano 2008-05-10 1:51 ` Jeremy Maitin-Shepard 1 sibling, 1 reply; 50+ messages in thread From: Junio C Hamano @ 2008-05-10 1:21 UTC (permalink / raw) To: Shawn O. Pearce Cc: Jeremy Maitin-Shepard, Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git "Shawn O. Pearce" <spearce@spearce.org> writes: > Jeremy Maitin-Shepard <jbms@cmu.edu> wrote: >> It is extremely cumbersome to have to worry about whether there are >> other concurrent accesses to the repository when running e.g. git gc. >> For servers, you may never be able to guarantee that nothing else is >> accessing the repository concurrently. Here is a possible solution: >> >> Each git process creates a log file of the references that it has >> created. The log file should be named in some way with e.g. the process >> id and start time of the process, and simply consist of a list of >> 20-byte sha1 hashes to be considered additional in-use references for >> the purpose of garbage collection. How would that solve the issue that you should not prune/gc the repository "clone --shared" aka "alternates" borrows from? By the way, I do not think your "git-commit stopped for two weeks due to a long editing session of the commit message" should result in any object lossage, as the new objects are all reachable from the index, and the new tree nor the new commit hasn't been built while you are typing (rather, not typing) the log message. Hmm, a partial commit that uses a temporary index file may lose, come to think of it. Perhaps we should teach reachable.c about the temporary index file as well. I dunno. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 1:21 ` Junio C Hamano @ 2008-05-10 1:51 ` Jeremy Maitin-Shepard 2008-05-10 5:25 ` Jeff King 0 siblings, 1 reply; 50+ messages in thread From: Jeremy Maitin-Shepard @ 2008-05-10 1:51 UTC (permalink / raw) To: Junio C Hamano Cc: Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, Jeff King, git Junio C Hamano <gitster@pobox.com> writes: > "Shawn O. Pearce" <spearce@spearce.org> writes: >> Jeremy Maitin-Shepard <jbms@cmu.edu> wrote: >>> It is extremely cumbersome to have to worry about whether there are >>> other concurrent accesses to the repository when running e.g. git gc. >>> For servers, you may never be able to guarantee that nothing else is >>> accessing the repository concurrently. Here is a possible solution: >>> >>> Each git process creates a log file of the references that it has >>> created. The log file should be named in some way with e.g. the process >>> id and start time of the process, and simply consist of a list of >>> 20-byte sha1 hashes to be considered additional in-use references for >>> the purpose of garbage collection. > How would that solve the issue that you should not prune/gc the repository > "clone --shared" aka "alternates" borrows from? The log files are only for handling in-progress commands editing the repository. I also describe in first part of the e-mail a possible solution to that issue as well as the issues created by having multiple working directories: When you create a new working directory, you would also create in the original repository a symlink named e.g. orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> that points to the .git directory of the newly created working directory. git clone -shared would likewise create such a link in the original repository. There could be a separate simple command to "destroy" a repository created via clone -shared or via new-work-dir that would simply remove this "peer" symlink from any repositories it shares from, and then rm -rf the target repository. The list of repositories that a given target repository shares from would be discovered using perhaps several different methods, depending on whether it is a new work dir, an actual separate repository, or the new type of "shared" repository I suggested in my original e-mail, namely one that has its own refs but completely shares the object store of the original repository, e.g. via a symlink to the original repository's objects directory In any case, I believe the information to go "upstream" is already available, and we just need to add those "peer" symlinks in order to be able to go "downstream". There could also be a simple git command to move a repository that would take care of updating all of the references that other repositories have to it. Currently it is not possible to write such a command, because the "downstream" links are not stored, but with these added symlinks it would be possible. As I said in my previous e-mail, if git gc finds any broken symlinks (i.e. symlinks that point to invalid repositories), it would error out, because user attention is required to specify whether the symlinks correspond to deleted repositories, or to repositories that have been moved without making the proper updates. > By the way, I do not think your "git-commit stopped for two weeks due to a > long editing session of the commit message" should result in any object > lossage, as the new objects are all reachable from the index, and the new > tree nor the new commit hasn't been built while you are typing (rather, > not typing) the log message. > Hmm, a partial commit that uses a temporary index file may lose, come to > think of it. Perhaps we should teach reachable.c about the temporary > index file as well. I dunno. Well, providing a generic mechanism for telling git about reachable things other than the index and refs is precisely what these log files would do, and also because they would record the process id and a timestamp, stale log files would automatically get cleaned up. If each individual git command has its own special way of trying to keep track of temporary references, it is just going to be more complicated and more error prone. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 1:51 ` Jeremy Maitin-Shepard @ 2008-05-10 5:25 ` Jeff King 2008-05-10 5:36 ` Jeremy Maitin-Shepard 0 siblings, 1 reply; 50+ messages in thread From: Jeff King @ 2008-05-10 5:25 UTC (permalink / raw) To: Jeremy Maitin-Shepard Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git On Fri, May 09, 2008 at 09:51:15PM -0400, Jeremy Maitin-Shepard wrote: > When you create a new working directory, you would also create in the > original repository a symlink named > e.g. orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> that > points to the .git directory of the newly created working directory. That assumes you _can_ write to the original repository. That may or may not be the case, depending on your setup. -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 5:25 ` Jeff King @ 2008-05-10 5:36 ` Jeremy Maitin-Shepard 2008-05-10 9:04 ` Johannes Schindelin 0 siblings, 1 reply; 50+ messages in thread From: Jeremy Maitin-Shepard @ 2008-05-10 5:36 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git Jeff King <peff@peff.net> writes: > On Fri, May 09, 2008 at 09:51:15PM -0400, Jeremy Maitin-Shepard wrote: >> When you create a new working directory, you would also create in the >> original repository a symlink named >> e.g. orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> that >> points to the .git directory of the newly created working directory. > That assumes you _can_ write to the original repository. That may or may > not be the case, depending on your setup. Well, I suppose in that case it could print a warning or maybe fail without some "force" option. If you can't write to the repository, then I think it is safe to say that it will never know or care about you, so you will fundamentally have a fragile setup. I'd say that except in very special circumstances, you are better off just not sharing it at all. Consider, for instance, that even if the repository that you are sharing form never deletes branches and never does non-fast-forward updates of references, it could very well happen to have, due to some temporary operation, some unreferenced object that happens to be exactly the same object that you want to add to your repository. Because you've listed it in info/alternates, you won't write that object to your own repository, but then the source repository will very likely garbage collect the object at some later point, corrupting your repository. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 5:36 ` Jeremy Maitin-Shepard @ 2008-05-10 9:04 ` Johannes Schindelin 2008-05-10 16:24 ` Jeremy Maitin-Shepard 0 siblings, 1 reply; 50+ messages in thread From: Johannes Schindelin @ 2008-05-10 9:04 UTC (permalink / raw) To: Jeremy Maitin-Shepard Cc: Jeff King, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git Hi, On Sat, 10 May 2008, Jeremy Maitin-Shepard wrote: > Jeff King <peff@peff.net> writes: > > > On Fri, May 09, 2008 at 09:51:15PM -0400, Jeremy Maitin-Shepard wrote: > >> When you create a new working directory, you would also create in the > >> original repository a symlink named > >> e.g. orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> that > >> points to the .git directory of the newly created working directory. > > > That assumes you _can_ write to the original repository. That may or may > > not be the case, depending on your setup. FWIW this argument can be found in the mailing list. It does not have to be told over and over again, right? > Well, I suppose in that case it could print a warning or maybe fail > without some "force" option. If you can't write to the repository, then > I think it is safe to say that it will never know or care about you, so > you will fundamentally have a fragile setup. I'd say that except in > very special circumstances, you are better off just not sharing it at > all. Counterexample kernel.org. Counterexample repo.or.cz. Hth, Dscho ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 9:04 ` Johannes Schindelin @ 2008-05-10 16:24 ` Jeremy Maitin-Shepard 2008-05-11 11:11 ` Johannes Schindelin 0 siblings, 1 reply; 50+ messages in thread From: Jeremy Maitin-Shepard @ 2008-05-10 16:24 UTC (permalink / raw) To: Johannes Schindelin Cc: Jeff King, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Hi, > On Sat, 10 May 2008, Jeremy Maitin-Shepard wrote: >> Jeff King <peff@peff.net> writes: >> >> > On Fri, May 09, 2008 at 09:51:15PM -0400, Jeremy Maitin-Shepard wrote: >> >> When you create a new working directory, you would also create in the >> >> original repository a symlink named >> >> e.g. orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> that >> >> points to the .git directory of the newly created working directory. >> >> > That assumes you _can_ write to the original repository. That may or may >> > not be the case, depending on your setup. > FWIW this argument can be found in the mailing list. It does not have to > be told over and over again, right? Maybe you can point me at the relevant thread. Fundamentally, though, I'd say objects/info/alternates _cannot_ work reliably without the source repository knowing about the objects that the sharing repositories need. Otherwise, there is no way for it to know not to prune them. The only way for it to have that information in general is to write it in the repository. In a site-specific setting, it may indeed be possible to rely on some site-specific database, but that is not particularly relevant. Currently repository sharing seems to be used in many cases in quite unsafe ways. It may seem unfortunate that doing things the "safe way" is much more of a hassle and doesn't work in certain environments, but I'd say that is just the way things have to be. Perhaps you can point me to an existing thread that addresses this idea, though. >> Well, I suppose in that case it could print a warning or maybe fail >> without some "force" option. If you can't write to the repository, then >> I think it is safe to say that it will never know or care about you, so >> you will fundamentally have a fragile setup. I'd say that except in >> very special circumstances, you are better off just not sharing it at >> all. > Counterexample kernel.org. Counterexample repo.or.cz. repo.or.cz is not a counterexample. It is completely "managed", and could quite easily implement the approach I described. I don't know exactly how kernel.org works, but I imagine likewise some setuid helper script could be provided to write these symlinks. There is the issue that these setuid helper scripts would mean at the very least that if user A can "fork" user B's repository, then to some extent user B can make user A use large amounts of disk space (i.e. exceed his quota or something) by just referencing a bunch of temporary objects that user A happens to have in his repository, and it would take careful examination of the git repository to actually figure out that it is user B's fault. I don't think this would be a significant problem in practice, though. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-10 16:24 ` Jeremy Maitin-Shepard @ 2008-05-11 11:11 ` Johannes Schindelin 2008-05-11 18:39 ` Junio C Hamano 0 siblings, 1 reply; 50+ messages in thread From: Johannes Schindelin @ 2008-05-11 11:11 UTC (permalink / raw) To: Jeremy Maitin-Shepard Cc: Jeff King, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git Hi, On Sat, 10 May 2008, Jeremy Maitin-Shepard wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > On Sat, 10 May 2008, Jeremy Maitin-Shepard wrote: > > >> Jeff King <peff@peff.net> writes: > >> > >> > On Fri, May 09, 2008 at 09:51:15PM -0400, Jeremy Maitin-Shepard > >> > wrote: > > >> >> When you create a new working directory, you would also create in > >> >> the original repository a symlink named e.g. > >> >> orig_repo/.git/peers/<some-arbitrary-name-that-doesn't-matter> > >> >> that points to the .git directory of the newly created working > >> >> directory. > >> > >> > That assumes you _can_ write to the original repository. That may > >> > or may not be the case, depending on your setup. > > > FWIW this argument can be found in the mailing list. It does not have > > to be told over and over again, right? > > Maybe you can point me at the relevant thread. Fundamentally, though, > I'd say objects/info/alternates _cannot_ work reliably without the > source repository knowing about the objects that the sharing > repositories need. Otherwise, there is no way for it to know not to > prune them. The only way for it to have that information in general is > to write it in the repository. In a site-specific setting, it may > indeed be possible to rely on some site-specific database, but that is > not particularly relevant. > > Currently repository sharing seems to be used in many cases in quite > unsafe ways. It may seem unfortunate that doing things the "safe way" > is much more of a hassle and doesn't work in certain environments, but > I'd say that is just the way things have to be. > > Perhaps you can point me to an existing thread that addresses this idea, > though. Unfortunately, a quick search did not turn up anything useful. Maybe you try your luck yourself... > >> Well, I suppose in that case it could print a warning or maybe fail > >> without some "force" option. If you can't write to the repository, > >> then I think it is safe to say that it will never know or care about > >> you, so you will fundamentally have a fragile setup. I'd say that > >> except in very special circumstances, you are better off just not > >> sharing it at all. > > > Counterexample kernel.org. Counterexample repo.or.cz. > > repo.or.cz is not a counterexample. It is completely "managed", and > could quite easily implement the approach I described. Half true... you said "if you can't write to the repository..." and on repo.or.cz, the first part is true, the second part not. > There is the issue that these setuid helper scripts would mean at the > very least that if user A can "fork" user B's repository, then to some > extent user B can make user A use large amounts of disk space (i.e. > exceed his quota or something) by just referencing a bunch of temporary > objects that user A happens to have in his repository, and it would take > careful examination of the git repository to actually figure out that it > is user B's fault. I don't think this would be a significant problem in > practice, though. Well, I think that the setuid helper script would open a whole bunch of other issues. I think that the shared repository problem is rather a semantic one, i.e. it is only solvable between the owners of the repository by good-ole talking, not something that can be solved by the tool (Git). Ciao, Dscho ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-11 11:11 ` Johannes Schindelin @ 2008-05-11 18:39 ` Junio C Hamano 0 siblings, 0 replies; 50+ messages in thread From: Junio C Hamano @ 2008-05-11 18:39 UTC (permalink / raw) To: Johannes Schindelin Cc: Jeremy Maitin-Shepard, Jeff King, Shawn O. Pearce, Nicolas Pitre, Brandon Casey, Geert Bosch, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Well, I think that the setuid helper script would open a whole bunch of > other issues. > > I think that the shared repository problem is rather a semantic one, i.e. > it is only solvable between the owners of the repository by good-ole > talking, not something that can be solved by the tool (Git). I very strongly agree with you that the suid helper would be the last ditch thing we would want to avoid doing unless there is no other way. I also agree with you that the owners of the repository need to be talking. But I think the tool _could_ help them do their talking. It is conceivable that just like you can explicitly allow selected others to push into your own repository, you would want to explicitly allow some others to ask you to keep objects their repositories borrow from you. Originally, I wrote "allow others to borrow from you", but that is very ill defined. If they can read from your repository they can unilaterally borrow from you without having any write permission to your repository that is needed to install backpointers. I am not fundamentally opposed to a backpointer that point at borrowers so that the lender can protect the objects that are pointed by them. However, there are two technical issues in the solution of pointing at the borrower's .git/refs with a symlink from the repository that is borrowed from, as I pointed out when this came up last time. - There are systems without symbolic links. - A symref (e.g. refs/remotes/origin/HEAD of borrower that points at refs/remotes/origin/master of borrower) is relative to borrower's repository. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 21:01 ` Jeff King 2008-05-08 21:15 ` Nicolas Pitre @ 2008-05-08 21:33 ` Guido Ostkamp 1 sibling, 0 replies; 50+ messages in thread From: Guido Ostkamp @ 2008-05-08 21:33 UTC (permalink / raw) To: Jeff King; +Cc: Brandon Casey, git On Thu, 8 May 2008, Jeff King wrote: > According to that commit message, prune is now a no-op. However, it > looks like it is still used for trigger a "repack -a" rather than > "repack -A". I tried to look at this but found this option '-A' to be undocumented in the manpage (git/Documentation/git-repack.txt and what is generated from it). Regards Guido ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 20:07 ` Brandon Casey 2008-05-08 20:52 ` Guido Ostkamp @ 2008-05-08 20:56 ` Jeff King 1 sibling, 0 replies; 50+ messages in thread From: Jeff King @ 2008-05-08 20:56 UTC (permalink / raw) To: Brandon Casey; +Cc: Guido Ostkamp, git On Thu, May 08, 2008 at 03:07:53PM -0500, Brandon Casey wrote: > btw, this is _really_ a non-issue. It seems to keep coming up on the list. > > Just know that each one of the config options that you set to zero, including > the one Jeff suggested setting to "now", is a safety mechanism that is there > to ensure that you never ever lose data and that mistakes are recoverable. Yes, I want to chime in since I have been giving advice in such threads: Please don't construe my help as any sort of endorsement of this behavior. Git tries hard not to lose your data, and it is almost always a bad idea to try to override these safety checks unless you really know what you are doing. And even then, try to consider balancing a bit of freed disk space (and generally _no_ performance gain, because git is very good about not looking at objects that aren't necessary to the current operation) versus thinking "oops, I wish I still had that data" in a few days. I can think offhand of only one time when it was truly useful for me to prune aggressively, and it was a very special case: a pathologically large repo for which I was doing a one-shot conversion from another format (and I wanted to prune failed attempts). -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: git gc & deleted branches 2008-05-08 18:55 ` Guido Ostkamp 2008-05-08 20:07 ` Brandon Casey @ 2008-05-08 20:51 ` Jeff King 1 sibling, 0 replies; 50+ messages in thread From: Jeff King @ 2008-05-08 20:51 UTC (permalink / raw) To: Guido Ostkamp; +Cc: git On Thu, May 08, 2008 at 08:55:50PM +0200, Guido Ostkamp wrote: > Jeff, I tried it, but it has no effect (see below). There is only the > master branch left, and only one commit therein, still it uses the space > former occupied by the branch. I'm using git version 1.5.5.1.147.g867f. It worked fine for me; it's possible, as Brandon mentioned, that it is in a pack already, and only a "repack -a" would get rid of it. FWIW, my steps were: # ...same as you for repo and branch creation git branch -D test git config gc.reflogexpire 0 git config gc.reflogexpireunreachable 0 git config gc.pruneexpire now git gc du -s .git ;# shows 10396 -Peff ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2008-05-11 18:40 UTC | newest] Thread overview: 50+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-08 17:45 git gc & deleted branches Guido Ostkamp 2008-05-08 18:39 ` Jeff King 2008-05-08 18:55 ` Guido Ostkamp 2008-05-08 20:07 ` Brandon Casey 2008-05-08 20:52 ` Guido Ostkamp 2008-05-08 21:01 ` Jeff King 2008-05-08 21:15 ` Nicolas Pitre 2008-05-08 21:17 ` Jeff King 2008-05-08 21:23 ` Brandon Casey 2008-05-08 21:31 ` Jeff King 2008-05-08 21:40 ` Brandon Casey 2008-05-08 21:44 ` Jeff King 2008-05-08 21:53 ` Brandon Casey 2008-05-08 22:48 ` Jeff King 2008-05-09 1:41 ` Brandon Casey 2008-05-09 3:21 ` Junio C Hamano [not found] ` <ee63ef30805082105w7f04a2d1y65a4618aeb787cac@mail.gmail.com> [not found] ` <7v1w4bb291.fsf@gitster.siamese.dyndns.org> 2008-05-10 3:32 ` Brandon Casey 2008-05-10 4:15 ` Brandon Casey 2008-05-10 4:01 ` [PATCH 0/3] leave unreferenced objects unpacked drafnel 2008-05-10 4:01 ` [PATCH 1/3] repack: modify behavior of -A option to " drafnel 2008-05-10 6:03 ` Jeff King 2008-05-11 1:10 ` Nicolas Pitre 2008-05-11 1:23 ` Junio C Hamano 2008-05-11 4:16 ` Brandon Casey 2008-05-11 4:51 ` Brandon Casey 2008-05-10 4:01 ` [PATCH 2/3] git-gc: always use -A when manually repacking drafnel 2008-05-10 4:01 ` [PATCH 3/3] builtin-gc.c: deprecate --prune, it now really has no effect drafnel 2008-05-09 4:19 ` git gc & deleted branches Jeff King 2008-05-09 15:00 ` Geert Bosch 2008-05-09 15:14 ` Brandon Casey 2008-05-09 15:53 ` Jeff King 2008-05-09 15:56 ` Brandon Casey 2008-05-09 16:12 ` Nicolas Pitre 2008-05-09 16:54 ` Brandon Casey 2008-05-09 22:33 ` Junio C Hamano 2008-05-09 23:09 ` [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch Chris Frey 2008-05-10 0:07 ` git gc & deleted branches Jeremy Maitin-Shepard 2008-05-10 0:20 ` Shawn O. Pearce 2008-05-10 0:43 ` Jeremy Maitin-Shepard 2008-05-10 1:21 ` Junio C Hamano 2008-05-10 1:51 ` Jeremy Maitin-Shepard 2008-05-10 5:25 ` Jeff King 2008-05-10 5:36 ` Jeremy Maitin-Shepard 2008-05-10 9:04 ` Johannes Schindelin 2008-05-10 16:24 ` Jeremy Maitin-Shepard 2008-05-11 11:11 ` Johannes Schindelin 2008-05-11 18:39 ` Junio C Hamano 2008-05-08 21:33 ` Guido Ostkamp 2008-05-08 20:56 ` Jeff King 2008-05-08 20:51 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).