* Repacking a repository uses up all available disk space @ 2016-06-12 21:25 Konstantin Ryabitsev 2016-06-12 21:38 ` Jeff King 0 siblings, 1 reply; 11+ messages in thread From: Konstantin Ryabitsev @ 2016-06-12 21:25 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 936 bytes --] Hello: I have a problematic repository that: - Takes up 9GB on disk - Passes 'git fsck --full' with no errors - When cloned with --mirror, takes up 38M on the target system - When attempting to repack, creates millions of files and eventually eats up all available disk space Repacking the result of 'git clone --mirror' shows no problem, so it's got to be something really weird with that particular instance of the repository. If anyone is interested in poking at this particular problem to figure out what causes the repack process to eat up all available disk space, you can find the tarball of the problematic repository here: http://mricon.com/misc/src.git.tar.xz (warning: 6.6GB) You can clone the non-problematic version of this repository from git://codeaurora.org/quic/chrome4sdp/breakpad/breakpad/src.git Best, -- Konstantin Ryabitsev Linux Foundation Collab Projects Montréal, Québec [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-12 21:25 Repacking a repository uses up all available disk space Konstantin Ryabitsev @ 2016-06-12 21:38 ` Jeff King 2016-06-12 21:54 ` Konstantin Ryabitsev 0 siblings, 1 reply; 11+ messages in thread From: Jeff King @ 2016-06-12 21:38 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: git On Sun, Jun 12, 2016 at 05:25:14PM -0400, Konstantin Ryabitsev wrote: > Hello: > > I have a problematic repository that: > > - Takes up 9GB on disk > - Passes 'git fsck --full' with no errors > - When cloned with --mirror, takes up 38M on the target system Cloning will only copy the objects that are reachable from the refs. So presumably the other 8.9GB is either reachable from reflogs, or not reachable at all (due to rewinding history or deleting branches). > - When attempting to repack, creates millions of files and eventually > eats up all available disk space That means these objects fall into the unreachable category. Git will prune unreachable loose objects after a grace period based on the filesystem mtime of the objects; the default is 2 weeks. For unreachable packed objects, their mtime is jumbled in with the rest of the objects in the packfile. So Git's strategy is to "eject" such objects from the packfiles into individual loose objects, and let them "age out" of the grace period individually. Generally this works just fine, but there are corner cases where you might have a very large number of such objects, and the loose storage is much more expensive than the packed (e.g., because each object is stored individually, not as a delta). It sounds like this is the case you're running into. The solution is to lower the grace period time, with something like: git gc --prune=5.minutes.ago or even: git gc --prune=now That will prune the unreachable objects immediately (and the packfile ejector is smart enough to skip ejecting any file that would just get deleted immediately anyway). -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-12 21:38 ` Jeff King @ 2016-06-12 21:54 ` Konstantin Ryabitsev 2016-06-12 22:13 ` Jeff King 0 siblings, 1 reply; 11+ messages in thread From: Konstantin Ryabitsev @ 2016-06-12 21:54 UTC (permalink / raw) To: Jeff King; +Cc: git [-- Attachment #1: Type: text/plain, Size: 1800 bytes --] On Sun, Jun 12, 2016 at 05:38:04PM -0400, Jeff King wrote: > > - When attempting to repack, creates millions of files and eventually > > eats up all available disk space > > That means these objects fall into the unreachable category. Git will > prune unreachable loose objects after a grace period based on the > filesystem mtime of the objects; the default is 2 weeks. > > For unreachable packed objects, their mtime is jumbled in with the rest > of the objects in the packfile. So Git's strategy is to "eject" such > objects from the packfiles into individual loose objects, and let them > "age out" of the grace period individually. > > Generally this works just fine, but there are corner cases where you > might have a very large number of such objects, and the loose storage is > much more expensive than the packed (e.g., because each object is stored > individually, not as a delta). > > It sounds like this is the case you're running into. > > The solution is to lower the grace period time, with something like: > > git gc --prune=5.minutes.ago > > or even: > > git gc --prune=now You are correct, this solves the problem, however I'm curious. The usual maintenance for these repositories is a regular run of: - git fsck --full - git repack -Adl -b --pack-kept-objects - git pack-refs --all - git prune The reason it's split into repack + prune instead of just gc is because we use alternates to save on disk space and try not to prune repos that are used as alternates by other repos in order to avoid potential corruption. Am I not doing something that needs to be doing in order to avoid the same problem? Thanks for your help. Regards, -- Konstantin Ryabitsev Linux Foundation Collab Projects Montréal, Québec [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-12 21:54 ` Konstantin Ryabitsev @ 2016-06-12 22:13 ` Jeff King 2016-06-13 0:24 ` Duy Nguyen 2016-06-13 1:43 ` Nasser Grainawi 0 siblings, 2 replies; 11+ messages in thread From: Jeff King @ 2016-06-12 22:13 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: git On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote: > > git gc --prune=now > > You are correct, this solves the problem, however I'm curious. The usual > maintenance for these repositories is a regular run of: > > - git fsck --full > - git repack -Adl -b --pack-kept-objects > - git pack-refs --all > - git prune > > The reason it's split into repack + prune instead of just gc is because > we use alternates to save on disk space and try not to prune repos that > are used as alternates by other repos in order to avoid potential > corruption. > > Am I not doing something that needs to be doing in order to avoid the > same problem? Your approach makes sense; we do the same thing at GitHub for the same reasons[1]. The main thing you are missing that gc will do is that it knows the prune-time it is going to feed to git-prune[2], and passes that along to repack. That's what enables the "don't bother ejecting these, because I'm about to delete them" optimization. That option is not documented, because it was always assumed to be an internal thing to git-gc, but it is: git repack ... --unpack-unreachable=5.minutes.ago or whatever. -Peff [1] We don't run the fsck at the front, though, because it's really expensive. I'm not sure it buys you much, either. The repack will do a full walk of the graph, so it gets you a connectivity check, as well as a full content check of the commits and trees. The blobs are copied as-is from the old pack, but there is a checksum on the pack data (to catch any bit flips by the disk storage). So the only thing the fsck is getting you is that it fully reconstructs the deltas for each blob and checks their sha1. That's more robust than a checksum, but it's a lot more expensive. [2] It's unclear to me if you're passing any options to git-prune, but you may want to pass "--expire" with a short grace period. Without any options it prunes every unreachable thing, which can lead to races if the repository is actively being used. At GitHub we actually have a patch to `repack` that keeps all objects, reachable or not, in the pack, and use it for all of our automated maintenance. Since we don't drop objects at all, we can't ever have such a race. Aside from some pathological cases, it wastes much less space than you'd expect. We turn the flag off for special cases (e.g., somebody has rewound history and wants to expunge a sensitive object). I'm happy to share the "keep everything" patch if you're interested. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-12 22:13 ` Jeff King @ 2016-06-13 0:24 ` Duy Nguyen 2016-06-13 4:58 ` Jeff King 2016-06-13 1:43 ` Nasser Grainawi 1 sibling, 1 reply; 11+ messages in thread From: Duy Nguyen @ 2016-06-13 0:24 UTC (permalink / raw) To: Jeff King; +Cc: Konstantin Ryabitsev, Git Mailing List On Mon, Jun 13, 2016 at 5:13 AM, Jeff King <peff@peff.net> wrote: > On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote: > >> > git gc --prune=now >> >> You are correct, this solves the problem, however I'm curious. The usual >> maintenance for these repositories is a regular run of: >> >> - git fsck --full >> - git repack -Adl -b --pack-kept-objects >> - git pack-refs --all >> - git prune >> >> The reason it's split into repack + prune instead of just gc is because >> we use alternates to save on disk space and try not to prune repos that >> are used as alternates by other repos in order to avoid potential >> corruption. Isn't this what extensions.preciousObjects is for? It looks like prune just refuses to run in precious objects mode though, and repack is skipped by gc, but if that repack command works, maybe we should do something like that in git-gc? BTW Jeff, I think we need more documentation for extensions.preciousObjects. It's only documented in technical/ which is practically invisible to all users. Maybe include::repository-version.txt in config.txt, or somewhere close to alternates? > [2] It's unclear to me if you're passing any options to git-prune, but > you may want to pass "--expire" with a short grace period. Without > any options it prunes every unreachable thing, which can lead to > races if the repository is actively being used. > > At GitHub we actually have a patch to `repack` that keeps all > objects, reachable or not, in the pack, and use it for all of our > automated maintenance. Since we don't drop objects at all, we can't > ever have such a race. Aside from some pathological cases, it wastes > much less space than you'd expect. We turn the flag off for special > cases (e.g., somebody has rewound history and wants to expunge a > sensitive object). > > I'm happy to share the "keep everything" patch if you're interested. Ah ok, I guess this is why we just skip repack. I guess '-Adl -b --pack-kept-objects' is not enough then. -- Duy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-13 0:24 ` Duy Nguyen @ 2016-06-13 4:58 ` Jeff King 0 siblings, 0 replies; 11+ messages in thread From: Jeff King @ 2016-06-13 4:58 UTC (permalink / raw) To: Duy Nguyen; +Cc: Konstantin Ryabitsev, Git Mailing List On Mon, Jun 13, 2016 at 07:24:51AM +0700, Duy Nguyen wrote: > >> - git fsck --full > >> - git repack -Adl -b --pack-kept-objects > >> - git pack-refs --all > >> - git prune > >> > >> The reason it's split into repack + prune instead of just gc is because > >> we use alternates to save on disk space and try not to prune repos that > >> are used as alternates by other repos in order to avoid potential > >> corruption. > > Isn't this what extensions.preciousObjects is for? It looks like prune > just refuses to run in precious objects mode though, and repack is > skipped by gc, but if that repack command works, maybe we should do > something like that in git-gc? Sort of. preciousObjects is a fail-safe so that you do not ever accidentally run an object-deleting operation where you shouldn't (e.g., in the shared repository used by others as an alternate). So the important step there is that before running "repack", you would want to make sure you have taken into account the reachability of anybody sharing from you. So you could do something like (in your shared repository): git config core.repositoryFormatVersion 1 git config extension.preciousObjects true # this will fail, because it's dangerous! git gc # but we can do it safely if we take into account the other repos for repo in $(somehow_get_list_of_shared_repos); do git fetch $repo +refs/*:refs/shared/$repo/* done git config extension.preciousObjects false git gc git config extension.preciousObjects true So it really is orthogonal to running the various gc commands yourself; it's just here to prevent you shooting yourself in the foot. It may still be useful in such a case to split up the commands in your own script, though. In my case, you'll note that the commands above are racy (what happens if somebody pushes a reference to a shared object between your fetch and the gc invocation?). So we use a custom "repack -k" to get around that (it just keeps everything). You _could_ have gc automatically switch to "-k" in a preciousObjects repository. That's at least safe. But note that it doesn't really solve all of the problems (you do still want to have ref tips from the leaf repositories, because it affects things like bitmaps, and packing order). > BTW Jeff, I think we need more documentation for > extensions.preciousObjects. It's only documented in technical/ which > is practically invisible to all users. Maybe > include::repository-version.txt in config.txt, or somewhere close to > alternates? I'm a little hesitant to document it for end users because it's still pretty experimental. In fact, even we are not using it at GitHub currently. We don't have a big problem with "oops, I accidentally ran something destructive in the shared repository", because nothing except the maintenance script ever even goes into the shared repository. The reason I introduced it in the first place is that I was experimenting with the idea of actually symlinking "objects/" in the leaf repos into the shared repository. That eliminates the object writing in the "fetch" step above, which can be a bottleneck in some cases (not just the I/O, but the shared repo ends up having a _lot_ of refs, and fetch can be pretty slow). But in that case, anything that deletes an object in one of the leaf repos is very dangerous, as it has no idea that its object store is shared with other leaf repos. So I really wanted a fail safe so that running "git gc" wasn't catastrophic. I still think that's a viable approach, but my experiments got side-tracked and I never produced anything worth looking at. So until there's something end users can actually make use of, I'm hesitant to push that stuff into the regular-user documentation. Anybody who is playing with it at this point probably _should_ be familiar with what's in Documentation/technical. -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repacking a repository uses up all available disk space 2016-06-12 22:13 ` Jeff King 2016-06-13 0:24 ` Duy Nguyen @ 2016-06-13 1:43 ` Nasser Grainawi 2016-06-13 4:33 ` [PATCH 0/3] repack --keep-unreachable Jeff King 1 sibling, 1 reply; 11+ messages in thread From: Nasser Grainawi @ 2016-06-13 1:43 UTC (permalink / raw) To: Jeff King; +Cc: Konstantin Ryabitsev, git On Jun 12, 2016, at 4:13 PM, Jeff King <peff@peff.net> wrote: > > At GitHub we actually have a patch to `repack` that keeps all > objects, reachable or not, in the pack, and use it for all of our > automated maintenance. Since we don't drop objects at all, we can't > ever have such a race. Aside from some pathological cases, it wastes > much less space than you'd expect. We turn the flag off for special > cases (e.g., somebody has rewound history and wants to expunge a > sensitive object). > > I'm happy to share the "keep everything" patch if you're interested. We have the same kind of patch actually (for the same reason), but back on the shell implementation of repack. It'd be great if you could share your modern version. Nasser -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 0/3] repack --keep-unreachable 2016-06-13 1:43 ` Nasser Grainawi @ 2016-06-13 4:33 ` Jeff King 2016-06-13 4:33 ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Jeff King @ 2016-06-13 4:33 UTC (permalink / raw) To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano On Sun, Jun 12, 2016 at 07:43:27PM -0600, Nasser Grainawi wrote: > On Jun 12, 2016, at 4:13 PM, Jeff King <peff@peff.net> wrote: > > > > At GitHub we actually have a patch to `repack` that keeps all > > objects, reachable or not, in the pack, and use it for all of our > > automated maintenance. Since we don't drop objects at all, we can't > > ever have such a race. Aside from some pathological cases, it wastes > > much less space than you'd expect. We turn the flag off for special > > cases (e.g., somebody has rewound history and wants to expunge a > > sensitive object). > > > > I'm happy to share the "keep everything" patch if you're interested. > > We have the same kind of patch actually (for the same reason), but > back on the shell implementation of repack. It'd be great if you could > share your modern version. Here is a cleaned-up version of what we run at GitHub (so this is a concept that has been exercised for a few years in production, but I had to forward port the patches a bit; I _probably_ didn't introduce any bugs. :) ). The heavy lifting is done by the existing --keep-unreachable option to pack-objects, which Junio added a long time ago[1] in support of a safer "gc --auto". But it doesn't look like we ever documented or exercised it, and "gc --auto" ended up using the loosen-unreachable strategy instead. In fact, the rest of that series seems to have been dropped; I couldn't find any discussion on the list explaining it, or why this one patch was kept (so I don't think anybody upstream has ever used this code, but as I said, we have been doing so for a few years, so I feel confident in it). [1/3]: repack: document --unpack-unreachable option [2/3]: repack: add --keep-unreachable option [3/3]: repack: extend --keep-unreachable to loose objects -Peff [1] http://article.gmane.org/gmane.comp.version-control.git/58413 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/3] repack: document --unpack-unreachable option 2016-06-13 4:33 ` [PATCH 0/3] repack --keep-unreachable Jeff King @ 2016-06-13 4:33 ` Jeff King 2016-06-13 4:36 ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King 2016-06-13 4:38 ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King 2 siblings, 0 replies; 11+ messages in thread From: Jeff King @ 2016-06-13 4:33 UTC (permalink / raw) To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano This was added back in 7e52f56 (gc: do not explode objects which will be immediately pruned, 2012-04-07), but not documented at the time, since it was an internal detail between git-gc and git-repack. However, as people with complicated setups may want to effectively reimplement the steps of git-gc themselves, it is nice for us to document these interfaces. Signed-off-by: Jeff King <peff@peff.net> --- Documentation/git-repack.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index b9c02ce..cde7b44 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -128,6 +128,12 @@ other objects in that pack they already have locally. with `-b` or `repack.writeBitmaps`, as it ensures that the bitmapped packfile has the necessary objects. +--unpack-unreachable=<when>:: + When loosening unreachable objects, do not bother loosening any + objects older than `<when>`. This can be used to optimize out + the write of any objects that would be immediately pruned by + a follow-up `git prune`. + Configuration ------------- -- 2.9.0.rc2.149.gd580ccd ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] repack: add --keep-unreachable option 2016-06-13 4:33 ` [PATCH 0/3] repack --keep-unreachable Jeff King 2016-06-13 4:33 ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King @ 2016-06-13 4:36 ` Jeff King 2016-06-13 4:38 ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King 2 siblings, 0 replies; 11+ messages in thread From: Jeff King @ 2016-06-13 4:36 UTC (permalink / raw) To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano The usual way to do a full repack (and what is done by git-gc) is to run "repack -Ad --unpack-unreachable=<when>", which will loosen any unreachable objects newer than "<when>", and drop any older ones. This is a safer alternative to "repack -ad", because "<when>" becomes a grace period during which we will not drop any new objects that are about to be referenced. However, it isn't perfectly safe. It's always possible that a process is about to reference an old object. Even if that process were to take care to update the timestamp on the object, there is no atomicity with a simultaneously running "repack" process. So while unlikely, there is a small race wherein we may drop an object that is in the process of being referenced. If you do automated repacking on a large number of active repositories, you may hit it eventually, and the result is a corrupted repository. It would be nice to fix that race in the long run, but it's complicated. In the meantime, there is a much simpler strategy for automated repository maintenance: do not drop objects at all. We already have a "--keep-unreachable" option in pack-objects; we just need to plumb it through from git-repack. Note that this _isn't_ plumbed through from git-gc, so at this point it's strictly a tool for people doing their own advanced repository maintenance strategy. Signed-off-by: Jeff King <peff@peff.net> --- Documentation/git-repack.txt | 6 ++++++ builtin/repack.c | 9 +++++++++ t/t7701-repack-unpack-unreachable.sh | 15 +++++++++++++++ 3 files changed, 30 insertions(+) diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index cde7b44..68702ea 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -134,6 +134,12 @@ other objects in that pack they already have locally. the write of any objects that would be immediately pruned by a follow-up `git prune`. +-k:: +--keep-unreachable:: + When used with `-ad`, any unreachable objects from existing + packs will be appended to the end of the packfile instead of + being removed. + Configuration ------------- diff --git a/builtin/repack.c b/builtin/repack.c index 858db38..573e66c 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -146,6 +146,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) int pack_everything = 0; int delete_redundant = 0; const char *unpack_unreachable = NULL; + int keep_unreachable = 0; const char *window = NULL, *window_memory = NULL; const char *depth = NULL; const char *max_pack_size = NULL; @@ -175,6 +176,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("write bitmap index")), OPT_STRING(0, "unpack-unreachable", &unpack_unreachable, N_("approxidate"), N_("with -A, do not loosen objects older than this")), + OPT_BOOL('k', "keep-unreachable", &keep_unreachable, + N_("with -a, repack unreachable objects")), OPT_STRING(0, "window", &window, N_("n"), N_("size of the window used for delta compression")), OPT_STRING(0, "window-memory", &window_memory, N_("bytes"), @@ -196,6 +199,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (delete_redundant && repository_format_precious_objects) die(_("cannot delete packs in a precious-objects repo")); + if (keep_unreachable && + (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) + die(_("--keep-unreachable and -A are incompatible")); + if (pack_kept_objects < 0) pack_kept_objects = write_bitmaps; @@ -239,6 +246,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) } else if (pack_everything & LOOSEN_UNREACHABLE) { argv_array_push(&cmd.args, "--unpack-unreachable"); + } else if (keep_unreachable) { + argv_array_push(&cmd.args, "--keep-unreachable"); } else { argv_array_push(&cmd.env_array, "GIT_REF_PARANOIA=1"); } diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh index b66e383..f13df43 100755 --- a/t/t7701-repack-unpack-unreachable.sh +++ b/t/t7701-repack-unpack-unreachable.sh @@ -122,4 +122,19 @@ test_expect_success 'keep packed objects found only in index' ' git cat-file blob :file ' +test_expect_success 'repack -k keeps unreachable packed objects' ' + # create packed-but-unreachable object + sha1=$(echo unreachable-packed | git hash-object -w --stdin) && + pack=$(echo $sha1 | git pack-objects .git/objects/pack/pack) && + git prune-packed && + + # -k should keep it + git repack -adk && + git cat-file -p $sha1 && + + # and double check that without -k it would have been removed + git repack -ad && + test_must_fail git cat-file -p $sha1 +' + test_done -- 2.9.0.rc2.149.gd580ccd ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] repack: extend --keep-unreachable to loose objects 2016-06-13 4:33 ` [PATCH 0/3] repack --keep-unreachable Jeff King 2016-06-13 4:33 ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King 2016-06-13 4:36 ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King @ 2016-06-13 4:38 ` Jeff King 2 siblings, 0 replies; 11+ messages in thread From: Jeff King @ 2016-06-13 4:38 UTC (permalink / raw) To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano If you use "repack -adk" currently, we will pack all objects that are already packed into the new pack, and then drop the old packs. However, loose unreachable objects will be left as-is. In theory these are meant to expire eventually with "git prune". But if you are using "repack -k", you probably want to keep things forever and therefore do not run "git prune" at all. Meaning those loose objects may build up over time and end up fooling any object-count heuristics (such as the one done by "gc --auto", though since git-gc does not support "repack -k", this really applies to whatever custom scripts people might have driving "repack -k"). With this patch, we instead stuff any loose unreachable objects into the pack along with the already-packed unreachable objects. This may seem wasteful, but it is really no more so than using "repack -k" in the first place. We are at a slight disadvantage, in that we have no useful ordering for the result, or names to hand to the delta code. However, this is again no worse than what "repack -k" is already doing for the packed objects. The packing of these objects doesn't matter much because they should not be accessed frequently (unless they actually _do_ become referenced, but then they would get moved to a different part of the packfile during the next repack). Signed-off-by: Jeff King <peff@peff.net> --- Documentation/git-repack.txt | 3 ++- builtin/pack-objects.c | 31 +++++++++++++++++++++++++++++++ builtin/repack.c | 1 + t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++ 4 files changed, 47 insertions(+), 1 deletion(-) diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index 68702ea..b58b6b5 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -138,7 +138,8 @@ other objects in that pack they already have locally. --keep-unreachable:: When used with `-ad`, any unreachable objects from existing packs will be appended to the end of the packfile instead of - being removed. + being removed. In addition, any unreachable loose objects will + be packed (and their loose counterparts removed). Configuration ------------- diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 8f5e358..a2f8cfd 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -44,6 +44,7 @@ static int non_empty; static int reuse_delta = 1, reuse_object = 1; static int keep_unreachable, unpack_unreachable, include_tag; static unsigned long unpack_unreachable_expiration; +static int pack_loose_unreachable; static int local; static int incremental; static int ignore_packed_keep; @@ -2378,6 +2379,32 @@ static void add_objects_in_unpacked_packs(struct rev_info *revs) free(in_pack.array); } +static int add_loose_object(const unsigned char *sha1, const char *path, + void *data) +{ + enum object_type type = sha1_object_info(sha1, NULL); + + if (type < 0) { + warning("loose object at %s could not be examined", path); + return 0; + } + + add_object_entry(sha1, type, "", 0); + return 0; +} + +/* + * We actually don't even have to worry about reachability here. + * add_object_entry will weed out duplicates, so we just add every + * loose object we find. + */ +static void add_unreachable_loose_objects(void) +{ + for_each_loose_file_in_objdir(get_object_directory(), + add_loose_object, + NULL, NULL, NULL); +} + static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1) { static struct packed_git *last_found = (void *)1; @@ -2547,6 +2574,8 @@ static void get_object_list(int ac, const char **av) if (keep_unreachable) add_objects_in_unpacked_packs(&revs); + if (pack_loose_unreachable) + add_unreachable_loose_objects(); if (unpack_unreachable) loosen_unused_packed_objects(&revs); @@ -2647,6 +2676,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_("include tag objects that refer to objects to be packed")), OPT_BOOL(0, "keep-unreachable", &keep_unreachable, N_("keep unreachable objects")), + OPT_BOOL(0, "pack-loose-unreachable", &pack_loose_unreachable, + N_("pack loose unreachable objects")), { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, diff --git a/builtin/repack.c b/builtin/repack.c index 573e66c..f7b7409 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -248,6 +248,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) "--unpack-unreachable"); } else if (keep_unreachable) { argv_array_push(&cmd.args, "--keep-unreachable"); + argv_array_push(&cmd.args, "--pack-loose-unreachable"); } else { argv_array_push(&cmd.env_array, "GIT_REF_PARANOIA=1"); } diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh index f13df43..987573c 100755 --- a/t/t7701-repack-unpack-unreachable.sh +++ b/t/t7701-repack-unpack-unreachable.sh @@ -137,4 +137,17 @@ test_expect_success 'repack -k keeps unreachable packed objects' ' test_must_fail git cat-file -p $sha1 ' +test_expect_success 'repack -k packs unreachable loose objects' ' + # create loose unreachable object + sha1=$(echo would-be-deleted-loose | git hash-object -w --stdin) && + objpath=.git/objects/$(echo $sha1 | sed "s,..,&/,") && + test_path_is_file $objpath && + + # and confirm that the loose object goes away, but we can + # still access it (ergo, it is packed) + git repack -adk && + test_path_is_missing $objpath && + git cat-file -p $sha1 +' + test_done -- 2.9.0.rc2.149.gd580ccd ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-06-13 4:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-06-12 21:25 Repacking a repository uses up all available disk space Konstantin Ryabitsev 2016-06-12 21:38 ` Jeff King 2016-06-12 21:54 ` Konstantin Ryabitsev 2016-06-12 22:13 ` Jeff King 2016-06-13 0:24 ` Duy Nguyen 2016-06-13 4:58 ` Jeff King 2016-06-13 1:43 ` Nasser Grainawi 2016-06-13 4:33 ` [PATCH 0/3] repack --keep-unreachable Jeff King 2016-06-13 4:33 ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King 2016-06-13 4:36 ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King 2016-06-13 4:38 ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).