* [PATCH 0/2] speed up "Counting objects" when there are many packs @ 2016-07-25 18:49 Jeff King 2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King 2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King 0 siblings, 2 replies; 13+ messages in thread From: Jeff King @ 2016-07-25 18:49 UTC (permalink / raw) To: git We sometimes see cases at GitHub where repository maintenance has fallen behind, and you get a large number of packs. The solution is to repack, but that process is itself made a lot slower by the number of packs. We've experimented a bit with fast "just cat all the packfiles together" type approaches, but they have some downsides, so I have nothing to show there yet. However, there are a few easy optimizations we can do to cut out some unnecessary computation in common cases (e.g., when you have no .keep files and when you have no upstream alternates storage). Both of these patches have been in production at GitHub for about 6 months. [1/2]: pack-objects: break out of want_object loop early [2/2]: pack-objects: compute local/ignore_pack_keep early -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King @ 2016-07-25 18:50 ` Jeff King 2016-07-25 19:56 ` Junio C Hamano 2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King 1 sibling, 1 reply; 13+ messages in thread From: Jeff King @ 2016-07-25 18:50 UTC (permalink / raw) To: git When pack-objects collects the list of objects to pack (either from stdin, or via its internal rev-list), it filters each one through want_object_in_pack(). This function loops through each existing packfile, looking for the object. When we find it, we mark the pack/offset combo for later use. However, we can't just return "yes, we want it" at that point. If --honor-pack-keep is in effect, we must keep looking to find it in _all_ packs, to make sure none of them has a .keep. Likewise, if --local is in effect, we must make sure it is not present in any local pack. As a result, the sum effort of these calls is effectively O(nr_objects * nr_packs). In an ordinary repository, we have only a handful of packs, and this doesn't make a big difference. But in pathological cases, it can slow the counting phase to a crawl. This patch notices the case that we have neither "--local" nor "--honor-pack-keep" in effect and breaks out of the loop early, after finding the first instance. Note that our worst case is still "objects * packs" (i.e., we might find each object in the last pack we look in), but in practice we will often break out early. On an "average" repo, my git.git with 8 packs, this shows a modest 2% (a few dozen milliseconds) improvement in the counting-objects phase of "git pack-objects --all <foo" (hackily instrumented by sticking exit(0) right after list_objects). But in a much more pathological case, it makes a bigger difference. I ran the same command on a real-world example with ~9 million objects across 1300 packs. The counting time dropped from 413s to 45s, an improvement of about 89%. Note that this patch won't do anything by itself for a normal "git gc", as it uses both --honor-pack-keep and --local. Signed-off-by: Jeff King <peff@peff.net> --- builtin/pack-objects.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a2f8cfd..55ef5a8 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1, return 0; if (ignore_packed_keep && p->pack_local && p->pack_keep) return 0; + if (!ignore_packed_keep && !local) + break; } } -- 2.9.2.512.g8a06708 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King @ 2016-07-25 19:56 ` Junio C Hamano 2016-07-25 21:41 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2016-07-25 19:56 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > When pack-objects collects the list of objects to pack > (either from stdin, or via its internal rev-list), it > filters each one through want_object_in_pack(). > > This function loops through each existing packfile, looking > for the object. When we find it, we mark the pack/offset > combo for later use. However, we can't just return "yes, we > want it" at that point. If --honor-pack-keep is in effect, > we must keep looking to find it in _all_ packs, to make sure > none of them has a .keep. Likewise, if --local is in effect, > we must make sure it is not present in any local pack. s/any local pack/any non-local pack/, no? > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index a2f8cfd..55ef5a8 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1, > return 0; > if (ignore_packed_keep && p->pack_local && p->pack_keep) > return 0; > + if (!ignore_packed_keep && !local) > + break; > } > } OK, so in this loop, we may return "false" (meaning, we do not want to pack the object) if "local" (do not pack objects that appear in non-local packs) or "ignore_packed_keep" (do not pack objects that appear in locally kept packs) are in effect, but if neither of the options is set, we know that one of the preconditions ("local" or "ignore_packed_keep") for these two "reject by returning false" if statements would never trigger for any pack on packed_git list, so it is safe to break out and return the one that we have found. If that is what is going on, I would have expected to see this early break before these "we found that this is available in borrowed pack and we are only packing local" and "we ignore objects in locally kept packs" checks return false. Or am I not following the logic in the loop correctly? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 19:56 ` Junio C Hamano @ 2016-07-25 21:41 ` Jeff King 2016-07-25 21:52 ` Junio C Hamano 0 siblings, 1 reply; 13+ messages in thread From: Jeff King @ 2016-07-25 21:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Jul 25, 2016 at 12:56:23PM -0700, Junio C Hamano wrote: > > This function loops through each existing packfile, looking > > for the object. When we find it, we mark the pack/offset > > combo for later use. However, we can't just return "yes, we > > want it" at that point. If --honor-pack-keep is in effect, > > we must keep looking to find it in _all_ packs, to make sure > > none of them has a .keep. Likewise, if --local is in effect, > > we must make sure it is not present in any local pack. > > s/any local pack/any non-local pack/, no? Oops, yeah. > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > > index a2f8cfd..55ef5a8 100644 > > --- a/builtin/pack-objects.c > > +++ b/builtin/pack-objects.c > > @@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1, > > return 0; > > if (ignore_packed_keep && p->pack_local && p->pack_keep) > > return 0; > > + if (!ignore_packed_keep && !local) > > + break; > > } > > } > > OK, so in this loop, we may return "false" (meaning, we do not want > to pack the object) if "local" (do not pack objects that appear in > non-local packs) or "ignore_packed_keep" (do not pack objects that > appear in locally kept packs) are in effect, but if neither of the > options is set, we know that one of the preconditions ("local" or > "ignore_packed_keep") for these two "reject by returning false" if > statements would never trigger for any pack on packed_git list, so > it is safe to break out and return the one that we have found. Correct. > If that is what is going on, I would have expected to see this early > break before these "we found that this is available in borrowed pack > and we are only packing local" and "we ignore objects in locally > kept packs" checks return false. > > Or am I not following the logic in the loop correctly? Yeah, I think that would work. It has to come after "did we find this in the pack", obviously. And it has to come after the other unrelated checks ("are we just finding it to exclude?" and "are we incremental?"). But you could do: if (!*found_pack) { ... first find! fill in found pack, etc ... } if (exclude) return 1; if (incremental) return 0; if (!ignore_packed_keep && !local) break; /* effectively return 1, but I think the break is more clear */ if (local && !p->pack_local) return 0; if (ignore_packed_keep && p->pack_local && p->pack_keep) return 0; which just bumps it up. I don't think there is a way to make it more elegant, e.g., by only checking ignore_packed_keep once, because we have to distinguish between each condition being set independently, or the case where neither is set. So I stuck the new check at the end, because to me logically it was "can we break out of the loop instead of looking at p->next". But I agree it would be equivalent to place it before the related checks, and I don't mind doing that if you think it's more readable. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 21:41 ` Jeff King @ 2016-07-25 21:52 ` Junio C Hamano 2016-07-25 22:14 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2016-07-25 21:52 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > if (!*found_pack) { > ... first find! fill in found pack, etc ... > } > if (exclude) > return 1; > if (incremental) > return 0; > if (!ignore_packed_keep && !local) > break; /* effectively return 1, but I think the break is more clear */ > if (local && !p->pack_local) > return 0; > if (ignore_packed_keep && p->pack_local && p->pack_keep) > return 0; > > which just bumps it up. I don't think there is a way to make it more > elegant, e.g., by only checking ignore_packed_keep once, because we have > to distinguish between each condition being set independently, or the > case where neither is set. > > So I stuck the new check at the end, because to me logically it was "can > we break out of the loop instead of looking at p->next". But I agree it > would be equivalent to place it before the related checks, and I don't > mind doing that if you think it's more readable. I do not mind too much about having to check two bools twice. But given that the reason why I was confused was because I didn't see why we need to pass the two "return 0" conditions at least once before we decide that we do not need the "return 0" thing at all, and started constructing a case where this might break by writing "Suppose you have two packs, one remote and one local in packed_git list in this order, and ..." before I realized that the new "early break" can be hoisted up like the above, I definitely feel that "we found one, and we aren't conditionally pretending that this thing does not need to be packed at all, so return early and say we want to pack it" is easier to understand before the two existing "if" statements. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 21:52 ` Junio C Hamano @ 2016-07-25 22:14 ` Jeff King 2016-07-26 20:38 ` Junio C Hamano 0 siblings, 1 reply; 13+ messages in thread From: Jeff King @ 2016-07-25 22:14 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Jul 25, 2016 at 02:52:24PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > if (!*found_pack) { > > ... first find! fill in found pack, etc ... > > } > > if (exclude) > > return 1; > > if (incremental) > > return 0; > > if (!ignore_packed_keep && !local) > > break; /* effectively return 1, but I think the break is more clear */ > > if (local && !p->pack_local) > > return 0; > > if (ignore_packed_keep && p->pack_local && p->pack_keep) > > return 0; > > > > which just bumps it up. I don't think there is a way to make it more > > elegant, e.g., by only checking ignore_packed_keep once, because we have > > to distinguish between each condition being set independently, or the > > case where neither is set. > > > > So I stuck the new check at the end, because to me logically it was "can > > we break out of the loop instead of looking at p->next". But I agree it > > would be equivalent to place it before the related checks, and I don't > > mind doing that if you think it's more readable. > > I do not mind too much about having to check two bools twice. But > given that the reason why I was confused was because I didn't see > why we need to pass the two "return 0" conditions at least once > before we decide that we do not need the "return 0" thing at all, > and started constructing a case where this might break by writing > "Suppose you have two packs, one remote and one local in packed_git > list in this order, and ..." before I realized that the new "early > break" can be hoisted up like the above, I definitely feel that "we > found one, and we aren't conditionally pretending that this thing > does not need to be packed at all, so return early and say we want > to pack it" is easier to understand before the two existing "if" > statements. Ah, right. Now you had me second-guessing for a moment that there might be a bad case in hoisting it up where we would want to return 0 but would break out early to the "return 1". But it cannot be the case, because the break is mutually exclusive with the two conditions. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-25 22:14 ` Jeff King @ 2016-07-26 20:38 ` Junio C Hamano 2016-07-26 20:48 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2016-07-26 20:38 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: >> I do not mind too much about having to check two bools twice. But >> given that the reason why I was confused was because I didn't see >> why we need to pass the two "return 0" conditions at least once >> before we decide that we do not need the "return 0" thing at all, >> and started constructing a case where this might break by writing >> "Suppose you have two packs, one remote and one local in packed_git >> list in this order, and ..." before I realized that the new "early >> break" can be hoisted up like the above, I definitely feel that "we >> found one, and we aren't conditionally pretending that this thing >> does not need to be packed at all, so return early and say we want >> to pack it" is easier to understand before the two existing "if" >> statements. > > Ah, right. Now you had me second-guessing for a moment that there might > be a bad case in hoisting it up where we would want to return 0 but > would break out early to the "return 1". > > But it cannot be the case, because the break is mutually exclusive with > the two conditions. Here is what I amended looks like (with s/local/non-local/ in the log message). diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a2f8cfd..a46bf5b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -977,6 +977,21 @@ static int want_object_in_pack(const unsigned char *sha1, return 1; if (incremental) return 0; + + /* + * When asked to do --local (do not include an + * object that appears in a pack we borrow + * from elsewhere) or --honor-pack-keep (do not + * include an object that appears in a pack marked + * with .keep), we need to make sure no copy of this + * object come from in _any_ pack that causes us to + * omit it, and need to complete this loop. When + * neither option is in effect, we know the object + * we just found is going to be packed, so break + * out of the loop to return 1 now. + */ + if (!ignore_packed_keep && !local) + break; if (local && !p->pack_local) return 0; if (ignore_packed_keep && p->pack_local && p->pack_keep) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-26 20:38 ` Junio C Hamano @ 2016-07-26 20:48 ` Jeff King 2016-07-26 21:38 ` Junio C Hamano 0 siblings, 1 reply; 13+ messages in thread From: Jeff King @ 2016-07-26 20:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Tue, Jul 26, 2016 at 01:38:47PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > >> I do not mind too much about having to check two bools twice. But > >> given that the reason why I was confused was because I didn't see > >> why we need to pass the two "return 0" conditions at least once > >> before we decide that we do not need the "return 0" thing at all, > >> and started constructing a case where this might break by writing > >> "Suppose you have two packs, one remote and one local in packed_git > >> list in this order, and ..." before I realized that the new "early > >> break" can be hoisted up like the above, I definitely feel that "we > >> found one, and we aren't conditionally pretending that this thing > >> does not need to be packed at all, so return early and say we want > >> to pack it" is easier to understand before the two existing "if" > >> statements. > > > > Ah, right. Now you had me second-guessing for a moment that there might > > be a bad case in hoisting it up where we would want to return 0 but > > would break out early to the "return 1". > > > > But it cannot be the case, because the break is mutually exclusive with > > the two conditions. > > Here is what I amended looks like (with s/local/non-local/ in the > log message). Thanks, I was actually just preparing a very similar patch (to move the condition and to add a comment, since clearly it is tricky). I got side-tracked by adding a t/perf test to show off the improvement. It's rather tricky to get right and takes a long time to run. I _think_ I have it now, but am waiting for results. :) > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index a2f8cfd..a46bf5b 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -977,6 +977,21 @@ static int want_object_in_pack(const unsigned char *sha1, > return 1; > if (incremental) > return 0; > + > + /* > + * When asked to do --local (do not include an > + * object that appears in a pack we borrow > + * from elsewhere) or --honor-pack-keep (do not > + * include an object that appears in a pack marked > + * with .keep), we need to make sure no copy of this > + * object come from in _any_ pack that causes us to > + * omit it, and need to complete this loop. When > + * neither option is in effect, we know the object > + * we just found is going to be packed, so break > + * out of the loop to return 1 now. > + */ > + if (!ignore_packed_keep && !local) > + break; This looks great. Given the explanation in the comment, it might be more clear to switch the break to "return 1", but I could go either way. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-26 20:48 ` Jeff King @ 2016-07-26 21:38 ` Junio C Hamano 2016-07-27 21:13 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2016-07-26 21:38 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > I got side-tracked by adding a t/perf test to show off the improvement. > It's rather tricky to get right and takes a long time to run. I _think_ > I have it now, but am waiting for results. :) Well, then I'd stop here and wait for the reroll to requeue. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-26 21:38 ` Junio C Hamano @ 2016-07-27 21:13 ` Jeff King 2016-07-27 21:28 ` Junio C Hamano 0 siblings, 1 reply; 13+ messages in thread From: Jeff King @ 2016-07-27 21:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Tue, Jul 26, 2016 at 02:38:28PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I got side-tracked by adding a t/perf test to show off the improvement. > > It's rather tricky to get right and takes a long time to run. I _think_ > > I have it now, but am waiting for results. :) > > Well, then I'd stop here and wait for the reroll to requeue. So I hoped to follow up with my little perf-testing patch last night, but the rabbit hole goes deeper. :) It turns out that I could not replicate my earlier results using a perf script like the one below. The problem is that it heavily depends on the ordering of your packs. The early-break does not do any good if you mostly end up finding your objects in the final pack in the list. And that is the exact situation my perf script creates: imagine you had a big repository, then 1000 pushes came in, and then you tried to run "git repack". We sort reverse-chronologically, so the big pack is at the end, and most lookups have to go through the entire pack list to get there. In the normal object-lookup paths, we cache the last_found_pack and check it first, but this loop has no equivalent (and until now it wouldn't have helped much, because we generally had to walk the whole list anyway). I knew this was a potential issue and even had experimented this back when the original patches were written, but in my experiments then it didn't help much. That makes sense, though; any case that _was_ improved by the first two patches would not benefit from reordering the pack lookups, because that meant it was finding the objects early in the search (or else the first two patches would have helped not at all). And I think the particular case I was experimenting on back then was not a normal "oops, we had some pushes and forgot to run gc". It was more bizarre, and had several packs that had most of the objects duplicated. So I think this is an area worth looking into; a series of small incremental packs on top of a large one is what I would expect to be the most common case, and the first two patches don't handle it well. I tried a hacky version of the last_found_pack trick, and it does cut the counting phase in half for my perf test. But no patch yet. One is just that doing it cleanly is a little tricky. But two is that I've wondered if we can do even better with a most-recently-used cache instead of the last_pack_found hack. So I'm trying to implement and measure that (both for this loop, and to see if it does better in find_pack_entry). Perf script below is just for a sneak peek at what I'm trying to measure. -Peff -- >8 -- #!/bin/sh test_description='performance with large numbers of packs' . ./perf-lib.sh test_perf_large_repo # Pretend we just have a single branch and no reflogs, and that everything is # in objects/pack; that makes our fake pack-building in the next step much # simpler. test_expect_success 'simplify reachability' ' tip=$(git rev-parse --verify HEAD) && git for-each-ref --format="option no-deref%0adelete %(refname)" | git update-ref --stdin && rm -rf .git/logs && git update-ref refs/heads/master $tip && git symbolic-ref HEAD refs/heads/master && git repack -ad ' # A real many-pack situation would probably come from having a lot of pushes # over time. We don't know how big each push would be, but we can fake it by # just walking the first-parent chain and having each commit be its own "push". # This isn't _entirely_ accurate, as real pushes would have some duplicate # objects due to thin-pack fixing, but it's a reasonable approximation. # # And then all of the rest of the objects can go in a single packfile that # represents the state before any of those pushes (actually, we'll generate # that first because in such a setup it would be the oldest pack, and we sort # the packs by reverse mtime inside git). # # We prepare this in a staging area, because we need to install our baseline # set of packs for each iteration of the perf test (which unfortunately counts # against their times, but is a limitation of the perf framework). test_expect_success 'create a large number of packs' ' mkdir staging && pushes() { git rev-list --first-parent -1000 HEAD } && pack() { git pack-objects --revs --delta-base-offset staging/pack } && bottom=$(pushes | tail -n 1) && echo "$bottom^" | pack && pushes | while read rev do printf "%s\n^%s^" $rev $rev | pack || return 1 done && setup_many_packs () { rm -f .git/objects/pack/* && cp staging/* .git/objects/pack/ } && setup_many_packs ' test_perf 'rev-list' ' git rev-list --objects --all ' test_perf 'full repack' ' setup_many_packs && git repack -ad ' test_done ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-27 21:13 ` Jeff King @ 2016-07-27 21:28 ` Junio C Hamano 2016-07-27 22:04 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2016-07-27 21:28 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List On Wed, Jul 27, 2016 at 2:13 PM, Jeff King <peff@peff.net> wrote: > ... But two is that I've > wondered if we can do even better with a most-recently-used cache > instead of the last_pack_found hack. So I'm trying to implement and > measure that (both for this loop, and to see if it does better in > find_pack_entry). It is always delightful to hear a well constructed description of a thought process. Thanks. One thing that made me wonder was what would happen to the last_found that is static to has_sha1_pack_kept_or_nonlocal() funciton, when we invalidate the packed_git list, but within the context of pack-objects it is not likely? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] pack-objects: break out of want_object loop early 2016-07-27 21:28 ` Junio C Hamano @ 2016-07-27 22:04 ` Jeff King 0 siblings, 0 replies; 13+ messages in thread From: Jeff King @ 2016-07-27 22:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Wed, Jul 27, 2016 at 02:28:32PM -0700, Junio C Hamano wrote: > On Wed, Jul 27, 2016 at 2:13 PM, Jeff King <peff@peff.net> wrote: > > ... But two is that I've > > wondered if we can do even better with a most-recently-used cache > > instead of the last_pack_found hack. So I'm trying to implement and > > measure that (both for this loop, and to see if it does better in > > find_pack_entry). > > It is always delightful to hear a well constructed description of a > thought process. Thanks. > > One thing that made me wonder was what would happen to the > last_found that is static to has_sha1_pack_kept_or_nonlocal() > funciton, when we invalidate the packed_git list, but within the > context of pack-objects it is not likely? I wondered that, too, and that's part of what I'm working on. :) We leave the old packed_git structs in place when we re-read the pack directory. That works because either we have them already opened (so even though they're gone, we can still access them, and that's why we _must_ keep the structs in place), or we will call is_pack_valid() and notice that it went away (and quietly skip to checking the next pack). There is one place where we do free the packed_git, but I actually think we can stop doing so. Here's what I'm planning as part of my series: -- >8 -- Subject: [PATCH] sha1_file: drop free_pack_by_name The point of this function is to drop an entry from the "packed_git" cache that points to a file we might be overwriting, because our contents may not be the same (and hence the only caller was pack-objects as it moved a temporary packfile into place). In older versions of git, this could happen because the names of packfiles were derived from the set of objects they contained, not the actual bits on disk. But since 1190a1a (pack-objects: name pack files after trailer hash, 2013-12-05), the name reflects the actual bits on disk, and any two packfiles with the same name can be used interchangeably. Dropping this function not only saves a few lines of code, it makes the lifetime of "struct packed_git" much easier to reason about: namely, we now do not ever free these structs. Signed-off-by: Jeff King <peff@peff.net> --- I won't be surprised if this fixes some obscure bug, because there may be parts of the code that store a pointer to our packed_git that could be invalidated (I'm certain the bitmap code does). But perhaps it doesn't matter because it's rare for this function to trigger at all. cache.h | 1 - pack-write.c | 1 - sha1_file.c | 30 ------------------------------ 3 files changed, 32 deletions(-) diff --git a/cache.h b/cache.h index 3855ddf..57ef726 100644 --- a/cache.h +++ b/cache.h @@ -1416,7 +1416,6 @@ extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t extern void close_pack_windows(struct packed_git *); extern void close_all_packs(void); extern void unuse_pack(struct pack_window **); -extern void free_pack_by_name(const char *); extern void clear_delta_base_cache(void); extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local); diff --git a/pack-write.c b/pack-write.c index 33293ce..ea0b788 100644 --- a/pack-write.c +++ b/pack-write.c @@ -354,7 +354,6 @@ void finish_tmp_packfile(struct strbuf *name_buffer, die_errno("unable to make temporary index file readable"); strbuf_addf(name_buffer, "%s.pack", sha1_to_hex(sha1)); - free_pack_by_name(name_buffer->buf); if (rename(pack_tmp_name, name_buffer->buf)) die_errno("unable to rename temporary pack file"); diff --git a/sha1_file.c b/sha1_file.c index d5e1121..e045d2f 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -891,36 +891,6 @@ void close_pack_index(struct packed_git *p) } } -/* - * This is used by git-repack in case a newly created pack happens to - * contain the same set of objects as an existing one. In that case - * the resulting file might be different even if its name would be the - * same. It is best to close any reference to the old pack before it is - * replaced on disk. Of course no index pointers or windows for given pack - * must subsist at this point. If ever objects from this pack are requested - * again, the new version of the pack will be reinitialized through - * reprepare_packed_git(). - */ -void free_pack_by_name(const char *pack_name) -{ - struct packed_git *p, **pp = &packed_git; - - while (*pp) { - p = *pp; - if (strcmp(pack_name, p->pack_name) == 0) { - clear_delta_base_cache(); - close_pack(p); - free(p->bad_object_sha1); - *pp = p->next; - if (last_found_pack == p) - last_found_pack = NULL; - free(p); - return; - } - pp = &p->next; - } -} - static unsigned int get_max_fd_limit(void) { #ifdef RLIMIT_NOFILE -- 2.9.2.607.g98dce7b ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early 2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King 2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King @ 2016-07-25 18:50 ` Jeff King 1 sibling, 0 replies; 13+ messages in thread From: Jeff King @ 2016-07-25 18:50 UTC (permalink / raw) To: git In want_object_in_pack(), we can exit early from our loop if neither "local" nor "ignore_pack_keep" are set. If they are, however, we must examine each pack to see if it has the object and is non-local or has a ".keep". It's quite common for there to be no non-local or .keep packs at all, in which case we know ahead of time that looking further will be pointless. We can pre-compute this by simply iterating over the list of packs ahead of time, and dropping the flags if there are no packs that could match. Another similar strategy would be to modify the loop in want_object_in_pack() to notice that we have already found the object once, and that we are looping only to check for "local" and "keep" attributes. If a pack has neither of those, we can skip the call to find_pack_entry_one(), which is the expensive part of the loop. This has two advantages: - it isn't all-or-nothing; we still get some improvement when there's a small number of kept or non-local packs, and a large number of non-kept local packs - it eliminates any possible race where we add new non-local or kept packs after our initial scan. In practice, I don't think this race matters; we already cache the packed_git information, so somebody who adds a new pack or .keep file after we've started will not be noticed at all, unless we happen to need to call reprepare_packed_git() because a lookup fails. In other words, we're already racy, and the race is not a big deal (losing the race means we might include an object in the pack that would not otherwise be, which is an acceptable outcome). However, it also has a disadvantage: we still loop over the rest of the packs for each object to check their flags. This is much less expensive than doing the object lookup, but still not free. So if we wanted to implement that strategy to cover the non-all-or-nothing cases, we could do so in addition to this one (so you get the most speedup in the all-or-nothing case, and the best we can do in the other cases). But given that the all-or-nothing case is likely the most common, it is probably not worth the trouble, and we can revisit this later if evidence points otherwise. Signed-off-by: Jeff King <peff@peff.net> --- builtin/pack-objects.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 55ef5a8..c5e343c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -46,6 +46,7 @@ static int keep_unreachable, unpack_unreachable, include_tag; static unsigned long unpack_unreachable_expiration; static int pack_loose_unreachable; static int local; +static int have_non_local_packs; static int incremental; static int ignore_packed_keep; static int allow_ofs_delta; @@ -981,7 +982,7 @@ static int want_object_in_pack(const unsigned char *sha1, return 0; if (ignore_packed_keep && p->pack_local && p->pack_keep) return 0; - if (!ignore_packed_keep && !local) + if (!ignore_packed_keep && (!local || !have_non_local_packs)) break; } } @@ -2785,6 +2786,28 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) progress = 2; prepare_packed_git(); + if (ignore_packed_keep) { + struct packed_git *p; + for (p = packed_git; p; p = p->next) + if (p->pack_local && p->pack_keep) + break; + if (!p) /* no keep-able packs found */ + ignore_packed_keep = 0; + } + if (local) { + /* + * unlike ignore_packed_keep above, we do not want to + * unset "local" based on looking at packs, as it + * also covers non-local objects + */ + struct packed_git *p; + for (p = packed_git; p; p = p->next) { + if (!p->pack_local) { + have_non_local_packs = 1; + break; + } + } + } if (progress) progress_state = start_progress(_("Counting objects"), 0); -- 2.9.2.512.g8a06708 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-07-27 22:04 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King 2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King 2016-07-25 19:56 ` Junio C Hamano 2016-07-25 21:41 ` Jeff King 2016-07-25 21:52 ` Junio C Hamano 2016-07-25 22:14 ` Jeff King 2016-07-26 20:38 ` Junio C Hamano 2016-07-26 20:48 ` Jeff King 2016-07-26 21:38 ` Junio C Hamano 2016-07-27 21:13 ` Jeff King 2016-07-27 21:28 ` Junio C Hamano 2016-07-27 22:04 ` Jeff King 2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).