From: "Dr N.W. Filardo" <nwf20@cam.ac.uk>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v3 4/5] repack: optionally assume transitive kept packs
Date: Tue, 25 Jun 2019 11:32:36 +0100 [thread overview]
Message-ID: <62970caebf1e341f72e46c1f2f98fa09@cam.ac.uk> (raw)
In-Reply-To: <fbbbbd2e-3f0f-ffbb-e617-220602c9950f@gmail.com>
On 2019-06-24 14:21, Derrick Stolee wrote:
> On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
>> If the user is careful to mark .pack files as kept only when they
>> refer
>> to (other) kept packs, then we can rely on this when walking the
>> object
>> graph in subsequent repack operations and reduce the time and memory
>> spent building the object graph.
>>
>> Towards that end, then, teach git repack to enumerate the COMMITs and
>> TREEs in kept packs and mark them as UNINTERESTING for pack-object's
>> operation. Because this creates UNINTERESTING marks, we make repack's
>> --assume-pack-keep-transitive imply --sparse for pack-object to avoid
>> hauling the entire object graph into memory.
>>
>> In testing with a 203GB repository with 80M objects, this amounts to
>> several gigabytes of memory and over ten minutes saved. This machine
>> has 32G of RAM and the repository is on a fast SSD to speed testing.
>
> Thanks for the performance details!
Happy to provide. Hopefully they justify my desire to see these patches
land in mainline. :)
>> All told, this test repository takes about 33 minutes elapsed (28
>> minutes in user code) and 3 GB of RAM to repack 12M objects occupying
>> 33GB scattered across 477 pack files not marked as kept (the remainder
>> of the objects are spread across three kept pack files). The time and
>> RAM usage with --assume-pack-keep-transitive should be dominated by
>> factors proportional to the size and number of un-kept objects.
>>
>> Without these optimizations, it takes about 45 minutes (34 minutes in
>> user code) and 7 GB of RAM to compute the same pack file. The extra
>> time and total memory use are expected to be proportional to the kept
>> packs' size, not the working set of the repack. The time discrepancy
>> can be expected to be more dramatic when the repository is on spinning
>> rust rather than SSD.
>>
>> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
>> ---
>> Documentation/git-repack.txt | 21 +++++++++++++
>> builtin/repack.c | 57
>> ++++++++++++++++++++++++++++++++++--
>> 2 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/git-repack.txt
>> b/Documentation/git-repack.txt
>> index 836d81457a..014812c4bb 100644
>> --- a/Documentation/git-repack.txt
>> +++ b/Documentation/git-repack.txt
>> @@ -169,6 +169,27 @@ depth is 4095.
>> Pass the `--sparse` option to `git-pack-objects`; see
>> linkgit:git-pack-objects[1].
>>
>> +--assume-pack-keep-transitive::
>> + This flag accelerates the search for objects to pack by assuming
>> + that commit objects found in kept packfiles make reference only
>> + to that or other kept packfiles. This is very useful if, for
>> + example, this repository periodically repacks all reachable objects
>> + and marks the resulting pack file as kept.
>> +
>> + Because this option may prevent git from descending into kept packs,
>> + no objects inside kept packs will be available for delta processing.
>> + One should therefore restrict the use of this option to situations
>> + where delta processing is disabled or when relatively large amounts
>> + of data are considered and the relative gain of a wider set of
>> + possible delta base objects is reduced.
>> +
>> + The simplest way to ensure transitivity of kept packs is to run `git
>> + repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
>> + kept, never removing the kept marker. In practice, one may wish to
>> + wait to mark packs as kept until they are sufficiently large
>> + (reducing the number of pack files necessary for a given set of
>> + objects) but not so large as to be unwieldy.
>> +
>> Configuration
>> -------------
>>
>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 4cfdd62bb8..a2cd479686 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -11,6 +11,8 @@
>> #include "midx.h"
>> #include "packfile.h"
>> #include "object-store.h"
>> +#include "revision.h"
>> +#include "list-objects.h"
>>
>> static int delta_base_offset = 1;
>> static int pack_kept_objects = -1;
>> @@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct
>> pack_objects_args *args,
>> die(_("could not finish pack-objects to repack promisor objects"));
>> }
>>
>> +static void apkt_show_commit(struct commit *commit, void *data)
>> +{
>> + struct tree *tree;
>> + struct pack_entry e;
>> +
>> + if (!find_pack_entry(the_repository, &commit->object.oid, &e))
>> + return;
>> +
>> + if (!(e.p->pack_keep || e.p->pack_keep_in_core))
>> + return;
>> +
>> + tree = get_commit_tree(commit);
>> + if (tree)
>> + tree->object.flags |= UNINTERESTING;
>> +
>> + write_oid(&commit->object.oid, e.p, 0, data);
>> + write_oid(&tree->object.oid, NULL, 0, data);
>> +}
>> +
>> +static void apkt_show_object(struct object *obj, const char *name,
>> void *data)
>> +{
>> + return;
>> +}
>> +
>> #define ALL_INTO_ONE 1
>> #define LOOSEN_UNREACHABLE 2
>>
>> @@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>> struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>> int no_update_server_info = 0;
>> int midx_cleared = 0;
>> + int assume_pack_keep_transitive = 0;
>> struct pack_objects_args po_args = {NULL};
>> int sparse = 0;
>>
>> @@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>> N_("repack objects in packs marked with .keep")),
>> OPT_BOOL(0, "sparse", &sparse,
>> N_("use the sparse reachability algorithm")),
>> + OPT_BOOL(0, "assume-pack-keep-transitive",
>> &assume_pack_keep_transitive,
>> + N_("assume kept packs reference only kept packs")),
>> OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>> N_("do not repack this pack")),
>> OPT_END()
>> @@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>> (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>> die(_("--keep-unreachable and -A are incompatible"));
>>
>> + if (assume_pack_keep_transitive) {
>> + /* imply --honor-pack-keep */
>> + pack_kept_objects = 0;
>> + }
>> +
>
> If you also set `sparse = 1;` here, then...
That seems reasonable.
>> - if (sparse)
>> + if (sparse || assume_pack_keep_transitive)
>> argv_array_push(&cmd.args, "--sparse");
>
> ...this logic can stay the same. And be simpler.
Done.
>> if (repository_format_partial_clone)
>> argv_array_push(&cmd.args, "--exclude-promisor-objects");
>> @@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv,
>> const char *prefix)
>> argv_array_push(&cmd.args, "--incremental");
>> }
>>
>> - cmd.no_stdin = 1;
>> + cmd.in = -1;
>> + cmd.no_stdin = !assume_pack_keep_transitive;
>
> I wonder if this `cmd.in = -1;` needs to be here, or should be in
> the `if (assume_pack_keep_transitive)` block below.
It doesn't need to be, I suppose, but it seemed a little less subtle to
just assign it either way, as it will either be ignored or do the right
thing below.
>>
>> ret = start_command(&cmd);
>> if (ret)
>> return ret;
>>
>> + if (assume_pack_keep_transitive) {
>> + struct rev_info revs;
>> + const char *revargv[] = { "skip", "--all", "--reflog",
>> "--indexed-objects", NULL };
>> +
>> + repo_init_revisions(the_repository, &revs, NULL);
>
>> + revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);
>
> Here is the bitflag! Excellent. Again, this can be simplified if we
> set `sparse = 1`
> in the first assume_pack_keep_transitive block.
In fact, maybe it should just be set to 1, because we're only here if...
.
I wonder if this is some cruft left over from how I developed the patch.
I'm OK with either "= 1" or "= sparse", as you prefer.
>> + setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv,
>> &revs, NULL);
>
> Some whitespace strangeness on this line.
Spurious space before , removed.
>> + if (prepare_revision_walk(&revs))
>> + die("revision walk setup failed");
>
> This string should be localizable. It's not entirely your fault, since
> running
> 'git grep -A 1 prepare_revision_walk' shows a variety of different
> uses, most
> of which don't use the localizable string, but enough do that you could
> add it
> here.
Well, I'll try not to be a further bad example!
Thanks for the review.
--nwf;
next prev parent reply other threads:[~2019-06-25 10:52 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
2019-06-24 12:52 ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
2019-06-24 12:54 ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
2019-06-24 13:03 ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
2019-06-24 13:21 ` Derrick Stolee
2019-06-25 10:32 ` Dr N.W. Filardo [this message]
2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=62970caebf1e341f72e46c1f2f98fa09@cam.ac.uk \
--to=nwf20@cam.ac.uk \
--cc=git@vger.kernel.org \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).