git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Dr N.W. Filardo" <nwf20@cam.ac.uk>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v3 4/5] repack: optionally assume transitive kept packs
Date: Tue, 25 Jun 2019 11:32:36 +0100	[thread overview]
Message-ID: <62970caebf1e341f72e46c1f2f98fa09@cam.ac.uk> (raw)
In-Reply-To: <fbbbbd2e-3f0f-ffbb-e617-220602c9950f@gmail.com>

On 2019-06-24 14:21, Derrick Stolee wrote:
> On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
>> If the user is careful to mark .pack files as kept only when they
>> refer
>> to (other) kept packs, then we can rely on this when walking the
>> object
>> graph in subsequent repack operations and reduce the time and memory
>> spent building the object graph.
>> 
>> Towards that end, then, teach git repack to enumerate the COMMITs and
>> TREEs in kept packs and mark them as UNINTERESTING for pack-object's
>> operation.  Because this creates UNINTERESTING marks, we make repack's
>> --assume-pack-keep-transitive imply --sparse for pack-object to avoid
>> hauling the entire object graph into memory.
>> 
>> In testing with a 203GB repository with 80M objects, this amounts to
>> several gigabytes of memory and over ten minutes saved.  This machine
>> has 32G of RAM and the repository is on a fast SSD to speed testing.
> 
> Thanks for the performance details!

Happy to provide.  Hopefully they justify my desire to see these patches
land in mainline. :)

>> All told, this test repository takes about 33 minutes elapsed (28
>> minutes in user code) and 3 GB of RAM to repack 12M objects occupying
>> 33GB scattered across 477 pack files not marked as kept (the remainder
>> of the objects are spread across three kept pack files).  The time and
>> RAM usage with --assume-pack-keep-transitive should be dominated by
>> factors proportional to the size and number of un-kept objects.
>> 
>> Without these optimizations, it takes about 45 minutes (34 minutes in
>> user code) and 7 GB of RAM to compute the same pack file.  The extra
>> time and total memory use are expected to be proportional to the kept
>> packs' size, not the working set of the repack.  The time discrepancy
>> can be expected to be more dramatic when the repository is on spinning
>> rust rather than SSD.
>> 
>> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
>> ---
>>  Documentation/git-repack.txt | 21 +++++++++++++
>>  builtin/repack.c             | 57
>> ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 76 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/git-repack.txt
>> b/Documentation/git-repack.txt
>> index 836d81457a..014812c4bb 100644
>> --- a/Documentation/git-repack.txt
>> +++ b/Documentation/git-repack.txt
>> @@ -169,6 +169,27 @@ depth is 4095.
>>  	Pass the `--sparse` option to `git-pack-objects`; see
>>  	linkgit:git-pack-objects[1].
>> 
>> +--assume-pack-keep-transitive::
>> +	This flag accelerates the search for objects to pack by assuming
>> +	that commit objects found in kept packfiles make reference only
>> +	to that or other kept packfiles.  This is very useful if, for
>> +	example, this repository periodically repacks all reachable objects
>> +	and marks the resulting pack file as kept.
>> +
>> +	Because this option may prevent git from descending into kept packs,
>> +	no objects inside kept packs will be available for delta processing.
>> +	One should therefore restrict the use of this option to situations
>> +	where delta processing is disabled or when relatively large amounts
>> +	of data are considered and the relative gain of a wider set of
>> +	possible delta base objects is reduced.
>> +
>> +	The simplest way to ensure transitivity of kept packs is to run `git
>> +	repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
>> +	kept, never removing the kept marker.  In practice, one may wish to
>> +	wait to mark packs as kept until they are sufficiently large
>> +	(reducing the number of pack files necessary for a given set of
>> +	objects) but not so large as to be unwieldy.
>> +
>>  Configuration
>>  -------------
>> 
>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 4cfdd62bb8..a2cd479686 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -11,6 +11,8 @@
>>  #include "midx.h"
>>  #include "packfile.h"
>>  #include "object-store.h"
>> +#include "revision.h"
>> +#include "list-objects.h"
>> 
>>  static int delta_base_offset = 1;
>>  static int pack_kept_objects = -1;
>> @@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct
>> pack_objects_args *args,
>>  		die(_("could not finish pack-objects to repack promisor objects"));
>>  }
>> 
>> +static void apkt_show_commit(struct commit *commit, void *data)
>> +{
>> +	struct tree *tree;
>> +	struct pack_entry e;
>> +
>> +	if (!find_pack_entry(the_repository, &commit->object.oid, &e))
>> +		return;
>> +
>> +	if (!(e.p->pack_keep || e.p->pack_keep_in_core))
>> +		return;
>> +
>> +	tree = get_commit_tree(commit);
>> +	if (tree)
>> +		tree->object.flags |= UNINTERESTING;
>> +
>> +	write_oid(&commit->object.oid, e.p, 0, data);
>> +	write_oid(&tree->object.oid, NULL, 0, data);
>> +}
>> +
>> +static void apkt_show_object(struct object *obj, const char *name,
>> void *data)
>> +{
>> +	return;
>> +}
>> +
>>  #define ALL_INTO_ONE 1
>>  #define LOOSEN_UNREACHABLE 2
>> 
>> @@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>>  	int no_update_server_info = 0;
>>  	int midx_cleared = 0;
>> +	int assume_pack_keep_transitive = 0;
>>  	struct pack_objects_args po_args = {NULL};
>>  	int sparse = 0;
>> 
>> @@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  				N_("repack objects in packs marked with .keep")),
>>  		OPT_BOOL(0, "sparse", &sparse,
>>  			 N_("use the sparse reachability algorithm")),
>> +		OPT_BOOL(0, "assume-pack-keep-transitive",
>> &assume_pack_keep_transitive,
>> +				N_("assume kept packs reference only kept packs")),
>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>  				N_("do not repack this pack")),
>>  		OPT_END()
>> @@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>>  		die(_("--keep-unreachable and -A are incompatible"));
>> 
>> +	if (assume_pack_keep_transitive) {
>> +		/* imply --honor-pack-keep */
>> +		pack_kept_objects = 0;
>> +	}
>> +
> 
> If you also set `sparse = 1;` here, then...

That seems reasonable.

>> -	if (sparse)
>> +	if (sparse || assume_pack_keep_transitive)
>>  		argv_array_push(&cmd.args, "--sparse");
> 
> ...this logic can stay the same. And be simpler.

Done.

>>  	if (repository_format_partial_clone)
>>  		argv_array_push(&cmd.args, "--exclude-promisor-objects");
>> @@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv,
>> const char *prefix)
>>  		argv_array_push(&cmd.args, "--incremental");
>>  	}
>> 
>> -	cmd.no_stdin = 1;
>> +	cmd.in = -1;
>> +	cmd.no_stdin = !assume_pack_keep_transitive;
> 
> I wonder if this `cmd.in = -1;` needs to be here, or should be in
> the `if (assume_pack_keep_transitive)` block below.

It doesn't need to be, I suppose, but it seemed a little less subtle to
just assign it either way, as it will either be ignored or do the right
thing below.

>> 
>>  	ret = start_command(&cmd);
>>  	if (ret)
>>  		return ret;
>> 
>> +	if (assume_pack_keep_transitive) {
>> +		struct rev_info revs;
>> +		const char *revargv[] = { "skip", "--all", "--reflog",
>> "--indexed-objects", NULL };
>> +
>> +		repo_init_revisions(the_repository, &revs, NULL);
> 
>> +		revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);
> 
> Here is the bitflag! Excellent. Again, this can be simplified if we
> set `sparse = 1`
> in the first assume_pack_keep_transitive block.

In fact, maybe it should just be set to 1, because we're only here if...
.
I wonder if this is some cruft left over from how I developed the patch.
I'm OK with either "= 1" or "= sparse", as you prefer.

>> +		setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv,
>> &revs, NULL);
> 
> Some whitespace strangeness on this line.

Spurious space before , removed.

>> +		if (prepare_revision_walk(&revs))
>> +			die("revision walk setup failed");
> 
> This string should be localizable. It's not entirely your fault, since
> running
> 'git grep -A 1 prepare_revision_walk' shows a variety of different
> uses, most
> of which don't use the localizable string, but enough do that you could
> add it
> here.

Well, I'll try not to be a further bad example!

Thanks for the review.
--nwf;


  reply	other threads:[~2019-06-25 10:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
2019-06-24 12:52   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
2019-06-24 12:54   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
2019-06-24 13:03   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
2019-06-24 13:21   ` Derrick Stolee
2019-06-25 10:32     ` Dr N.W. Filardo [this message]
2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=62970caebf1e341f72e46c1f2f98fa09@cam.ac.uk \
    --to=nwf20@cam.ac.uk \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).