All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Nathaniel Filardo <nwf20@cl.cam.ac.uk>, git@vger.kernel.org
Subject: Re: [PATCH v3 4/5] repack: optionally assume transitive kept packs
Date: Mon, 24 Jun 2019 09:21:23 -0400	[thread overview]
Message-ID: <fbbbbd2e-3f0f-ffbb-e617-220602c9950f@gmail.com> (raw)
In-Reply-To: <20190624120711.27744-5-nwf20@cl.cam.ac.uk>

On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
> If the user is careful to mark .pack files as kept only when they refer
> to (other) kept packs, then we can rely on this when walking the object
> graph in subsequent repack operations and reduce the time and memory
> spent building the object graph.
> 
> Towards that end, then, teach git repack to enumerate the COMMITs and
> TREEs in kept packs and mark them as UNINTERESTING for pack-object's
> operation.  Because this creates UNINTERESTING marks, we make repack's
> --assume-pack-keep-transitive imply --sparse for pack-object to avoid
> hauling the entire object graph into memory.
> 
> In testing with a 203GB repository with 80M objects, this amounts to
> several gigabytes of memory and over ten minutes saved.  This machine
> has 32G of RAM and the repository is on a fast SSD to speed testing.

Thanks for the performance details!

> All told, this test repository takes about 33 minutes elapsed (28
> minutes in user code) and 3 GB of RAM to repack 12M objects occupying
> 33GB scattered across 477 pack files not marked as kept (the remainder
> of the objects are spread across three kept pack files).  The time and
> RAM usage with --assume-pack-keep-transitive should be dominated by
> factors proportional to the size and number of un-kept objects.
> 
> Without these optimizations, it takes about 45 minutes (34 minutes in
> user code) and 7 GB of RAM to compute the same pack file.  The extra
> time and total memory use are expected to be proportional to the kept
> packs' size, not the working set of the repack.  The time discrepancy
> can be expected to be more dramatic when the repository is on spinning
> rust rather than SSD.
> 
> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
> ---
>  Documentation/git-repack.txt | 21 +++++++++++++
>  builtin/repack.c             | 57 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 836d81457a..014812c4bb 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -169,6 +169,27 @@ depth is 4095.
>  	Pass the `--sparse` option to `git-pack-objects`; see
>  	linkgit:git-pack-objects[1].
>  
> +--assume-pack-keep-transitive::
> +	This flag accelerates the search for objects to pack by assuming
> +	that commit objects found in kept packfiles make reference only
> +	to that or other kept packfiles.  This is very useful if, for
> +	example, this repository periodically repacks all reachable objects
> +	and marks the resulting pack file as kept.
> +
> +	Because this option may prevent git from descending into kept packs,
> +	no objects inside kept packs will be available for delta processing.
> +	One should therefore restrict the use of this option to situations
> +	where delta processing is disabled or when relatively large amounts
> +	of data are considered and the relative gain of a wider set of
> +	possible delta base objects is reduced.
> +
> +	The simplest way to ensure transitivity of kept packs is to run `git
> +	repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
> +	kept, never removing the kept marker.  In practice, one may wish to
> +	wait to mark packs as kept until they are sufficiently large
> +	(reducing the number of pack files necessary for a given set of
> +	objects) but not so large as to be unwieldy.
> +
>  Configuration
>  -------------
>  
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 4cfdd62bb8..a2cd479686 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -11,6 +11,8 @@
>  #include "midx.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "revision.h"
> +#include "list-objects.h"
>  
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> @@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
>  		die(_("could not finish pack-objects to repack promisor objects"));
>  }
>  
> +static void apkt_show_commit(struct commit *commit, void *data)
> +{
> +	struct tree *tree;
> +	struct pack_entry e;
> +
> +	if (!find_pack_entry(the_repository, &commit->object.oid, &e))
> +		return;
> +
> +	if (!(e.p->pack_keep || e.p->pack_keep_in_core))
> +		return;
> +
> +	tree = get_commit_tree(commit);
> +	if (tree)
> +		tree->object.flags |= UNINTERESTING;
> +
> +	write_oid(&commit->object.oid, e.p, 0, data);
> +	write_oid(&tree->object.oid, NULL, 0, data);
> +}
> +
> +static void apkt_show_object(struct object *obj, const char *name, void *data)
> +{
> +	return;
> +}
> +
>  #define ALL_INTO_ONE 1
>  #define LOOSEN_UNREACHABLE 2
>  
> @@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>  	int no_update_server_info = 0;
>  	int midx_cleared = 0;
> +	int assume_pack_keep_transitive = 0;
>  	struct pack_objects_args po_args = {NULL};
>  	int sparse = 0;
>  
> @@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				N_("repack objects in packs marked with .keep")),
>  		OPT_BOOL(0, "sparse", &sparse,
>  			 N_("use the sparse reachability algorithm")),
> +		OPT_BOOL(0, "assume-pack-keep-transitive", &assume_pack_keep_transitive,
> +				N_("assume kept packs reference only kept packs")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>  				N_("do not repack this pack")),
>  		OPT_END()
> @@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>  		die(_("--keep-unreachable and -A are incompatible"));
>  
> +	if (assume_pack_keep_transitive) {
> +		/* imply --honor-pack-keep */
> +		pack_kept_objects = 0;
> +	}
> +

If you also set `sparse = 1;` here, then...

> -	if (sparse)
> +	if (sparse || assume_pack_keep_transitive)
>  		argv_array_push(&cmd.args, "--sparse");

...this logic can stay the same. And be simpler.

>  	if (repository_format_partial_clone)
>  		argv_array_push(&cmd.args, "--exclude-promisor-objects");
> @@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		argv_array_push(&cmd.args, "--incremental");
>  	}
>  
> -	cmd.no_stdin = 1;
> +	cmd.in = -1;
> +	cmd.no_stdin = !assume_pack_keep_transitive;

I wonder if this `cmd.in = -1;` needs to be here, or should be in
the `if (assume_pack_keep_transitive)` block below.

>  
>  	ret = start_command(&cmd);
>  	if (ret)
>  		return ret;
>  
> +	if (assume_pack_keep_transitive) {
> +		struct rev_info revs;
> +		const char *revargv[] = { "skip", "--all", "--reflog", "--indexed-objects", NULL };
> +
> +		repo_init_revisions(the_repository, &revs, NULL);

> +		revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);

Here is the bitflag! Excellent. Again, this can be simplified if we set `sparse = 1`
in the first assume_pack_keep_transitive block.

> +		setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv, &revs, NULL);

Some whitespace strangeness on this line.

> +		if (prepare_revision_walk(&revs))
> +			die("revision walk setup failed");

This string should be localizable. It's not entirely your fault, since running
'git grep -A 1 prepare_revision_walk' shows a variety of different uses, most
of which don't use the localizable string, but enough do that you could add it
here.

> +
> +		xwrite(cmd.in, "--not\n", 6);
> +		traverse_commit_list(&revs, apkt_show_commit, apkt_show_object,
> +				     &cmd);
> +		xwrite(cmd.in, "--not\n", 6);
> +
> +		close(cmd.in);
> +	}
> +
>  	out = xfdopen(cmd.out, "r");
>  	while (strbuf_getline_lf(&line, out) != EOF) {
>  		if (line.len != the_hash_algo->hexsz)
> 


  reply	other threads:[~2019-06-24 13:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
2019-06-24 12:52   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
2019-06-24 12:54   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
2019-06-24 13:03   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
2019-06-24 13:21   ` Derrick Stolee [this message]
2019-06-25 10:32     ` Dr N.W. Filardo
2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fbbbbd2e-3f0f-ffbb-e617-220602c9950f@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nwf20@cl.cam.ac.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.