From: Jakub Narebski <jnareb@gmail.com>
To: "Jeff King via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, "Derrick Stolee" <stolee@gmail.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>,
"Jonathan Tan" <jonathantanmy@google.com>,
"Jeff Hostetler" <jeffhost@microsoft.com>,
"Taylor Blau" <me@ttaylorr.com>, "Jeff King" <peff@peff.net>,
"Garima Singh" <garimasigit@gmail.com>,
"Christian Couder" <christian.couder@gmail.com>,
"Emily Shaffer" <emilyshaffer@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Garima Singh" <garima.singh@microsoft.com>
Subject: Re: [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
Date: Tue, 18 Feb 2020 18:59:31 +0100 [thread overview]
Message-ID: <86k14jkc8s.fsf@gmail.com> (raw)
In-Reply-To: <78e8e49c3a1131ffacf660603de60729b3dbadc9.1580943390.git.gitgitgadget@gmail.com> (Jeff King via GitGitGadget's message of "Wed, 05 Feb 2020 22:56:24 +0000")
"Jeff King via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Jeff King <peff@peff.net>
>
> Looking at the diff of commit objects in pack order is much faster than
> in sha1 order, as it gives locality to the access of tree deltas
Nitpick: should we still say sha1 order? Git is still using SHA-1 as an
*oid*, but hopefully soon it will be transitioning to NewHash = SHA-256.
(No need to change anything.)
> (whereas sha1 order is effectively random). Unfortunately the
> commit-graph code sorts the commits (several times, sometimes as an oid
> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
> order.
Actually, commit-graph code needs write_commit_graph_context.commits.list
to be in lexicographical order to be able to turn position in graph into
reference to a commit. The information about the parents of the commit
are stored using positional references within the graph file.
>
> Instead, let's remember the position at which we see each commit, and
> traverse in that order when looking at bloom filters. This drops my time
> for "git commit-graph write --changed-paths" in linux.git from ~4
> minutes to ~1.5 minutes.
Nitpick: with reordering of patches (which I think is otherwise a good
thing) this patch actually comes before the one adding "--changed-paths"
option to "git commit-graph write". So it 'This would drop my time'
rather than 'This drops my time...' ;-)
>
> Probably the "--reachable" code path would want something similar.
Has anyone tried doing this?
>
> Or alternatively, we could use a different data structure (either a
> hash, or maybe even just a bit in "struct commit") to keep track of
> which oids we've seen, etc instead of sorting. And then we could keep
> the original order.
I think it is nice to keep those "what ifs?" thoughts in the commit
message. They add some color.
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
> commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 724bfcffc4..e125511a1c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -17,6 +17,7 @@
> #include "replace-object.h"
> #include "progress.h"
> #include "bloom.h"
> +#include "commit-slab.h"
>
> #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
> #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -46,6 +47,29 @@
> /* Remember to update object flag allocation in object.h */
> #define REACHABLE (1u<<15)
>
> +/* Keep track of the order in which commits are added to our list. */
> +define_commit_slab(commit_pos, int);
> +static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> +
> +static void set_commit_pos(struct repository *r, const struct object_id *oid)
> +{
> + static int32_t max_pos;
> + struct commit *commit = lookup_commit(r, oid);
> +
> + if (!commit)
> + return; /* should never happen, but be lenient */
> +
> + *commit_pos_at(&commit_pos, commit) = max_pos++;
> +}
All right, that is nice and universal function.
> +
> +static int commit_pos_cmp(const void *va, const void *vb)
> +{
> + const struct commit *a = *(const struct commit **)va;
> + const struct commit *b = *(const struct commit **)vb;
> + return commit_pos_at(&commit_pos, a) -
> + commit_pos_at(&commit_pos, b);
> +}
Hmmm... I wonder what would happen in commit_pos was not set (like
e.g. commit-graph commits not coming from the packfile). Let's look up
the documenation...
commit_pos_at() returns a pointer to an int... why are we comparing
pointers and not values? Shouldn't it be
+ return *commit_pos_at(&commit_pos, a) -
+ *commit_pos_at(&commit_pos, b);
With commit_pos_at() the location to store the data is allocated as
necessary (if data for commit doesn't exists), and because we are using
xalloc() the *commit_pos_at() is 0-initialized. This means that if
commits didn't come from the packfile, we sort all commits as being
equal. Luckily we fix that in next patch.
> +
> char *get_commit_graph_filename(const char *obj_dir)
> {
> char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
> oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
> ctx->oids.nr++;
>
> + set_commit_pos(ctx->r, oid);
> +
> return 0;
> }
>
> @@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> {
> int i;
> struct progress *progress = NULL;
> + struct commit **sorted_by_pos;
In the next patch in series we would sort commits by generation number
and creation data; shouldn't this variable name be more generic to
reflect this, for example just `sorted_commits` or `commits_sorted`?
>
> load_bloom_filters();
>
> @@ -1155,13 +1182,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> _("Computing commit diff Bloom filters"),
> ctx->commits.nr);
>
> + ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> + COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
> + QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +
All right: allocate array, copy data, sort it.
We need to copy data because (what I think) we need commits in
lexicographical order to be able to turn the position in graph that
parents of a commit are stored as into the reference to this commit.
> for (i = 0; i < ctx->commits.nr; i++) {
> - struct commit *c = ctx->commits.list[i];
> + struct commit *c = sorted_by_pos[i];
All right: use sorted data.
> struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
> display_progress(progress, i + 1);
> }
>
> + free(sorted_by_pos);
Can we free the slab data, i.e. call `clear_commit_pos(&commit_pos);`
here? Otherwise we are leaking memory (well, except that finishing
command makes the operating system to free memory for us).
> stop_progress(&progress);
> }
Best,
--
Jakub Narębski
next prev parent reply other threads:[~2020-02-18 17:59 UTC|newest]
Thread overview: 159+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
2020-01-01 20:20 ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
2019-12-21 16:48 ` Philip Oakley
2020-01-06 18:44 ` Jakub Narebski
2020-01-13 19:48 ` Garima Singh
2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-01-07 12:19 ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
2020-01-07 14:46 ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
2020-01-07 16:01 ` Jakub Narebski
2020-01-14 15:14 ` Garima Singh
2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
2020-01-08 0:32 ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
2020-01-09 19:12 ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-01-11 0:27 ` Jakub Narebski
2020-01-15 0:08 ` Garima Singh
2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
2020-01-11 19:56 ` Jakub Narebski
2020-01-15 0:55 ` Garima Singh
2019-12-20 22:14 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Junio C Hamano
2019-12-22 9:26 ` Christian Couder
2019-12-22 9:38 ` Jeff King
2020-01-01 12:04 ` Jakub Narebski
2019-12-22 9:30 ` Jeff King
2019-12-22 9:32 ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
2019-12-27 14:51 ` Derrick Stolee
2019-12-29 6:12 ` Jeff King
2019-12-29 6:28 ` Jeff King
2019-12-30 14:37 ` Derrick Stolee
2019-12-30 14:51 ` Derrick Stolee
2019-12-22 9:32 ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
2019-12-27 14:52 ` Derrick Stolee
2019-12-22 9:32 ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
2019-12-27 14:53 ` Derrick Stolee
2019-12-26 14:21 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
2019-12-29 6:03 ` Jeff King
2019-12-27 16:11 ` Derrick Stolee
2019-12-29 6:24 ` Jeff King
2019-12-30 16:04 ` Derrick Stolee
2019-12-30 17:02 ` Junio C Hamano
2019-12-31 16:45 ` Jakub Narebski
2020-01-13 16:54 ` Garima Singh
2020-01-20 13:48 ` Jakub Narebski
2020-01-21 16:14 ` Garima Singh
2020-02-02 18:43 ` Jakub Narebski
2020-01-21 23:40 ` Emily Shaffer
2020-01-27 18:24 ` Garima Singh
2020-02-01 23:32 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
2020-02-05 22:56 ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-02-09 12:39 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-02-15 17:17 ` Jakub Narebski
2020-02-16 16:49 ` Jakub Narebski
2020-02-22 0:32 ` Garima Singh
2020-02-23 13:38 ` Jakub Narebski
2020-02-24 17:34 ` Garima Singh
2020-02-24 18:20 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-02-17 0:00 ` Jakub Narebski
2020-02-22 0:37 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-02-17 21:56 ` Jakub Narebski
2020-02-22 0:55 ` Garima Singh
2020-02-23 17:34 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-02-18 17:59 ` Jakub Narebski [this message]
2020-02-24 18:29 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
2020-02-19 0:32 ` Jakub Narebski
2020-02-24 20:45 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-02-19 15:13 ` Jakub Narebski
2020-02-24 21:14 ` Garima Singh
2020-02-25 11:40 ` Jakub Narebski
2020-02-25 15:58 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-02-20 18:48 ` Jakub Narebski
2020-02-24 21:45 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-02-20 20:28 ` Jakub Narebski
2020-02-24 21:51 ` Garima Singh
2020-02-25 12:10 ` Jakub Narebski
2020-02-20 22:10 ` Bryan Turner
2020-02-22 1:44 ` Garima Singh
2020-02-05 22:56 ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-02-21 17:31 ` Jakub Narebski
2020-02-21 22:45 ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-02-22 0:11 ` Jakub Narebski
2020-02-07 13:52 ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
2020-02-07 15:09 ` Garima Singh
2020-02-07 15:36 ` Derrick Stolee
2020-02-07 16:15 ` SZEDER Gábor
2020-02-07 16:33 ` Derrick Stolee
2020-02-11 19:08 ` Garima Singh
2020-02-08 23:04 ` Jakub Narebski
2020-02-21 17:41 ` Garima Singh
2020-03-29 18:36 ` Junio C Hamano
2020-03-30 0:31 ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 05/16] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 08/16] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 09/16] diff: skip batch object download when possible Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-03-30 0:31 ` [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-06-27 15:53 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-08-04 14:47 ` SZEDER Gábor
2020-08-04 16:25 ` Derrick Stolee
2020-08-04 17:00 ` SZEDER Gábor
2020-08-04 17:31 ` Derrick Stolee
2020-08-05 17:08 ` Derrick Stolee
2020-04-06 16:59 ` [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 08/15] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-05-29 8:57 ` SZEDER Gábor
2020-05-29 13:35 ` Derrick Stolee
2020-05-31 17:23 ` SZEDER Gábor
2020-07-09 17:00 ` [PATCH] commit-graph: fix "Writing out commit graph" progress counter SZEDER Gábor
2020-07-09 18:01 ` Derrick Stolee
2020-07-09 18:20 ` Derrick Stolee
2020-04-06 16:59 ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-06-19 14:02 ` SZEDER Gábor
2020-06-19 19:28 ` Junio C Hamano
2020-07-27 21:33 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-06-07 22:21 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-06-26 6:34 ` SZEDER Gábor
2020-04-06 16:59 ` [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-04-06 16:59 ` [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-08 15:51 ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
2020-04-08 19:21 ` Junio C Hamano
2020-04-08 20:05 ` Jakub Narębski
2020-04-12 20:34 ` Taylor Blau
2020-03-05 19:49 ` [PATCH 0/9] [RFC] " Garima Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86k14jkc8s.fsf@gmail.com \
--to=jnareb@gmail.com \
--cc=christian.couder@gmail.com \
--cc=emilyshaffer@gmail.com \
--cc=garima.singh@microsoft.com \
--cc=garimasigit@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=jeffhost@microsoft.com \
--cc=jonathantanmy@google.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.