From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
git@vger.kernel.org, peff@peff.net, dstolee@microsoft.com,
szeder.dev@gmail.com, gitster@pobox.com
Subject: Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
Date: Fri, 14 Aug 2020 16:10:28 -0400 [thread overview]
Message-ID: <20200814201028.GC30103@syl.lan> (raw)
In-Reply-To: <ee5c0db0-6582-f039-96e1-be64ee4fe7a6@gmail.com>
On Wed, Aug 12, 2020 at 08:29:06AM -0400, Derrick Stolee wrote:
> On 8/11/2020 4:52 PM, Taylor Blau wrote:
> > Introduce a command-line flag and configuration variable to fill in the
> > 'max_new_filters' variable introduced by the previous patch.
> >
> > The command-line option '--max-new-filters' takes precedence over
> > 'commitGraph.maxNewFilters', which is the default value.
> > '--no-max-new-filters' can also be provided, which sets the value back
> > to '-1', indicating that an unlimited number of new Bloom filters may be
> > generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> > '0', hence a custom callback was used instead).
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
> ...
>
> > diff --git a/bloom.c b/bloom.c
> > index ed54e96e57..8d07209c6b 100644
> > --- a/bloom.c
> > +++ b/bloom.c
> > @@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
> > else
> > start_index = 0;
> >
> > + if ((start_index == end_index) &&
> > + (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> > + /*
> > + * If the filter is zero-length, either (1) the filter has no
> > + * changes, (2) the filter has too many changes, or (3) it
> > + * wasn't computed (eg., due to '--max-new-filters').
> > + *
> > + * If either (1) or (2) is the case, the 'large' bit will be set
> > + * for this Bloom filter. If it is unset, then it wasn't
> > + * computed. In that case, return nothing, since we don't have
> > + * that filter in the graph.
> > + */
> > + return 0;
> > + }
> > +
>
> Here, you are creating a distinction between an "empty" filter and
> a "too large" filter at a place that I don't think is important.
>
> For instance, this code will be triggered by "git log -- <path>"
> but you only care about the filter being too large when writing the
> commit-graph. I think this check for a "too large" filter should
> instead be inside get_or_compute_bloom_filter(). I include a patch
> below that applies on top of tb/bloom-improvements that gets to
> my point (and how minor the issue might be).
That's a good point. I factored the lazy-loading out in basically the
same way that you did below, but didn't move this call out of
'load_bloom_filter_from_graph'. I think that you're right that this is
better at the calling end (replacing the 'start_index == end_index'
check with a '!filter->len' one instead).
I'm going to "ignore" this patch below since I (a) already have it
locally, and (b) would rather just do this correctly the first time than
introduce and subsequently fix a performance regression.
> Thanks,
> -Stolee
>
> --- >8 ---
>
> From 92a7a3d0f769fad7617426730cb97584a3d07794 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Wed, 12 Aug 2020 08:20:03 -0400
> Subject: [PATCH] commit-graph: lazy-load large Bloom filter bitmap
>
> The bloom_large bitmap in struct commit_graph is currently loaded
> immediately upon first parse of the commit-graph file (when the chunk
> exists). This is needed because the current implementation of
> get_bloom_filter_from_graph() is special-cased to return NULL when the
> filter is marked as "too large".
>
> This has a slight drawback: before we can read a single commit out of
> the commit-graph file, we need to load this entire chunk into memory.
> This happens even for commands that don't need changed-path Bloom
> filters, such as "git log -1".
>
> This "too large" information is only used when writing a commit-graph
> file, so we can delay the check for a large filter until after we check
> compute_if_not_present in get_or_compute_bloom_filter(). Also, place
> that lazy-load directly in the get_bloom_filter_large_in_graph() method,
> so we ensure it is ready when needed.
>
> This may be overkill. For a repository with one million commits, this
> filter size is approximately 125 *kilobytes* of data. My local
> measurements found that this took between 1 and 2 milliseconds to load
> into memory. Even for repositories with 10 million commits, this
> difference would not be noticeable for end-user commands.
>
> On the other hand, these handfuls of milliseconds could add up when
> running a hosting service using Git, so this extra effort is probably
> worth it.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> bloom.c | 27 ++++++++-------------------
> commit-graph.c | 30 +++++++++++++++++-------------
> commit-graph.h | 8 ++++++++
> 3 files changed, 33 insertions(+), 32 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 8d07209c6b..6d0884fa19 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -51,21 +51,6 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
> else
> start_index = 0;
>
> - if ((start_index == end_index) &&
> - (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> - /*
> - * If the filter is zero-length, either (1) the filter has no
> - * changes, (2) the filter has too many changes, or (3) it
> - * wasn't computed (eg., due to '--max-new-filters').
> - *
> - * If either (1) or (2) is the case, the 'large' bit will be set
> - * for this Bloom filter. If it is unset, then it wasn't
> - * computed. In that case, return nothing, since we don't have
> - * that filter in the graph.
> - */
> - return 0;
> - }
> -
> filter->len = end_index - start_index;
> filter->data = (unsigned char *)(g->chunk_bloom_data +
> sizeof(unsigned char) * start_index +
> @@ -212,16 +197,20 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>
> if (!filter->data) {
> load_commit_graph_info(r, c);
> - if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
> - load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> - return filter;
> + if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
> + load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
> }
>
> - if (filter->data)
> + if (filter && filter->len)
> return filter;
> if (!compute_if_not_present)
> return NULL;
>
> + if (filter && !filter->len &&
> + get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
> + settings->max_changed_paths))
> + return filter;
> +
> repo_diff_setup(r, &diffopt);
> diffopt.flags.recursive = 1;
> diffopt.detect_rename = 0;
> diff --git a/commit-graph.c b/commit-graph.c
> index 4aae5471e3..ea89f431cc 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -435,17 +435,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> if (graph->chunk_bloom_large_filters)
> chunk_repeated = 1;
> else if (r->settings.commit_graph_read_changed_paths) {
> - size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
> + graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
> graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
> - if (alloc) {
> - size_t j;
> - graph->bloom_large = bitmap_word_alloc(alloc);
> -
> - for (j = 0; j < graph->bloom_large->word_alloc; j++)
> - graph->bloom_large->words[j] = get_be64(
> - graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> - }
> }
> break;
> }
> @@ -953,9 +945,9 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
> return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
> }
>
> -static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> - const struct commit *c,
> - uint32_t max_changed_paths)
> +int get_bloom_filter_large_in_graph(struct commit_graph *g,
> + const struct commit *c,
> + uint32_t max_changed_paths)
> {
> uint32_t graph_pos = commit_graph_position(c);
> if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> @@ -964,8 +956,20 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> while (g && graph_pos < g->num_commits_in_base)
> g = g->base_graph;
>
> - if (!(g && g->bloom_large))
> + if (!g || !g->bloom_large_alloc)
> return 0;
> +
> + if (!g->bloom_large) {
> + size_t j;
> + g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
> +
> + for (j = 0; j < g->bloom_large->word_alloc; j++) {
> + const void *data = g->chunk_bloom_large_filters +
> + j * sizeof(eword_t);
> + g->bloom_large->words[j] = get_be64(data);
> + }
> + }
> +
> if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> /*
> * Force all commits which are subject to a different
> diff --git a/commit-graph.h b/commit-graph.h
> index 75ef83708b..126fd43380 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -51,6 +51,13 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
> struct tree *get_commit_tree_in_graph(struct repository *r,
> const struct commit *c);
>
> +/*
> + * Returns 1 if this commit was marked in the BFXL chunk as having more
> + * than max_changed_paths changes.
> + */
> +int get_bloom_filter_large_in_graph(struct commit_graph *g,
> + const struct commit *c,
> + uint32_t max_changed_paths);
> struct commit_graph {
> const unsigned char *data;
> size_t data_len;
> @@ -74,6 +81,7 @@ struct commit_graph {
> const unsigned char *chunk_bloom_data;
> const unsigned char *chunk_bloom_large_filters;
>
> + size_t bloom_large_alloc;
> struct bitmap *bloom_large;
>
> struct bloom_filter_settings *bloom_filter_settings;
> --
> 2.28.0.38.gc6f546511c1
>
>
Thanks,
Taylor
next prev parent reply other threads:[~2020-08-14 20:10 UTC|newest]
Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-04 7:24 ` Jeff King
2020-08-04 20:08 ` Taylor Blau
2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
2020-08-04 12:31 ` Derrick Stolee
2020-08-04 20:10 ` Taylor Blau
2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-03 18:59 ` Taylor Blau
2020-08-04 12:57 ` Derrick Stolee
2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-04 13:00 ` Derrick Stolee
2020-08-04 20:12 ` Taylor Blau
2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-04 13:03 ` Derrick Stolee
2020-08-04 20:14 ` Taylor Blau
2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-05 17:01 ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-05 17:02 ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-05 17:02 ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-05 17:02 ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-05 17:02 ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-05 17:02 ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-05 17:02 ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-05 17:02 ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-05 17:02 ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-05 17:02 ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-05 17:02 ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-05 17:02 ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-05 21:01 ` Junio C Hamano
2020-08-05 21:17 ` Taylor Blau
2020-08-05 22:21 ` Junio C Hamano
2020-08-05 22:25 ` Taylor Blau
2020-08-11 13:48 ` Taylor Blau
2020-08-11 18:59 ` Junio C Hamano
2020-08-05 17:03 ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-05 17:03 ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-11 20:51 ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-11 21:18 ` SZEDER Gábor
2020-08-11 21:21 ` Taylor Blau
2020-08-11 21:27 ` SZEDER Gábor
2020-08-11 21:34 ` Taylor Blau
2020-08-11 23:55 ` SZEDER Gábor
2020-08-12 11:48 ` Derrick Stolee
2020-08-14 20:17 ` Taylor Blau
2020-08-11 20:51 ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-11 20:51 ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-11 20:51 ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-11 20:51 ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-11 20:51 ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-11 20:51 ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-11 20:52 ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-11 20:52 ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-11 21:11 ` Derrick Stolee
2020-08-11 21:18 ` Taylor Blau
2020-08-11 22:05 ` Taylor Blau
2020-08-19 13:35 ` SZEDER Gábor
2020-09-02 20:23 ` Taylor Blau
2020-09-01 14:35 ` SZEDER Gábor
2020-09-02 20:40 ` Taylor Blau
2020-08-11 20:52 ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-19 9:56 ` SZEDER Gábor
2020-09-02 21:02 ` Taylor Blau
2020-08-11 20:52 ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-12 11:49 ` SZEDER Gábor
2020-08-14 20:20 ` Taylor Blau
2020-08-17 22:50 ` SZEDER Gábor
2020-09-02 21:03 ` Taylor Blau
2020-08-12 12:29 ` Derrick Stolee
2020-08-14 20:10 ` Taylor Blau [this message]
2020-08-18 22:23 ` SZEDER Gábor
2020-09-03 16:35 ` Taylor Blau
2020-08-19 8:20 ` SZEDER Gábor
2020-09-03 16:42 ` Taylor Blau
2020-09-04 8:50 ` SZEDER Gábor
2020-09-01 14:36 ` SZEDER Gábor
2020-09-03 18:49 ` Taylor Blau
2020-09-03 21:45 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
2020-09-03 22:33 ` Taylor Blau
2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
2020-09-03 22:46 ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-03 22:46 ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
2020-09-03 22:46 ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-03 22:46 ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-03 22:46 ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-03 22:46 ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-03 22:46 ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-05 17:22 ` Jakub Narębski
2020-09-05 17:38 ` Taylor Blau
2020-09-05 17:50 ` Jakub Narębski
2020-09-05 18:01 ` Taylor Blau
2020-09-05 18:18 ` Jakub Narębski
2020-09-05 18:38 ` Taylor Blau
2020-09-05 18:55 ` Taylor Blau
2020-09-05 19:04 ` SZEDER Gábor
2020-09-05 19:49 ` Taylor Blau
2020-09-06 21:52 ` Junio C Hamano
2020-09-03 22:46 ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-03 22:46 ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-03 22:46 ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-09-03 22:46 ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-09-04 20:18 ` René Scharfe
2020-09-04 20:22 ` Taylor Blau
2020-09-03 22:46 ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-09-03 22:46 ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-04 15:20 ` Taylor Blau
2020-09-03 22:47 ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-04 14:39 ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200814201028.GC30103@syl.lan \
--to=me@ttaylorr.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.