From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
"Felipe Contreras" <felipe.contreras@gmail.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>,
"Chris Torek" <chris.torek@gmail.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Taylor Blau" <me@ttaylorr.com>
Subject: [PATCH v4 0/6] Speed up connectivity checks
Date: Thu, 5 Aug 2021 13:25:19 +0200 [thread overview]
Message-ID: <cover.1628162156.git.ps@pks.im> (raw)
In-Reply-To: <cover.1627896460.git.ps@pks.im>
[-- Attachment #1: Type: text/plain, Size: 17702 bytes --]
Hi,
this is version 4 of my series to speed up connectivity checks. Given
that v3 has received positive feedback, I finally stuck to the approach
and only have a bunch of iterative changes based on your feedback.
Changes compared to v3:
- Patch 1/6 is new and splits up revs->no_walk into revs->no_walk
which now only indicates whether to walk and revs->unserted_input,
which indicates whether to sort input.
- Patch 2/6 got some documentation for the new `--unsorted-input`
option. Furthermore, we refuse `--no-walk`/`--no-walk=sorted` and
`--unsorted-input` if used together. I've also added some tests
for the new option.
- Patch 3/6 has an updated commit message, detailing that
`add_pending_oid()` only is a thin wrapper around
`add_pending_object()`.
- Patch 4/6 has an update commit message, stating that it's a
prerequisite for 6/6.
- Patch 5/6 is new, splitting out the logic to find a commit ID in
the commit graph as a prerequisite for 6/6.
- Patch 6/6 now also verifies that commits parsed only via the
commit-graph exist in the ODB. I've also renamed
`find_object_in_graph()` to `parse_commit_in_graph_gently()` to
better reflect what the function does.
With the added existence check in 6/6, the speedup is not as big as
before (1.47x faster instead of 1.55x). But it's still very much worth
it. In total, this patch series decreases `git rev-list --objects
--unsorted --not --all --not $newrev` from 7.6s to 3.0s in my test
repository.
Thanks for all your feedback!
Patrick
Patrick Steinhardt (6):
revision: separate walk and unsorted flags
connected: do not sort input revisions
revision: stop retrieving reference twice
revision: avoid loading object headers multiple times
commit-graph: split out function to search commit position
revision: avoid hitting packfiles when commits are in commit-graph
Documentation/rev-list-options.txt | 8 +++-
builtin/log.c | 2 +-
builtin/revert.c | 3 +-
commit-graph.c | 75 +++++++++++++++++++++---------
commit-graph.h | 7 +++
connected.c | 1 +
revision.c | 52 +++++++++++++++++----
revision.h | 7 +--
t/t6000-rev-list-misc.sh | 38 +++++++++++++++
9 files changed, 153 insertions(+), 40 deletions(-)
Range-diff against v3:
-: ---------- > 1: 67232910ac revision: separate walk and unsorted flags
1: 1fd83f726a ! 2: 9d7f484907 connected: do not sort input revisions
@@ Commit message
Signed-off-by: Patrick Steinhardt <ps@pks.im>
+ ## Documentation/rev-list-options.txt ##
+@@ Documentation/rev-list-options.txt: list of the missing objects. Object IDs are prefixed with a ``?'' character.
+ objects.
+ endif::git-rev-list[]
+
++--unsorted-input::
++ Show commits in the order they were given on the command line instead
++ of sorting them in reverse chronological order by commit time. Cannot
++ be combined with `--no-walk` or `--no-walk=sorted`.
++
+ --no-walk[=(sorted|unsorted)]::
+ Only show the given commits, but do not traverse their ancestors.
+ This has no effect if a range is specified. If the argument
+@@ Documentation/rev-list-options.txt: endif::git-rev-list[]
+ given on the command line. Otherwise (if `sorted` or no argument
+ was given), the commits are shown in reverse chronological order
+ by commit time.
+- Cannot be combined with `--graph`.
++ Cannot be combined with `--graph`. Cannot be combined with
++ `--unsorted-input` if `sorted` or no argument was given.
+
+ --do-walk::
+ Overrides a previous `--no-walk`.
+
## connected.c ##
@@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
if (opt->progress)
@@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, cons
revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
revs->topo_order = 1;
+ } else if (!strcmp(arg, "--unsorted-input")) {
++ if (revs->no_walk && !revs->unsorted_input)
++ die(_("--unsorted-input is incompatible with --no-walk and --no-walk=sorted"));
+ revs->unsorted_input = 1;
} else if (!strcmp(arg, "--early-output")) {
revs->early_output = 100;
revs->topo_order = 1;
-@@ revision.c: int prepare_revision_walk(struct rev_info *revs)
-
- if (!revs->reflog_info)
- prepare_to_use_bloom_filter(revs);
-- if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
-+ if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
- commit_list_sort_by_date(&revs->commits);
- if (revs->no_walk)
- return 0;
+@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
+ } else if (!strcmp(arg, "--not")) {
+ *flags ^= UNINTERESTING | BOTTOM;
+ } else if (!strcmp(arg, "--no-walk")) {
++ if (revs->unsorted_input)
++ die(_("--no-walk is incompatible with --no-walk=unsorted and --unsorted-input"));
+ revs->no_walk = 1;
+ } else if (skip_prefix(arg, "--no-walk=", &optarg)) {
+ /*
+@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
+ * not allowed, since the argument is optional.
+ */
+ revs->no_walk = 1;
+- if (!strcmp(optarg, "sorted"))
++ if (!strcmp(optarg, "sorted")) {
++ if (revs->unsorted_input)
++ die(_("--no-walk=sorted is incompatible with --no-walk=unsorted "
++ "and --unsorted-input"));
+ revs->unsorted_input = 0;
+- else if (!strcmp(optarg, "unsorted"))
++ } else if (!strcmp(optarg, "unsorted"))
+ revs->unsorted_input = 1;
+ else
+ return error("invalid argument to --no-walk");
- ## revision.h ##
-@@ revision.h: struct rev_info {
- simplify_history:1,
- show_pulls:1,
- topo_order:1,
-+ unsorted_input:1,
- simplify_merges:1,
- simplify_by_decoration:1,
- single_worktree:1,
+ ## t/t6000-rev-list-misc.sh ##
+@@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' '
+ test_line_count = $count actual
+ '
+
++test_expect_success 'rev-list --unsorted-input results in different sorting' '
++ git rev-list --unsorted-input HEAD HEAD~ >first &&
++ git rev-list --unsorted-input HEAD~ HEAD >second &&
++ ! test_cmp first second &&
++ sort first >first.sorted &&
++ sort second >second.sorted &&
++ test_cmp first.sorted second.sorted
++'
++
++test_expect_success 'rev-list --unsorted-input compatible with --no-walk=unsorted' '
++ git rev-list --unsorted-input --no-walk=unsorted HEAD HEAD~ >actual &&
++ git rev-parse HEAD >expect &&
++ git rev-parse HEAD~ >>expect &&
++ test_cmp expect actual
++'
++
++test_expect_success 'rev-list --unsorted-input incompatible with --no-walk=sorted' '
++ cat >expect <<-EOF &&
++ fatal: --no-walk is incompatible with --no-walk=unsorted and --unsorted-input
++ EOF
++ test_must_fail git rev-list --unsorted-input --no-walk HEAD 2>error &&
++ test_cmp expect error &&
++
++ cat >expect <<-EOF &&
++ fatal: --no-walk=sorted is incompatible with --no-walk=unsorted and --unsorted-input
++ EOF
++ test_must_fail git rev-list --unsorted-input --no-walk=sorted HEAD 2>error &&
++ test_cmp expect error &&
++
++ cat >expect <<-EOF &&
++ fatal: --unsorted-input is incompatible with --no-walk and --no-walk=sorted
++ EOF
++ test_must_fail git rev-list --no-walk --unsorted-input HEAD 2>error &&
++ test_cmp expect error &&
++ test_must_fail git rev-list --no-walk=sorted --unsorted-input HEAD 2>error &&
++ test_cmp expect error
++'
++
+ test_done
2: db85480649 ! 3: d8e63d0943 revision: stop retrieving reference twice
@@ Commit message
revision: stop retrieving reference twice
When queueing up references for the revision walk, `handle_one_ref()`
- will resolve the reference's object ID and then queue the ID as pending
- object via `add_pending_oid()`. But `add_pending_oid()` will again try
- to resolve the object ID to an object, effectively duplicating the work
- its caller already did before.
+ will resolve the reference's object ID via `get_reference()` and then
+ queue the ID as pending object via `add_pending_oid()`. But given that
+ `add_pending_oid()` is only a thin wrapper around `add_pending_object()`
+ which fist calls `get_reference()`, we effectively resolve the reference
+ twice and thus duplicate some of the work.
- Fix the issue by instead calling `add_pending_object()`, which takes the
- already-resolved object as input. In a repository with lots of refs,
- this translates in a nearly 10% speedup:
+ Fix the issue by instead calling `add_pending_object()` directly, which
+ takes the already-resolved object as input. In a repository with lots of
+ refs, this translates into a near 10% speedup:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 5.015 s ± 0.038 s [User: 4.698 s, System: 0.316 s]
3: b9897e102a ! 4: ba8df5cad0 revision: avoid loading object headers multiple times
@@ Commit message
either a tag or a commit, we'd have a modest increase in memory
consumption of about 12.5% here.
+ Note that on its own, this patch may not seem like a clear win. But it
+ is a prerequisite for the following patch, which will result in another
+ 37% speedup.
+
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## revision.c ##
-: ---------- > 5: e33cd51ebf commit-graph: split out function to search commit position
4: f6fc2a5e6d ! 6: 900c5a9c60 revision: avoid hitting packfiles when commits are in commit-graph
@@ Commit message
directly fill in the commit object, otherwise we can still hit the disk
to determine the object's type.
- Expose a new function `find_object_in_graph()`, which fills in an object
- of unknown type in case we find its object ID in the graph. This
- provides a big performance win in cases where there is a commit-graph
- available in the repository in case we load lots of references. The
- following has been executed in a real-world repository with about 2.2
- million refs:
+ Expose a new function `parse_commit_in_graph_gently()`, which fills in
+ an object of unknown type in case we find its object ID in the graph.
+ This provides a big performance win in cases where there is a
+ commit-graph available in the repository in case we load lots of
+ references. The following has been executed in a real-world repository
+ with about 2.2 million refs:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
- Time (mean ± σ): 4.465 s ± 0.037 s [User: 4.144 s, System: 0.320 s]
- Range (min … max): 4.411 s … 4.514 s 10 runs
+ Time (mean ± σ): 4.508 s ± 0.039 s [User: 4.131 s, System: 0.377 s]
+ Range (min … max): 4.455 s … 4.576 s 10 runs
Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
- Time (mean ± σ): 2.886 s ± 0.032 s [User: 2.570 s, System: 0.316 s]
- Range (min … max): 2.826 s … 2.933 s 10 runs
+ Time (mean ± σ): 3.072 s ± 0.031 s [User: 2.707 s, System: 0.365 s]
+ Range (min … max): 3.040 s … 3.144 s 10 runs
Summary
'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
- 1.55 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
+ 1.47 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## commit-graph.c ##
-@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
- return 1;
+@@ commit-graph.c: static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
+ }
}
-+static int find_object_id_in_graph(const struct object_id *id, struct commit_graph *g, uint32_t *pos)
-+{
-+ struct commit_graph *cur_g = g;
-+ uint32_t lex_index;
-+
-+ while (cur_g && !bsearch_graph(cur_g, (struct object_id *)id, &lex_index))
-+ cur_g = cur_g->base_graph;
-+
-+ if (cur_g) {
-+ *pos = lex_index + cur_g->num_commits_in_base;
-+ return 1;
-+ }
-+
-+ return 0;
-+}
-+
-+int find_object_in_graph(struct repository *repo, struct object *object)
++int parse_commit_in_graph_gently(struct repository *repo, struct object *object)
+{
+ struct commit *commit;
+ uint32_t pos;
@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
+ if (!repo->objects->commit_graph)
+ return -1;
+
-+ if (!find_object_id_in_graph(&object->oid, repo->objects->commit_graph, &pos))
++ if (!search_commit_pos_in_graph(&object->oid, repo->objects->commit_graph, &pos))
+ return -1;
+
+ commit = object_as_type(object, OBJ_COMMIT, 1);
@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
+ return 0;
+}
+
- static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
- {
- uint32_t graph_pos = commit_graph_position(item);
-@@ commit-graph.c: static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
- *pos = graph_pos;
- return 1;
- } else {
-- struct commit_graph *cur_g = g;
-- uint32_t lex_index;
--
-- while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
-- cur_g = cur_g->base_graph;
--
-- if (cur_g) {
-- *pos = lex_index + cur_g->num_commits_in_base;
-- return 1;
-- }
--
-- return 0;
-+ return find_object_id_in_graph(&item->object.oid, g, pos);
- }
- }
-
+ static int parse_commit_in_graph_one(struct repository *r,
+ struct commit_graph *g,
+ struct commit *item)
## commit-graph.h ##
-@@ commit-graph.h: int write_commit_graph(struct object_directory *odb,
- enum commit_graph_write_flags flags,
- const struct commit_graph_opts *opts);
+@@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
+ */
+ int parse_commit_in_graph(struct repository *r, struct commit *item);
-+int find_object_in_graph(struct repository *repo, struct object *object);
++/*
++ * Given an object of unknown type, try to fill in the object in case it is a
++ * commit part of the commit-graph. Returns 0 if the object is a parsed commit
++ * or if it could be filled in via the commit graph, otherwise it returns -1.
++ */
++int parse_commit_in_graph_gently(struct repository *repo, struct object *object);
+
- #define COMMIT_GRAPH_VERIFY_SHALLOW (1 << 0)
-
- int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
+ /*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
## revision.c ##
@@ revision.c: static struct object *get_reference(struct rev_info *revs, const char *name,
@@ revision.c: static struct object *get_reference(struct rev_info *revs, const cha
if (object->type == OBJ_NONE) {
- int type = oid_object_info(revs->repo, oid, NULL);
- if (type < 0 || !object_as_type(object, type, 1)) {
-- object = NULL;
-- goto out;
-+ if (find_object_in_graph(revs->repo, object) < 0) {
++ /*
++ * It's likely that the reference points to a commit, so we
++ * first try to look it up via the commit-graph. If successful,
++ * then we know it's a commit and don't have to unpack the
++ * object header. We still need to assert that the object
++ * exists, but given that we don't request any info about the
++ * object this is a lot faster than `oid_object_info()`.
++ */
++ if (parse_commit_in_graph_gently(revs->repo, object) < 0) {
+ int type = oid_object_info(revs->repo, oid, NULL);
+ if (type < 0 || !object_as_type(object, type, 1)) {
+ object = NULL;
+ goto out;
+ }
++ } else if (!repo_has_object_file(revs->repo, oid)) {
+ object = NULL;
+ goto out;
}
- }
-
--
2.32.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2021-08-05 11:25 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-28 5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28 5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28 7:49 ` Ævar Arnfjörð Bjarmason
2021-06-29 6:18 ` Patrick Steinhardt
2021-06-29 12:09 ` Ævar Arnfjörð Bjarmason
2021-06-28 5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28 8:00 ` Ævar Arnfjörð Bjarmason
2021-06-28 8:06 ` Ævar Arnfjörð Bjarmason
2021-06-29 6:26 ` Patrick Steinhardt
2021-06-30 1:31 ` Jeff King
2021-06-30 1:35 ` Jeff King
2021-06-30 13:52 ` Patrick Steinhardt
2021-06-28 5:33 ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Patrick Steinhardt
2021-06-28 20:23 ` Taylor Blau
2021-06-29 22:44 ` Taylor Blau
2021-06-30 2:04 ` Jeff King
2021-06-30 3:07 ` Taylor Blau
2021-06-30 5:45 ` Jeff King
2021-07-02 17:44 ` Taylor Blau
2021-07-02 21:21 ` Jeff King
2021-06-30 1:51 ` Jeff King
2021-07-20 14:26 ` Patrick Steinhardt
2021-08-02 9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02 9:38 ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49 ` Ævar Arnfjörð Bjarmason
2021-08-03 8:50 ` Patrick Steinhardt
2021-08-04 11:01 ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00 ` Junio C Hamano
2021-08-03 8:55 ` Patrick Steinhardt
2021-08-03 21:47 ` Junio C Hamano
2021-08-02 9:38 ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53 ` Ævar Arnfjörð Bjarmason
2021-08-02 9:38 ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55 ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12 ` Patrick Steinhardt
2021-08-02 19:40 ` Junio C Hamano
2021-08-03 9:07 ` Patrick Steinhardt
2021-08-06 14:17 ` Patrick Steinhardt
2021-08-02 9:38 ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01 ` Junio C Hamano
2021-08-03 9:16 ` Patrick Steinhardt
2021-08-03 21:56 ` Junio C Hamano
2021-08-05 11:01 ` Patrick Steinhardt
2021-08-05 16:16 ` Junio C Hamano
2021-08-04 10:51 ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25 ` Patrick Steinhardt [this message]
2021-08-05 11:25 ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47 ` Junio C Hamano
2021-08-05 11:25 ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44 ` Junio C Hamano
2021-08-06 6:00 ` Patrick Steinhardt
2021-08-06 16:50 ` Junio C Hamano
2021-08-05 11:25 ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09 8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09 8:02 ` Patrick Steinhardt
2021-08-09 8:11 ` Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09 8:12 ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1628162156.git.ps@pks.im \
--to=ps@pks.im \
--cc=avarab@gmail.com \
--cc=chris.torek@gmail.com \
--cc=felipe.contreras@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.