All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"René Scharfe" <l.s.r@web.de>
Subject: [PATCH v2 0/7] Speed up mirror-fetches with many refs
Date: Tue, 24 Aug 2021 12:36:51 +0200	[thread overview]
Message-ID: <cover.1629800774.git.ps@pks.im> (raw)
In-Reply-To: <cover.1629452412.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 13857 bytes --]

Hi,

this is the second version of my patch series to speed up mirror-fetches
with many refs. This topic applies on top of Junio's 9d5700f60b (Merge
branch 'ps/connectivity-optim' into jch, 2021-08-23).

Changes compared to v1:

    - Patch 1/7: I've applied Stolee's proposal to only
      opportunistically load objects via the commit-graph in case the
      reference is not in refs/tags/ such that we don't regress repos
      with many annotated tags.

    - Patch 3/7: The return parameter of the iterator is now const to
      allow further optimizations by the compiler, as suggested by
      René. I've also re-benchmarked this, and one can now see a very
      slight performance improvement of ~1%.

    - Patch 4/7: Added my missing DCO, as pointed out by Junio.

    - Patch 5, 6, 7: I've redone these to make it clearer that the
      refactoring I'm doing doesn't cause us to miss any object
      connectivity checks. Most importantly, I've merged `fetch_refs()`
      and `consume_refs()` into `fetch_and_consume_refs()` in 6/7, which
      makes the optimization where we elide the second connectivity
      check in 7/7 trivial.

Thanks for your feedback!

Patrick

Patrick Steinhardt (7):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: merge fetching and consuming refs
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++---
 builtin/fetch.c        | 74 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++------
 connected.c            | 15 +++++----
 connected.h            |  2 +-
 fetch-pack.c           | 14 +++++---
 6 files changed, 68 insertions(+), 62 deletions(-)

Range-diff against v1:
1:  6872979c45 ! 1:  4a819a6830 fetch: speed up lookup of want refs via commit-graph
    @@ Commit message
         referenced objects.
     
         Speed this up by opportunistcally trying to resolve object IDs via the
    -    commit graph: more likely than not, they're going to be a commit anyway,
    -    and this lets us avoid having to unpack object headers completely in
    -    case the object is a commit that is part of the commit-graph. This
    -    significantly speeds up mirror-fetches in a real-world repository with
    +    commit graph. We only do so for any refs which are not in "refs/tags":
    +    more likely than not, these are going to be a commit anyway, and this
    +    lets us avoid having to unpack object headers completely in case the
    +    object is a commit that is part of the commit-graph. This significantly
    +    speeds up mirror-fetches in a real-world repository with
         2.3M refs:
     
             Benchmark #1: HEAD~: git-fetch
    -          Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
    -          Range (min … max):   56.372 s … 57.533 s    5 runs
    +          Time (mean ± σ):     56.482 s ±  0.384 s    [User: 53.340 s, System: 5.365 s]
    +          Range (min … max):   56.050 s … 57.045 s    5 runs
     
             Benchmark #2: HEAD: git-fetch
    -          Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
    -          Range (min … max):   33.454 s … 33.844 s    5 runs
    +          Time (mean ± σ):     33.727 s ±  0.170 s    [User: 30.252 s, System: 5.194 s]
    +          Range (min … max):   33.452 s … 33.871 s    5 runs
     
             Summary
               'HEAD: git-fetch' ran
    -            1.69 ± 0.02 times faster than 'HEAD~: git-fetch'
    +            1.67 ± 0.01 times faster than 'HEAD~: git-fetch'
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## builtin/fetch.c ##
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 			      int connectivity_checked, struct ref *ref_map)
    + {
    + 	struct fetch_head fetch_head;
    +-	struct commit *commit;
    + 	int url_len, i, rc = 0;
    + 	struct strbuf note = STRBUF_INIT, err = STRBUF_INIT;
    + 	struct ref_transaction *transaction = NULL;
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 	     want_status <= FETCH_HEAD_IGNORE;
    + 	     want_status++) {
    + 		for (rm = ref_map; rm; rm = rm->next) {
    ++			struct commit *commit = NULL;
    + 			struct ref *ref = NULL;
    + 
    + 			if (rm->status == REF_STATUS_REJECT_SHALLOW) {
     @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
      				continue;
      			}
    @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *
     -								1);
     -			if (!commit)
     -				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
    -+			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
    ++			/*
    ++			 * References in "refs/tags/" are often going to point
    ++			 * to annotated tags, which are not part of the
    ++			 * commit-graph. We thus only try to look up refs in
    ++			 * the graph which are not in that namespace to not
    ++			 * regress performance in repositories with many
    ++			 * annotated tags.
    ++			 */
    ++			if (!starts_with(rm->name, "refs/tags/"))
    ++				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
     +			if (!commit) {
     +				commit = lookup_commit_reference_gently(the_repository,
     +									&rm->old_oid,
2:  d3dac607f2 = 2:  81ebadabe8 fetch: avoid unpacking headers in object existence check
3:  3bdad7bc8b ! 3:  98e981ced9 connected: refactor iterator to return next object ID directly
    @@ Commit message
         The object ID iterator used by the connectivity checks returns the next
         object ID via an out-parameter and then uses a return code to indicate
         whether an item was found. This is a bit roundabout: instead of a
    -    separate error code, we can just retrun the next object ID directly and
    +    separate error code, we can just return the next object ID directly and
         use `NULL` pointers as indicator that the iterator got no items left.
         Furthermore, this avoids a copy of the object ID.
     
         Refactor the iterator and all its implementations to return object IDs
    -    directly. While I was honestly hoping for a small speedup given that we
    -    can now avoid a copy, both versions perform the same. Still, the end
    -    result is easier to understand and thus it makes sense to keep this
    -    refactoring regardless.
    +    directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:
    +
    +        Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
    +          Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
    +          Range (min … max):   29.934 s … 30.406 s    10 runs
    +
    +        Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
    +          Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
    +          Range (min … max):   29.696 s … 29.996 s    10 runs
    +
    +        Summary
    +          '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
    +            1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'
    +
    +    While this 1% speedup could be labelled as statistically insignificant,
    +    the speedup is consistent on my machine. Furthermore, this is an end to
    +    end test, so it is expected that the improvement in the connectivity
    +    check itself is more significant.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
    @@ builtin/clone.c: static void write_followtags(const struct ref *refs, const char
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/fetch.c: static int update_local_ref(struct ref *ref,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/receive-pack.c: static void refuse_unconfigured_deny_delete_current(void
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid);
    -+static struct object_id *command_singleton_iterator(void *cb_data);
    ++static const struct object_id *command_singleton_iterator(void *cb_data);
      static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
      {
      	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
    @@ builtin/receive-pack.c: static void check_aliased_updates(struct command *comman
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid)
    -+static struct object_id *command_singleton_iterator(void *cb_data)
    ++static const struct object_id *command_singleton_iterator(void *cb_data)
      {
      	struct command **cmd_list = cb_data;
      	struct command *cmd = *cmd_list;
    @@ builtin/receive-pack.c: struct iterate_data {
      };
      
     -static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_receive_command_list(void *cb_data)
    ++static const struct object_id *iterate_receive_command_list(void *cb_data)
      {
      	struct iterate_data *data = cb_data;
      	struct command **cmd_list = &data->cmds;
    @@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
      	FILE *rev_list_in;
      	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
     -	struct object_id oid;
    -+	struct object_id *oid;
    ++	const struct object_id *oid;
      	int err = 0;
      	struct packed_git *new_pack = NULL;
      	struct transport *transport;
    @@ connected.h: struct transport;
       * to signal EOF, otherwise return 0.
       */
     -typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
    -+typedef struct object_id *(*oid_iterate_fn)(void *);
    ++typedef const struct object_id *(*oid_iterate_fn)(void *);
      
      /*
       * Named-arguments struct for check_connected. All arguments are
    @@ fetch-pack.c: static void update_shallow(struct fetch_pack_args *args,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
4:  67917af7ce ! 4:  6311203f08 fetch-pack: optimize loading of refs via commit graph
    @@ Commit message
         In case this fails, we fall back to the old code which peels the
         objects to a commit.
     
    +    Signed-off-by: Patrick Steinhardt <ps@pks.im>
    +
      ## fetch-pack.c ##
     @@ fetch-pack.c: static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
      {
5:  7653f8eabc ! 5:  56a9158ac3 fetch: refactor fetch refs to be more extendable
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
      static int fetch_refs(struct transport *transport, struct ref *ref_map)
      {
     -	int ret = check_exist_and_connected(ref_map);
    --	if (ret) {
    --		trace2_region_enter("fetch", "fetch_refs", the_repository);
    --		ret = transport_fetch_refs(transport, ref_map);
    --		trace2_region_leave("fetch", "fetch_refs", the_repository);
    --	}
     +	int ret;
     +
     +	/*
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
     +	 * refs.
     +	 */
     +	ret = check_exist_and_connected(ref_map);
    - 	if (!ret)
    + 	if (ret) {
    + 		trace2_region_enter("fetch", "fetch_refs", the_repository);
    + 		ret = transport_fetch_refs(transport, ref_map);
    + 		trace2_region_leave("fetch", "fetch_refs", the_repository);
    ++		if (ret) {
    ++			transport_unlock_pack(transport);
    ++			return ret;
    ++		}
    + 	}
    +-	if (!ret)
     -		/*
     -		 * Keep the new pack's ".keep" file around to allow the caller
     -		 * time to update refs to reference the new objects.
     -		 */
    - 		return 0;
    +-		return 0;
     -	transport_unlock_pack(transport);
     -	return ret;
     +
    -+	trace2_region_enter("fetch", "fetch_refs", the_repository);
    -+	ret = transport_fetch_refs(transport, ref_map);
    -+	trace2_region_leave("fetch", "fetch_refs", the_repository);
    -+	if (ret) {
    -+		transport_unlock_pack(transport);
    -+		return ret;
    -+	}
    -+
     +	/*
     +	 * Keep the new pack's ".keep" file around to allow the caller
     +	 * time to update refs to reference the new objects.
6:  646ac90e62 < -:  ---------- fetch: avoid second connectivity check if we already have all objects
-:  ---------- > 6:  31d9f72edf fetch: merge fetching and consuming refs
-:  ---------- > 7:  84e39c847f fetch: avoid second connectivity check if we already have all objects
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-08-24 10:37 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-20 14:27   ` Derrick Stolee
2021-08-20 17:18     ` Junio C Hamano
2021-08-23  6:46       ` Patrick Steinhardt
2021-08-25 14:12         ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-25 23:44   ` Ævar Arnfjörð Bjarmason
2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-20 14:32   ` Derrick Stolee
2021-08-20 17:43     ` Junio C Hamano
2021-08-20 17:43   ` René Scharfe
2021-08-23  6:47     ` Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-20 14:37   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-20 14:41   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-20 14:47   ` Derrick Stolee
2021-08-23  6:52     ` Patrick Steinhardt
2021-08-20 14:50 ` [PATCH 0/6] Speed up mirror-fetches with many refs Derrick Stolee
2021-08-21  0:09 ` Junio C Hamano
2021-08-24 10:36 ` Patrick Steinhardt [this message]
2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-25 14:16     ` Derrick Stolee
2021-08-24 10:37   ` [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-25 14:19     ` Derrick Stolee
2021-09-01 12:48       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-08-25 14:26     ` Derrick Stolee
2021-09-01 12:49       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-08-25  6:04     ` Patrick Steinhardt
2021-08-25 14:27   ` Derrick Stolee
2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-09-08  0:08     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1629800774.git.ps@pks.im \
    --to=ps@pks.im \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.