git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] receive-pack: avoid sending duplicate "have" lines
@ 2015-11-06 23:16 Lukas Fleischer
  2015-11-06 23:38 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Lukas Fleischer @ 2015-11-06 23:16 UTC (permalink / raw)
  To: git

Alternates and refs outside the current namespace are advertised as
"have" lines. To this end, the object identifiers of alternates are
collected in an array and repeated hashes are omitted before
transmission. In contrast, refs outside the used namespace are currently
converted into "have" lines and transmitted immediately, without
checking for duplicate lines. This means that exactly the same "have"
line might be transmitted several times.

Optimize this by using a single pool to collect all object identifiers
to be converted into "have" lines (including both alternates and refs
outside the namespace) first and transmit them later, omitting any
duplicates.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Lukas Fleischer <lfleischer@lfos.de>
---
This is based on pu. I am not sure whether we should also change the
name of show_one_alternate_sha1() in this patch since it is now used
to transmit refs outside the current namespace as well...

 builtin/receive-pack.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index f06f70a..548d4ce 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -217,23 +217,24 @@ static void show_ref(const char *path, const unsigned char *sha1)
 }
 
 static int show_ref_cb(const char *path_full, const struct object_id *oid,
-		       int flag, void *unused)
+		       int flag, void *data)
 {
 	const char *path = strip_namespace(path_full);
 
 	if (ref_is_hidden(path, path_full))
 		return 0;
 
-	/*
-	 * Advertise refs outside our current namespace as ".have"
-	 * refs, so that the client can use them to minimize data
-	 * transfer but will otherwise ignore them. This happens to
-	 * cover ".have" that are thrown in by add_one_alternate_ref()
-	 * to mark histories that are complete in our alternates as
-	 * well.
-	 */
-	if (!path)
-		path = ".have";
+	if (!path) {
+		/*
+		 * Advertise refs outside our current namespace as ".have"
+		 * refs, so that the client can use them to minimize data
+		 * transfer but will otherwise ignore them.
+		 */
+		struct sha1_array *sa = data;
+		sha1_array_append(sa, oid->hash);
+		return 0;
+	}
+
 	show_ref(path, oid->hash);
 	return 0;
 }
@@ -254,9 +255,9 @@ static void write_head_info(void)
 	struct sha1_array sa = SHA1_ARRAY_INIT;
 
 	for_each_alternate_ref(collect_one_alternate_ref, &sa);
+	for_each_ref(show_ref_cb, &sa);
 	sha1_array_for_each_unique(&sa, show_one_alternate_sha1, NULL);
 	sha1_array_clear(&sa);
-	for_each_ref(show_ref_cb, NULL);
 	if (!sent_capabilities)
 		show_ref("capabilities^{}", null_sha1);
 
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] receive-pack: avoid sending duplicate "have" lines
  2015-11-06 23:16 [PATCH] receive-pack: avoid sending duplicate "have" lines Lukas Fleischer
@ 2015-11-06 23:38 ` Junio C Hamano
  2015-11-07  9:04   ` Lukas Fleischer
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2015-11-06 23:38 UTC (permalink / raw)
  To: Lukas Fleischer; +Cc: git

Lukas Fleischer <lfleischer@lfos.de> writes:

> @@ -254,9 +255,9 @@ static void write_head_info(void)
>  	struct sha1_array sa = SHA1_ARRAY_INIT;
>  
>  	for_each_alternate_ref(collect_one_alternate_ref, &sa);
> +	for_each_ref(show_ref_cb, &sa);
>  	sha1_array_for_each_unique(&sa, show_one_alternate_sha1, NULL);

Heh, I didn't realize that we already have half a support for this
deduping.  Good find.

We used to show ".have" from alternates first and then our own, but
now we show the refs that matter and then ".have"s from alternates
and ".have"s for our repository outside the current namespace.  That
shouldn't cause problems and the result would probably make more
sense from aesthetics point of view ;-)

I suspect that many of these that turn into ".have"s point the same
object as those sit at the tip of our refs (e.g. tags in alternates
we borrow from, which are the folks of the same project that copy
the same tags from the same upstream).  I wonder if it easy to filter
them out?  After all, if we say object X sits at refs/tags/v1.0
there is no point showing ".have" for that same object X.

>  	sha1_array_clear(&sa);
> -	for_each_ref(show_ref_cb, NULL);
>  	if (!sent_capabilities)
>  		show_ref("capabilities^{}", null_sha1);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] receive-pack: avoid sending duplicate "have" lines
  2015-11-06 23:38 ` Junio C Hamano
@ 2015-11-07  9:04   ` Lukas Fleischer
  0 siblings, 0 replies; 3+ messages in thread
From: Lukas Fleischer @ 2015-11-07  9:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sat, 07 Nov 2015 at 00:38:00, Junio C Hamano wrote:
> [...]
> I suspect that many of these that turn into ".have"s point the same
> object as those sit at the tip of our refs (e.g. tags in alternates
> we borrow from, which are the folks of the same project that copy
> the same tags from the same upstream).  I wonder if it easy to filter
> them out?  After all, if we say object X sits at refs/tags/v1.0
> there is no point showing ".have" for that same object X.
> [...]

It is certainly doable but we would need a better infrastructure. The
easiest implementation (i.e. the one that is closest to what we
currently do) is adding a second SHA1 array to collect all the hashes we
already advertised. We could then pass a struct of both SHA1 arrays as
second parameter to for_each_ref() and always insert the object
identifier into either of the arrays in show_ref_cb(), depending on
whether we immediately advertise it as a ref or decide convert it into a
"have" line. To transmit "have" lines, instead of calling
sha1_array_for_each_unique(), we would then call some function that
prints unique entries from the first array that do not appear in the
second array. Something like this (fully untested):

-- 8< --
void sha1_array_for_each_diff(struct sha1_array *array1,
			      struct sha1_array *array2,
			      for_each_sha1_fn fn,
			      void *data)
{
	int i, j;

	if (!array1->sorted)
		sha1_array_sort(array1);
	if (!array2->sorted)
		sha1_array_sort(array2);

	for (i = j = 0; i < array1->nr; i++) {
		if (i > 0 && !hashcmp(array1->sha1[i], array1->sha1[i - 1]))
			continue;
		while (j < array2->nr && hashcmp(array1->sha1[i], array2->sha1[j]) > 0)
			j++;
		if (j < array2->nr && !hashcmp(array1->sha1[i], array2->sha1[j]))
			continue;
		fn(array1->sha1[i], data);
	}
}
-- >8 --

If we decide to implement all that, however, I wonder whether we should
use a more sophisticated data structure instead. Currently, if there are
n refs outside the current namespace pointing all to the same object, we
append n entries to the SHA1 array and sort them afterwards which
requires O(n) space. If we would maintain a set of hashes instead of an
array in the first place, we could reduce space usage significantly by
only storing what is needed (O(1) in the scenario I mentioned). We could
also add markers to those object identifiers we already advertised
instead of maintaining two separate lists.

Unfortunately, in order to keep the running time of O(n log n), we would
probably need something like self-balancing binary search trees.
Alternatively, a hash map based approach could be taken. Is something
like this already used anywhere in the Git source tree such that we can
borrow the required data structures and functions?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-07  9:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-06 23:16 [PATCH] receive-pack: avoid sending duplicate "have" lines Lukas Fleischer
2015-11-06 23:38 ` Junio C Hamano
2015-11-07  9:04   ` Lukas Fleischer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).