git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lukas Fleischer <lfleischer@lfos.de>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] receive-pack: avoid sending duplicate "have" lines
Date: Sat, 07 Nov 2015 10:04:31 +0100	[thread overview]
Message-ID: <20151107090431.576.3746@typhoon.lan> (raw)
In-Reply-To: <xmqq8u6a3dif.fsf@gitster.mtv.corp.google.com>

On Sat, 07 Nov 2015 at 00:38:00, Junio C Hamano wrote:
> [...]
> I suspect that many of these that turn into ".have"s point the same
> object as those sit at the tip of our refs (e.g. tags in alternates
> we borrow from, which are the folks of the same project that copy
> the same tags from the same upstream).  I wonder if it easy to filter
> them out?  After all, if we say object X sits at refs/tags/v1.0
> there is no point showing ".have" for that same object X.
> [...]

It is certainly doable but we would need a better infrastructure. The
easiest implementation (i.e. the one that is closest to what we
currently do) is adding a second SHA1 array to collect all the hashes we
already advertised. We could then pass a struct of both SHA1 arrays as
second parameter to for_each_ref() and always insert the object
identifier into either of the arrays in show_ref_cb(), depending on
whether we immediately advertise it as a ref or decide convert it into a
"have" line. To transmit "have" lines, instead of calling
sha1_array_for_each_unique(), we would then call some function that
prints unique entries from the first array that do not appear in the
second array. Something like this (fully untested):

-- 8< --
void sha1_array_for_each_diff(struct sha1_array *array1,
			      struct sha1_array *array2,
			      for_each_sha1_fn fn,
			      void *data)
{
	int i, j;

	if (!array1->sorted)
		sha1_array_sort(array1);
	if (!array2->sorted)
		sha1_array_sort(array2);

	for (i = j = 0; i < array1->nr; i++) {
		if (i > 0 && !hashcmp(array1->sha1[i], array1->sha1[i - 1]))
			continue;
		while (j < array2->nr && hashcmp(array1->sha1[i], array2->sha1[j]) > 0)
			j++;
		if (j < array2->nr && !hashcmp(array1->sha1[i], array2->sha1[j]))
			continue;
		fn(array1->sha1[i], data);
	}
}
-- >8 --

If we decide to implement all that, however, I wonder whether we should
use a more sophisticated data structure instead. Currently, if there are
n refs outside the current namespace pointing all to the same object, we
append n entries to the SHA1 array and sort them afterwards which
requires O(n) space. If we would maintain a set of hashes instead of an
array in the first place, we could reduce space usage significantly by
only storing what is needed (O(1) in the scenario I mentioned). We could
also add markers to those object identifiers we already advertised
instead of maintaining two separate lists.

Unfortunately, in order to keep the running time of O(n log n), we would
probably need something like self-balancing binary search trees.
Alternatively, a hash map based approach could be taken. Is something
like this already used anywhere in the Git source tree such that we can
borrow the required data structures and functions?

      reply	other threads:[~2015-11-07  9:04 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-06 23:16 [PATCH] receive-pack: avoid sending duplicate "have" lines Lukas Fleischer
2015-11-06 23:38 ` Junio C Hamano
2015-11-07  9:04   ` Lukas Fleischer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151107090431.576.3746@typhoon.lan \
    --to=lfleischer@lfos.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).