From: Jeff King <peff@peff.net>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: Junio C Hamano <gitster@pobox.com>,
Martin Fick <mfick@codeaurora.org>,
git@vger.kernel.org
Subject: Re: [PATCH] Avoid sorting if references are added to ref_cache in order
Date: Thu, 24 May 2012 17:10:48 -0400 [thread overview]
Message-ID: <20120524211047.GC21329@sigill.intra.peff.net> (raw)
In-Reply-To: <4FBEA16D.4040204@alum.mit.edu>
On Thu, May 24, 2012 at 11:00:29PM +0200, Michael Haggerty wrote:
> >Technically we would still be sorted if strcmp(...) == 0. But I guess it
> >probably doesn't matter, as we shouldn't ever be adding duplicates here.
>
> Yes, duplicate refs should be an exceptional case and needn't be
> handled efficiently. sort_ref_dir() explicitly accepts the
> possibility of duplicate references, de-dups them if they are
> consistent with each other, or dies if they are inconsistent. We
> shouldn't add a way to bypass that logic. We could implement the
> duplicate-detection-and-checking code again in add_entry_to_dir(),
> but my choice was to leave it to sort_ref_dir() to deal with
> duplicates.
Ah, I didn't realize that there was duplicate-detection magic was in
sort_ref_dir. So comparing to 0 is not even "this doesn't matter, but we
could..." but "doing it that way will actively break other code". I
withdraw my suggestion. :)
> More general approaches:
> [...]
> Something that would help read_packed_refs() would be to keep track
> of the directory that it is currently working in. This would
> effectively remove the need to access a directory while working in
> one of its subdirectories, thereby avoiding any need to repeatedly
> sort intermediate directories. It would also avoid having to
> traverse the directories starting at the root for each new entry,
> which itself would save time independent of the time for sorting. I
> have some patches that implement this change but they are stale. I
> want to do some benchmarking first though to see whether the extra
> complication brings measurable benefits.
This is specifically what I was thinking of when I wrote my previous
message. But I rejected it as too complicated to be worth the trouble,
if the only code path helped would be mass-inserting unsorted refs,
which we don't even currently do.
However, I didn't think about the fact that we could avoid traversing
the refs tree entirely for each ref. That might actually yield a
measurable speedup. I'd be curious to see your results.
> Finally, I did some work on the idea of keeping the refs within a
> directory *mostly* sorted. (Probably this technique is already
> known, but I have never run into it.) One would keep the first part
> of the array sorted, and append new references to the tail of the
> array unsorted. Searching would be via a binary search over the
> sorted part of the array, and a linear search over the unsorted tail.
> The trick is that every so often the tail should be sorted and merged
> into the head. How often? It is a tradeoff between the work of
> sorting and merging versus the work saved by avoiding linear searches
> through the tail. I worked out the theory, and I think the optimum
> was to re-sort the array when the size of the unsorted tail reached
> the squareroot of the total size or something like that. This method
> could reduce the cost of (lookup, add, lookup, add, ...) sequences,
> albeit not to the extent of a more optimal data structure.
That's very clever. In fact, way more clever than this problem deserves.
It is not like we are journaling a database; we basically have one
mass-insertion when we read packed-refs, and I think the simpler
solutions will perform just fine.
-Peff
prev parent reply other threads:[~2012-05-24 21:10 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-24 12:16 [PATCH] Avoid sorting if references are added to ref_cache in order mhagger
2012-05-24 17:49 ` Jeff King
2012-05-24 21:00 ` Michael Haggerty
2012-05-24 21:10 ` Jeff King [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120524211047.GC21329@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mfick@codeaurora.org \
--cc=mhagger@alum.mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).