From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFD] annnotating a pair of commit objects?
Date: Thu, 3 Jan 2013 03:14:59 -0500 [thread overview]
Message-ID: <20130103081459.GB32377@sigill.intra.peff.net> (raw)
In-Reply-To: <7vr4m2ycij.fsf@alter.siamese.dyndns.org>
On Wed, Jan 02, 2013 at 11:03:00PM -0800, Junio C Hamano wrote:
> I'd like a datastore that maps a pair of commit object names to
> another object name, such that:
>
> * When looking at two commits A and B, efficiently query all data
> associated with a pair of commits <X,Y> where X is contained in
> the range A..B and not in B..A, and Y is contained in the range
> B..A and not in A..B.
> [...]
> Obviously, with O(cnt(A..B))*O(cnt(B..A)) complexity, this can be
> trivially coded, by trying all pairs in symmetric difference.
>
> But I am hoping we can do better than that.
>
> Ideas?
Just thinking out loud, the problem can be generalized in math terms as:
- you have a universe of elements, `U` (i.e., all commits)
- you have two sets, `X` and `Y`, such that each is a subset of `U`
(these correspond to the two sides of the range, but we can think of
them just as sets of commits). We happen to know that the sets are
disjoint, but I don't know if that is helpful here.
- you have a set of sets `M` that is a subset of the cartesian product
`U x U` (i.e., its elements are "{x,y}" pairs, and membership in
that set is your "bit"; you could also think of it as a mapping if
you wanted more than a bit).
- you want to know the intersection of `X x Y` and `M` (which of your
pairs are in the mapping set).
Without doing any careful analysis, my gut says that in the worst case,
you are going to be stuck with `O(|X|*|Y|)` (i.e., what you are trying
to do better than above). But if we assume that `M` is relatively sparse
(which it should be; you only create entries when you do a merge between
two commits, and even then, only when it is tricky), we can probably do
better in practice.
For example, consider this naive way of doing it. Store `M` as a mapping
of commits to sets of commits, with fast lookup (a hash, or sorted
list). For each element of `X`, look it up in `M`; call its value `V`
(which, remember, is a set itself). For each element of `Y`, look it up
in `V`. The complexity would be:
O(|X| * lg(|M|) * |Y| * lg(V_avg))
where "V_avg" is the average cardinality of each of the value sets we
map in the first step. But imagine we pre-sort `Y`, and then in the
second step, rather than looking up each `Y` in `V`, we instead look up
each `V` in `Y`. Then we have:
O(|X| * lg(|M|) * V_avg * lg(|Y|))
IOW, we get to apply the log to |Y|. This is a win if we expect that
V_avg is going to be much smaller than |Y|. Which I think it would be,
since we would only have entries for merges we've done before[1].
That's just off the top of my head. This seems like it should be a
classic databases problem (since the cartesian product here is really
just a binary relation), but I couldn't find anything obvious online
(and I never paid attention in class, either).
-Peff
[1] You can do the same inversion trick for looking up elements of `M`
in `X` instead of vice versa. It would probably buy you less, as you
have a lot of commits that have merges at all (i.e., `M` is big),
but only a few matching partners for each entry (i.e., `V` is
small).
next prev parent reply other threads:[~2013-01-03 8:15 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-03 7:03 [RFD] annnotating a pair of commit objects? Junio C Hamano
2013-01-03 8:14 ` Jeff King [this message]
2013-01-03 9:59 ` Michael Haggerty
2013-04-05 19:36 ` Antoine Pelisse
2013-04-06 7:55 ` Michael Haggerty
2013-01-03 10:23 ` Johannes Sixt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130103081459.GB32377@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).