git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD] annnotating a pair of commit objects?
@ 2013-01-03  7:03 Junio C Hamano
  2013-01-03  8:14 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Junio C Hamano @ 2013-01-03  7:03 UTC (permalink / raw)
  To: git

I'd like a datastore that maps a pair of commit object names to
another object name, such that:

 * When looking at two commits A and B, efficiently query all data
   associated with a pair of commits <X,Y> where X is contained in
   the range A..B and not in B..A, and Y is contained in the range
   B..A and not in A..B.

 * When <X,Y> is registered in the datastore, and X is rewritten to
   X' and/or Y is rewritten to Y', the mapping is updated so that it
   can be queried with <X',Y'> as a new key, similar to the way a
   notes tree that maps object X can be updated to map object X'
   when such a rewrite happens.

The intended use case is to "go beyond rerere".  Given a history of
this shape:

    o---o---o---I      mainline
   / 
  O---o---X---o---A    topic A
   \
    o---Y---o---o---B  topic B

Suppose in the original O we had a function "distimmed_doshes()" to
tell if doshes are already distimmed, with some call sites.  On the
branch leading to A, at commit X, this function was renamed to
"doshes_are_distimmed()", and all existing call sites were adjusted.
On the side branch leading to B, however, at commit Y, a new call
site to the old function was added in a file that was not touched
between O..A at all.

When merging either the topic A or the topic B (but not both) to the
integration branch that did not touch this function or use of it, no
special care needs to be taken, but when merging the second topic
after merging the other one, we need to resolve a semantic conflict.
Namely, the callsite to "distimmed_doshes()" introduced by commit Y
needs to be adjusted to call "doshes_are_distimmed()" instead.

The first step is to recognize the potential issue.  When queuing
the topic that contains X and the other topic that contains Y,
suppose I could register <X,Y> to the datastore I am dreaming.  When
I merge A to the integration branch, I can notice that there is no
such pair <M,N> in the datastore that:

 * M is in A..I and not in I..A
 * N is in I..A and not in A..I

and can create a merge J without semantic adjustment.

    o---o---o---I---J      mainline
   /               /  
  O---o---X---o---A        topic A
   \
    o---Y---o---o---B      topic B

When I later merge topic B to the integration branch, however, there
is <X,Y> in the datastore such that:

 * X is in B..J and not in J..B
 * Y is in J..B and not in B..J

to notice that we need to be careful when creating the merge K:

    o---o---o---I---J---K  mainline
   /               /   /
  O---o---X---o---A   /    topic A
   \                 /
    o---Y---o---o---B      topic B

Of course, the next step is to store not just one bit "<X,Y> exists
in the datastore--be careful", but what semantic adjustment needs to
be applied [*1*]

Obviously, with O(cnt(A..B))*O(cnt(B..A)) complexity, this can be
trivially coded, by trying all pairs in symmetric difference.

But I am hoping we can do better than that.

Ideas?


[Footnote]

*1* We could do it in multiple ways and the details are not
interesting. A blob object that records a patch that can be applied
with "git apply -3", or a commit object that represents necessary
"evil" change that can be cherry-pick'ed, would be two possible
implementations.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFD] annnotating a pair of commit objects?
  2013-01-03  7:03 [RFD] annnotating a pair of commit objects? Junio C Hamano
@ 2013-01-03  8:14 ` Jeff King
  2013-01-03  9:59 ` Michael Haggerty
  2013-01-03 10:23 ` Johannes Sixt
  2 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2013-01-03  8:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Jan 02, 2013 at 11:03:00PM -0800, Junio C Hamano wrote:

> I'd like a datastore that maps a pair of commit object names to
> another object name, such that:
> 
>  * When looking at two commits A and B, efficiently query all data
>    associated with a pair of commits <X,Y> where X is contained in
>    the range A..B and not in B..A, and Y is contained in the range
>    B..A and not in A..B.
> [...]
> Obviously, with O(cnt(A..B))*O(cnt(B..A)) complexity, this can be
> trivially coded, by trying all pairs in symmetric difference.
> 
> But I am hoping we can do better than that.
> 
> Ideas?

Just thinking out loud, the problem can be generalized in math terms as:

  - you have a universe of elements, `U` (i.e., all commits)

  - you have two sets, `X` and `Y`, such that each is a subset of `U`
    (these correspond to the two sides of the range, but we can think of
    them just as sets of commits). We happen to know that the sets are
    disjoint, but I don't know if that is helpful here.

  - you have a set of sets `M` that is a subset of the cartesian product
    `U x U` (i.e., its elements are "{x,y}" pairs, and membership in
    that set is your "bit"; you could also think of it as a mapping if
    you wanted more than a bit).

  - you want to know the intersection of `X x Y` and `M` (which of your
    pairs are in the mapping set).

Without doing any careful analysis, my gut says that in the worst case,
you are going to be stuck with `O(|X|*|Y|)` (i.e., what you are trying
to do better than above). But if we assume that `M` is relatively sparse
(which it should be; you only create entries when you do a merge between
two commits, and even then, only when it is tricky), we can probably do
better in practice.

For example, consider this naive way of doing it. Store `M` as a mapping
of commits to sets of commits, with fast lookup (a hash, or sorted
list).  For each element of `X`, look it up in `M`; call its value `V`
(which, remember, is a set itself).  For each element of `Y`, look it up
in `V`. The complexity would be:

  O(|X| * lg(|M|) * |Y| * lg(V_avg))

where "V_avg" is the average cardinality of each of the value sets we
map in the first step. But imagine we pre-sort `Y`, and then in the
second step, rather than looking up each `Y` in `V`, we instead look up
each `V` in `Y`. Then we have:

  O(|X| * lg(|M|) * V_avg * lg(|Y|))

IOW, we get to apply the log to |Y|. This is a win if we expect that
V_avg is going to be much smaller than |Y|. Which I think it would be,
since we would only have entries for merges we've done before[1].

That's just off the top of my head. This seems like it should be a
classic databases problem (since the cartesian product here is really
just a binary relation), but I couldn't find anything obvious online
(and I never paid attention in class, either).

-Peff

[1] You can do the same inversion trick for looking up elements of `M`
    in `X` instead of vice versa. It would probably buy you less, as you
    have a lot of commits that have merges at all (i.e., `M` is big),
    but only a few matching partners for each entry (i.e., `V` is
    small).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFD] annnotating a pair of commit objects?
  2013-01-03  7:03 [RFD] annnotating a pair of commit objects? Junio C Hamano
  2013-01-03  8:14 ` Jeff King
@ 2013-01-03  9:59 ` Michael Haggerty
  2013-04-05 19:36   ` Antoine Pelisse
  2013-01-03 10:23 ` Johannes Sixt
  2 siblings, 1 reply; 6+ messages in thread
From: Michael Haggerty @ 2013-01-03  9:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King

On 01/03/2013 08:03 AM, Junio C Hamano wrote:
> I'd like a datastore that maps a pair of commit object names to
> another object name, such that:
> 
>  * When looking at two commits A and B, efficiently query all data
>    associated with a pair of commits <X,Y> where X is contained in
>    the range A..B and not in B..A, and Y is contained in the range
>    B..A and not in A..B.
> 
>  * When <X,Y> is registered in the datastore, and X is rewritten to
>    X' and/or Y is rewritten to Y', the mapping is updated so that it
>    can be queried with <X',Y'> as a new key, similar to the way a
>    notes tree that maps object X can be updated to map object X'
>    when such a rewrite happens.
> 
> The intended use case is to "go beyond rerere".  Given a history of
> this shape:
> 
>     o---o---o---I      mainline
>    / 
>   O---o---X---o---A    topic A
>    \
>     o---Y---o---o---B  topic B

If we ignore rewriting for a moment, the information that you want to
record is essentially the merge M of X and Y, no?  Namely, X and Y
conflict logically with each other (though perhaps not textually) and
you, the human, want to record how to reconcile them:

     o---o---o---I      mainline
    /
   O---o---X---o---A    topic A
    \       \
     \       M
      \     /
       o---Y---o---o---B  topic B

However, you don't necessarily want to go to the trouble to make a
branch to point at M, nor to do the bookkeeping manually that would be
required to take the information stored in M into account when merging A
and B later.

Suppose we had M; how could we make use of it in future merges?

> [...] and can create a merge J without semantic adjustment.
> 
>     o---o---o---I---J      mainline
>    /               /  
>   O---o---X---o---A        topic A
>    \
>     o---Y---o---o---B      topic B

That would become

     o---o---o---I---J      mainline
    /               /
   O---o---X---o---A        topic A
    \       \
     \       M
      \     /
       o---Y---o---o---B  topic B


> When I later merge topic B to the integration branch, however, [...]
> to notice that we need to be careful when creating the merge K:
> 
>     o---o---o---I---J---K  mainline
>    /               /   /
>   O---o---X---o---A   /    topic A
>    \                 /
>     o---Y---o---o---B      topic B

When doing this merge, I think your goal is equivalent to discovering
that M includes part of the merge of J and B, and adding M as an
(implicit or explicit) third parent to the merge:

     o---o---o---I---J-------K  mainline
    /               /    .  /
   O---o---X---o---A    .  /    topic A
    \       \          .  /
     \       M.........  /
      \     /           /
       o---Y---o---o---B        topic B

How could M be stored?  Assuming that these type of premerge merges are
sparse, then Jeff's analysis seems good.  Concretely, one could simply
store pointers to M from both X and Y; e.g.,

* Add a note to X with the information "when merging this commit with Y,
use premerge M"

* Add a note to Y with the information "when merging this commit with X,
use premerge M"

Then, when merging, create the set J..B, scan all of the commits on B..J
for these "premerge" notes (O(|B..J|)), and for each one, look in the
set J..B to see if it is present.  The effort should scale like

    O( |J..B| + |B..J| * lg(|J..B|) )

where, of course J and B could be exchanged for either aesthetic or
performance reasons.  (One would also need a mechanism for preventing M
from being garbage-collected.)

Incidentally, this is just the sort of thing I have been thinking about
using to implement a kind of "incremental merge"; I've started writing
up my thoughts on my blog [1,2,3] (including how to make pretty pictures
of merge conflicts).

Michael

[1]
http://softwareswirl.blogspot.de/2012/12/the-conflict-frontier-of-nightmare-merge.html
[2]
http://softwareswirl.blogspot.de/2012/12/mapping-merge-conflict-frontier.html
[3]
http://softwareswirl.blogspot.de/2012/12/real-world-conflict-diagrams.html

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFD] annnotating a pair of commit objects?
  2013-01-03  7:03 [RFD] annnotating a pair of commit objects? Junio C Hamano
  2013-01-03  8:14 ` Jeff King
  2013-01-03  9:59 ` Michael Haggerty
@ 2013-01-03 10:23 ` Johannes Sixt
  2 siblings, 0 replies; 6+ messages in thread
From: Johannes Sixt @ 2013-01-03 10:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Am 03.01.2013 08:03, schrieb Junio C Hamano:
> The intended use case is to "go beyond rerere".  Given a history of
> this shape:
> 
>     o---o---o---I      mainline
>    / 
>   O---o---X---o---A    topic A
>    \
>     o---Y---o---o---B  topic B
> 
> Suppose in the original O we had a function "distimmed_doshes()" to
> tell if doshes are already distimmed, with some call sites.  On the
> branch leading to A, at commit X, this function was renamed to
> "doshes_are_distimmed()", and all existing call sites were adjusted.
> On the side branch leading to B, however, at commit Y, a new call
> site to the old function was added in a file that was not touched
> between O..A at all.
> 
> When merging either the topic A or the topic B (but not both) to the
> integration branch that did not touch this function or use of it, no
> special care needs to be taken, but when merging the second topic
> after merging the other one, we need to resolve a semantic conflict.
> Namely, the callsite to "distimmed_doshes()" introduced by commit Y
> needs to be adjusted to call "doshes_are_distimmed()" instead.

I guess this issue comes up when you rebuild pu. Perhaps you (and other
integrators with a similar workflow) are sufficiently served with
something that resembles rebase -p --first-parent, as proposed here:

http://thread.gmane.org/gmane.comp.version-control.git/198125/focus=198483

(A proposal of the same idea appeared already years earlier:

http://thread.gmane.org/gmane.comp.version-control.git/62782/focus=62794

but its implementation only re-did the merge, which would not help your
case.)

-- Hannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFD] annnotating a pair of commit objects?
  2013-01-03  9:59 ` Michael Haggerty
@ 2013-04-05 19:36   ` Antoine Pelisse
  2013-04-06  7:55     ` Michael Haggerty
  0 siblings, 1 reply; 6+ messages in thread
From: Antoine Pelisse @ 2013-04-05 19:36 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, git, Jeff King, Johannes Sixt

On Thu, Jan 3, 2013 at 10:59 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 01/03/2013 08:03 AM, Junio C Hamano wrote:
>> I'd like a datastore that maps a pair of commit object names to
>> another object name, such that:
>>
>>  * When looking at two commits A and B, efficiently query all data
>>    associated with a pair of commits <X,Y> where X is contained in
>>    the range A..B and not in B..A, and Y is contained in the range
>>    B..A and not in A..B.
>>
>>  * When <X,Y> is registered in the datastore, and X is rewritten to
>>    X' and/or Y is rewritten to Y', the mapping is updated so that it
>>    can be queried with <X',Y'> as a new key, similar to the way a
>>    notes tree that maps object X can be updated to map object X'
>>    when such a rewrite happens.
>>
>> The intended use case is to "go beyond rerere".  Given a history of
>> this shape:
>>
>>     o---o---o---I      mainline
>>    /
>>   O---o---X---o---A    topic A
>>    \
>>     o---Y---o---o---B  topic B

I would indeed also be interested in such a feature for my day-to-day
work where we use a workflow similar to git.git's.

> When doing this merge, I think your goal is equivalent to discovering
> that M includes part of the merge of J and B, and adding M as an
> (implicit or explicit) third parent to the merge:
>
>      o---o---o---I---J-------K  mainline
>     /               /    .  /
>    O---o---X---o---A    .  /    topic A
>     \       \          .  /
>      \       M.........  /
>       \     /           /
>        o---Y---o---o---B        topic B
>
> How could M be stored?  Assuming that these type of premerge merges are
> sparse, then Jeff's analysis seems good.  Concretely, one could simply
> store pointers to M from both X and Y; e.g.,
>
> * Add a note to X with the information "when merging this commit with Y,
> use premerge M"
>
> * Add a note to Y with the information "when merging this commit with X,
> use premerge M"
>
> Then, when merging, create the set J..B, scan all of the commits on B..J
> for these "premerge" notes (O(|B..J|)), and for each one, look in the
> set J..B to see if it is present.  The effort should scale like
>
>     O( |J..B| + |B..J| * lg(|J..B|) )
>
> where, of course J and B could be exchanged for either aesthetic or
> performance reasons.  (One would also need a mechanism for preventing M
> from being garbage-collected.)

I like the idea of using notes and a kind of "pre-merge". The proposal
seems decent to me.

Michael, have you started implementing such a thing ? If you did, I
would love to help as much as I can.
If you didn't, I would like to start implementing this feature (I
think I now have some time to do so).
Maybe that would require some kind of mentoring though. It could be a
nice opportunity for the community to improve that too as a fake
"gsoc" (no google, no summer, no student)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFD] annnotating a pair of commit objects?
  2013-04-05 19:36   ` Antoine Pelisse
@ 2013-04-06  7:55     ` Michael Haggerty
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Haggerty @ 2013-04-06  7:55 UTC (permalink / raw)
  To: Antoine Pelisse; +Cc: Junio C Hamano, git, Jeff King, Johannes Sixt

On 04/05/2013 09:36 PM, Antoine Pelisse wrote:
> On Thu, Jan 3, 2013 at 10:59 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> How could M be stored?  Assuming that these type of premerge merges are
>> sparse, then Jeff's analysis seems good.  Concretely, one could simply
>> store pointers to M from both X and Y; e.g.,
>>
>> * Add a note to X with the information "when merging this commit with Y,
>> use premerge M"
>>
>> * Add a note to Y with the information "when merging this commit with X,
>> use premerge M"
>>
>> Then, when merging, create the set J..B, scan all of the commits on B..J
>> for these "premerge" notes (O(|B..J|)), and for each one, look in the
>> set J..B to see if it is present.  The effort should scale like
>>
>>     O( |J..B| + |B..J| * lg(|J..B|) )
>>
>> where, of course J and B could be exchanged for either aesthetic or
>> performance reasons.  (One would also need a mechanism for preventing M
>> from being garbage-collected.)
> 
> I like the idea of using notes and a kind of "pre-merge". The proposal
> seems decent to me.
> 
> Michael, have you started implementing such a thing ? If you did, I
> would love to help as much as I can.
> If you didn't, I would like to start implementing this feature (I
> think I now have some time to do so).
> Maybe that would require some kind of mentoring though. It could be a
> nice opportunity for the community to improve that too as a fake
> "gsoc" (no google, no summer, no student)

No, I haven't pursued this topic per se.  We don't use such a workflow
on my projects, so it isn't a particular itch of mine.  I am currently
more interested in approaches to merging branches that have diverged
"too far" from each other [1].

There will be some overlap with this idea and my "git-mergemate" project
[2], if I ever find the time to make progress on it.  For that project,
I will also need to store lots of intermediate merge commits, in fact
also merges between parts of two branches.  I had planned to
autogenerate branch names by simply sticking the SHA1s of the parents
together, like maybe

    refs/mergemate/NAME/SHA1-SHA1

or

    refs/mergemate/NAME/SHA1/SHA1

where NAME is the overall name of a merge that is in progress.  These
references would be cleaned up when the merge was complete but would
prevent garbage collection of the intermediate results until then.

I would be happy to collaborate with you but it might not turn out so
well because my time for open-sourcing is so limited and unpredictable.

Michael

[1]
http://softwareswirl.blogspot.de/2012/12/the-conflict-frontier-of-nightmare-merge.html

[2] https://github.com/mhagger/git-mergemate  (yes I know the name sucks)

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-04-06 16:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-03  7:03 [RFD] annnotating a pair of commit objects? Junio C Hamano
2013-01-03  8:14 ` Jeff King
2013-01-03  9:59 ` Michael Haggerty
2013-04-05 19:36   ` Antoine Pelisse
2013-04-06  7:55     ` Michael Haggerty
2013-01-03 10:23 ` Johannes Sixt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).