Fake linear history in a deterministic manner.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fake linear history in a deterministic manner.
@ 2006-02-13  1:46 Martin Langhoff
  2006-02-13  5:24 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Martin Langhoff @ 2006-02-13  1:46 UTC (permalink / raw)
  To: Git Mailing List, martyn

To emulate `cvs log somepath` I need to munge history to look linear.
I am working on the theory that I will tell the cvs client about *one*
linear history, and show merges from parallel histories as a merge
commit, "flattened" so to speak, and with a commit message where I'll
list the hash and first line of each commit that it involves.

Now, we are keeping a sqlite db (bad dependency, I know) with a list
of the lies we tell to the clients, so we can at least be consistent
(and fast). The fact that true git clients will be able to commit to
the repo means that actual parallel development will happen in the
repo.

Now, I can't think of an approach to drawing the linearized history
that is deterministic. I can't chose any branch with any confidence,
because the repo always has a very limited view. A client could come
in any time and push onto the repo a series of commits based on an
ancient commit, with ancient dates, and a merge to todays head.

We've been talking about updating the sqlite db with a post update
hook, which means that in that context we never have to choose, the
commits that get to the repo first win because they now drive the
linearized history.

But when creating a new lies database, we have no
"pushed-to-this-repo" timestamp in the commits so we'll have to pick.
At the moment, I suspect I'll pick the one with the earliest
"following" commit.

I thought briefly about delaying the decision until I see the merge,
and pick the leftmost, or rightmost, if there is some bias in
git-merge or cg-merge on putting whatever origin has on a particular
side. It'd mean running backwards through history and that the very
last merge can flip the decision entirely. Hmmm... any strategy I can
come up with means that each new merge throws the dice again entirely.

Ideas?

(As a result of this, the git-cvsserver we're drafting is of limited
usefulness to projects that really do use git to do what it does best:
DSCM. Projects with a mostly linearized history -- using git-rebase
liberally to avoid uninteresting merges -- do get a reasonable cvs
history. Or will get. Sometime. Soon.)

martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-13  1:46 Fake linear history in a deterministic manner Martin Langhoff
@ 2006-02-13  5:24 ` Junio C Hamano
  2006-02-13  5:58   ` Martin Langhoff
  2006-02-16 22:29 ` Eric Wong
  2006-02-16 23:54 ` Junio C Hamano
  2 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2006-02-13  5:24 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

Martin Langhoff <martin.langhoff@gmail.com> writes:

> I thought briefly about delaying the decision until I see the merge,
> and pick the leftmost, or rightmost, if there is some bias in
> git-merge or cg-merge on putting whatever origin has on a particular
> side. It'd mean running backwards through history and that the very
> last merge can flip the decision entirely. Hmmm... any strategy I can
> come up with means that each new merge throws the dice again entirely.
>
> Ideas?

When somebody pushes a ref to your existing commit ancestry
graph, you can easily identify which commits are the new ones
you see for the first time.

                A
           o---o
          /
     o---o---o---o---o B

Suppose you started from two branch repo A and B.  Your sqlite
database knows about all of these commits, and say you earlier
have decided to treat A as a side branch, B as trunk.

Then somebody pushed a new B, making the ancestry graph like
this:

                A
           o---o-------*---*---* B-new
          /           /     
     o---o---o---o---o B-old

When the update-hook runs, as you read in receive-pack.c, 
your refs have not been updated yet, so you can identify the
commits marked with * with:

	git rev-list B-new $(git rev-parse --not --all)

I think you have to do this enumeration of new commits anyway,
if you are keeping tabs on all commits in your database.  And
you check the chain * commits form, and work backwards.  The
topologically oldest ones determine what branch others belong
to.

With CVS you cannot express a merge very well, so you now face a
choice.  Which parent to drop from the leftmost * commit in the
above picture?

It has commits known to your sqlite database as its parents, and
one of them, B (old), is on trunk; the other one is a side
branch, so in this case the decision is clear.  You leave the
link from B-old and cut link from A.  But this heuristics would
not work very well if both are branches.  Which one to prefer?

One approach would be to see the world with eyes of the person
who did such a merge.  Both git and cogito place the current
branch as the first parent, so you can tell which one the person
who did the merge had before that merge that way.  Also in
practice, in a normal branch, you would build sequence of small
commits and occasionally merge from elsewhere, so the side with
more difference tends to be a side branch from the point of view
of the person who performs the merge.

However, this would not help much.  There is no inherent
distinction between trunk and branch in the distributed
development.  Although I treat my "next" branch as a side branch
and "master" branch as the trunk, "next" sometimes needs to
merge from "master", and by looking at only that merge, you
cannot tell which one is the trunk.  My merging "master" into
"next" does not make my "next" the trunk.

Since the ordering is completely arbitrary, I think you can give
preference to the ones you have found out about earlier.  That
is, suppose you already had this, and for some reason track B is
the trunk and track A is a branch (say commits on B have 1.XX,
the ones on A have 1.XX.1.YY where XX depends on which commits
on B the branch was forked from).  At this point, if you see a
push to C:

                A
           o---o
          /
     o---o---o---o---o B
              \
               *---* C

you give 1.XX.2.YY to commits on C.

This may or may not match reality, and C may collect a lot of
commits while B might stay dormant -- and you may regret that
you picked B as the trunk.  But the thing is, there is no
inherent trunk or branch in the distributed world, so the cvs
clients of your server needs to live with it.  To you, you saw B
first and people who track the project through your server also
saw B first.  Then you can have total ordering to side
branches.  When you see a merge, you arbitrarily but
consistently pick which side to pick.  The one that has smaller
number at the third place (which I am using as "the branch
number").

Just thinking out aloud again...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-13  5:24 ` Junio C Hamano
@ 2006-02-13  5:58   ` Martin Langhoff
  2006-02-13 16:34     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Langhoff @ 2006-02-13  5:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 2/13/06, Junio C Hamano <junkio@cox.net> wrote:

Yep, I had roughly this exact scenario in my mind, thanks for the
graphs -- help make sense of what's happening.

> When somebody pushes a ref to your existing commit ancestry
> graph, you can easily identify which commits are the new ones
> you see for the first time.
>
>                 A
>            o---o
>           /
>      o---o---o---o---o B
>
> Suppose you started from two branch repo A and B.  Your sqlite
> database knows about all of these commits, and say you earlier
> have decided to treat A as a side branch, B as trunk.

Well, from the point of view of B being a head we know about, either A
is another head, and we don't care about it, or A is someone's repo
and we haven't seen it and thus cannot care about it either.

A bit of background... git-cvsserver at the moment has the following
semantics: what cvs considers a top-level "module" we use to identify
the head. So

    cvs -d :ext:me@server:/var/frob.git checkout  master

will get you frob.git#master. I thought long and hard about this, and
it is horribly hard to mimic CVS's idea of branches. So clients see a
strictly linear history. Any user in this scenario wanting to do
branching and merging is kindly invited to use real tools. There';s
only so much magic Perl can do ;-)

> Then somebody pushed a new B, making the ancestry graph like
> this:
>
>                 A
>            o---o-------*---*---* B-new
>           /           /
>      o---o---o---o---o B-old
>
> When the update-hook runs, as you read in receive-pack.c,
> your refs have not been updated yet, so you can identify the
> commits marked with * with:
>
>         git rev-list B-new $(git rev-parse --not --all)

That's a nifty trick, though I'm not sure I'll be able to use it.

Right now I'm doing something stupid-er just using `git-log
--parents`, skipping commits that don't move the ball forward on the
linear head I know about, and then processing those "other tracks"
when I see a merge commit. I'll probably find the merge base and see
what commits were brought in from the other side.

The problem is that the situation running from post update hook is
very different from the scenario of running the very same script on a
repo where you see all the history.

In any case, I'm undecided whether to use --topo-order or
--merge-order. Does it really matter?

(...)
> With CVS you cannot express a merge very well, so you now face a
> choice.  Which parent to drop from the leftmost * commit in the
> above picture?

Well, we've already lied about having B to clients that won't know
what to do if we tell them about parallel histories, so our pick is B.

(...)
> One approach would be to see the world with eyes of the person
> who did such a merge.  Both git and cogito place the current
> branch as the first parent,

Yes, I thought about that, but that order is ambiguous in the two most
interesting cases:

 - project maintainer pulls from mature feature branches from other
developers - her side is first, show the "pulled" stuff as merges
(flattened with a merge summary in the commit msg). Still, you can
argue the feature development is more interesting.

 - team-shared-git-repo user does cg-update and merges updates from
origin - her side is first, we don't know which side is the
interesting one. At all. Hmmm.

> But the thing is, there is no
> inherent trunk or branch in the distributed world, so the cvs
> clients of your server needs to live with it.

That's the mantra so far, and we'll talk to cvs clients about
perfectly linearized history. Anything else won't be useful as far as
I can tell -- or in any case,until this is going well for basic usage.
If someone's crazy enough to try I won't get in the way.

cheers.

martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-13  5:58   ` Martin Langhoff
@ 2006-02-13 16:34     ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2006-02-13 16:34 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, git

On Mon, 13 Feb 2006, Martin Langhoff wrote:
> 
> In any case, I'm undecided whether to use --topo-order or
> --merge-order. Does it really matter?

Use topo-order.

merge-order is much more expensive to calculate, and doesn't even exist if 
NO_OPENSSL is set.

Anyway, while linearization in general is impossible, I'd suggest a few 
heuristics:

 - you obviously must remember the head of the previous linearization. Any 
   tree choices you made in the past are not something you can change any 
   more for cvs export.

   This may sound obvious, but the fact is, especially if you do the cvs 
   export frequently, and the main maintainer updates the tree in a timely 
   manner, it will limit your choices a LOT. In fact, most of the time 
   you'll have no choice at all: you will have an unambiguous "TRUNK" that 
   is defined by the fact that there is only one linear path from the 
   previous export-head to the current HEAD.

In short, most of the time you won't have any issues in a stable system. 
You'll see a true "fork in the road" choice only when the maintainer 
hasn't pushed his tree to the thing you export in a while, long enough 
that no CVS exporting has taken place over a whole parallel cycle. It's 
probably rare.

So the main issue of "which child to choose" becomes one of the initial 
export, and then just occasionally thereafter. And in that case, there's 
really just one obvious algorithm:

 - simply enumerate the possible paths (not that hard - it's just 
   enumerating a tree once you've generated the graph in memory with child 
   and parent pointers), and just selecting the one with the "longest 
   weighted path".

I say "weighted path", because you might want to consider different 
commits to have different weights. For example, you might want to consider 
merge commits to be less interesting (so weight them as 0.5 commits) in 
order to get as many "real" commits as possible. And you might consider 
commits made by a certain person to be "more TRUNKish", and give them a 
weight of 2, so that you'll be more likely to follow the "maintainer view" 
over any other linearized history.

No other algorithm seems very sane. 

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-13  1:46 Fake linear history in a deterministic manner Martin Langhoff
  2006-02-13  5:24 ` Junio C Hamano
@ 2006-02-16 22:29 ` Eric Wong
  2006-02-16 23:29   ` Martin Langhoff
  2006-02-16 23:54 ` Junio C Hamano
  2 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2006-02-16 22:29 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List, Paul Mackerras

Martin Langhoff <martin.langhoff@gmail.com> wrote:
> To emulate `cvs log somepath` I need to munge history to look linear.
> I am working on the theory that I will tell the cvs client about *one*
> linear history, and show merges from parallel histories as a merge
> commit, "flattened" so to speak, and with a commit message where I'll
> list the hash and first line of each commit that it involves.

I'd be interested in exporting from git to SVN with something like this.

> I thought briefly about delaying the decision until I see the merge,
> and pick the leftmost, or rightmost, if there is some bias in
> git-merge or cg-merge on putting whatever origin has on a particular
> side. It'd mean running backwards through history and that the very
> last merge can flip the decision entirely. Hmmm... any strategy I can
> come up with means that each new merge throws the dice again entirely.
> 
> Ideas?

I'd actually like to do this interactively in gitk.  Just browse history
visually and pick the path you want to choose each time there's a merge,
and then having it output the revisions to stdout or saved to a file
after you're done picking.  Ideally you'd be able to use saved output
interactively, as well.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-16 22:29 ` Eric Wong
@ 2006-02-16 23:29   ` Martin Langhoff
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Langhoff @ 2006-02-16 23:29 UTC (permalink / raw)
  To: Eric Wong; +Cc: Git Mailing List, Paul Mackerras

On 2/17/06, Eric Wong <normalperson@yhbt.net> wrote:
> Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > To emulate `cvs log somepath` I need to munge history to look linear.
> > I am working on the theory that I will tell the cvs client about *one*
> > linear history, and show merges from parallel histories as a merge
> > commit, "flattened" so to speak, and with a commit message where I'll
> > list the hash and first line of each commit that it involves.
>
> I'd be interested in exporting from git to SVN with something like this.

We're hoping to release the code soon, but the truth is that it's
really trivial. It was more agonizing over the fact that there's no
"good" (aka "stable") algorithm for this.

> > I thought briefly about delaying the decision until I see the merge,
> > and pick the leftmost, or rightmost, if there is some bias in
> > git-merge or cg-merge on putting whatever origin has on a particular
> > side. It'd mean running backwards through history and that the very
> > last merge can flip the decision entirely. Hmmm... any strategy I can
> > come up with means that each new merge throws the dice again entirely.
> >
> > Ideas?
>
> I'd actually like to do this interactively in gitk.  Just browse history
> visually and pick the path you want to choose each time there's a merge,
> and then having it output the revisions to stdout or saved to a file
> after you're done picking.  Ideally you'd be able to use saved output
> interactively, as well.

It's cool to be able to pick, but if it's for a git-svnserver
implementation, you can't change your (fake) history you tell after
clients have seen. So a merge that gets pushed to the repo later may
contain more interesting paths, but you're bound to the lies you've
told.

cheers,


martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fake linear history in a deterministic manner.
  2006-02-13  1:46 Fake linear history in a deterministic manner Martin Langhoff
  2006-02-13  5:24 ` Junio C Hamano
  2006-02-16 22:29 ` Eric Wong
@ 2006-02-16 23:54 ` Junio C Hamano
  2 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2006-02-16 23:54 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

Martin, (and Martyn),

I just cloned the gitcvs via git locally, exported it via
gitcvs, and ran "cvs log" and "cvs annotate README" there.  

Good job.  I really had a good laugh looking at the generated
output.  It just looks like .... CVS! ;-)

I haven't read the code in detail (I tend to start nitpicking
the details without looking at a bigger picture, once I start
reading the code, so I try to only give superficial look until I
know I'll have tons of time to spend on it).

But it looks like a fun project.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-16 23:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-13  1:46 Fake linear history in a deterministic manner Martin Langhoff
2006-02-13  5:24 ` Junio C Hamano
2006-02-13  5:58   ` Martin Langhoff
2006-02-13 16:34     ` Linus Torvalds
2006-02-16 22:29 ` Eric Wong
2006-02-16 23:29   ` Martin Langhoff
2006-02-16 23:54 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).