Consistency question

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Consistency question
@ 2014-01-15 10:37 David Kastrup
  2014-01-15 11:13 ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: David Kastrup @ 2014-01-15 10:37 UTC (permalink / raw)
  To: git

Hi,

I am in the process of rewriting the core logic of git blame (the
current speed of which is quite an impediment to some workflows).
I currently have one question I don't see an answer to right away, and
that question arises in doing a reasonably robust traversal of commits
without determining topology first:

The question is what guarantees I have with regard to the commit date of
a commit in relation to that of its parent commits:

a) none
b) commitdate(child) >= commitdate(parent)
c) commitdate(child) > commitdate(parent)

Obviously, I can rely on c) being true "almost always": it's definitely
good for a heuristic used for improving performance (meaning as an
ordering criterion for a commit priority queue).  The problem is how
much I should cater for graceful behavior for the cases where it's not.

Does git do any actual checks before pushing?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 10:37 Consistency question David Kastrup
@ 2014-01-15 11:13 ` Jeff King
  2014-01-15 11:40   ` David Kastrup
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jeff King @ 2014-01-15 11:13 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Wed, Jan 15, 2014 at 11:37:08AM +0100, David Kastrup wrote:

> The question is what guarantees I have with regard to the commit date of
> a commit in relation to that of its parent commits:
> 
> a) none
> b) commitdate(child) >= commitdate(parent)
> c) commitdate(child) > commitdate(parent)

a) none

> Obviously, I can rely on c) being true "almost always":

Actually, b) is quite often the case in automated processes (e.g., "git
am" or "git rebase"). The author dates are different, but the committer
dates may be in the same second.

And of course a) is the result of clock skew and software bugs.

> it's definitely
> good for a heuristic used for improving performance (meaning as an
> ordering criterion for a commit priority queue).  The problem is how
> much I should cater for graceful behavior for the cases where it's not.

Yes, this is exactly how git uses it. We generally walk breadth-first
through the graph, relying on commit times for performance but not
correctness.

There are some parts of the code that will behave badly with clock skew.
For example, "--since" will stop traversing when we hit a certain point.
It requires a fixed number of "too old" commits before quitting, though,
in an attempt to bypass small runs of skewed clocks.

The "git describe --contains" algorithm will also produce wrong results
in the face of skew. I believe it uses a "slop" of 24 hours in its skew.

The current "tag --contains" algorithm is currently correct in the face
of skew, but it can go much faster if you accept that skew will cause it
to go wrong. I suspect there are other algorithms that could be sped up,
too, if we had trustworthy generation numbers (I implemented and timed
the "--contains" algorithm, but haven't done so for other algorithms).

I've played with calculating and caching the generation numbers at
repack time, but there aren't any patches currently under consideration.

> Does git do any actual checks before pushing?

No. One problem with checking the commit relationships is that it may
not be the new commit which is broken (by skewing backwards), but rather
the commit it builds on (which has skewed forwards). If you are building
on such a fast-forward commit, it is often to late to fix that commit.
So you have to fast-forward yourself, propagating the bogus value.

If the receiving machine checked the incoming commits against its own
internal clock, that could work. You would still want to check the
commit relationships to catch new commits that erroneously claim to be
from the past, though (that is, before their parents).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 11:13 ` Jeff King
@ 2014-01-15 11:40   ` David Kastrup
  2014-01-15 12:44     ` Andreas Krey
  2014-01-15 11:55   ` David Kastrup
  2014-01-18  1:22   ` Mike Hommey
  2 siblings, 1 reply; 8+ messages in thread
From: David Kastrup @ 2014-01-15 11:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Wed, Jan 15, 2014 at 11:37:08AM +0100, David Kastrup wrote:
>
>> The question is what guarantees I have with regard to the commit date of
>> a commit in relation to that of its parent commits:
>> 
>> a) none
>> b) commitdate(child) >= commitdate(parent)
>> c) commitdate(child) > commitdate(parent)
>
> a) none
>
>> Obviously, I can rely on c) being true "almost always":
>
> Actually, b) is quite often the case in automated processes (e.g., "git
> am" or "git rebase"). The author dates are different, but the committer
> dates may be in the same second.

Ok, thanks.  Assuming that rebases don't happen 1000/s, I should likely
not worry too much about O(n^2) for this case (and frankly, clearly
nobody worried about O(n^2) in the current blame.c anyway).  It's also
not really relevant for linear parts of the history like that of "git
rebase" since in that case the parent enters my priority queue when its
child is getting processed: nothing to be confused about here.  This is
more about sibling rivalries calling a parent to the queue before the
sibling had a chance to leave.  So it comes into play for my use case
basically only when dealing with merge commits.

> I suspect there are other algorithms that could be sped up, too, if we
> had trustworthy generation numbers (I implemented and timed the
> "--contains" algorithm, but haven't done so for other algorithms).

With a single root, "depth" helps a lot.  When looking for a common
parent of a number of commits, you first shorten all ancestries to the
same size and then you can look for the point of convergence in
lockstep.

But didn't git forego the "single root" requirement in its commit DAG at
some point of time?

Thanks for the speedy reply!  I think I'm good with what I need to know
to go ahead.  The rest is just idle curiosity.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 11:13 ` Jeff King
  2014-01-15 11:40   ` David Kastrup
@ 2014-01-15 11:55   ` David Kastrup
  2014-01-18  1:22   ` Mike Hommey
  2 siblings, 0 replies; 8+ messages in thread
From: David Kastrup @ 2014-01-15 11:55 UTC (permalink / raw)
  To: git

Jeff King <peff@peff.net> writes:

> There are some parts of the code that will behave badly with clock skew.
> For example, "--since" will stop traversing when we hit a certain point.
> It requires a fixed number of "too old" commits before quitting, though,
> in an attempt to bypass small runs of skewed clocks.

That actually turns out to be a somewhat sore point for me: I use
something like

git shortlog -n --since 2013/12/01 --until 2014/01/01 master

for generating statistics on LilyPond when I am doing my monthly report
begging the community for money.

It turns out that the numbers of commits attributed to me tend to go
_down_ quite regularly in the time from starting the report to sending
it out.  Which might also cause me to overlook a particularly
selfpraiseworthy item.

Not sure how feasible it would be to arrive at a stable and
complementary set of --since/--until.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 11:40   ` David Kastrup
@ 2014-01-15 12:44     ` Andreas Krey
  2014-01-15 13:00       ` David Kastrup
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Krey @ 2014-01-15 12:44 UTC (permalink / raw)
  To: David Kastrup; +Cc: Jeff King, git

On Wed, 15 Jan 2014 12:40:29 +0000, David Kastrup wrote:
...
> With a single root, "depth" helps a lot.  When looking for a common
> parent of a number of commits, you first shorten all ancestries to the
> same size and then you can look for the point of convergence in
> lockstep.

Hmm, how about traversing from all the start commits downwards
simultaneously, noting which start you say each commit from, and stopping
when you have a commit carrying all start labels?

I don't quite see how the same size plus lockstep works out (but the
'same size' part is possibly the same as my 'concurrent traversal').

> But didn't git forego the "single root" requirement in its commit DAG at
> some point of time?

About at the beginning, I guess. Nothing in the data model ever required it?

> ... The rest is just idle curiosity.

Me too, mostly. I may have to do some traversal for tree/dag painting.

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 12:44     ` Andreas Krey
@ 2014-01-15 13:00       ` David Kastrup
  2014-01-15 13:45         ` Andreas Krey
  0 siblings, 1 reply; 8+ messages in thread
From: David Kastrup @ 2014-01-15 13:00 UTC (permalink / raw)
  To: Andreas Krey; +Cc: Jeff King, git

Andreas Krey <a.krey@gmx.de> writes:

> On Wed, 15 Jan 2014 12:40:29 +0000, David Kastrup wrote:
> ...
>> With a single root, "depth" helps a lot.  When looking for a common
>> parent of a number of commits, you first shorten all ancestries to the
>> same size and then you can look for the point of convergence in
>> lockstep.
>
> Hmm, how about traversing from all the start commits downwards
> simultaneously, noting which start you say each commit from, and stopping
> when you have a commit carrying all start labels?

It means that when the start commits are at considerably different
depth, you'll traverse much more material than necessary.

Also you need labels.

> I don't quite see how the same size plus lockstep works out (but the
> 'same size' part is possibly the same as my 'concurrent traversal').

It just equalizes the depth before starting, so you don't need labels:
any common ancestor is reached at the same time by its descendants.

Of course, I conveniently forgot merge commits.  This scheme works out
of the box only with single parenting.  And it works fine without a
common ancestor, too: you just run into a NULL pointer at the same time
when there isn't one.

So, uh, this solution does not really seem to match the problem...

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 13:00       ` David Kastrup
@ 2014-01-15 13:45         ` Andreas Krey
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Krey @ 2014-01-15 13:45 UTC (permalink / raw)
  To: David Kastrup; +Cc: Jeff King, git

On Wed, 15 Jan 2014 14:00:30 +0000, David Kastrup wrote:
> Andreas Krey <a.krey@gmx.de> writes:
...
> > Hmm, how about traversing from all the start commits downwards
> > simultaneously, noting which start you say each commit from, and stopping
> > when you have a commit carrying all start labels?
> 
> It means that when the start commits are at considerably different
> depth, you'll traverse much more material than necessary.

But it has the advantage that you don't need to traverse the DAG
to the root when the differences are small - the runtimes are roughly
proportional to the difference between the commits.

> Also you need labels.

Yes, I assume that that (marking commits) is not a cost factor.

> > I don't quite see how the same size plus lockstep works out (but the
> > 'same size' part is possibly the same as my 'concurrent traversal').
> 
> It just equalizes the depth before starting, so you don't need labels:
> any common ancestor is reached at the same time by its descendants.
> 
> Of course, I conveniently forgot merge commits.

I finally noticed but forgot to mention in the last post.

Merges also mean that there is not necessarily a unique common
ancestor between commits.

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Consistency question
  2014-01-15 11:13 ` Jeff King
  2014-01-15 11:40   ` David Kastrup
  2014-01-15 11:55   ` David Kastrup
@ 2014-01-18  1:22   ` Mike Hommey
  2 siblings, 0 replies; 8+ messages in thread
From: Mike Hommey @ 2014-01-18  1:22 UTC (permalink / raw)
  To: Jeff King; +Cc: David Kastrup, git

On Wed, Jan 15, 2014 at 06:13:30AM -0500, Jeff King wrote:
> On Wed, Jan 15, 2014 at 11:37:08AM +0100, David Kastrup wrote:
> 
> > The question is what guarantees I have with regard to the commit date of
> > a commit in relation to that of its parent commits:
> > 
> > a) none
> > b) commitdate(child) >= commitdate(parent)
> > c) commitdate(child) > commitdate(parent)
> 
> a) none
> 
> > Obviously, I can rely on c) being true "almost always":
> 
> Actually, b) is quite often the case in automated processes (e.g., "git
> am" or "git rebase"). The author dates are different, but the committer
> dates may be in the same second.
> 
> And of course a) is the result of clock skew and software bugs.

... or importing non-git repositories that don't have commit info
separated from author info like git does. In such cases, it's usual to
duplicate the author info as commit info so that clones of the same
non-git repo end up with the same git sha1s. Mercurial easily allows
author dates to be in a non topological order.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-01-18  1:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-15 10:37 Consistency question David Kastrup
2014-01-15 11:13 ` Jeff King
2014-01-15 11:40   ` David Kastrup
2014-01-15 12:44     ` Andreas Krey
2014-01-15 13:00       ` David Kastrup
2014-01-15 13:45         ` Andreas Krey
2014-01-15 11:55   ` David Kastrup
2014-01-18  1:22   ` Mike Hommey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).