Re: Git commit generation numbers

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Git commit generation numbers
@ 2011-07-17 18:27 George Spelvin
  2011-07-17 19:00 ` Long, Martin
  2011-07-17 19:30 ` Linus Torvalds
  0 siblings, 2 replies; 89+ messages in thread
From: George Spelvin @ 2011-07-17 18:27 UTC (permalink / raw)
  To: git; +Cc: linux, torvalds

> The thing I hate about it is very fundamental: I think it's a hack around a basic git
> design mistake. And it's a mistake we have known about for a long time.
> 
> Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically
> say "we made a mistake in the original commit design, and instead of fixing it we
> create a separate workaround for it".
> 
> THAT I find distasteful. My reaction is that if we're going to add generation
> numbers, then were should just do it the way we should have done them originally,
> rather than as some separate hack.

There are a few design mistakes in git.  The way the object type
and size are prefixed to the data for hasing purposes, which prevents
aligned fetching from memory-mapped data in the hashing code, isn't too
pretty either.

But git has generally preferred to avoid storing information that can
be recomputed.  File renames are the big example.  given this, why the
heck store generation numbers?

They *can* be computed on demand, so arguably they *should*.  Cacheing is
then an optimization, just like packs, pack indexes, the hashed object
storage directories, and all that.

I'm in the "make it a cache" camp, honestly.  

For example, here's a different possible generation number scheme.
By making the generation number a cache, it becomes a valid alternative
to experiment with.

Simply store a topologically sorted list of commits.  Each commit's
position can serve as a generation number, and is greater than the
positions of all ancestors.  But by using the offset within the list,
the number is stored implicitly.

Generation numbers don't have to be consecutive as long as they're
correctly ordered, so you could, e.g. choose to make them unique.

I don't think this is actually worth it; I'm just using it as a
not-completely-insane example of a different design that nonetheless
achieves the same goal.

Why freeze this in the object format?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-17 18:27 Git commit generation numbers George Spelvin
@ 2011-07-17 19:00 ` Long, Martin
  2011-07-17 19:30 ` Linus Torvalds
  1 sibling, 0 replies; 89+ messages in thread
From: Long, Martin @ 2011-07-17 19:00 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, torvalds

> Why freeze this in the object format?

Because if you put it in the object format, then it gets pushed and
pulled around, thereby putting generation numbers in every clone.

I'm starting to think put them in the object store, for exactly that
reason, and to start moving repositories in a direction where the look
more like they would if this had been done correctly from the start.

Then, because some operations are still going to create a lot of
traversals, a cache is always an option to improve the performance in
that area.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-17 18:27 Git commit generation numbers George Spelvin
  2011-07-17 19:00 ` Long, Martin
@ 2011-07-17 19:30 ` Linus Torvalds
  2011-07-17 23:39   ` George Spelvin
  1 sibling, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-17 19:30 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Sun, Jul 17, 2011 at 11:27 AM, George Spelvin <linux@horizon.com> wrote:
>
> There are a few design mistakes in git.  The way the object type
> and size are prefixed to the data for hasing purposes, which prevents
> aligned fetching from memory-mapped data in the hashing code, isn't too
> pretty either.

Why would you ever care? That makes no sense.

> But git has generally preferred to avoid storing information that can
> be recomputed.  File renames are the big example.  given this, why the
> heck store generation numbers?

Guys, please don't bring up file renames. I explained once already why
bringing up file renames just makes you look like a f^&% moron.

Let me explain one more time:

 - Storing file renames is STUPID. It's stupid for very fundamental
reasons that have absolutely *NOTHING* to do with "it can be computed
later".

It's fundamentally stupid because it will FOREVER SCREW UP YOUR DATA,
and because it will make merging an unmitigated disaster and make your
repository depend on how you *created* your data, rather than on what
the data is. It will totally break the situation of one person doing a
rename, while another person does something else to the metadata (eg a
create of the same filename).

Trying to track file identities will leave to very fundamentally
unsolvable issues like "which file identity do we choose when two
different files get the same name", or "which file identity will we
choose when one file splits in two".

Git doesn't track renames, because unlike pretty much every other SCM
out there, git really does have a good design, and because I damn well
understood the real problems.

So bringing it up as an example of "we don't store it because we can
compute it" is really totally idiotic. It's a sign of not
understanding the problems with renames. Stop doing it. That argument
is totally irrelevant. Really.

It's like saying "We shouldn't do generation numbers because fish
don't use bicycles". The only thing that kind of argument does is to
make me convinced that you don't understand the problem enough to be
worth even arguing with. It is not only a worthless argument, but it
makes your every other argument suspect.

Comprende? Stop it.

> They *can* be computed on demand, so arguably they *should*.

Umm, no.

That's actually a really bad argument.

There are valid things that we "should" do, but they have nothing to
do with "if something can be done, it should be done". That's just a
crazy argument.

A thing we really *should* do is perform well. And be really reliable.
And support a distributed workflow.

Those are real arguments that aren't about "just because it's there".

Now, some of those arguments can then be used to say "don't bother
storing redundant data". For example, redundant data takes disk space
and network bandwidth, and if something can be recomputed cheaply (ie
if it doesn't have a negative impact on performance), then redundant
data is just bad.

And what appears like a much better argument (right now) is that some
data isn't needed AT ALL, because you can make do with other data
entirely (ie dates).

But "just because we could recompute it" is a bad bad reason.

The thing is, the very basic design of git is all about *incomplete*
DAG traversal. The DAG traversal part is pretty obvious and simple,
but the *partial* thing really is very very important. We absolutely
need it for reasonable scalability. We've spent a *lot* of time in git
development on trying to perform really well by avoiding work. Not
just in revision traversal, but in many other areas too (like making
diff and merge much faster by being able to handle whole identical
recursive subdirectories by just checking the SHA1, for example).

That's a *really* fundamental design issue in git. Performance was
always a primary goal. And by primary, I really mean primary. As in
"more important than just about anything else".  There were other
primary goals, but really not very many.

And there really aren't very good ways to limit DAG traversal.
Generation numbers are one of the very few fundamental ones. We hacked
around it with dates, and it works pretty well in practice (well
enough that I'm certainly ok with the hack), but it's definitely one
of the areas where git simply does something "wrong". It's simply not
a entirely reliable algorithm, and that fact makes me a bit
uncomfortable with it.

(Now, in theory, a global *approximate* time is theoretically possible
in a distributed environment, and as such it's arguable that "global
time with a slop that is based on the speed of light and knowledge of
location" is at least theoretically sound. So the real problem with
commit dates is that people simply don't have good clocks. So it's a
practical problem rather than a theoretical one, and it's a practical
problem that doesn't really cause enough problems in practice to not
be workable. But I'm making excuses for it, and I _know_ I'm making
excuses for it, so I'm not really happy about it)

And it's just about the only area where I am aware of git doing
something "wrong". Which is why I would like to have had generation
numbers even though the dates do work.

Anyway, to get back to the actual issue of caching vs not caching: if
you think "we could compute it dynamically" means that we should, then
we damn well shouldn't cache it either - why cache it, when you could
just compute it. And if it's worth it to waste resources on the cache
in order to avoid performance issues, then it damn well would be ok to
waste (fewer) resources on just saving the generation number in the
object data base. And make that *fundamental* fix to a hack that git
has had since pretty much day one.

And btw, git didn't have the date-based hack originally, because I
didn't think it would be problematic enough. I thought that we could
do universally efficient partial DAG traversal - not having to go all
the way to the root -  based purely on the DAG. The code in
"everybody_uninteresting()" tries to be that "limit DAG traversal by
only looking at the DAG itself", and it works for many simple
situations. But it turns out that it does *not* work for many other
cases.

So the generation number really is very very fundamnetal. It's
absolutely not some "additional information that can be computed",
because the whole AND ONLY point of having the number is to not
compute it.

We are never interested in the generation number for its own sake. We
are only interested in it in order to avoid having to look at the rest
of the DAG.

So no, the number fundamentally isn't computable, because computing it
obviates the need for it.

                           Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-17 19:30 ` Linus Torvalds
@ 2011-07-17 23:39   ` George Spelvin
  2011-07-17 23:58     ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-17 23:39 UTC (permalink / raw)
  To: linux, torvalds; +Cc: git

> So the generation number really is very very fundamnetal. It's
> absolutely not some "additional information that can be computed",
> because the whole AND ONLY point of having the number is to not
> compute it.
> 
> We are never interested in the generation number for its own sake. We
> are only interested in it in order to avoid having to look at the rest
> of the DAG.

You're making my point and somehow not seeing it.

What you're describing here is the archetpical cache.

The only reason for having a memory cache is to avoid accessing memory!
The only reason for having a TLB is to avoid walking the page tables!
The only reason for having a page cache is to avoid hitting the disk!
The only reason for having a dcache is to avoid traversing the file
system directories!

And yes, the only reason for having a generation number cache is to avoid
traversing the DAG.  D'oh.  Do you think this is somehow news to anyone?

The fundamental nature of a cache is that it lets you look something up
quickly that you could compute but don't want to.

I'm slapping my forehead like Homer Simpson here.  The fact that computing
the generation number is expensive is why it's worth cacheing.  But the
fact that it *can* be computed is a reason not to clutter the published
commit object format with it.

The generation number is NOT FUNDAMENTAL.  It contains no information
that's not already in the DAG.  The danger of putting it into a commit
is that you'll do it wrong, and thereby screw everything up.

If we have broken code that generates a broken cache, we fix the code
and the bugs magically go away.

If we have broken code that generates a broken commit object, we have
a huge problem.

Just like we don't ship pack indexes around, but recompute them on arrival.
The index is essential for performance, but it's absolutely non-essential
for correctness.

As a general design principle, the exported data structures, like the
commits, should be as simple as possible.  Do not include extraneous
or redundant data, because then you have to deal with the possibility
of inconsistency.  This leads to bugs.  (Frequently buffer overflow bugs.)

Maybe it would have been worth violating that principle during the initial
git design.  I still see a good argument for not doing that even if we
had a time machine.

But now that the commit format is established and widely used, the argument
has far more force.  Changing the commit format provides zero functionality
gain, and the performance gain can be obtained a different way.

Maybe a bit more code, but nothing extraordinary.

To me, the KISS principle says "don't change the commit format!"

Now, you complain about code complexity.  But this is a read-only cache.
The generation number of a commit object never changes.  There's no update
operation.  Like an I-cache, if there's ever any problem, throw it away.

Arguing that "the patch to put it in the commit object is smaller" is
stupidly short-sighted.  Now every version of git from now until forever
has to support both kinds of commit objects.  (And browsing old git
trees will forever be slow.)

You only take on that sort of legacy support burden if you absolutely have to.

> But "just because we could recompute it" is a bad bad reason.

Bull puckey.  You're ugly and stupid and WRONG.

It's an excellent reason.  I'm amazed that you're not seeing it.
The principle is "don't include redundant data in a transport format."
Because it can be recomputed, it's redundant.  Therefore, it shouldn't
be included in the transport format.

It's exactly the same principle as "don't store the indexes in the
database dump" and "don't store filename hashes in file system
archives".

This is a principle, not an iron-clad rule.  It can be violated for
good and sufficient reasons, notably performance.

But in this case, we can get the performance without it.  Without,
in fact, changing the git transport format at all.

And "don't change a widely-used transport format" is ANOTHER important
principle.  Backward-compatible is much better than incompatible, but
far better to avoid changing it at all.

Breaking two such principles without an absolutely iron-clad reason is
ugly and stupid and wrong.

(As you well know, the more general principle is "don't store redundant
data AT ALL unless you need to for performance".  Redundant data is A
Bad Thing.  It can get out of sync.  But if you have to, a private cache
is much better than a exchange format.)

Put another way, it IS stupid, it IS expendable, and therefore it SHOULD go.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-17 23:39   ` George Spelvin
@ 2011-07-17 23:58     ` Linus Torvalds
  2011-07-18  5:13       ` George Spelvin
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-17 23:58 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Sun, Jul 17, 2011 at 4:39 PM, George Spelvin <linux@horizon.com> wrote:
>
> I'm slapping my forehead like Homer Simpson here.  The fact that computing
> the generation number is expensive is why it's worth cacheing.  But the
> fact that it *can* be computed is a reason not to clutter the published
> commit object format with it.

And I'm slapping *my* forehead.

Nobody has *ever* given a reason why the cache would be better than
just making it explicit.

That's my issue.

Why is that so hard for people to understand? The cache is just EXTRA WORK.

To take your TLB example: it's like having a TLB for a page table that
would be as easy to just create in a way that it's *faster* to look up
in the actual data structure than it would be to look up in the cache.

Or to take your disk cache example: wouldn't you say that a disk cache
is a F&*&ING BAD IDEA if it is slower than the disk it caches?

Seriously.

                    Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-17 23:58     ` Linus Torvalds
@ 2011-07-18  5:13       ` George Spelvin
  2011-07-18 10:28         ` Anthony Van de Gejuchte
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-18  5:13 UTC (permalink / raw)
  To: linux, torvalds; +Cc: git

> Nobody has *ever* given a reason why the cache would be better than
> just making it explicit.

I thought I listed a few.  Let me be clearer.

1) It involves changing the commit format.  Since the change is
   backward-compatible, it's not too bad, but this is still fundamentally
   A Bad Thing, to be avoided if possible.

2) It can't be retrofitted to help historical browsing.

3) You have to support commits without generation numbers forever.
   This is a support burden.  If you can generate generation numbers for
   an entire repository, including pre-existing commits, you can *throw
   out* the commit date heuristic code entirely.

4) It can't be made to work with grafts or replace objects.

5) It includes information which is redundant, but hard to verify,
   in git objects.  Leading to potentially bizarre and version-dependent
   behaviour if it's wrong.  (Checking that the numbers are consistent
   is the same work as regenerating a cache.)

6) It makes git commits slightly larger.  (Okay, that's reaching.)

> Why is that so hard for people to understand? The cache is just EXTRA WORK.

That's why it *might* have been a good idea to include the number in
the original design.  But now that the design is widely deployed, it's
better to avoid changing the design if not necessary.

With a bit of extra work, it's not necessary.

> To take your TLB example: it's like having a TLB for a page table that
> would be as easy to just create in a way that it's *faster* to look up
> in the actual data structure than it would be to look up in the cache.

You've subtly jumped points.  The original point was that it's worth
precomputing and storing the generation numbers.  I was trying to
say that this is fundamentally a caching operation.

Now we're talking about *where* to store the cached generation numbers.

Your point, which is a very valid one, is that they are to be stored
on disk, exactly one per commit, can be computed when the commit is
generated, and are accessed at the same time as the commit, so it makes
all kinds of sense to store them *with* the commits.  As part of them,
even.

This has the huge benefit that it does away with the need for a *separate*
data structure.  (Kinda sorts like the way AMD stores instruction
boundaries in the L1 I-cache, avoiding the need for a separate data
structure.)

I'm arguing that, despite this annoying overhead, there are valid reasons
to want to store it separately.  There are some practical ones, but the
basic one is an esthetic/maintainability judgement of "less cruft in
the commit objects is worth more cruft in the code".

Git has done very well partly *because* of the minimality of its basic
persistent object database format.  I think we should be very reluctant
to add to that without a demonstrated need that *cannot* be met in
another way.

In this particular case, a TLB is not a transport format.  It's okay
to add redundant cruft to make it faster, because it only lasts until
the next reboot.  (A more apropos, software-oriented analogy might be
"struct page".)

A git commit object *is* a transport format, one specifically designed
for transporting data a very long way forward in time, so it should be
designed with considerable care, and cruft ruthlessly eradicated.

Whatever you add to it has to be supported by every git implementation,
forever.  As does every implementation bug ever produced.

A cache, on the other hand, is purely a local implementation detail.
It can be changed between versions with much less effort.

I agree it's more implementation work.  But the upside is a cleaner
struct commit.  Which is a very good thing.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-18  5:13       ` George Spelvin
@ 2011-07-18 10:28         ` Anthony Van de Gejuchte
  2011-07-18 11:48           ` George Spelvin
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Van de Gejuchte @ 2011-07-18 10:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, git

On 18-jul-2011, at 07:13, George Spelvin wrote:

>> Nobody has *ever* given a reason why the cache would be better than
>> just making it explicit.
> 
> I thought I listed a few.  Let me be clearer.
> 
> 1) It involves changing the commit format.  Since the change is
>   backward-compatible, it's not too bad, but this is still fundamentally
>   A Bad Thing, to be avoided if possible.

Git is designed to ignore data in this case afaik, so I do not see any
reason why backwards-compatibility gets broken here.

> 
> 2) It can't be retrofitted to help historical browsing.

I like to see more (valid) arguments, as I do not see what you are
trying to explain.

> 
> 3) You have to support commits without generation numbers forever.
>   This is a support burden.  If you can generate generation numbers for
>   an entire repository, including pre-existing commits, you can *throw
>   out* the commit date heuristic code entirely.

I'll give you a few months to rethink at this statement until this
feature does get used widely. I think there was never a moment where
we would ever think to rebuild older commits as this would break the
hash of the commits where many people are potential looking for.

> 
> 4) It can't be made to work with grafts or replace objects.
> 
> 5) It includes information which is redundant, but hard to verify,
>   in git objects.  Leading to potentially bizarre and version-dependent
>   behaviour if it's wrong.  (Checking that the numbers are consistent
>   is the same work as regenerating a cache.)

The data is *consistent* as long as the hash doesn't change, storing the
data in the commits *can* reduce resource and makes calculations cheaper.
Therefore, I think there are enough reasons to add the generation number
in the commit. Yes, many data can be calculated or can be an overhead,
but as Torvalds already said, it can be used as consistency check.

If the data does get wrong, then its probably caused by something stupid
enough to break the rules. Yes, this is a problem but I think there are
already enough reasons given, look back to the archives of this topic.

Ok, there is one possible thing that *can* go wrong and that is when you
are changing history with generation numbers with an older git client.

(And thats a good reason to communicate with others as clear as possible
about this feature, but its still not version-dependent as it doesn't
require a client to use it)

> 
> 6) It makes git commits slightly larger.  (Okay, that's reaching.)
> 
>> Why is that so hard for people to understand? The cache is just EXTRA WORK.
> 
> That's why it *might* have been a good idea to include the number in
> the original design.  But now that the design is widely deployed, it's
> better to avoid changing the design if not necessary.
> 
> With a bit of extra work, it's not necessary.
> 
>> To take your TLB example: it's like having a TLB for a page table that
>> would be as easy to just create in a way that it's *faster* to look up
>> in the actual data structure than it would be to look up in the cache.
> 
> You've subtly jumped points.  The original point was that it's worth
> precomputing and storing the generation numbers.  I was trying to
> say that this is fundamentally a caching operation.
> 
> Now we're talking about *where* to store the cached generation numbers.
> 
> Your point, which is a very valid one, is that they are to be stored
> on disk, exactly one per commit, can be computed when the commit is
> generated, and are accessed at the same time as the commit, so it makes
> all kinds of sense to store them *with* the commits.  As part of them,
> even.
> 
> This has the huge benefit that it does away with the need for a *separate*
> data structure.  (Kinda sorts like the way AMD stores instruction
> boundaries in the L1 I-cache, avoiding the need for a separate data
> structure.)
> 
> I'm arguing that, despite this annoying overhead, there are valid reasons
> to want to store it separately.  There are some practical ones, but the
> basic one is an esthetic/maintainability judgement of "less cruft in
> the commit objects is worth more cruft in the code".
> 
> Git has done very well partly *because* of the minimality of its basic
> persistent object database format.  I think we should be very reluctant
> to add to that without a demonstrated need that *cannot* be met in
> another way.
> 
> 
> In this particular case, a TLB is not a transport format.  It's okay
> to add redundant cruft to make it faster, because it only lasts until
> the next reboot.  (A more apropos, software-oriented analogy might be
> "struct page".)
> 
> A git commit object *is* a transport format, one specifically designed
> for transporting data a very long way forward in time, so it should be
> designed with considerable care, and cruft ruthlessly eradicated.
> 
> Whatever you add to it has to be supported by every git implementation,
> forever.  As does every implementation bug ever produced.
> 
> A cache, on the other hand, is purely a local implementation detail.
> It can be changed between versions with much less effort.
> 
> I agree it's more implementation work.  But the upside is a cleaner
> struct commit.  Which is a very good thing.

A cache would use more resources because they can become invalid at any
point and *should* be recalculated by every client. We are processing
data that *can* be reused by everybody with a git client which has this
specific feature, but does not break anything with an older client.

So please, calculate things only once as this may save a *lot* of time :-)

I would see more advantage in a cache if the data could differs on
every client, but that still doesn't mean that you should use one.

> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Maybe I shouldn't even have responded to this as I tend not to agree with
the given opinions to use a cache, even when I think that Torvalds starts
throwing arguments as well for certain reasons, but thats probably my wrong
thinking at it.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-18 10:28         ` Anthony Van de Gejuchte
@ 2011-07-18 11:48           ` George Spelvin
  2011-07-20 20:51             ` Nicolas Pitre
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-18 11:48 UTC (permalink / raw)
  To: anthonyvdgent, linux; +Cc: git, torvalds

>> 1) It involves changing the commit format.  Since the change is
>>   backward-compatible, it's not too bad, but this is still fundamentally
>>   A Bad Thing, to be avoided if possible.

> Git is designed to ignore data in this case afaik, so I do not see any
> reason why backwards-compatibility gets broken here.

That's what I just wrote.  "The change is backward-compatible"
is a simpler and shorter way of writing "it doesn't break
backwards-compatibility" (to put the generation number in the commit
object).

I just said that *any* change is still undesirable.

>> 2) It can't be retrofitted to help historical browsing.

> I like to see more (valid) arguments, as I do not see what you are
> trying to explain.

I apologize for being unclear.  I meant that if you store the generation
in the commit, then you can't add generation numbers to an existing
repository ("retrofit") in order to speed up --contains and --topo-sort
operations on pre-existing git repositories.

(Without recomputing all the hashes and breaking the ability to merge
with people not using the feature.)

As Linus points out, this is not likely to be a major performance issue
in practice, as operations like finding merge bases overwhelmingly
use recent objects (which will have generation numbers once the feature
goes in), but it is a measurable disadvantage.

>> 3) You have to support commits without generation numbers forever.
>>   This is a support burden.  If you can generate generation numbers for
>>   an entire repository, including pre-existing commits, you can *throw
>>   out* the commit date heuristic code entirely.

> I'll give you a few months to rethink at this statement until this
> feature does get used widely. I think there was never a moment where
> we would ever think to rebuild older commits as this would break the
> hash of the commits where many people are potential looking for.

I'm afraid that your English grammar is sufficiently mangled here that
I don't understand *your* point.  Which is a shame because it's
one of my more important points.

Storing the generation number inside the commit means that a commit
with a generation number has a different hash than a commit without one.
This means that people won't want to break the hashes of existing commits
by adding them.  In many cases, ever.

Which means that git will have to be able to work without the generation
numbers forever.

If the generation numbers are stored in a separate data structure that
can be added to an existing repository, then a new version of git can
do that when needed.  Which lets git depend on always having the the
generation numbers to do all history walking and stop using commit date
based heuristics completely.

>> 4) It can't be made to work with grafts or replace objects.
>>
>> 5) It includes information which is redundant, but hard to verify,
>>   in git objects.  Leading to potentially bizarre and version-dependent
>>   behaviour if it's wrong.  (Checking that the numbers are consistent
>>   is the same work as regenerating a cache.)

> The data is *consistent* as long as the hash doesn't change, storing the
> data in the commits *can* reduce resource and makes calculations cheaper.

You're mixing up two issues.  Storing the generation number *anywhere*
can make calculations cheaper.  Storing them in the commit is indeed the
*simplest* place, but the calculation cost point is equally true if the
numbers are stored somewhere else.

As for consistency...

I'm defining "consistent" as consistency between the generation number
and the parent pointers.  This is the property that the history-walking
optimizations depend on.

A commit's generation number is consistent if it is larger than the
generation number of any of its parents.  (Optionally, you
may require that it be larger by exatly 1.)

A generation number is *not* consistent if is less than or equal to the
generation number of one of its parents.

If this happens, history walking code that uses the generation numbers
will not produce correct output.

Further, the nature of the incorrectness will depend on implementation
details ("potentially bizarre and version-dependent behaviour") of the
history-walking code.

By computing the generation numbers when needed, the entire "what happens
if someone makes a commit with an inconsistent generation number"
problem goes away.  It goes from "not likely to happen" or "somthing
that has to be checked for when receiving objects" to "can't happen".

The computation to verify that an incoming commit's generation number
is consistent is exactly the same computation needed to compute the
generation number it should have: look up all parent commit generation
numbers and take the maximum.  The only question is whether we store
the result after computing it, or compare with the included generation
number and possibly print an error message.

For example, suppose I generate a commit with a generation number of
UINT_MAX.  Will this crash git?  That's a new error condition the code
has to worry about.  If I generate the generation number locally, I know
that can't happen in any repository that I can download in a reasonable
period of time.

If we had generation numbers from day 1, we could just require that they
always be checked, and an inconsistent object could be always rejected.

But since old git versions ignore the generation number in commits, a
bad generation number could spread a long way before someone notices it.
It becomes a visible problem.  Not a really big one (I'm pretty sure
that refusing to pull it introduces no security holes), but it's an
error condition that we have to actually think about.

> A cache would use more resources because they can become invalid at any
> point and *should* be recalculated by every client. We are processing
> data that *can* be reused by everybody with a git client which has this
> specific feature, but does not break anything with an older client.
>
> So please, calculate things only once as this may save a *lot* of time :-)

This is silly.  The cache can't become invalid except by disk corruption,
which can corrupt numbers stored in the commit object just the same.
(The corruption can be detected by git-fsck, but that's also true
independent of where the numbers are stored.)

And the work to recalculate the numbers is far less than the work to
garbage collect, or repack, or generate the index of an incoming pack,
or any of a dozen operations that are normally done by all clients.
(Don't get me started on rename detection!)

This is a completely misplaced optimization.  Walking every commit in
the repository takes a few seconds and enough memory that we don't want
to do it every "git log" operation, but it's barely perceptible compared
to other repository maintenance operations.

Do it once when you install a new git software version and then you can
forget about it.

> I would see more advantage in a cache if the data could differs on
> every client, but that still doesn't mean that you should use one.

If you use grafts or replace objects, it can be.  That's my point 4)
above.  Supporting these makes maintaining a cache trickier, but it's
simply impossible to do with in-commit generation numbers.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-18 11:48           ` George Spelvin
@ 2011-07-20 20:51             ` Nicolas Pitre
  2011-07-20 22:16               ` George Spelvin
  0 siblings, 1 reply; 89+ messages in thread
From: Nicolas Pitre @ 2011-07-20 20:51 UTC (permalink / raw)
  To: George Spelvin; +Cc: anthonyvdgent, git, Linus Torvalds

On Mon, 18 Jul 2011, George Spelvin wrote:

> Storing the generation number inside the commit means that a commit
> with a generation number has a different hash than a commit without one.
> This means that people won't want to break the hashes of existing commits
> by adding them.  In many cases, ever.
> 
> Which means that git will have to be able to work without the generation
> numbers forever.

I've been diverting myself from $day_job by reading through this thread.  
Still, I couldn't make my mind between having the generation number 
stored in the commit object or in a separate cache by reading all the 
arguments for each until now. Admittedly I'm not as involved in the 
design of Git as I once was, so my comments can be considered with the 
same proportions.

Obviously, with a perfect design, we would have had gen numbers from the 
beginning.  But we did mistakes, and now have to regret and live with 
them (and yes I have my own share of responsibility for some of those 
regrets which are now embodied in the Git data format).

> If the generation numbers are stored in a separate data structure that
> can be added to an existing repository, then a new version of git can
> do that when needed.  Which lets git depend on always having the the
> generation numbers to do all history walking and stop using commit date
> based heuristics completely.

To me this is the killer argument.  Being able to forget about the 
broken date heuristics entirely and simplify the code is what makes the 
external cache so fundamentally better as it can be applied to any 
existing repositories.  And it has no backward compatibility issues as 
old Git version won't work any worse if they can't make any usage of 
that cache.

The alternative of having to sometimes use the generation number, 
sometimes use the possibly broken commit date, makes for much more 
complicated code that has to be maintained forever.  Having a solution 
that starts working only after a certain point in history doesn't look 
eleguant to me at all.  It is not like having different pack formats 
where back and forth conversions can be made for the _entire_ history.

And if you don't care about graft/replace then the cached data is 
immutable just like the in-commit version would, so there is no 
consistency issues.  If you do care about graft/replace (or who knows 
what other dag alteration scheme might be created in 5 years from now) 
then a separate cache will be required _anyway_, regardless of any 
in-commit gen number.

So to say that if a generation number is _really_ needed, then it should 
go in a separate cache.  Saying that if we would have done it initially 
then it would have been inside the commit object is not a good enough 
justification to do it today if it can't be applied to the whole of 
already existing repositories and avoid special cases.

I however have not formed any opinion on that fundamental question i.e. 
whether or not gen numbers are worth it in today's conditions. Neither 
did I think about the actual cache format (I don't think that adding it 
to the pack index is a good idea if grafts are to be honored) which 
certainly has bearing on that fundamental question too.

But I don't see the point of starting to add them now to commit objects, 
even if we regret not doing it initially, simply because having them 
appear randomly based on the Git version/implementation being used is 
still much uglier than some ad hoc cache or even not having them at all.

Nicolas

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-20 20:51             ` Nicolas Pitre
@ 2011-07-20 22:16               ` George Spelvin
  2011-07-20 23:26                 ` david
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-20 22:16 UTC (permalink / raw)
  To: linux, nico; +Cc: anthonyvdgent, git, torvalds

> The alternative of having to sometimes use the generation number, 
> sometimes use the possibly broken commit date, makes for much more 
> complicated code that has to be maintained forever.  Having a solution 
> that starts working only after a certain point in history doesn't look 
> eleguant to me at all.  It is not like having different pack formats 
> where back and forth conversions can be made for the _entire_ history.

It seemed like a pretty strong argument to me, too.

> And if you don't care about graft/replace then the cached data is 
> immutable just like the in-commit version would, so there is no 
> consistency issues.  If you do care about graft/replace (or who knows 
> what other dag alteration scheme might be created in 5 years from now) 
> then a separate cache will be required _anyway_, regardless of any 
> in-commit gen number.

A possible workaround would be to keep track of the largest generation
number skew introduced by any graft, and add that safety factor into
the history-walking code, but that would be painful if you replace a
single large commit with an equivalent long development history, such
as adding a historical development tree behind a recently-cut-off one.
or development history You can do a workaround at the expense of ine

> Neither did I think about the actual cache format (I don't think that
> adding it to the pack index is a good idea if grafts are to be honored)
> which certainly has bearing on that fundamental question too.

I was thinking of something very close to the V2 pack format.
http://book.git-scm.com/7_the_packfile.html
A magic number, a 256-entry fanout table, a sorted list of 20-byte hashes,
followed by a matching list of 4-byte generation numbers.

Ending with a 20-byte hash of the replaces and grafts state that this
cache is valid for, and a hash of the cache itself.

A bit of code factoring should make it easy to share much of the code.

It would certainly be possible to share the SHA1 table in an existing
pack index and store the generation numbers of the base (no replacement)
case, but you'd have to store null values for all the non-commit objects.

That takes 4 bytes per object, while a separate list of commits
takes 24 bytes per commit.  A separate list is better if commits
are less than 1/6 of all objects.

Looking at git's own object database, we have:
 66125 blobs   (45.50%)
 49292 trees   (33.92%)
 29554 commits (20.33%)
   362 tags    ( 0.25%)
145333 total

So we're actually a bit over the 16.66% optimum. but it's not far enough
to be a real efficiency problem.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-20 22:16               ` George Spelvin
@ 2011-07-20 23:26                 ` david
  2011-07-20 23:36                   ` Nicolas Pitre
  2011-07-21 12:03                   ` Drew Northup
  0 siblings, 2 replies; 89+ messages in thread
From: david @ 2011-07-20 23:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: nico, anthonyvdgent, git, torvalds

On Wed, 20 Jul 2011, George Spelvin wrote:

>> The alternative of having to sometimes use the generation number,
>> sometimes use the possibly broken commit date, makes for much more
>> complicated code that has to be maintained forever.  Having a solution
>> that starts working only after a certain point in history doesn't look
>> eleguant to me at all.  It is not like having different pack formats
>> where back and forth conversions can be made for the _entire_ history.
>
> It seemed like a pretty strong argument to me, too.

except that you then have different caches on different systems. If the 
generation number is part of the repository then it's going to be the same 
for everyone.

in either case, you still have the different heristics depending on what 
version of git someone is running

David Lang

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-20 23:26                 ` david
@ 2011-07-20 23:36                   ` Nicolas Pitre
  2011-07-21  0:08                     ` Phil Hord
  2011-07-21 12:03                   ` Drew Northup
  1 sibling, 1 reply; 89+ messages in thread
From: Nicolas Pitre @ 2011-07-20 23:36 UTC (permalink / raw)
  To: david; +Cc: George Spelvin, anthonyvdgent, git, torvalds

On Wed, 20 Jul 2011, david@lang.hm wrote:

> On Wed, 20 Jul 2011, George Spelvin wrote:
> 
> > > The alternative of having to sometimes use the generation number,
> > > sometimes use the possibly broken commit date, makes for much more
> > > complicated code that has to be maintained forever.  Having a solution
> > > that starts working only after a certain point in history doesn't look
> > > eleguant to me at all.  It is not like having different pack formats
> > > where back and forth conversions can be made for the _entire_ history.
> > 
> > It seemed like a pretty strong argument to me, too.
> 
> except that you then have different caches on different systems.

So what?

> If the generation number is part of the repository then it's going to 
> be the same for everyone.

The actual generation number will be, and has to be, the same for 
everyone with the same repository content, regardless of the cache used.  
It is a well defined number with no room to interpretation.

> in either case, you still have the different heristics depending on what
> version of git someone is running

Indeed.


Nicolas

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-20 23:36                   ` Nicolas Pitre
@ 2011-07-21  0:08                     ` Phil Hord
  2011-07-21  0:18                       ` david
  2011-07-21  0:58                       ` Nicolas Pitre
  0 siblings, 2 replies; 89+ messages in thread
From: Phil Hord @ 2011-07-21  0:08 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds

On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
> On Wed, 20 Jul 2011, david@lang.hm wrote:
>
>> If the generation number is part of the repository then it's going to
>> be the same for everyone.
> The actual generation number will be, and has to be, the same for
> everyone with the same repository content, regardless of the cache used.
> It is a well defined number with no room to interpretation.

Nonsense.

Even if the generation number is well-defined and shared by all clients, 
the only quasi-essential definition is "for each A in ancestors_of(B), 
gen(A) < gen(B)".

In practice, the actual generation number *will be the same* for 
everyone with the same repository content, unless and until someone 
develops a different calculation method.  But there is no reason to 
require that the number *has to be* the same for everyone unless you 
expect (or require) everyone to share their gen-caches.

Surely there will be a competent and efficient gen-cache API.  But most 
code can just ask if B --contains A or even just use rev-list and 
benefit from the increased speed of the answer.  Because most code 
doesn't really care about the gen numbers themselves, but only the speed 
of determining ancestry.

Phil

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:08                     ` Phil Hord
@ 2011-07-21  0:18                       ` david
  2011-07-21  0:37                         ` Shawn Pearce
  2011-07-21  0:39                         ` Phil Hord
  2011-07-21  0:58                       ` Nicolas Pitre
  1 sibling, 2 replies; 89+ messages in thread
From: david @ 2011-07-21  0:18 UTC (permalink / raw)
  To: Phil Hord; +Cc: Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds

On Wed, 20 Jul 2011, Phil Hord wrote:

> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>> On Wed, 20 Jul 2011, david@lang.hm wrote:
>> 
>>> If the generation number is part of the repository then it's going to
>>> be the same for everyone.
>> The actual generation number will be, and has to be, the same for
>> everyone with the same repository content, regardless of the cache used.
>> It is a well defined number with no room to interpretation.
>
> Nonsense.
>
> Even if the generation number is well-defined and shared by all clients, the 
> only quasi-essential definition is "for each A in ancestors_of(B), gen(A) < 
> gen(B)".
>
> In practice, the actual generation number *will be the same* for everyone 
> with the same repository content, unless and until someone develops a 
> different calculation method.  But there is no reason to require that the 
> number *has to be* the same for everyone unless you expect (or require) 
> everyone to share their gen-caches.

and I think this is why Linus is not happy with a cache. He is seeing this 
as something that has significantly more value if it is going to be 
consistant in a distributed manner than if it's just something calculated 
locally that can be different from other systems.

if it's just locally generated, then I could easily see generation numbers 
being different on different people's ssstems, dependin on the order that 
they see commits (either locally generated or pulled from others)

If it's part of the commit, then as that commit gets propogated the 
generation number gets propogated as well, and every repository will agree 
on what the generation number is for any commit that's shared.

I agree that this consistancy guarantee seems to be valuable.

> Surely there will be a competent and efficient gen-cache API.  But most code 
> can just ask if B --contains A or even just use rev-list and benefit from the 
> increased speed of the answer.  Because most code doesn't really care about 
> the gen numbers themselves, but only the speed of determining ancestry.

in that case, why bother with generation numbers at all? the improved data 
based heristic seems to solve that problem.

David Lang

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:18                       ` david
@ 2011-07-21  0:37                         ` Shawn Pearce
  2011-07-21  0:47                           ` Phil Hord
  2011-07-21  4:26                           ` david
  2011-07-21  0:39                         ` Phil Hord
  1 sibling, 2 replies; 89+ messages in thread
From: Shawn Pearce @ 2011-07-21  0:37 UTC (permalink / raw)
  To: david
  Cc: Phil Hord, Nicolas Pitre, George Spelvin, anthonyvdgent, git,
	torvalds

On Wed, Jul 20, 2011 at 17:18,  <david@lang.hm> wrote:
>
> if it's just locally generated, then I could easily see generation numbers
> being different on different people's ssstems, dependin on the order that
> they see commits (either locally generated or pulled from others)

But this should only happen if the user fudges with their Git sources
and makes Git produce a different generation number.

If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
+ 1" then it doesn't matter who merged what commits, the same commit
appears at the same part of the graph relative to all of its
ancestors, and therefore always has the same generation number. This
is true whether or not the commit contains the generation number.

> If it's part of the commit, then as that commit gets propogated the
> generation number gets propogated as well, and every repository will agree
> on what the generation number is for any commit that's shared.

This isn't really as beneficial as you are making it out to be. We
already can agree on what the generation number should be for any
given commit, if you topo-sort the commit DAG, you get the same
result.

> I agree that this consistancy guarantee seems to be valuable.

Its valuable, but its consistent either with a cache, or not.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:37                         ` Shawn Pearce
@ 2011-07-21  0:47                           ` Phil Hord
  2011-07-21  4:26                           ` david
  1 sibling, 0 replies; 89+ messages in thread
From: Phil Hord @ 2011-07-21  0:47 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: david, Nicolas Pitre, George Spelvin, anthonyvdgent, git,
	torvalds


On 07/20/2011 08:37 PM, Shawn Pearce wrote:
> On Wed, Jul 20, 2011 at 17:18,<david@lang.hm>  wrote:
>> if it's just locally generated, then I could easily see generation numbers
>> being different on different people's ssstems, dependin on the order that
>> they see commits (either locally generated or pulled from others)
> But this should only happen if the user fudges with their Git sources
> and makes Git produce a different generation number.
>
> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
> + 1" then it doesn't matter who merged what commits, the same commit
> appears at the same part of the graph relative to all of its
> ancestors, and therefore always has the same generation number. This
> is true whether or not the commit contains the generation number.

Interesting.  I was going to disagree with the latter part of your 
statement, but then I realized you're right.

And that your algorithm allows duplicate generation numbers.

And that there's nothing wrong with that.

Because it meets the one quasi-essential need, "for each A in 
ancestors_of(B), gen(A) < gen(B)".

>> If it's part of the commit, then as that commit gets propogated the
>> generation number gets propogated as well, and every repository will agree
>> on what the generation number is for any commit that's shared.
> This isn't really as beneficial as you are making it out to be. We
> already can agree on what the generation number should be for any
> given commit, if you topo-sort the commit DAG, you get the same
> result.
>
>> I agree that this consistancy guarantee seems to be valuable.
> Its valuable, but its consistent either with a cache, or not.

I still fail to see the value.

Phil

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:37                         ` Shawn Pearce
  2011-07-21  0:47                           ` Phil Hord
@ 2011-07-21  4:26                           ` david
  2011-07-21 12:43                             ` George Spelvin
  1 sibling, 1 reply; 89+ messages in thread
From: david @ 2011-07-21  4:26 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Phil Hord, Nicolas Pitre, George Spelvin, anthonyvdgent, git,
	torvalds

On Wed, 20 Jul 2011, Shawn Pearce wrote:

> On Wed, Jul 20, 2011 at 17:18,  <david@lang.hm> wrote:
>>
>> if it's just locally generated, then I could easily see generation numbers
>> being different on different people's ssstems, dependin on the order that
>> they see commits (either locally generated or pulled from others)
>
> But this should only happen if the user fudges with their Git sources
> and makes Git produce a different generation number.
>
> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
> + 1" then it doesn't matter who merged what commits, the same commit
> appears at the same part of the graph relative to all of its
> ancestors, and therefore always has the same generation number. This
> is true whether or not the commit contains the generation number.

I have to think about this more, but I'm wondering about cases where the 
same result ia achieved via different methods, something along the lines 
of one person developing something with _many_ commits (creating a large 
generation number) that one person merges far sooner than another, causing 
the commits that they do after the merge to have much larger generation 
numbers than someone making the same changes, but doing the merge later

something like

   C9
    \
C2 - C10 - C11 - C12

vs
                 C9
                   \
C2 - C3 - C4 - C5 - C10

where the C10-12 in the first set and C3-5 in the second set are 
completely unrelated to what's done in C9 and C12 in the first set and C10 
in the sedond set are identical trees.

now I know that part of a commit is what it's parents are, so that is 
different (and that may be enough to say that generations don't matter 
and this entire issue is moot), but I haven't thought about it long enough 
to convince myself what would (or should) happen in these cases.

David Lang

>> If it's part of the commit, then as that commit gets propogated the
>> generation number gets propogated as well, and every repository will agree
>> on what the generation number is for any commit that's shared.
>
> This isn't really as beneficial as you are making it out to be. We
> already can agree on what the generation number should be for any
> given commit, if you topo-sort the commit DAG, you get the same
> result.
>
>> I agree that this consistancy guarantee seems to be valuable.
>
> Its valuable, but its consistent either with a cache, or not.
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  4:26                           ` david
@ 2011-07-21 12:43                             ` George Spelvin
  2011-07-21 19:19                               ` Jakub Narebski
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-21 12:43 UTC (permalink / raw)
  To: david, spearce; +Cc: anthonyvdgent, git, hordp, linux, nico, torvalds

On <david@lang.hm> wrote:
> On Wed, 20 Jul 2011, Shawn Pearce wrote:
>> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
>> + 1" then it doesn't matter who merged what commits, the same commit
>> appears at the same part of the graph relative to all of its
>> ancestors, and therefore always has the same generation number. This
>> is true whether or not the commit contains the generation number.

> I have to think about this more, but I'm wondering about cases where the 
> same result ia achieved via different methods, something along the lines 
> of one person developing something with _many_ commits (creating a large 
> generation number) that one person merges far sooner than another, causing 
> the commits that they do after the merge to have much larger generation 
> numbers than someone making the same changes, but doing the merge later

Can't happen.  Using the basic algorithm as Shawn described, the
generation number is defined uniquely by the ancestor DAG.

The generation number is the length of the longest path to a
root (zero-ancestor) commit through the DAG.

If you look at past discussion, several people have thought it was
okay to bake into the commit precsiely because it can be computed
once and will never change.

However, git does have some ability to amend the history DAG after
it's been written, using grafts and replace objects.  These can
change generation numbers, presisely because they change the DAG.

> something like
> 
>    C9
>     \
> C2 - C10 - C11 - C12
> 
> vs
>                  C9
>                    \
> C2 - C3 - C4 - C5 - C10
>
> where the C10-12 in the first set and C3-5 in the second set are
> completely unrelated to what's done in C9 and C12 in the first set
> and C10 in the second set are identical trees.

The generation numbers in the above are as follows:
First example:
	C2 = C9 = 0
	C10 = 1 = max(C2, C9) + 1
	C11 = 2 = C10 + 1
	C12 = 3 = C11 + 1

Second example:
	C2 = C9 = 0
	C3 = 1 = C2 + 1
	C4 = 2 = C2 + 1
	C5 = 3 = C4 + 1
	C10 = 4 = max(C5, C9) + 1

Now, the history pruning works fine if the "+1" is replaced my any other
non-zero increment, but it's not clear why you'd bother.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 12:43                             ` George Spelvin
@ 2011-07-21 19:19                               ` Jakub Narebski
  2011-07-21 20:27                                 ` George Spelvin
  0 siblings, 1 reply; 89+ messages in thread
From: Jakub Narebski @ 2011-07-21 19:19 UTC (permalink / raw)
  To: George Spelvin; +Cc: david, spearce, anthonyvdgent, git, hordp, nico, torvalds

George Spelvin, could you please try not mangle CC to include only
emails, stripping names (e.g. "spearce@spearce.org" instead of
"Shawn Pearce <spearce@spearce.org>")?

"George Spelvin" <linux@horizon.com> writes:
> On <david@lang.hm> wrote:
>> On Wed, 20 Jul 2011, Shawn Pearce wrote:

>>> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
>>> + 1" then it doesn't matter who merged what commits, the same commit
>>> appears at the same part of the graph relative to all of its
>>> ancestors, and therefore always has the same generation number. This
>>> is true whether or not the commit contains the generation number.
> 
>> I have to think about this more, but I'm wondering about cases where the 
>> same result ia achieved via different methods, something along the lines 
>> of one person developing something with _many_ commits (creating a large 
>> generation number) that one person merges far sooner than another, causing 
>> the commits that they do after the merge to have much larger generation 
>> numbers than someone making the same changes, but doing the merge later
> 
> Can't happen.  Using the basic algorithm as Shawn described, the
> generation number is defined uniquely by the ancestor DAG.
> 
> The generation number is the length of the longest path to a
> root (zero-ancestor) commit through the DAG.
> 
> If you look at past discussion, several people have thought it was
> okay to bake into the commit precsiely because it can be computed
> once and will never change.
> 
> However, git does have some ability to amend the history DAG after
> it's been written, using grafts and replace objects.  These can
> change generation numbers, presisely because they change the DAG.

There is also another issue that I have mentioned, namely incomplete
clones - which currently means shallow clone, without access to full
history.


Nb. grafts are so horrible hack that I would be not against turning
off generation numbers if they are used.

In the case of replace objects you need both non-replaced and replaced
DAG generation numbers.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 19:19                               ` Jakub Narebski
@ 2011-07-21 20:27                                 ` George Spelvin
  2011-07-21 20:33                                   ` Shawn Pearce
  2011-07-22 12:18                                   ` Jakub Narebski
  0 siblings, 2 replies; 89+ messages in thread
From: George Spelvin @ 2011-07-21 20:27 UTC (permalink / raw)
  To: jnareb, linux; +Cc: anthonyvdgent, david, git, hordp, nico, spearce, torvalds

> There is also another issue that I have mentioned, namely incomplete
> clones - which currently means shallow clone, without access to full
> history.

As far as history walking is concerned, you can just consider "missing
parent" the same as "no parent" and start the generation numbers at 0.
As long as you recompute

> Nb. grafts are so horrible hack that I would be not against turning
> off generation numbers if they are used.

Yeah, but it's not too miserable to add support (the logic is very similar
to replace objects), and then you would be able to have the history walking
code depend on the presence of generation numbers.  (The "load the cache"
function would regenerate it if necessary.)

Only do this if you already have support for "no generation numbers" in
the history walking code for (say) loose objects.

> In the case of replace objects you need both non-replaced and replaced
> DAG generation numbers.

Yes, the cache validity/invalidation criteria are the tricky bit.
Honestly, this is where the code gets ugly, not computing and storing
the generation numbers.

One thought on an expanded generation number cache:

There are many git operations that use ONLY the commit DAG, and do not
actually use any information from the commits other than their hashes
and parent pointers.  The ones that come to mind are rev-parse, rev-list,
describe, name-rev, and merge-base.

These could be sped up if, instead of just generation numbers, we kept
a complete cached copy of the commit DAG, so the commit objects didn't
have to be uncompressed and parsed.

This could be provided by an extended form of generation number cache.
In addition to listing the generation number of each commit, it
would list all the ancestors (by file offset rather than hash, for
compactness).  Then simple commit walking could load this cache and
avoid unpacking commit objects from packs.

A compact implementation would abuse the flexibility of generation numbers
to make them serve double duty.  They would be used as offsets into a
table of parent pointers.  By keeping the table topologically sorted,
the offsets would satisfy the requirements for generation numbers, but
would be unique, and there would be additional gaps when a commit had
multiple parents.

The parent pointers would themselves be 31-bit offsets into the table of
SHA-1 hashes, with the msbit meaning "this commit has multiple parents,
also look at the following table entry".  (If we use offset 0 to mean
"no parents", it might be more convenient to have the offset point to
the *end* of the run of parents rather than the beginning, so "following"
would be earlier in the file, but that's an implementation detail.)

I'm assuming that 2^31 commits having (in aggregate) 2^32 parents would
be enough for the time being.  As a local cache, it can be extended
with a software upgrade.  There's no need to ever have support for two
formats in any given release; just notice that the cache format is wrong,
blow it away, and regenerate it.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 20:27                                 ` George Spelvin
@ 2011-07-21 20:33                                   ` Shawn Pearce
  2011-07-22 12:18                                   ` Jakub Narebski
  1 sibling, 0 replies; 89+ messages in thread
From: Shawn Pearce @ 2011-07-21 20:33 UTC (permalink / raw)
  To: George Spelvin; +Cc: jnareb, anthonyvdgent, david, git, hordp, nico, torvalds

On Thu, Jul 21, 2011 at 13:27, George Spelvin <linux@horizon.com> wrote:
>
> be enough for the time being.  As a local cache, it can be extended
> with a software upgrade.  There's no need to ever have support for two
> formats in any given release; just notice that the cache format is wrong,
> blow it away, and regenerate it.

Don't assume that. Consider a repository stored on NFS that is
read-only to you. The NFS server has one version of Git installed, and
is using cache format A. You have a newer version of Git installed on
your workstation, using cache format B. Now you cannot use this
repository as a local filesystem... its only available to you over the
Git protocols. This breaks a number of people's environments.  :-)

Its better if we can avoid having to change file formats very often,
even if they are a local "cache".

-- 
Shawn.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 20:27                                 ` George Spelvin
  2011-07-21 20:33                                   ` Shawn Pearce
@ 2011-07-22 12:18                                   ` Jakub Narebski
  2011-07-22 13:09                                     ` Nicolas Pitre
  2011-07-22 18:02                                     ` david
  1 sibling, 2 replies; 89+ messages in thread
From: Jakub Narebski @ 2011-07-22 12:18 UTC (permalink / raw)
  To: George Spelvin
  Cc: Anthony Van de Gejuchte, David Lang, git, Phil Hord,
	Nicolas Pitre, Shawn Pearce, Linus Torvalds

On Thu, 21 Jul 2011, George Spelvin wrote:

> > There is also another issue that I have mentioned, namely incomplete
> > clones - which currently means shallow clone, without access to full
> > history.
> 
> As far as history walking is concerned, you can just consider "missing
> parent" the same as "no parent" and start the generation numbers at 0.
> As long as you recompute.

Well, shallow clone case can be considered both for putting 'true'
generation numbers in commit header, and against it.

For, because with generation numbers in commits you can use true
generation numbers.

Against, because if there are commits without generation numbers in
header, you cannot assign true generation number, and you can only use
"shallow" generation number, in generation numbers cache.

> > Nb. grafts are so horrible hack that I would be not against turning
> > off generation numbers if they are used.
> 
> Yeah, but it's not too miserable to add support (the logic is very similar
> to replace objects), and then you would be able to have the history walking
> code depend on the presence of generation numbers.  (The "load the cache"
> function would regenerate it if necessary.)
> 
> Only do this if you already have support for "no generation numbers" in
> the history walking code for (say) loose objects.

Grafts are non-transferable, and if you use them to cull rather than add
history they are unsafe against garbage collection... I think.

> > In the case of replace objects you need both non-replaced and replaced
> > DAG generation numbers.
> 
> Yes, the cache validity/invalidation criteria are the tricky bit.
> Honestly, this is where the code gets ugly, not computing and storing
> the generation numbers.

BTW. with storing generation number in commit header there is a problem
what would old version of git, one which does not understand said header,
do during rebase.  Would it strip unknown headers, or would it copy
generation number verbatim - which means that it can be incorrect?

BTW2. code size comparing in-commit and external cache cases must take
into account yet to be written fsck for in-commit generation numbers.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 12:18                                   ` Jakub Narebski
@ 2011-07-22 13:09                                     ` Nicolas Pitre
  2011-07-22 18:02                                       ` david
  2011-07-22 18:02                                     ` david
  1 sibling, 1 reply; 89+ messages in thread
From: Nicolas Pitre @ 2011-07-22 13:09 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: George Spelvin, Anthony Van de Gejuchte, David Lang, git,
	Phil Hord, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, Jakub Narebski wrote:

> BTW. with storing generation number in commit header there is a problem
> what would old version of git, one which does not understand said header,
> do during rebase.  Would it strip unknown headers, or would it copy
> generation number verbatim - which means that it can be incorrect?

They would indeed be copied verbatim and become incorrect.


Nicolas

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 13:09                                     ` Nicolas Pitre
@ 2011-07-22 18:02                                       ` david
  2011-07-22 18:34                                         ` Jakub Narebski
  0 siblings, 1 reply; 89+ messages in thread
From: david @ 2011-07-22 18:02 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, George Spelvin, Anthony Van de Gejuchte, git,
	Phil Hord, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, Nicolas Pitre wrote:

> On Fri, 22 Jul 2011, Jakub Narebski wrote:
>
>> BTW. with storing generation number in commit header there is a problem
>> what would old version of git, one which does not understand said header,
>> do during rebase.  Would it strip unknown headers, or would it copy
>> generation number verbatim - which means that it can be incorrect?
>
> They would indeed be copied verbatim and become incorrect.

how would they become incorrect?

David Lang

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 18:02                                       ` david
@ 2011-07-22 18:34                                         ` Jakub Narebski
  2011-07-22 19:06                                           ` Linus Torvalds
  2011-07-22 19:08                                           ` david
  0 siblings, 2 replies; 89+ messages in thread
From: Jakub Narebski @ 2011-07-22 18:34 UTC (permalink / raw)
  To: david
  Cc: Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git,
	Phil Hord, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, David Lang <david@lang.hm> wrote:
> On Fri, 22 Jul 2011, Nicolas Pitre wrote:
> > On Fri, 22 Jul 2011, Jakub Narebski wrote:
> >
> > > BTW. with storing generation number in commit header there is a problem
> > > what would old version of git, one which does not understand said header,
> > > do during rebase.  Would it strip unknown headers, or would it copy
> > > generation number verbatim - which means that it can be incorrect?
> >
> > They would indeed be copied verbatim and become incorrect.
> 
> how would they become incorrect?

Let's assume that the following history was created with new git, one
that correcly adds generation number header to commits:


  A(1)---B(2)---C(3)---D(4)---E(5)       <-- master
          \
           \----x(3)---y(4)---z(5)       <-- foo

The numbers are generation numbers in commit object.

Let's assume that this repository is fetched into repository instance
that is managed by older git, one that doesn't understand generation
header.

Then, if we do

  [old]$ git rebase master foo

and if old git _copies_ generation number header _verbatim_, we would
get:

  A(1)---B(2)---C(3)---D(4)---E(5)                         <-- master
                               \
                                \---x'(3)--y'(4)--z'(5)    <-- foo

Those generation numbers are *incorrect*; they should be:

  A(1)---B(2)---C(3)---D(4)---E(5)                         <-- master
                               \
                                \---x'(6)--y'(7)--z'(8)    <-- foo


That is IF unknown headers are copied verbatim during rebase.  For
"encoding" header this is a good thing, for "generation" it isn't.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 18:34                                         ` Jakub Narebski
@ 2011-07-22 19:06                                           ` Linus Torvalds
  2011-07-22 22:02                                             ` Jeff King
  2011-07-28 15:00                                             ` Felipe Contreras
  2011-07-22 19:08                                           ` david
  1 sibling, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-22 19:06 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: david, Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte,
	git, Phil Hord, Shawn Pearce

On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>
> That is IF unknown headers are copied verbatim during rebase.  For
> "encoding" header this is a good thing, for "generation" it isn't.

Afaik, they aren't copied verbatim, and never have been. Afaik, the
only thing that has *ever* written commits is "commit_tree()"
(originally "main()" in commit-tree.c). Why is this red herring even
being discussed?

Of course you can always generate bogus commits by writing them by
hand. But that's irrelevant.

                     Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 19:06                                           ` Linus Torvalds
@ 2011-07-22 22:02                                             ` Jeff King
  2011-07-28 15:00                                             ` Felipe Contreras
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-22 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, david, Nicolas Pitre, George Spelvin,
	Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce

On Fri, Jul 22, 2011 at 12:06:08PM -0700, Linus Torvalds wrote:

> On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote:
> >
> > That is IF unknown headers are copied verbatim during rebase.  For
> > "encoding" header this is a good thing, for "generation" it isn't.
> 
> Afaik, they aren't copied verbatim, and never have been. Afaik, the
> only thing that has *ever* written commits is "commit_tree()"
> (originally "main()" in commit-tree.c). Why is this red herring even
> being discussed?

In git.git, that is the case. There are other programs that may write
git commits, though. Try:

  http://www.google.com/codesearch#search/&q=hash-object.*commit&type=cs

Many uses seem OK (they are generating a commit from scratch). This one
at least (the sixth result from the search above) would actually
generate buggy generation headers (it modifies parents but passes other
headers through):

  http://www.google.com/codesearch#XUVcT9DKB_U/replace&ct=rc&cd=7&q=hash-object.*commit

It may be worth saying that such code is stupid and ugly and wrong, or
that it is not deployed widely enough to care about.  But it's not
entirely a red herring.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 19:06                                           ` Linus Torvalds
  2011-07-22 22:02                                             ` Jeff King
@ 2011-07-28 15:00                                             ` Felipe Contreras
  2011-09-06 10:02                                               ` Ramkumar Ramachandra
  1 sibling, 1 reply; 89+ messages in thread
From: Felipe Contreras @ 2011-07-28 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, david, Nicolas Pitre, George Spelvin,
	Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce

On Fri, Jul 22, 2011 at 10:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>>
>> That is IF unknown headers are copied verbatim during rebase.  For
>> "encoding" header this is a good thing, for "generation" it isn't.
>
> Afaik, they aren't copied verbatim, and never have been. Afaik, the
> only thing that has *ever* written commits is "commit_tree()"
> (originally "main()" in commit-tree.c). Why is this red herring even
> being discussed?
>
> Of course you can always generate bogus commits by writing them by
> hand. But that's irrelevant.

Let's suppose for a moment that the commits do have these wrong
generation numbers, shouldn't a fetch on the newer client check these
and show an error? But what if they are pushed to a central server
that has old version of git? It would be messy.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-28 15:00                                             ` Felipe Contreras
@ 2011-09-06 10:02                                               ` Ramkumar Ramachandra
  0 siblings, 0 replies; 89+ messages in thread
From: Ramkumar Ramachandra @ 2011-09-06 10:02 UTC (permalink / raw)
  To: Git List
  Cc: Linus Torvalds, Jeff King, Jakub Narebski, david, Nicolas Pitre,
	George Spelvin, Anthony Van de Gejuchte, Phil Hord, Shawn Pearce,
	Felipe Contreras

Hi,

First, let me start out by saying that I'm a fairly new contributor to
Git, and I'm far less experienced than the other people on this
thread.  I've read through all the discussions time and again, and
thought about the problem for some time now - I can't say I understand
it as fully as many of you do, but I think I may have a slightly
different perspective to offer.

In what way is Git fundamentally different from Subversion?  It's the
simplicity of the data model.  From the simplest building block, a
key-value store, we have been able to compose and build things on top
of it.  The reason we built centralized version control systems
earlier is because it was *easier* to address the composition
problems.  We dumped all related repository and problems into one
central server.  With so much information in one place, things are
tightly coupled and problems are easier to solve.  Still not
convinced?  What's the weakest component in Git today?  Undoubtedly
submodules.  Ofcourse, a large part of the reason is that many people
don't use submodules, and hence it doesn't improve -- but it's
actually a circular problem.  People don't use submodules, because
it's so featureless and hard to develop.  Why is it so hard?  Back to
the fundamental problem of composition from simple building blocks.
In submodules, we have to take entire DAGs and build a composite DAG.
The key pieces of information are deep inside Git's fundamnetals:
Gitlinks.  Other projects try like Gitslave try to attack the problem
on a more superficial level, but they all hit a barrier when they
discover that they can't compose big blocks of data: you need simple
building blocks to compose.

It's the same story with C (and now, Haskell).  Why does everyone like
C so much?  Because it only provides fundamental building blocks and
gives people the freedom to compose the way they like.  It doesn't
provide big "template blocks" like Java, because they tend to be
restrictive in the long run.  Sure, Java is easier to start out with,
but people soon realize that big blocks can't compose.

More than arguing about backward compatibility, and about how older
versions of Git commits won't have generation numbers, I think this is
what we should be focusing on.  Sure, it'll additionally make sense to
put in a cache to speed things up now, but we need to think about what
Git will be 10~15 years from now.  The fundamental pieces of
information required for composition must be present in the
fundamental building blocks.

The real question we should be asking is: "Should Git have had commit
generation numbers in 2005?".  If the answer is "yes", we should put
them in now before it becomes even harder, bending over backwards for
backward compatibility if necessary.  Otherwise, we'll regret this
decision 10~15 years later, when we're faced with deeper issues.  If
you want a concrete example, think about how you'd compose DAGs
together (again, the submodules problem): where is the information
required to prune each DAG and compose?

I wish I could write this in myself, but I'm afraid I don't have the
engineering skill yet.  I'll be happy to contribute whatever little I
can, and participate in the review process.

Thanks.

-- Ram

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 18:34                                         ` Jakub Narebski
  2011-07-22 19:06                                           ` Linus Torvalds
@ 2011-07-22 19:08                                           ` david
  2011-07-22 19:40                                             ` Nicolas Pitre
  1 sibling, 1 reply; 89+ messages in thread
From: david @ 2011-07-22 19:08 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git,
	Phil Hord, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, Jakub Narebski wrote:

> On Fri, 22 Jul 2011, David Lang <david@lang.hm> wrote:
>> On Fri, 22 Jul 2011, Nicolas Pitre wrote:
>>> On Fri, 22 Jul 2011, Jakub Narebski wrote:
>>>
>>>> BTW. with storing generation number in commit header there is a problem
>>>> what would old version of git, one which does not understand said header,
>>>> do during rebase.  Would it strip unknown headers, or would it copy
>>>> generation number verbatim - which means that it can be incorrect?
>>>
>>> They would indeed be copied verbatim and become incorrect.
>>
>> how would they become incorrect?
>
> Let's assume that the following history was created with new git, one
> that correcly adds generation number header to commits:
>
>
>  A(1)---B(2)---C(3)---D(4)---E(5)       <-- master
>          \
>           \----x(3)---y(4)---z(5)       <-- foo
>
> The numbers are generation numbers in commit object.
>
> Let's assume that this repository is fetched into repository instance
> that is managed by older git, one that doesn't understand generation
> header.
>
> Then, if we do
>
>  [old]$ git rebase master foo
>
> and if old git _copies_ generation number header _verbatim_, we would
> get:
>
>  A(1)---B(2)---C(3)---D(4)---E(5)                         <-- master
>                               \
>                                \---x'(3)--y'(4)--z'(5)    <-- foo
>
> Those generation numbers are *incorrect*; they should be:
>
>  A(1)---B(2)---C(3)---D(4)---E(5)                         <-- master
>                               \
>                                \---x'(6)--y'(7)--z'(8)    <-- foo
>
>
> That is IF unknown headers are copied verbatim during rebase.  For
> "encoding" header this is a good thing, for "generation" it isn't.

commit headers are _not_ copied during rebase

a rebase is not the exact same commit, it's a "logically equivalent" 
commit.

so when you do a rebase, you change the commit headers (you have to change 
the parent headers in any case, and you would have to change the 
generation numbers as well)

this was discussed earlier in this thread.

David Lang

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 19:08                                           ` david
@ 2011-07-22 19:40                                             ` Nicolas Pitre
  0 siblings, 0 replies; 89+ messages in thread
From: Nicolas Pitre @ 2011-07-22 19:40 UTC (permalink / raw)
  To: david
  Cc: Jakub Narebski, George Spelvin, Anthony Van de Gejuchte, git,
	Phil Hord, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, david@lang.hm wrote:

> On Fri, 22 Jul 2011, Jakub Narebski wrote:
> 
> > That is IF unknown headers are copied verbatim during rebase.  For
> > "encoding" header this is a good thing, for "generation" it isn't.
> 
> commit headers are _not_ copied during rebase

Yes, this turns out to be true as I forgot that rebase is constructed on 
top of format-patch+am, and format-patch doesn't preserve the ancillary 
headers such as the existing "encoding" header, or the hypothetical 
"generation" header.


Nicolas

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-22 12:18                                   ` Jakub Narebski
  2011-07-22 13:09                                     ` Nicolas Pitre
@ 2011-07-22 18:02                                     ` david
  1 sibling, 0 replies; 89+ messages in thread
From: david @ 2011-07-22 18:02 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: George Spelvin, Anthony Van de Gejuchte, git, Phil Hord,
	Nicolas Pitre, Shawn Pearce, Linus Torvalds

On Fri, 22 Jul 2011, Jakub Narebski wrote:

>> Yes, the cache validity/invalidation criteria are the tricky bit.
>> Honestly, this is where the code gets ugly, not computing and storing
>> the generation numbers.
>
> BTW. with storing generation number in commit header there is a problem
> what would old version of git, one which does not understand said header,
> do during rebase.  Would it strip unknown headers, or would it copy
> generation number verbatim - which means that it can be incorrect?

Linus has already pointed out that this is safe.

old versions won't create generation numbers, but they will ignore them if 
they exist. Since commits are not modified after they are created, the old 
versions don't copy or modify them.

David Lang

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:18                       ` david
  2011-07-21  0:37                         ` Shawn Pearce
@ 2011-07-21  0:39                         ` Phil Hord
  1 sibling, 0 replies; 89+ messages in thread
From: Phil Hord @ 2011-07-21  0:39 UTC (permalink / raw)
  To: david; +Cc: Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds

On 07/20/2011 08:18 PM, david@lang.hm wrote:
> On Wed, 20 Jul 2011, Phil Hord wrote:
>
>> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>>> On Wed, 20 Jul 2011, david@lang.hm wrote:
>>>
>>>> If the generation number is part of the repository then it's going to
>>>> be the same for everyone.
>>> The actual generation number will be, and has to be, the same for
>>> everyone with the same repository content, regardless of the cache 
>>> used.
>>> It is a well defined number with no room to interpretation.
>>
>> Nonsense.
>>
>> Even if the generation number is well-defined and shared by all 
>> clients, the only quasi-essential definition is "for each A in 
>> ancestors_of(B), gen(A) < gen(B)".
>>
>> In practice, the actual generation number *will be the same* for 
>> everyone with the same repository content, unless and until someone 
>> develops a different calculation method.  But there is no reason to 
>> require that the number *has to be* the same for everyone unless you 
>> expect (or require) everyone to share their gen-caches.
>
> and I think this is why Linus is not happy with a cache. He is seeing 
> this as something that has significantly more value if it is going to 
> be consistant in a distributed manner than if it's just something 
> calculated locally that can be different from other systems.

It will only be used locally, so it needn't be consistent with anyone 
else's.

>
> if it's just locally generated, then I could easily see generation 
> numbers being different on different people's ssstems, dependin on the 
> order that they see commits (either locally generated or pulled from 
> others)
>
> If it's part of the commit, then as that commit gets propogated the 
> generation number gets propogated as well, and every repository will 
> agree on what the generation number is for any commit that's shared.
>
> I agree that this consistancy guarantee seems to be valuable.

I can't see why.

>> Surely there will be a competent and efficient gen-cache API.  But 
>> most code can just ask if B --contains A or even just use rev-list 
>> and benefit from the increased speed of the answer.  Because most 
>> code doesn't really care about the gen numbers themselves, but only 
>> the speed of determining ancestry.
>
> in that case, why bother with generation numbers at all? the improved 
> data based heristic seems to solve that problem.

Does it?  Surely the ruckus would've died down in that case.  But I 
haven't been reading pu.

It seems to me that the main drawback to a gen-cache is that it slows 
down the first operation after even a local clone (with just hardlinks).

On the other hand, I see too many nails in the distributed-gen-numbers 
coffin:  legacy commits can't catch up (and therefore suffer), and 
legacy clients can trash or corrupt even "new-style" commits.

Phil

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:08                     ` Phil Hord
  2011-07-21  0:18                       ` david
@ 2011-07-21  0:58                       ` Nicolas Pitre
  2011-07-21  1:09                         ` Phil Hord
  1 sibling, 1 reply; 89+ messages in thread
From: Nicolas Pitre @ 2011-07-21  0:58 UTC (permalink / raw)
  To: Phil Hord; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds

On Wed, 20 Jul 2011, Phil Hord wrote:

> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
> > On Wed, 20 Jul 2011, david@lang.hm wrote:
> > 
> > > If the generation number is part of the repository then it's going to
> > > be the same for everyone.
> > The actual generation number will be, and has to be, the same for
> > everyone with the same repository content, regardless of the cache used.
> > It is a well defined number with no room to interpretation.
> 
> Nonsense.
> 
> Even if the generation number is well-defined and shared by all clients, the
> only quasi-essential definition is "for each A in ancestors_of(B), gen(A) <
> gen(B)".

Sure.  But what do you gain by making holes in the sequence?

> In practice, the actual generation number *will be the same* for everyone with
> the same repository content, unless and until someone develops a different
> calculation method.  But there is no reason to require that the number *has to
> be* the same for everyone unless you expect (or require) everyone to share
> their gen-caches.

And with the above you clearly reinforced the argument _against_ storing 
the generation number in the commit object.  If you can imagine a 
different calculation method already, and if it is actually useful, then 
who knows if something even better could be done eventually.


Nicolas

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21  0:58                       ` Nicolas Pitre
@ 2011-07-21  1:09                         ` Phil Hord
  0 siblings, 0 replies; 89+ messages in thread
From: Phil Hord @ 2011-07-21  1:09 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds

On 07/20/2011 08:58 PM, Nicolas Pitre wrote:
> On Wed, 20 Jul 2011, Phil Hord wrote:
>
>> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>>> On Wed, 20 Jul 2011, david@lang.hm wrote:
>>>
>>>> If the generation number is part of the repository then it's going to
>>>> be the same for everyone.
>>> The actual generation number will be, and has to be, the same for
>>> everyone with the same repository content, regardless of the cache used.
>>> It is a well defined number with no room to interpretation.
>> Nonsense.
>>
>> Even if the generation number is well-defined and shared by all clients, the
>> only quasi-essential definition is "for each A in ancestors_of(B), gen(A)<
>> gen(B)".
> Sure.  But what do you gain by making holes in the sequence?

Depends on the algorithm.  Probably speed.  Possibly more efficient 
limited-cache building (jit-style discovery in reverse, as-needed, for 
example).

What do you gain by enforcing contiguousness?  Why not require all gen 
numbers to be even?  Or prime?  ;)

>> In practice, the actual generation number *will be the same* for everyone with
>> the same repository content, unless and until someone develops a different
>> calculation method.  But there is no reason to require that the number *has to
>> be* the same for everyone unless you expect (or require) everyone to share
>> their gen-caches.
> And with the above you clearly reinforced the argument _against_ storing
> the generation number in the commit object.  If you can imagine a
> different calculation method already, and if it is actually useful, then
> who knows if something even better could be done eventually.

Good.  Nice to see I'm being self-consistent, then.

Phil

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-20 23:26                 ` david
  2011-07-20 23:36                   ` Nicolas Pitre
@ 2011-07-21 12:03                   ` Drew Northup
  2011-07-21 12:55                     ` George Spelvin
  1 sibling, 1 reply; 89+ messages in thread
From: Drew Northup @ 2011-07-21 12:03 UTC (permalink / raw)
  To: david; +Cc: George Spelvin, nico, anthonyvdgent, git, torvalds

On Wed, 2011-07-20 at 16:26 -0700, david@lang.hm wrote:
> On Wed, 20 Jul 2011, George Spelvin wrote:
> 
> >> The alternative of having to sometimes use the generation number,
> >> sometimes use the possibly broken commit date, makes for much more
> >> complicated code that has to be maintained forever.  Having a solution
> >> that starts working only after a certain point in history doesn't look
> >> eleguant to me at all.  It is not like having different pack formats
> >> where back and forth conversions can be made for the _entire_ history.
> >
> > It seemed like a pretty strong argument to me, too.
> 
> except that you then have different caches on different systems. If the 
> generation number is part of the repository then it's going to be the same 
> for everyone.

I keep hearing (reading) people stating this utterly unfounded argument.
The fact is that for any work not yet integrated back into a shared
repository it just isn't true--and even after upstream integration the
truth of such a statement may be limited.

I have not read yet one discussion about how generation numbers [baked
into a commit] deal with rebasing, for instance. Do we assign one more
than the revision prior to the base of the rebase operation or do we
start with the revision one after the highest of those original commits
included in the rebase? Depending on how that is done
_drastically_different_ numbers can come out of different repository
instances for the same _final_ DAG. This is one major reason why, as I
see it, local storage is good for generation numbers and putting them in
the commit is bad. 

I have no problem with putting an _advisory_ "revision number" in the
commit. It would not be expected to have a proper "1-to-1 and onto"
functional association with the _final_ DAG, but it could potentially
get us some nice benefits. We would still need to answer questions like
the one I ask above, but it would hurt less to change if we need to.

One other sane option that was mentioned at least once in passing was to
store the generation number in some Git "filesystem-level" object. This
could then be reconciled with each "git gc" or "git fsck" operation if
not more often. This is less ad-hoc and messy than a separate cache,
becomes amenable to the standard tool-set, and always gets updated (no
invalid cache). If an _advisory_ revision number is available in commits
that are sent along those could conceivably be used to help build up the
local git-fs generation numbers more quickly. (If a "git pull" is issued
to our repo, or we push to another, we don't send the generation numbers
locally stored--we expect the git-fs machinery to regenerate those on
the fly.)

I may not be one of the "resident rocket scientists," but that's how I
see it.

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 12:03                   ` Drew Northup
@ 2011-07-21 12:55                     ` George Spelvin
  2011-07-21 15:57                       ` Drew Northup
  0 siblings, 1 reply; 89+ messages in thread
From: George Spelvin @ 2011-07-21 12:55 UTC (permalink / raw)
  To: david, drew.northup; +Cc: anthonyvdgent, git, linux, nico, torvalds

> I have not read yet one discussion about how generation numbers [baked
> into a commit] deal with rebasing, for instance. Do we assign one more
> than the revision prior to the base of the rebase operation or do we
> start with the revision one after the highest of those original commits
> included in the rebase? Depending on how that is done
> _drastically_different_ numbers can come out of different repository
> instances for the same _final_ DAG. This is one major reason why, as I
> see it, local storage is good for generation numbers and putting them in
> the commit is bad. 

Er, no.  Whenever a new commit object is generated (as the result
of a rebase or not), its commit number is computed based on its
parent commits.  It is NEVER copied.

Just like the parent pointers themselves.  Remember, even though we talk
about "the same commit" after rebasing, it's really just an EQUIVALENT
commit according to some higher-level concept of similarity.  As far
as the core git engine is concerned, it's always a DIFFERENT commit,
with different parent hashes and a different hash itself.

This point hasn't been mentioned explicltly precisely because it's
so obvious; the history-walking code that the generation numbers are
for requires this property to function.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 12:55                     ` George Spelvin
@ 2011-07-21 15:57                       ` Drew Northup
  2011-07-21 16:24                         ` Phil Hord
  2011-07-21 17:36                         ` George Spelvin
  0 siblings, 2 replies; 89+ messages in thread
From: Drew Northup @ 2011-07-21 15:57 UTC (permalink / raw)
  To: George Spelvin; +Cc: david, anthonyvdgent, git, nico, torvalds


On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:
> > I have not read yet one discussion about how generation numbers [baked
> > into a commit] deal with rebasing, for instance. Do we assign one more
> > than the revision prior to the base of the rebase operation or do we
> > start with the revision one after the highest of those original commits
> > included in the rebase? Depending on how that is done
> > _drastically_different_ numbers can come out of different repository
> > instances for the same _final_ DAG. This is one major reason why, as I
> > see it, local storage is good for generation numbers and putting them in
> > the commit is bad. 
> 
> Er, no.  Whenever a new commit object is generated (as the result
> of a rebase or not), its commit number is computed based on its
> parent commits.  It is NEVER copied.

I don't see the word "copy" in my original. 

B-O1-O2-O3-O4-O5-O6
 \
  R1----R2-------R3

What's the correct generation number for R3? I would say gen(B)+3. My
reading of the posts made by some others was that they thought gen(O6)
was the correct answer. Still others seemed to indicate gen(O6)+1 was
the correct answer. I don't think everybody MEANT to be saying such
different things--that's just how they appeared on this end.

Now, did you mean something different by "commit number?"

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 15:57                       ` Drew Northup
@ 2011-07-21 16:24                         ` Phil Hord
  2011-07-21 22:40                           ` Pēteris Kļaviņš
  2011-07-21 17:36                         ` George Spelvin
  1 sibling, 1 reply; 89+ messages in thread
From: Phil Hord @ 2011-07-21 16:24 UTC (permalink / raw)
  To: Drew Northup; +Cc: George Spelvin, david, anthonyvdgent, git, nico, torvalds

On 07/21/2011 11:57 AM, Drew Northup wrote:
> On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:
>>> I have not read yet one discussion about how generation numbers [baked
>>> into a commit] deal with rebasing, for instance. Do we assign one more
>>> than the revision prior to the base of the rebase operation or do we
>>> start with the revision one after the highest of those original commits
>>> included in the rebase? Depending on how that is done
>>> _drastically_different_ numbers can come out of different repository
>>> instances for the same _final_ DAG. This is one major reason why, as I
>>> see it, local storage is good for generation numbers and putting them in
>>> the commit is bad.
>> Er, no.  Whenever a new commit object is generated (as the result
>> of a rebase or not), its commit number is computed based on its
>> parent commits.  It is NEVER copied.
> I don't see the word "copy" in my original.
>
> B-O1-O2-O3-O4-O5-O6
>   \
>    R1----R2-------R3
>
> What's the correct generation number for R3? I would say gen(B)+3.
And you would be correct if you follow the SoP algorithm.

> My
> reading of the posts made by some others was that they thought gen(O6)
> was the correct answer. Still others seemed to indicate gen(O6)+1 was
> the correct answer.
Maybe the confusion comes from the different storage mechanisms being 
discussed.  If the generation numbers are in a local cache and used by a 
single client, the determinism of the specific numbers doesn't much 
matter.  If they are part of the commit, it still doesn't need to be 
completely deterministic. However, interoperability requires standards, 
and standards favor determinism, so dogmatic determinism may triumph in 
that case.

1. gen(06) might make sense if you mean to implement --date-order using 
gen-numbers, for example.  But I don't think it's practical in any case.

2. gen(06)+1 might make sense if you mean to require that gen-numbers 
are unique per repo.  But this is both unsupportable and unnecessary, so 
it's a non-starter.

3. gen(B)+1 is what you'd get from the the algorithm I saw proposed.

All three of these are provably correct by my definition of "correct": 
"for each A in ancestors_of(B), gen(A) < gen(B)".

However, [1] and [2] have some extra features of dubious value.  Simpler 
is better for interoperability, so I like [3] for this purpose.

Even [3] has an extra feature I think is unnecessary: determinism.  If 
that "requirement" is dropped, I think all three of these algorithms are 
(functionally) roughly equivalent.

> I don't think everybody MEANT to be saying such
> different things--that's just how they appeared on this end.
>
> Now, did you mean something different by "commit number?"

I remain unconvinced that there is value in gen-number distribution, so 
to my mind, the specific algorithm and whether or not it is 
deterministic are unimportant.

Phil ~ who wasn't really being asked, but felt like answering

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 16:24                         ` Phil Hord
@ 2011-07-21 22:40                           ` Pēteris Kļaviņš
  2011-07-22  9:30                             ` Christian Couder
  0 siblings, 1 reply; 89+ messages in thread
From: Pēteris Kļaviņš @ 2011-07-21 22:40 UTC (permalink / raw)
  To: git

On 21/07/2011 5:24 PM, Phil Hord wrote:
> Maybe the confusion comes from the different storage mechanisms being
> discussed. If the generation numbers are in a local cache and used by a
> single client, the determinism of the specific numbers doesn't much
> matter. If they are part of the commit, it still doesn't need to be
> completely deterministic. However, interoperability requires standards,
> and standards favor determinism, so dogmatic determinism may triumph in
> that case.
>
> 1. gen(06) might make sense if you mean to implement --date-order using
> gen-numbers, for example. But I don't think it's practical in any case.
>
> 2. gen(06)+1 might make sense if you mean to require that gen-numbers
> are unique per repo. But this is both unsupportable and unnecessary, so
> it's a non-starter.
>
> 3. gen(B)+1 is what you'd get from the the algorithm I saw proposed.
>
> All three of these are provably correct by my definition of "correct":
> "for each A in ancestors_of(B), gen(A) < gen(B)".
>
> However, [1] and [2] have some extra features of dubious value. Simpler
> is better for interoperability, so I like [3] for this purpose.
>
> Even [3] has an extra feature I think is unnecessary: determinism. If
> that "requirement" is dropped, I think all three of these algorithms are
> (functionally) roughly equivalent.
>
>> I don't think everybody MEANT to be saying such
>> different things--that's just how they appeared on this end.
>>
>> Now, did you mean something different by "commit number?"
>
> I remain unconvinced that there is value in gen-number distribution, so
> to my mind, the specific algorithm and whether or not it is
> deterministic are unimportant.
>

The beauty of Git is that no two copies of a Git repository as a whole 
are the same:  some people make shallow copies;  others prune away all 
branches except for the one they are interested in;  yet others graft 
together multiple original repositories.  The upshot is that two copies 
of the same repository may end up having different commits as their root 
commits, and so the generation numbers computed for their repositories 
would be different.  Indeed, the shallow repository copy could later be 
filled out with additional underlying commits, and so on.

Given this context, I can't see the value in fixing generation numbers 
within commits.  In my mind generation numbers are extremely useful 
transient helper objects in every Git repository but they have no 
meaning outside that repository, sort of like GIT_WORK_TREE.

Peter

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 22:40                           ` Pēteris Kļaviņš
@ 2011-07-22  9:30                             ` Christian Couder
  0 siblings, 0 replies; 89+ messages in thread
From: Christian Couder @ 2011-07-22  9:30 UTC (permalink / raw)
  To: Pēteris Kļaviņš; +Cc: git

On Fri, Jul 22, 2011 at 12:40 AM, Pēteris Kļaviņš
<klavins@netspace.net.au> wrote:
>
> The beauty of Git is that no two copies of a Git repository as a whole are
> the same:  some people make shallow copies;  others prune away all branches
> except for the one they are interested in;  yet others graft together
> multiple original repositories.  The upshot is that two copies of the same
> repository may end up having different commits as their root commits, and so
> the generation numbers computed for their repositories would be different.
>  Indeed, the shallow repository copy could later be filled out with
> additional underlying commits, and so on.

Not only people want different repos, but with their own repo they
want different "views" (or "virtual graph") of it.

> Given this context, I can't see the value in fixing generation numbers
> within commits.  In my mind generation numbers are extremely useful
> transient helper objects in every Git repository but they have no meaning
> outside that repository, sort of like GIT_WORK_TREE.

It's not even per repository that they have a meaning, it's per "view"
of the commit graph.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-21 15:57                       ` Drew Northup
  2011-07-21 16:24                         ` Phil Hord
@ 2011-07-21 17:36                         ` George Spelvin
  1 sibling, 0 replies; 89+ messages in thread
From: George Spelvin @ 2011-07-21 17:36 UTC (permalink / raw)
  To: drew.northup, linux; +Cc: anthonyvdgent, david, git, nico, torvalds

Drew Northup wrote:
> On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:
>> I have not read yet one discussion about how generation numbers [baked
>> into a commit] deal with rebasing, for instance. Do we assign one more
>> than the revision prior to the base of the rebase operation or do we
>> start with the revision one after the highest of those original commits
>> included in the rebase? Depending on how that is done
>> _drastically_different_ numbers can come out of different repository
>> instances for the same _final_ DAG. This is one major reason why, as I
>> see it, local storage is good for generation numbers and putting them in
>> the commit is bad. 
> 
> Er, no.  Whenever a new commit object is generated (as the result
> of a rebase or not), its commit number is computed based on its
> parent commits.  It is NEVER copied.

> I don't see the word "copy" in my original. 

Indeed, you didn't use it; it was my simplified mental model of your
suggestion that the rebased commits would have generation numbers that
somehow depended on the generation numbers before rebasing.

Althouugh you suggested something different, the mistake is the same:
the rebased commits' generation numbers have simply no relationship to
those of the original pre-rebase commits.  The generation numbers depend
only on the commits explicitly listed as parents in the commit objects.

That's why I went on to explain that the equivalence of the commits
produced by a rebase operation is a higher-level concept; the core git
object database just knows that they aren't identical, and therefore
are different.

Thus, they would retain the same relative order as before the rebase
(unless you permuted them with rebase -i), but start with the generation
number of the rebase target.

> B-O1-O2-O3-O4-O5-O6
>  \
>   R1----R2-------R3

> What's the correct generation number for R3? I would say gen(B)+3. My
> reading of the posts made by some others was that they thought gen(O6)
> was the correct answer. Still others seemed to indicate gen(O6)+1 was
> the correct answer. I don't think everybody MEANT to be saying such
> different things--that's just how they appeared on this end.

According to the canonical algorithm, it's gen(B)+3 = gen(R2)+1.

However, any non-decreasing series is equally permissible for
optimizing history walking, so you could add jumps to (for example)
make the numbers unique if that simplified anything.

I don't think it does simplify anything, so the issue hasn't been
discussed much.

For the purpose of the optimization enabled by the generation
numbers, however, it doesn't actually matter.

What matters is that if I am listing commits down multiple branches,
once I have walked back on each branch to commits of generation N or
less, I know that I have found all possible descendants of all commits
of generation N or more.

This lets me display the recent part of the commit DAG (back to generation
N) without exploring the entire commit treem or worrying that I'll have to
"back up" to insert a commit in its proper order.  Without precomputed
generation numbers, the only way to be sure of this is to explore back
to generation 0 (parentless commits) or to use date-based heuristics.

> Now, did you mean something different by "commit number?"

No, just a bran fart I didn't catch before posting.
I meant "generation number".

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Git commit generation numbers
@ 2011-07-14 18:24 Linus Torvalds
  2011-07-14 18:37 ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 18:24 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Jeff King

Ok, so I see that the old discussion about generation numbers has resurfaced.

And I have to say, with six years of git use, I think it's not a
coincidence that the notion of generation numbers has come up several
times over the years: I think the lack of them is literally the only
real design mistake we have.

And I absolutely *detest* the generation number cache thing I see on the list.

Maybe I missed the discussion that actually added them to the commits
(I don't read the git mailing list regularly any more) but I think
it's a mistake to add an external cache to work around the fact that I
didn't add the generation numbers originally.

So I think we should just add the generation numbers now. We can make
the rule be that if a commit doesn't have a generation number, we end
up having to compute it (with no real need for caching). Yes, it's
expensive. But it's going to be a *lot* less expensive over time as
people start using a git version that adds the generation numbers to
commits.

And we can easily mix this - there's no "flag-day" issues. Old
versions of git will ignore the generation number and generate new
commits that doesn't have it. New versions of git will generate them,
and use them. And once the project starts having generation numbers in
some commits, the "generating them" part will get cheaper over time.

I'll send out a patch that admittedly does not have much testing as a
reply to this one. It ends up being really simple. Of course, maybe
it's simple because I did something incredibly stupid, but please take
a look.

                                 Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:24 Linus Torvalds
@ 2011-07-14 18:37 ` Jeff King
  2011-07-14 18:47   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Jeff King @ 2011-07-14 18:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:24:27AM -0700, Linus Torvalds wrote:

> And I have to say, with six years of git use, I think it's not a
> coincidence that the notion of generation numbers has come up several
> times over the years: I think the lack of them is literally the only
> real design mistake we have.

Agreed.

> And I absolutely *detest* the generation number cache thing I see on
> the list.

I'd love to have in-commit generation numbers. I'm just not sure we can
get the speeds we want without caching them for existing commits.

> Maybe I missed the discussion that actually added them to the commits
> (I don't read the git mailing list regularly any more) but I think
> it's a mistake to add an external cache to work around the fact that I
> didn't add the generation numbers originally.
> 
> So I think we should just add the generation numbers now. We can make
> the rule be that if a commit doesn't have a generation number, we end
> up having to compute it (with no real need for caching). Yes, it's
> expensive. But it's going to be a *lot* less expensive over time as
> people start using a git version that adds the generation numbers to
> commits.

I'm not sure that is the best plan. Calculating generation numbers
involves going to all roots. So once you have to find any generation
number, it's going to be expensive, no matter how many recent commits
have generation numbers already in them (but it won't get _more_
expensive as more commits are added; you'll always be traversing from
the commit in question down to the roots).

As we add new commits with generation numbers, we won't need to do a
calculation to get their numbers. But if you are doing something like
"tag --contains", you are going to want to know the generation number of
old tags (otherwise, you can't know whether your cutoff might hit them
or not). IOW, even if we add generation numbers _today_, every "tag
--contains" in linux-2.6 is going to end up traversing from v3.0-rc7
down to the roots to get its generation number (v3.0-rc8 would get an
embedded generation, of course).

So if you aren't going to cache generation numbers, then you might as
well write your traversal algorithm to assume you don't know them for
old commits. Because calculating them needs to touch every ancestor, and
that's probably equivalent to the worst-case for your algorithm.

There's also one other issue with generation numbers. How do you handle
grafts and object-replacement refs?  If you graft history, your embedded
generation numbers will all be junk, and you can't trust them.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:37 ` Jeff King
@ 2011-07-14 18:47   ` Linus Torvalds
  2011-07-14 18:55     ` Linus Torvalds
  2011-07-14 19:08     ` Jeff King
  2011-07-14 18:52   ` Linus Torvalds
  2011-07-14 20:26   ` Junio C Hamano
  2 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 18:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote:
>
> I'd love to have in-commit generation numbers. I'm just not sure we can
> get the speeds we want without caching them for existing commits.

So my argument would be that we'd simply be much better off fixing the
fundamental data structure (which we can), and let it become the
long-term solution.

Now, if *may* turn out that we'd want to have some cache for
generation numbers in commits that don't have them, but I absolutely
think that that should be a "add-on" rather than anything fundamental.
For example, if we just merge the "add generation numbers to the
commit object" logic first, then the "cache" case never really needs
to care about us generating new commits. They simply won't need the
cache.

Also, I suspect that the cache could easily be done as a *small* and
*incomplete* cache, ie you don't need to cache all commits, it would
be sufficient to cache a few hundred spread-out commits, and just know
that "from any commit, the cached commit will be quickly reachable".

> I'm not sure that is the best plan. Calculating generation numbers
> involves going to all roots. So once you have to find any generation
> number, it's going to be expensive, no matter how many recent commits
> have generation numbers already in them (but it won't get _more_
> expensive as more commits are added; you'll always be traversing from
> the commit in question down to the roots).

It only ends up being expensive if the commit has parents that don't
have generation numbers.

That's a fairly short-term problem. For the kernel, for example,
basically no development happens on a base that is older than one or
two releases. So if I (and Greg, with the stable tree) start using my
patch, within a couple of weeks, pretty much all development would
have a generation number in its history.

Sure, sometimes I'd merge from people who based their tree on
something old, and I'd end up calculating it all. But it would get
progressively rarer.

> As we add new commits with generation numbers, we won't need to do a
> calculation to get their numbers. But if you are doing something like
> "tag --contains", you are going to want to know the generation number of
> old tags (otherwise, you can't know whether your cutoff might hit them
> or not). IOW, even if we add generation numbers _today_, every "tag
> --contains" in linux-2.6 is going to end up traversing from v3.0-rc7
> down to the roots to get its generation number (v3.0-rc8 would get an
> embedded generation, of course).

So that could easily be handled by caching. In fact, I suspect that
you could make the cache no associate with a commit ID, but be
associated with the tags and heads. But again, then the cache would be
a "secondary" issue, not something fundamental.

> So if you aren't going to cache generation numbers, then you might as
> well write your traversal algorithm to assume you don't know them for
> old commits.

But that's how our algorithms are *already* written.

So why not have that as the fallback? You get the advantage of
generation numbers only with modern things, but those are the ones you
actually tend to use.

Merge bases are *very* seldom historical, for example.

                     Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:47   ` Linus Torvalds
@ 2011-07-14 18:55     ` Linus Torvalds
  2011-07-14 19:12       ` Jeff King
  2011-07-14 19:46       ` Ted Ts'o
  2011-07-14 19:08     ` Jeff King
  1 sibling, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 18:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Also, I suspect that the cache could easily be done as a *small* and
> *incomplete* cache, ie you don't need to cache all commits, it would
> be sufficient to cache a few hundred spread-out commits, and just know
> that "from any commit, the cached commit will be quickly reachable".

Put another way: we could do the cache not as a real dynamic entity,
but as something that gets generated at "git clone" time or when
re-packing.

I'm actually much more nervous about a cache being inconsistent than I
would be about having generation numbers in the tree. The latter we
can (and should - but my patch didn't) add a fsck test for, and then
you would never get into some situation where there's some really
subtle issue with merge base calculation due to a corrupt cache.

                    Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:55     ` Linus Torvalds
@ 2011-07-14 19:12       ` Jeff King
  2011-07-14 19:46       ` Ted Ts'o
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-14 19:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote:

> I'm actually much more nervous about a cache being inconsistent than I
> would be about having generation numbers in the tree. The latter we
> can (and should - but my patch didn't) add a fsck test for, and then
> you would never get into some situation where there's some really
> subtle issue with merge base calculation due to a corrupt cache.

Interesting. I'm nervous about that, too, which is why I _favor_ the
cache. Because we calculate the cache ourselves, we know its accurate
according to the parent pointers. If we find a bug, we fix it and bump
the cache version, which forces it to regenerate.

Contrast that with a bogus generation number that makes its way into an
actual commit object. That's there for eternity, just like the commit
timestamp skew we already have. I find it much less likely to happen
than skew in the commit timestamp, if only because generations are a
dirt-simple concept. But it is a case where there is duplicated
information in the actual DAG, and if that information doesn't match up
we are screwed.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:55     ` Linus Torvalds
  2011-07-14 19:12       ` Jeff King
@ 2011-07-14 19:46       ` Ted Ts'o
  2011-07-14 19:51         ` Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Ted Ts'o @ 2011-07-14 19:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote:
> On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Also, I suspect that the cache could easily be done as a *small* and
> > *incomplete* cache, ie you don't need to cache all commits, it would
> > be sufficient to cache a few hundred spread-out commits, and just know
> > that "from any commit, the cached commit will be quickly reachable".
> 
> Put another way: we could do the cache not as a real dynamic entity,
> but as something that gets generated at "git clone" time or when
> re-packing.

Would it be considered evil if we put the generation number in the
pack, but not consider it part of the formal object (i.e., it would be
just a cache, but one that wouldn't change once the pack was created)?

       	      	      	   	    	   	- Ted

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 19:46       ` Ted Ts'o
@ 2011-07-14 19:51         ` Linus Torvalds
  2011-07-14 20:07           ` Jeff King
  2011-07-14 20:08           ` Ted Ts'o
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 19:51 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <tytso@mit.edu> wrote:
>
> Would it be considered evil if we put the generation number in the
> pack, but not consider it part of the formal object (i.e., it would be
> just a cache, but one that wouldn't change once the pack was created)?

That would actually be a major change to data structures, and would
require some serious surgery and be hard to support in a
backwards-compatible way (think different git versions accessing the
same repository).

Much bigger patch than the one I did.

So it sounds like it would work - and it would probably be a simple
matter of just incrementing the pack version number if you just say
"cannot access the pack with old versions" - but I think it's a really
fragile approach.

                  Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 19:51         ` Linus Torvalds
@ 2011-07-14 20:07           ` Jeff King
  2011-07-14 20:08           ` Ted Ts'o
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-14 20:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ted Ts'o, Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <tytso@mit.edu> wrote:
> >
> > Would it be considered evil if we put the generation number in the
> > pack, but not consider it part of the formal object (i.e., it would be
> > just a cache, but one that wouldn't change once the pack was created)?
> 
> That would actually be a major change to data structures, and would
> require some serious surgery and be hard to support in a
> backwards-compatible way (think different git versions accessing the
> same repository).

If we put it in the index, but not the pack, then it wouldn't be any
more painful than pack index v2. I don't recall there being huge fallout
from that; we just gave a reasonable deprecation period before switching
it on as the default.

I'm not sure it is much less crappy than having the cache in a separate
file. It does take less space, since the pack index already contains all
of the sha1s. But if we don't like the on-the-fly writing of what was in
my series, it would not be hard to generate the same cache during
pack-index time. Not having it in a separate file makes it hard to
invalidate the cache when the graph changes (due to grafts or replace
refs). But maybe we don't care about that. Or maybe it's OK to tell the
user to manually rebuild the pack index if they tweak those features.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 19:51         ` Linus Torvalds
  2011-07-14 20:07           ` Jeff King
@ 2011-07-14 20:08           ` Ted Ts'o
  1 sibling, 0 replies; 89+ messages in thread
From: Ted Ts'o @ 2011-07-14 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote:
> 
> So it sounds like it would work - and it would probably be a simple
> matter of just incrementing the pack version number if you just say
> "cannot access the pack with old versions" - but I think it's a really
> fragile approach.

So if we ever change the pack format again, it's something to think
about adding, but probably not worth it on its own...

What if we simply have a cache file per pack, which again is generated
when the pack is first received or generated, but is otherwise not
dynamic?  It's an extra file which is icky, but it would keep things
simpler.

							- Ted

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:47   ` Linus Torvalds
  2011-07-14 18:55     ` Linus Torvalds
@ 2011-07-14 19:08     ` Jeff King
  2011-07-14 19:23       ` Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-14 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:47:45AM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote:
> >
> > I'd love to have in-commit generation numbers. I'm just not sure we can
> > get the speeds we want without caching them for existing commits.
> 
> So my argument would be that we'd simply be much better off fixing the
> fundamental data structure (which we can), and let it become the
> long-term solution.
> 
> Now, if *may* turn out that we'd want to have some cache for
> generation numbers in commits that don't have them, but I absolutely
> think that that should be a "add-on" rather than anything fundamental.
> For example, if we just merge the "add generation numbers to the
> commit object" logic first, then the "cache" case never really needs
> to care about us generating new commits. They simply won't need the
> cache.

Sure, I'd be fine with that (modulo the graft issue, which you don't
seem to care about). I half-toyed with making an extra "add generation
numbers to commit header" on top of my series, but I wanted to first
prove that generation numbers actually could yield speedups.

> Also, I suspect that the cache could easily be done as a *small* and
> *incomplete* cache, ie you don't need to cache all commits, it would
> be sufficient to cache a few hundred spread-out commits, and just know
> that "from any commit, the cached commit will be quickly reachable".

Yeah, that would work. Is it worth the trouble? Your cache size is still
O(n). And you still have the complexity of _having_ a cache.  Yes, the
size is 1/100th of what it was (dropping from 6M to 600K on linux-2.6).
But you're also going to spend more time calculating. I think you'd have
to measure to see how it performs in practice.

> It only ends up being expensive if the commit has parents that don't
> have generation numbers.
> 
> That's a fairly short-term problem. For the kernel, for example,
> basically no development happens on a base that is older than one or
> two releases. So if I (and Greg, with the stable tree) start using my
> patch, within a couple of weeks, pretty much all development would
> have a generation number in its history.

Sure, that makes generation during commit-time cheaper, and eventually
the cost just goes away. I'm more concerned that it won't actually speed
up algorithms where you look at old commits, which was the whole point
in the first place.

> > As we add new commits with generation numbers, we won't need to do a
> > calculation to get their numbers. But if you are doing something like
> > "tag --contains", you are going to want to know the generation number of
> > old tags (otherwise, you can't know whether your cutoff might hit them
> > or not). IOW, even if we add generation numbers _today_, every "tag
> > --contains" in linux-2.6 is going to end up traversing from v3.0-rc7
> > down to the roots to get its generation number (v3.0-rc8 would get an
> > embedded generation, of course).
> 
> So that could easily be handled by caching. In fact, I suspect that
> you could make the cache no associate with a commit ID, but be
> associated with the tags and heads. But again, then the cache would be
> a "secondary" issue, not something fundamental.

Yeah, you could do that. And it would handle "tag --contains" and
"branch --contains" (the latter doesn't even really need a cache; as the
branch tips move, they will get new commits with generation numbers). I
suspect we could get faster topo-sorting and possibly faster merge-base
calculation out of generation numbers, too.  But that won't happen if we
only have generation numbers for a handful of specific commits.

> > So if you aren't going to cache generation numbers, then you might as
> > well write your traversal algorithm to assume you don't know them for
> > old commits.
> 
> But that's how our algorithms are *already* written.

Sort of. We tend to rely on commit timestamps as a proxy for generation
numbers. But in the face of clock skew, git will give wrong answers
(e.g., Ted posted some examples of name-rev giving wrong answers near
some skew in linux-2.6).

If we aren't going to go whole-hog on generation numbers, I'm much more
tempted to simply keep using commit timestamps. It's easy to build a
cache of commits with bogus timestamps (which I've already posted a
patch for) if you want to better accuracy at the cost of more
complexity. And as time progresses, you tend to ask about commits near
the skewed ones less often (and hopefully lessons learned from seeing
how the skew occurred will help us prevent them from reocurring in new
commits).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 19:08     ` Jeff King
@ 2011-07-14 19:23       ` Linus Torvalds
  2011-07-14 20:01         ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 19:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <peff@peff.net> wrote:
>
> If we aren't going to go whole-hog on generation numbers, I'm much more
> tempted to simply keep using commit timestamps.

Sure. I think it's entirely reasonable to say that the issue basically
boils down to one git question: "can commit X be an ancestor of commit
Y" (as a way to basically limit certain algorithms from having to walk
all the way down). We've used commit dates for it, and realistically
it really has worked very well. But it was always a broken heuristic.

So yes, I personally see generation counters as a way to do the commit
date comparisons right. And it would be perfectly fine to just say "if
there are no generation numbers, we'll use the datestamps instead, and
know that they could be incorrect".

That "use the datestamps" fallback thing may well involve all the
heuristics we already do (ie check for the stamps looking sane, and
not trusting just one individual one).

                           Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 19:23       ` Linus Torvalds
@ 2011-07-14 20:01         ` Jeff King
  2011-07-14 20:19           ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-14 20:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 12:23:31PM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <peff@peff.net> wrote:
> >
> > If we aren't going to go whole-hog on generation numbers, I'm much more
> > tempted to simply keep using commit timestamps.
> 
> Sure. I think it's entirely reasonable to say that the issue basically
> boils down to one git question: "can commit X be an ancestor of commit
> Y" (as a way to basically limit certain algorithms from having to walk
> all the way down). We've used commit dates for it, and realistically
> it really has worked very well. But it was always a broken heuristic.

Yeah, I agree with that.

> So yes, I personally see generation counters as a way to do the commit
> date comparisons right. And it would be perfectly fine to just say "if
> there are no generation numbers, we'll use the datestamps instead, and
> know that they could be incorrect".

In that case, is it really worth adding generation numbers to the cache?
Because they _can_ be wrong, too. I suspect they will be wrong less
often than commit timestamps, if only because they're dirt simple to
calculate. But all it takes is some crappy porcelain doing:

  git cat-file commit $foo |
  munge_the_parents |
  git hash-object -t commit --stdin -w

to give us a bogus object. Sure, we can catch it via fsck. But we could
also catch commit timestamp skew via fsck just as easily.

> That "use the datestamps" fallback thing may well involve all the
> heuristics we already do (ie check for the stamps looking sane, and
> not trusting just one individual one).

Those aren't foolproof, of course. I asked people a few months ago to
run my skew-detection program on various repos, and some repos have long
runs of skew (think somebody with a bad clock or a bogus program doing a
whole series). But they're fast and work OK in practice. We should apply
them more consistently (name-rev, for example, will tolerate a day of
skew, but will not look past a single commit).

And if people really want to be thorough, we can mark the skewed commits
in a cache during "git gc" for them (or they can just say "for this
traversal, I want to be thorough; turn off timestamp cutoffs").

Out of curiosity, what don't you like about the generation cache? The
idea of using external storage? Generating it on the fly? The particular
implementation is too slow or crappy?

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 20:01         ` Jeff King
@ 2011-07-14 20:19           ` Linus Torvalds
  2011-07-14 20:31             ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 20:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

Jeff King <peff@peff.net> wrote:
>
>Out of curiosity, what don't you like about the generation cache?

The thing I hate about it is very fundamental: I think it's a hack around a basic git design mistake. And it's a mistake we have known about for a long time.

Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically say "we made a mistake in the original commit design, and instead of fixing it we create a separate workaround for it".

THAT I find distasteful. My reaction is that if we're going to add generation numbers, then were should just do it the way we should have done them originally, rather than as some separate hack.

See? That's why I wouldn't have any problem with adding a separate cache on top of it, if it's really required, but I would hope that it isn't really needed.

So a cache in itself is not necessarily wrong. But leaving the original design mistake in place IS.

And fixing it really ended up being a very tiny patch, no?

     Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 20:19           ` Linus Torvalds
@ 2011-07-14 20:31             ` Jeff King
  2011-07-15  1:19               ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-14 20:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 01:19:51PM -0700, Linus Torvalds wrote:

> >Out of curiosity, what don't you like about the generation cache?
> 
> The thing I hate about it is very fundamental: I think it's a hack
> around a basic git design mistake. And it's a mistake we have known
> about for a long time.
> 
> Now, I don't think it's a *fatal* mistake, but I do find it very
> broken to basically say "we made a mistake in the original commit
> design, and instead of fixing it we create a separate workaround for
> it".
> 
> THAT I find distasteful. My reaction is that if we're going to add
> generation numbers, then were should just do it the way we should have
> done them originally, rather than as some separate hack.
> 
> See? That's why I wouldn't have any problem with adding a separate
> cache on top of it, if it's really required, but I would hope that it
> isn't really needed.
> 
> So a cache in itself is not necessarily wrong. But leaving the
> original design mistake in place IS.

Thanks, that makes some sense to me.

However, I'm not 100% convinced leaving generation numbers out was a
mistake. The git philosophy seems always to have been to keep the
minimal required information in the DAG. And I think that has served us
well, because we're not saddled with cruft that seemed like a good idea
early on, but isn't.

Generation numbers are _completely_ redundant with the actual structure
of history represented by the parent pointers. Having them in there is
not about giving git more information that it doesn't have, but about
being a cheap place to stuff a value that is a little expensive to
calculate.

And so that seems a bit hack-ish to me.

I liken it somewhat to the "don't store renames" debate. We don't want
to crystallize forever in the history whatever crappy rename-detection
algorithm is done at the time of commit. We put the minimum amount of
information in the DAG, and it's the runtime's responsibility to get the
answer.

I think the decision is a little more gray with generation numbers,
because it's not about "you got this information with a wrong and crappy
algorithm" like it might be with rename detection, but rather "we're
sticking this redundant number in the commit object, and we assume that
it will always be useful enough to future algorithms to merit being
here".

> And fixing it really ended up being a very tiny patch, no?

Well, yes. But it also doesn't yield a 100-fold speedup in "git tag
--contains" for existing repositories. So it's not quite a full
solution.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 20:31             ` Jeff King
@ 2011-07-15  1:19               ` Linus Torvalds
  2011-07-15  2:41                 ` Geert Bosch
                                   ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15  1:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@peff.net> wrote:
>
> However, I'm not 100% convinced leaving generation numbers out was a
> mistake. The git philosophy seems always to have been to keep the
> minimal required information in the DAG.

Yes.

And until I saw the patches trying to add generation numbers, I didn't
really try to push adding generation numbers to commits (although it
actually came up as early as July 2005, so the "let's use generation
numbers in commits" thing is *really* old).

In other words, I do agree that we should strive for minimal required
information.

But dammit, if you start using generation numbers, then they *are*
required information. The fact that you then hide them in some
unarchitected random file doesn't change anything! It just makes it
ugly and random, for chrissake!

I really don't understand your logic that says that the cache is
somehow cleaner. It's a random hack! It's saying "we don't have it in
the main data structure, so let's add it to some other one instead,
and now we have a consistency and cache generation problem instead".

Just look at the size of the patches in question. Your caching patches
are bigger and more complicated. Sure, part of it is that your series
adds the code to _use_ the generation number, but look purely at the
code to maintain them.

Why do you think the odd separate cache is somehow better than just
doing it right? Seriously? If we require the generation numbers, then
they have *become* that minimal information that we should save!

 And I think that has served us
> well, because we're not saddled with cruft that seemed like a good idea
> early on, but isn't.

Again - we discussed adding generation numbers about 6 years ago. We
clearly *should* have done it. Instead, we went with the hacky "let's
use commit time", that everybody really knew was technically wrong,
and was a hack, but avoided the need.

Now, six years later, you clearly are saying that we need the
generation numbers, but then you go off and try to say that they
should be in some secondary non-architected random collection of data
structures that isn't covered by the security and maintenance
guarantees that the core git objects are.

Dammit, one of the things that makes git special is that the data
structures are NOT random odd ad-hoc files. There is a design to them.

> Generation numbers are _completely_ redundant with the actual structure
> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole
history" to that statement.

And we've *never* parsed the whole history, because it's just too
expensive and doesn't scale. So right now we depend on commit dates
with a few hacks.

So no, generation numbers are not at all redundant. They are
fundamental. It's why we had this discussion six years ago.

> And so that seems a bit hack-ish to me.

Um? If you feel that way, then why the hell are you pushing your EVEN
MORE HACKISH CACHE PATCHES?

That's what this really boils down to. I think that if we have a value
that we need, then it should be recorded. In the data structures. Not
in some random other location that isn't part of the real git data
structures.

We don't do caches in git, because we don't NEED to. Sure, gitk has
it's hacky cache, but that's not core functionality.

I think it's a sign of good design that we can do a "find .git" and
explain every single file, and show that it's all core functionality
(again, with the exception of "gitk.cache", and I suspect that's
because gitk is a script, not because of any really fundamental data
issues), and explain it.

I think the *cache* is a hell of a lot more hacky than just doing it right.

> I liken it somewhat to the "don't store renames" debate.

That's total and utter bullshit.

Storing renames is *wrong*. I've explained a million times why it's
wrong. Doing it is a disaster. I know. I've used systems that did it.
It's crap. It's fundamentally information that is actively misleading
and WRONG. It's not even that you can do rename detection at run-time,
it's that you *HAVE* to do rename detection at run-time, because doing
it at commit time is simply utterly and fundamentally *wrong*.

Just look at "git blame -C" to remind yourself why rename information is wrong.

But even more importantly, look at git merges. Look at how git has
gotten merging right since pretty much day #1, and has absolutely no
issues with files that got generated two different ways. Look at every
SCM that tries to do rename detection, and look at how THEY CANNOT DO
MERGES RIGHT.

It's that simple. Rename detection is not about avoiding "redundant
data". It's about doing the right thing.

                          Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  1:19               ` Linus Torvalds
@ 2011-07-15  2:41                 ` Geert Bosch
  2011-07-15  7:46                 ` Jeff King
  2011-07-15  9:12                 ` Jakub Narebski
  2 siblings, 0 replies; 89+ messages in thread
From: Geert Bosch @ 2011-07-15  2:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Jul 14, 2011, at 21:19, Linus Torvalds wrote:
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

Generation numbers never will be required information, because we
can always compute them. These numbers are really much more similar 
to other pack index information than anything else.

<aside>
Sometimes I wish we'd have general "depth" information for each
SHA1, which would be the maximum number of steps in the DAG to reach
a leaf. This way, if we want to do something like "git log
drivers/net/slip.c", we don't have to bother reading the majority
of trees that have a depth less than two. The depth can also be used
as a limiter for "contains" operations, where we want to see if
commit X contains commit Y: depth (X) has to be at least depth (Y).

However, any such notion, wether generation or depth or whatever
else we'll think of tomorrow, is something particular to a certain
implementation of git. It does not add anything to the information
we stored.
</aside>

I don't think my commit should have a different SHA1 from yours,
because your tree has a more generation numbers than mine.

The beauty and genius of GIT is that it just takes the minimum
amount of data needed to uniquely identify the information to be
stored, and stores that in a UNIQUE format. By allowing generation
numbers to either be present or absent, that's all broken.

It's like computing the SHA1 of compressed data: it doesn't depend
on the data we store, just about the particular representation we
choose. Fortunately we have done away with the first mistake.

So, if you're going to add generation numbers, there has to be a
flag day, after which generation numbers are required everywhere. 
Of course it would be possible to recognize "old style" commits 
and convert them on the fly, but that is true for pretty much 
any format change. However, adding redundant information seems 
like a poor excuse for having a flag day.

Storing generation data in pack indices on the other hand makes
perfect sense: when we generate these indices, we do complete
traversals and have all required information trivially at hand.  We
can never have that many loose objects, so lack of generation
information there isn't a big deal. By storing generation information
in the index, we can be sure it is consistent with the data contained
in the pack, so there are no cache invalidation issues.

I know I must have missed some stupid and obvious reason why
this is all wrong, I just don't quite see it yet.

  -Geert

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  1:19               ` Linus Torvalds
  2011-07-15  2:41                 ` Geert Bosch
@ 2011-07-15  7:46                 ` Jeff King
  2011-07-15 16:10                   ` Linus Torvalds
  2011-07-15  9:12                 ` Jakub Narebski
  2 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-15  7:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 06:19:30PM -0700, Linus Torvalds wrote:

> Yes.
> 
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
> 
> In other words, I do agree that we should strive for minimal required
> information.
> 
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

So you don't see a difference between storing the information directly
in the commit object, where it affects the sha1 of the commit, and
calculating and storing it somewhere else? That is what seems ungit to
me. You aren't adding new information to the DAG (note I said "DAG" and
not commit) that is not already there, but you are changing the ids of
commits in the DAG.

I'm not saying that's a reason to ultimately reject the idea of putting
generation numbers in commit objects. But it is a reason to give us
pause and figure out if there are other solutions, because it will be
the first time such redundant information has been added. And that's
what I've been trying to do during this discussion with you: work out
what the options are and evaluate them.

> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

Are packfiles unclean, or a random hack? How about pack indices? What
about Nico's and Shawn's ideas for a packv4 that would gain efficiency
by storing objects not in their whole format, but in a way that would
make tree examination faster (but would be able to restore the whole
objects byte for byte)?

Those things rely on the idea that the git DAG is a data model that we
present to the user, but that we're allowed to do things behind the
scenes to make things faster. We're allowed to make an index of offsets
of objects in the packfile for faster lookup. Why are we not allowed to
use an index for other object data if it will speed up our local
algorithms?

Again, I'm not saying that the patches I posted are necessarily the
answer. Maybe my cache implementation sucks. Maybe the value should go
into a pack index instead. Maybe the whole idea is stupid. But I don't
think it's worth rejecting out-of-hand the idea that the generation
number might be stored outside of the commit object. I do think it's
worth talking about what the actual downsides are, as compared to other
options.

For example, you mentioned there a consistency problem in the paragraph
above. What is it?

If you mean the problem with refs/replace, then yes, that is an open
problem to be solved (though not a hard one, as I already mentioned a
solution elsewhere). But is that problem better or worse with this
solution versus an embedded generation number? It seems to me that an
embedded generation number is even worse.

> Just look at the size of the patches in question. Your caching patches
> are bigger and more complicated. Sure, part of it is that your series
> adds the code to _use_ the generation number, but look purely at the
> code to maintain them.

It's 300 lines of code. That can also be used to store arbitrary
meta-information for commits. I've already achieved significant speedups
in some workflows by caching patch-id calculations, which would reuse
this code.  And I certainly don't think _that_ should go into the commit
object.

Yes, it's more complex than simply adding a generation number to the
commit header. But simply adding a generation number does not actually
give the 100-fold speedup I'm seeing. So again, I'm not interested in
rejecting solutions out of hand; I'm interested in things like: is the
complexity of the cache worth this speedup? What other options do we
have, and what speedup do they provide? Do we care enough about this
speedup to even bother?

> Why do you think the odd separate cache is somehow better than just
> doing it right? Seriously? If we require the generation numbers, then
> they have *become* that minimal information that we should save!

What do you do when generation numbers don't match the DAG represented
by the parent pointers? Are you proposing to just ignore it? I'm not
asking that question adversarially; ignoring may be the sane thing to
do, and we say "generation numbers are to be trusted, even if they don't
match parent pointers".

> Now, six years later, you clearly are saying that we need the
> generation numbers, but then you go off and try to say that they
> should be in some secondary non-architected random collection of data
> structures that isn't covered by the security and maintenance
> guarantees that the core git objects are.

I don't think I said we clearly need them. I said we can get speedups by
using them, and I showed some patches. I _also_ posted patches showing
how to accomplish similar speedups using timestamps. Note that all of my
patches started with "RFC". I am trying to figure out which is the best
way to proceed.

And why _would_ they need to be covered by the security and maintenance
guarantees of core objects? You can trivially calculate them from the
core objects. Are pack indices also a "secondary non-architected random
collection of data structures"?

> Dammit, one of the things that makes git special is that the data
> structures are NOT random odd ad-hoc files. There is a design to them.

There is just as much documentation and design for the new file format I
added as there is for pack indices (in fact, they're quite similar in
design). I really see them at the same level: something we calculate to
speed up some algorithms, but something we could regenerate at any time
if we felt like.

> > And so that seems a bit hack-ish to me.
> 
> Um? If you feel that way, then why the hell are you pushing your EVEN
> MORE HACKISH CACHE PATCHES?

Please, there is really no need to shout. And I find it quite silly that
you would refer to me as "pushing" these patches when they have been
clearly listed as RFC, and everything I have posted in the nearby
threads has been about comparing different strategies (with patches and
timings for some of those other strategies!).

> We don't do caches in git, because we don't NEED to. Sure, gitk has
> it's hacky cache, but that's not core functionality.

I'm sorry to tell you that there is already a cache for external
conversion of diffs for blobs. And that I have a patch series which
makes "git cherry" much more pleasant to use by caching patch ids.

Do we "need" those? No, of course not. Git works just fine without them,
albeit a bit slower. But is it sometimes worth making a space-time
tradeoff to make some algorithms faster? I think it sometimes is,
depending on the space and time factors, and the complexity of the
storage (e.g., consistency problems with caching).

> I think it's a sign of good design that we can do a "find .git" and
> explain every single file, and show that it's all core functionality
> (again, with the exception of "gitk.cache", and I suspect that's
> because gitk is a script, not because of any really fundamental data
> issues), and explain it.

Would it make you happier if we stored the generation data in the pack
index when we index the packs?

> I think the *cache* is a hell of a lot more hacky than just doing it right.

You still haven't explained how we would "do it right" and get the same
speedups. When I responded to your initial email, your answers were
along the lines of "we could cache fewer things". If your position is to
damn the speedup, the cache is not worth the complexity, I can buy that.
If your position is that the complexity is not worth it, and we are
better off to keep using timestamps, I can buy that. If your position is
that you can find a clever way, using only the generation numbers in
newly created commits, to get similar speedups in "git {tag,branch}
--contains", I'd love to hear it.

> > I liken it somewhat to the "don't store renames" debate.
> 
> That's total and utter bullshit.
> 
> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.

Yes, I am well aware that stored renames are wrong for merging. The
problem is that they not a function of a tree state (which is what a
commit stores), but rather of the difference between two states. So when
you diff the commit's state with some other arbitrary merge-base, any
renames recorded at commit time would be worthless.

But consider another case. Each time I run "git log -M --raw", I compute
the same renames over and over. Let's say I have a case in which this is
annoyingly slow, and want to speed it up. The state of a particular
commit and the state of its parents are invariants for a particular sha1
commit id; this is a fundamental property of git, as you well know. So
for a given rename-detection algorithm (and any parameters it has), the
set of renames between the states will also be an invariant.

Now imagine I create a persistent cache mapping the commit sha1 for some
sane default set of rename algorithm parameters to a set of rename
pairs. My annoyingly slow "log -M" is now faster, and I'm happier.

I think you encounter a similar set of questions here as you do with the
concept of a generation header. If the information is an invariant for a
particular commit sha1, can we and should we store it in the commit
object? Is the speedup worth the complexity of a cache? What are the
circumstances under which the cache is not applicable, and how often do
they come up? Can we accurately detect when the cache is not applicable?

And that is why I compared it to the idea of storing renames. Please
note that I did _not_ say they were exactly the same situation, or that
the answers to one set of questions were the same as the answers to
another.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  7:46                 ` Jeff King
@ 2011-07-15 16:10                   ` Linus Torvalds
  2011-07-15 16:18                     ` Shawn Pearce
  2011-07-15 19:48                     ` Jeff King
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 16:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 12:46 AM, Jeff King <peff@peff.net> wrote:
>
> So you don't see a difference between storing the information directly
> in the commit object, where it affects the sha1 of the commit, and
> calculating and storing it somewhere else?

Sure, I see the difference. And I think it's uglier to have two
different places for required information.

>  That is what seems ungit to
> me. You aren't adding new information to the DAG (note I said "DAG" and
> not commit) that is not already there, but you are changing the ids of
> commits in the DAG.

Umm. It's redundant, but so what? We have tons of redundant
information in there already. Those commits are very explicitly using
a 40-byte ASCII representation of the 20-byte SHA1 names. The very
original deeper object structure is also redundant: we repeat the
object size in the object itself, even though it's part of the
implicit object format itself.

We also very purposefully repeat the type of the object there, even
though the type is basically always redundant (in fact, the core git
functions require you to give the type of the object as part of the
lookup, and will error out if the SHA1 points to the wrong type). That
was one of my original design decisions, exactly because I wanted the
redundancy for verification.

Redundancy isn't a problem. It's a source of sanity checking.

I'm not seeing why you are harping on it.

I think it's much worse to have the same information in two different
places where it can cause inconsistencies that are hard to see and may
not be repeatable. If git ever finds the wrong merge base (because,
say, the generation numbers are wrong), I want it to be a *repeatable*
thing. I want to be able to repeat on the git mailing list "hey, guys,
look at what happens when I try to merge commits ABC and XYZ". If you
go "yeah, it works for me", then that is bad.

What I tried very hard to do in the git data structures is to make
them (a) immutable (so the DAG could never have two-way links, for
example) and (b) "simple".

Right now, we do *have* a "generation number". It's just that it's
very easy to corrupt even by mistake. It's called "committer date". We
could improve on it.

> Are packfiles unclean, or a random hack? How about pack indices?

No. Neither of them are unclean or random. The original git design was
very much about thinking of the object space as a "filesystem". Now,
the original object layout actually used the native OS filesystem, and
I naively thought that would be ok. Using  aspecialized filesystem
instead doesn't really change anything. It's not fundamentally
different from the difference between running git on ext3 or btrfs or
nfs or whatever. In fact, I think we've had more filesystem-related
bugs wrt NFS than we've had with pack-files.

The pack indices are actually kind of ugly - and I would have
preferred having them in the same file instead of having the worry of
consistency across two different files. They *are* the kind of thing
that could cause local inconsistency, but they are fairly simple, and
they have some serious protection in them (ie they aren't just SHA1'd
in themselves, they contain a SHA1 of the pack-file they index in them
to make sure that any inconsistency is findable). Again, that's
"redundancy". But I consider the packfile/index to be just a
filesystem. It really fundamentally *is* that.

Partly for that reason, I do think that if the generation count was
embedded in the pack-file, that would not be an "ugly" decision. The
pack-files have definitely become "core git data structures", and are
more than just a local filesystem representation of the objects:
they're obviously also the data transport method, even if the rules
there are slightly different (no index, thank god, and incomplete
"thin" packs).

That said, I don't think a generation count necessarily "fits" in the
pack-file. They are designed to be incremental, so it's not very
natural there. But I do think it would be conceptually prettier to
have the "depth of commit" be part of the "filesystem" data than to
have it as a separate ad-hoc cache.

> Those things rely on the idea that the git DAG is a data model that we
> present to the user, but that we're allowed to do things behind the
> scenes to make things faster.

.. and that is relevant to this discussion exactly *how*?

It's not. It's totally irrelevant. I certainly would never walk away
from the DAG model. It's a fundamental git decision, and it's the
correct one.

But it all boils down to one simple issue: we should have added
generation counts back in 2005. It's likely the *one* data format
decision that I regret. Using commit dates was wrong. Everybody knew
it was wrong, but we ended up going with it just to keep the format
constant.

If I had realized how small the patch was to add generation counters,
and that it wouldn't have broken backwards compatibility (ie fsck
doesn't start complaining). I would have done it originally, instead
of all the crazy hacks we did for commit date verification.

And that is what this discussion fundamentally boils down to for me.

If we should have fixed it in the original specification, we damn well
should fix it today. It's been "ignorable" because it's just not been
important enough. But if git now adds a fundamental cache for them,
then that information is clearly no longer "not important enough".

                               Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 16:10                   ` Linus Torvalds
@ 2011-07-15 16:18                     ` Shawn Pearce
  2011-07-15 16:44                       ` Linus Torvalds
  2011-07-15 19:48                     ` Jeff King
  1 sibling, 1 reply; 89+ messages in thread
From: Shawn Pearce @ 2011-07-15 16:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 09:10, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Right now, we do *have* a "generation number". It's just that it's
> very easy to corrupt even by mistake. It's called "committer date". We
> could improve on it.
...
> If I had realized how small the patch was to add generation counters,
> and that it wouldn't have broken backwards compatibility (ie fsck
> doesn't start complaining). I would have done it originally, instead
> of all the crazy hacks we did for commit date verification.

What about going forward making the requirement that a new commit must
have a committer date whose date is >= the maximum date of its
parents?

We could also add a check during fast-forward merges to refuse to
perform the merge if the incoming commit has a committer date too far
forward in the future (e.g. more than 5 minutes). If you pull from a
moron whose system clock is set such that the committer date isn't a
proxy for generation number, Git would just refuse the merge, and you
could ask them to fix their objects.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 16:18                     ` Shawn Pearce
@ 2011-07-15 16:44                       ` Linus Torvalds
  2011-07-15 18:42                         ` Ted Ts'o
  2011-07-15 18:46                         ` Tony Luck
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 16:44 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 9:18 AM, Shawn Pearce <spearce@spearce.org> wrote:
>
> What about going forward making the requirement that a new commit must
> have a committer date whose date is >= the maximum date of its
> parents?

So you suggest just making commit dates be the generation number.

I'd be ok with that. It's basically what we've been doing for the last
six years.

But in that case, we shouldn't be doing the generation count cache either.

Btw, I do agree that we probably should add a warning for the case
("your clock is wrong - your commit date is before the commit date of
your parents") and maybe require the use of "-f" or something to
override it. That would certainly be a good thing quite independently
of anything else. So regardless of generation counts, it's probably
worth it.

But if you think commit date is good enough for generation counts -
and I'm not arguing against it - then please tell me why you would
then want to have a separate generation count cache.

So I would like to repeat: I think our commit-date based hack has been
pretty successful. We've lived with it for years and years. Even the
"let's try to fix it by adding slop" code is from three years ago
(commit 7d004199d1), which means that for three years we never really
saw any serious problems. I forget what problem we actually did see -
I have this dim memory of it being Ted that had problems with a merge
because git picked a crap merge base, but that may just be my
Alzheimer's speaking.

Obviously there are cases where we miss some merge base and it doesn't
really end up mattering, so we may well have a *ton* of commits that
have bad dates, but they just haven't affected us enough for us to
care. That's fine too - I dislike how our algorithm isn't truly
reliable, but at the same time I think we're so robust that it all
works regardless.

So I think it's ugly and fairly hacky, but it has worked well enough
in practice. I dislike our commit dates, but I don't _hate_ them. I do
think it was a mistake, but not one I'm especially ashamed of.

So why do I dislike the generation count cache so much? I dislike it
exactly because

  "if the commit date isn't good enough, then dammit, we should have
just added a generation count".

And if we should have added it six years ago, then we should add it
today. Not say "oh, we made a mistake six years ago, let's work around
the mistake instead of fixing it".

That's really what it boils down to. Let's not paper over a mistake.
Either we need the generation depth or we don't. And if we do need it,
we should replace the date-based hackery with it (where "replace" may
well be "still fall back on our traditional date-based hackery in the
absense of generation counters").

But if we decide that we don't really need generation counters AT ALL,
and can just continue with the commit date hack, then I'm personally
ok with that too.

So to me, it's a "either or" situation. Either the commit dates are
good enough, or we should add generation counts to the commits.

But in *neither* case is it ok to do some external cache to work around it.

                          Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 16:44                       ` Linus Torvalds
@ 2011-07-15 18:42                         ` Ted Ts'o
  2011-07-15 19:00                           ` Linus Torvalds
  2011-07-16  9:16                           ` Christian Couder
  2011-07-15 18:46                         ` Tony Luck
  1 sibling, 2 replies; 89+ messages in thread
From: Ted Ts'o @ 2011-07-15 18:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote:
> So I would like to repeat: I think our commit-date based hack has been
> pretty successful. We've lived with it for years and years. Even the
> "let's try to fix it by adding slop" code is from three years ago
> (commit 7d004199d1), which means that for three years we never really
> saw any serious problems. I forget what problem we actually did see -
> I have this dim memory of it being Ted that had problems with a merge
> because git picked a crap merge base, but that may just be my
> Alzheimer's speaking.

My original main issue was simply that "git tag --contains" and "git
branch --contains" was either (a) incorrect, or (b) slower than
popping up gitk and pulling the information out of the GUI.  The
reason for (b) is because of gitk.cache.

Maybe the answer then is creating a command-line tool (it doesn't have to
be in "core" of git) which just pulls the dammned information out of
gitk.cache....

(Yes, it's gross, but I'm not worrying about the long-term
architecture of git or anything high-falutin' like that.  I'm just a
poor dumb user who just wants git tag --contains and git branch
--contains to be fast and accurate...)

						- Ted

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 18:42                         ` Ted Ts'o
@ 2011-07-15 19:00                           ` Linus Torvalds
  2011-07-16  9:16                           ` Christian Couder
  1 sibling, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 19:00 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 11:42 AM, Ted Ts'o <tytso@mit.edu> wrote:
>
> My original main issue was simply that "git tag --contains" and "git
> branch --contains" was either (a) incorrect, or (b) slower than
> popping up gitk and pulling the information out of the GUI.  The
> reason for (b) is because of gitk.cache.

With "original issue" I actually meant the case that caused us to add
the "slop" commit (7d004199d1). But I was too lazy to try to find the
archives from March 2008..

                  Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 18:42                         ` Ted Ts'o
  2011-07-15 19:00                           ` Linus Torvalds
@ 2011-07-16  9:16                           ` Christian Couder
  2011-07-18  3:41                             ` Jeff King
  1 sibling, 1 reply; 89+ messages in thread
From: Christian Couder @ 2011-07-16  9:16 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Linus Torvalds, Shawn Pearce, Jeff King, Git Mailing List,
	Junio C Hamano

On Fri, Jul 15, 2011 at 8:42 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote:
>> So I would like to repeat: I think our commit-date based hack has been
>> pretty successful. We've lived with it for years and years. Even the
>> "let's try to fix it by adding slop" code is from three years ago
>> (commit 7d004199d1), which means that for three years we never really
>> saw any serious problems. I forget what problem we actually did see -
>> I have this dim memory of it being Ted that had problems with a merge
>> because git picked a crap merge base, but that may just be my
>> Alzheimer's speaking.
>
> My original main issue was simply that "git tag --contains" and "git
> branch --contains" was either (a) incorrect, or (b) slower than
> popping up gitk and pulling the information out of the GUI.  The
> reason for (b) is because of gitk.cache.
>
> Maybe the answer then is creating a command-line tool (it doesn't have to
> be in "core" of git) which just pulls the dammned information out of
> gitk.cache....
>
> (Yes, it's gross, but I'm not worrying about the long-term
> architecture of git or anything high-falutin' like that.  I'm just a
> poor dumb user who just wants git tag --contains and git branch
> --contains to be fast and accurate...)

If  "git tag --contains" and "git branch --contains" give incorrect
answers because the commiter date is wrong in some commits, then why
not use "git replace" to "change" the commiter date in the commits
that have a wrong date? Is it because you don't want to use "git
replace", or because there is no script to do it automatically, or is
there another reason?

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-16  9:16                           ` Christian Couder
@ 2011-07-18  3:41                             ` Jeff King
  2011-07-19  4:14                               ` Christian Couder
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-18  3:41 UTC (permalink / raw)
  To: Christian Couder
  Cc: Ted Ts'o, Linus Torvalds, Shawn Pearce, Git Mailing List,
	Junio C Hamano

On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote:

> If  "git tag --contains" and "git branch --contains" give incorrect
> answers because the commiter date is wrong in some commits, then why
> not use "git replace" to "change" the commiter date in the commits
> that have a wrong date? Is it because you don't want to use "git
> replace", or because there is no script to do it automatically, or is
> there another reason?

That would work. There are a few tricky things, though:

  1. Most commits have less than 100 skewed commits. But some have many
     (e.g., thousands in the mesa repo). How well does git cope with
     large numbers of replace refs, performance-wise?

  2. Declaring which commits are skewed is actually tricky. You can find
     a commit whose timestamp is less than the timestamp of one of its
     ancestors. But you don't know whether it is skewed, or the
     ancestor.

     If you are implementing a list of commits whose timestamps
     shouldn't be used for traversal cutoff, it doesn't really matter
     who is _right_; you just care about whether the timestamps are
     strictly increasing from that point.

     But once you start replacing commits, you need to put in a
     reasonable value for the timestamp. So you may well be replacing a
     perfectly valid commit with one that has bogus, skewed information
     in the commit timestamp.

  3. Any value you put in is actually going to be a lie during things
     like "git log --pretty=raw". That may be OK. But it is letting an
     optimization meant to make traversal fast and accurate bleed into
     the actual data we show the user.

  4. Sometimes we need to do traversals on the real objects (e.g.,
     because we are doing upload-pack). To get the benefit, those
     traversals would presumably need to look at both the original
     object and the replacement, use the timestamp from the replacement
     for traversal, but otherwise use the original object.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-18  3:41                             ` Jeff King
@ 2011-07-19  4:14                               ` Christian Couder
  2011-07-19 20:00                                 ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Christian Couder @ 2011-07-19  4:14 UTC (permalink / raw)
  To: Jeff King
  Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce,
	Git Mailing List, Junio C Hamano

On Monday 18 July 2011 05:41:06 Jeff King wrote:
> On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote:
> > If  "git tag --contains" and "git branch --contains" give incorrect
> > answers because the commiter date is wrong in some commits, then why
> > not use "git replace" to "change" the commiter date in the commits
> > that have a wrong date? Is it because you don't want to use "git
> > replace", or because there is no script to do it automatically, or is
> > there another reason?
> 
> That would work. There are a few tricky things, though:
> 
>   1. Most commits have less than 100 skewed commits. But some have many
>      (e.g., thousands in the mesa repo). How well does git cope with
>      large numbers of replace refs, performance-wise?

If it did not cope well, it should be possible to improve the performance.

Anyway, another way to fix the problem with "git replace" could be to create 
branches with commits that have a fixed commiter date and then to use "git 
replace" only to connect these branches to the graph.

For example if you have this:

A - B - X1 - X2 - X3 - C - D

where X1, X2 and X3 are skewed, then you can create this:

A - B - X1 - X2 - X3 - C - D
         \ Y1 - Y2 - Y3

where Y1, Y2, Y3 are the same as X1, X2, X3 except they are not skewed.
Then you only need to do "git replace X3 Y3" so you create only one replace 
ref.

> 
>   2. Declaring which commits are skewed is actually tricky. You can find
>      a commit whose timestamp is less than the timestamp of one of its
>      ancestors. But you don't know whether it is skewed, or the
>      ancestor.
> 
>      If you are implementing a list of commits whose timestamps
>      shouldn't be used for traversal cutoff, it doesn't really matter
>      who is _right_; you just care about whether the timestamps are
>      strictly increasing from that point.
> 
>      But once you start replacing commits, you need to put in a
>      reasonable value for the timestamp. So you may well be replacing a
>      perfectly valid commit with one that has bogus, skewed information
>      in the commit timestamp.

Perhaps but with "git replace" you can choose to create new replace refs and 
deprecate the old replace refs to fix this where you got it wrong.

It would be easier to do that if "git replace" supported sub directories like 
"refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs 
more easily.

For example you could create new refs in "refs/replace/clock-skew/ted-
july-2011-2/" if you found a better fix. And then use these new refs instead of 
those in "refs/replace/clock-skew/ted-july-2011/".

>   3. Any value you put in is actually going to be a lie during things
>      like "git log --pretty=raw". That may be OK. But it is letting an
>      optimization meant to make traversal fast and accurate bleed into
>      the actual data we show the user.

With replace refs, the user could choose the "lies" told to him/her by 
selecting the replace refs or set of replace refs that are used.

As commits are immutable, when they are created with bad data, the best we can 
do is let the user choose if they want to see the original or another "fixed" 
version. Because the original will always be "true" in a way.

>   4. Sometimes we need to do traversals on the real objects (e.g.,
>      because we are doing upload-pack). To get the benefit, those
>      traversals would presumably need to look at both the original
>      object and the replacement, use the timestamp from the replacement
>      for traversal, but otherwise use the original object.

Yeah, or maybe when we do traversals on real objects we could afford not to 
rely on commiter date or some other "fragile" data.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-19  4:14                               ` Christian Couder
@ 2011-07-19 20:00                                 ` Jeff King
  2011-07-21  6:29                                   ` Christian Couder
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-19 20:00 UTC (permalink / raw)
  To: Christian Couder
  Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce,
	Git Mailing List, Junio C Hamano

On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote:

> >      But once you start replacing commits, you need to put in a
> >      reasonable value for the timestamp. So you may well be replacing a
> >      perfectly valid commit with one that has bogus, skewed information
> >      in the commit timestamp.
> 
> Perhaps but with "git replace" you can choose to create new replace refs and 
> deprecate the old replace refs to fix this where you got it wrong.
> 
> It would be easier to do that if "git replace" supported sub directories like 
> "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs 
> more easily.

I think all of the arguments I cut from your email are reasonable, but
the crux of the issue comes down to this point.

If you are interested in actually correcting the skew, then yes, replace
refs are a good solution. But doing so is going to involve somebody
looking at the commits and deciding which ones are wrong, and what they
should be. And maybe that's a good thing to do for people who really
care about cleaning history.

But for something like "speed up revision traversal by assuming commit
timestamps are roughly increasing", we want something very automated,
and what is needs to say is much weaker (not "this is what this commit
_should_ say", but rather "this commit might be right, but it is not a
good point for cutting off a traversal"). So that's a much easier
problem, and it's easy to do in an automated way.

So I think while you could use replace refs to handle this issue, it is
not always going to be the right solution, and there is room for
something simpler (and weaker).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-19 20:00                                 ` Jeff King
@ 2011-07-21  6:29                                   ` Christian Couder
  0 siblings, 0 replies; 89+ messages in thread
From: Christian Couder @ 2011-07-21  6:29 UTC (permalink / raw)
  To: Jeff King
  Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce,
	Git Mailing List, Junio C Hamano

On Tue, Jul 19, 2011 at 10:00 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote:
>
>> Perhaps but with "git replace" you can choose to create new replace refs and
>> deprecate the old replace refs to fix this where you got it wrong.
>>
>> It would be easier to do that if "git replace" supported sub directories like
>> "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs
>> more easily.
>
> I think all of the arguments I cut from your email are reasonable, but
> the crux of the issue comes down to this point.
>
> If you are interested in actually correcting the skew, then yes, replace
> refs are a good solution. But doing so is going to involve somebody
> looking at the commits and deciding which ones are wrong, and what they
> should be.

I think that we can help the user a lot to find the skew, and then to
decide which commits are wrong, and then to fix the skew even if the
fix we suggest is far from being perfect.

> And maybe that's a good thing to do for people who really
> care about cleaning history.

Yeah, so maybe at one point we will want to help these people even if
we have implemented automatic generation numbers. Then this means that
automated generation numbers are useful only if:

1) there are commits with skews
2) the heuristics to deal with some skew don't work
3) the user is too lazy to use the help we (can) provide to fix the skews

I think that we can probably find heuristics that will deal with at
least 95% of the cases. For example we could perhaps decide that we
don't cut off a traversal until the date difference is greater than 5
days.

Then in the hopefully few cases where there are really big skews that
won't be caught by our heuristics, (but that we can automatically
detect when fetching or commiting,) we can perhaps afford to ask the
user to do a small analysis to properly fix the skew.

I mean that at one point when things are too weird it is ok and
perhaps even a good thing to involve the user.

> But for something like "speed up revision traversal by assuming commit
> timestamps are roughly increasing", we want something very automated,
> and what is needs to say is much weaker (not "this is what this commit
> _should_ say", but rather "this commit might be right, but it is not a
> good point for cutting off a traversal"). So that's a much easier
> problem, and it's easy to do in an automated way.

Yeah, generation numbers look like an easy thing to do. And yeah,
being automated is great too. But it does not mean it is the right
thing to do. (Or perhaps we could have them but not save them in any
cache, nor in the commit object.)

> So I think while you could use replace refs to handle this issue, it is
> not always going to be the right solution, and there is room for
> something simpler (and weaker).

You know, replace refs can be used to fix or improve a lot of things
like bad authors, clock skews, bisecting on a fixed up history,
working on a larger or smaller repository than the original, and so
on. And of course for each of these problems you may find another
solution tailored to the problem at hand that will seem simpler or
easier. But in the end if you develop all these other solutions you
will have developed a lot of stuff that will be harder to maintain,
less generic, more complex and so on, that properly developed replace
refs.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 16:44                       ` Linus Torvalds
  2011-07-15 18:42                         ` Ted Ts'o
@ 2011-07-15 18:46                         ` Tony Luck
  2011-07-15 18:58                           ` Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Tony Luck @ 2011-07-15 18:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 9:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Btw, I do agree that we probably should add a warning for the case
> ("your clock is wrong - your commit date is before the commit date of
> your parents") and maybe require the use of "-f" or something to
> override it. That would certainly be a good thing quite independently
> of anything else. So regardless of generation counts, it's probably
> worth it.

What if my clock is wrong in the opposite direction - set to some time
out in 2025.
It would pass the check you propose and let the commit go in - but would
cause problems for everyone if that tree was pulled into upstream.

You'd also want a check in pull(merge) that none of the commits being
added were in the future (as defined by the time on your machine).

-Tony

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 18:46                         ` Tony Luck
@ 2011-07-15 18:58                           ` Linus Torvalds
  0 siblings, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 18:58 UTC (permalink / raw)
  To: Tony Luck; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 11:46 AM, Tony Luck <tony.luck@intel.com> wrote:
>
> What if my clock is wrong in the opposite direction - set to some time
> out in 2025.
> It would pass the check you propose and let the commit go in - but would
> cause problems for everyone if that tree was pulled into upstream.

I think Shawn suggested that we just notice it at merge time.

But yes, it's why (a) I'd suggest we have a "-f" to override and (b) I
do think that generation counts are a better idea. You could still
screw them up, but it would be due to an outright bug or malicious
behavior, rather than simple incompetence on the part of a user.

Incompetent users (where "date on the machine set to the wrong
century" is just _one_ sign of incompetence) are something git should
pretty much take for granted. It may not be the common case, but it's
certainly something we should design for and take into account.

In contrast, if somebody *wants* to screw his repository up by
re-writing objects with "git hash-object" etc, be my guest. We should
just make sure fsck catches anything serious.

So I would suggest checking the date regardless of any generation
count issues, because it would possibly find badly configured machines
that should be fixed. The same way we complain when we find no name.

Whether it should then be a correctness issue or not is kind of separate.

> You'd also want a check in pull(merge) that none of the commits being
> added were in the future (as defined by the time on your machine).

I don't think you need to care about "none of the commits", just
making sure the tip is reasonable. That would not only be expensive,
and not what we normally do (we show the diff against endpoints, not
all changes, etc). It would also cause problems for "fixed"
repositories (ie anything that has historical dates that are wrong,
but are ok now).

                  Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 16:10                   ` Linus Torvalds
  2011-07-15 16:18                     ` Shawn Pearce
@ 2011-07-15 19:48                     ` Jeff King
  2011-07-15 20:07                       ` Jeff King
  2011-07-15 21:17                       ` Linus Torvalds
  1 sibling, 2 replies; 89+ messages in thread
From: Jeff King @ 2011-07-15 19:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 09:10:48AM -0700, Linus Torvalds wrote:

> I think it's much worse to have the same information in two different
> places where it can cause inconsistencies that are hard to see and may
> not be repeatable. If git ever finds the wrong merge base (because,
> say, the generation numbers are wrong), I want it to be a *repeatable*
> thing. I want to be able to repeat on the git mailing list "hey, guys,
> look at what happens when I try to merge commits ABC and XYZ". If you
> go "yeah, it works for me", then that is bad.

Having the information in two different places is my concern, too. And I
think the fundamental difference between putting it inside or outside
the commit sha1 (where outside encompasses putting it in a cache, in the
pack-index, or whatever), is that I see the commit sha1 as somehow more
"definitive". That is, it is the sole data we pass from repo to repo
during pushes and pulls, and it is the thing that is consistency-checked
by hashes.

So if there is an inconsistency between what the parent pointers
represent, and what the generation number in "outside" storage says,
then the outside storage is wrong, and the parent pointers are the right
answer. It becomes a lot more fuzzy to me if there is an inconsistency
between what the parent pointers represent, and what the generation
number says.

How should that situation be handled? Should fsck check for it and
complain? Should we just ignore it, even though it may cause our
traversal algorithms to be inaccurate? Like clock skew, there's not much
that can be done if the commits are published.

Those are serious questions that I think should be considered if we are
going to put a generation header into the commit object, and I haven't
seen answers for them yet.

> Partly for that reason, I do think that if the generation count was
> embedded in the pack-file, that would not be an "ugly" decision. The
> pack-files have definitely become "core git data structures", and are
> more than just a local filesystem representation of the objects:
> they're obviously also the data transport method, even if the rules
> there are slightly different (no index, thank god, and incomplete
> "thin" packs).
> 
> That said, I don't think a generation count necessarily "fits" in the
> pack-file. They are designed to be incremental, so it's not very
> natural there. But I do think it would be conceptually prettier to
> have the "depth of commit" be part of the "filesystem" data than to
> have it as a separate ad-hoc cache.

Sure, I would be fine with that. When you say "packfile", do you mean
the the general concept, as in it could go in the pack index as opposed
to the packfile itself? Or specifically in the packfile? The latter
seems a lot more problematic to me in terms of implementation.

> > Those things rely on the idea that the git DAG is a data model that we
> > present to the user, but that we're allowed to do things behind the
> > scenes to make things faster.
> 
> .. and that is relevant to this discussion exactly *how*?

Because keeping the generation information outside of the DAG keeps the
model we present to the user simple (and not just the user; the
information that we present to other programs), but lets git still use
the information without calculating it from scratch each time. Just like
we present the data as a DAG of loose objects via things like "git
cat-file", even though the underlying storage inside a packfile may be
very different. I just don't see those two ideas as fundamentally
different.

> It's not. It's totally irrelevant. I certainly would never walk away
> from the DAG model. It's a fundamental git decision, and it's the
> correct one.

Of course not. I never suggested we should.

> And that is what this discussion fundamentally boils down to for me.
> 
> If we should have fixed it in the original specification, we damn well
> should fix it today. It's been "ignorable" because it's just not been
> important enough. But if git now adds a fundamental cache for them,
> then that information is clearly no longer "not important enough".

OK, so let's say we add generation headers to each commit. What happens
next? Are we going to convert algorithms that use timestamps to use
commit generations? How are we going to handle performance issues when
dealing with older parts of history that don't have generations?

Again, those are serious questions that need answered. I respect that
you think the lack of a generation header is a design decision that
should be corrected. As I said before, I'm not 100% sure I agree, but
nor do I completely disagree (and I think it largely boils down to a
philosophical distinction, which I think you will agree should take a
backseat to real, practical concerns). But it's not 2005, and we have a
ton of history without generation numbers. So adding them now is only
one piece of the puzzle.

What's your solution for the rest of it?

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 19:48                     ` Jeff King
@ 2011-07-15 20:07                       ` Jeff King
  2011-07-15 21:17                       ` Linus Torvalds
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-15 20:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 03:48:07PM -0400, Jeff King wrote:

> OK, so let's say we add generation headers to each commit. What happens
> next? Are we going to convert algorithms that use timestamps to use
> commit generations? How are we going to handle performance issues when
> dealing with older parts of history that don't have generations?
> 
> Again, those are serious questions that need answered. I respect that
> you think the lack of a generation header is a design decision that
> should be corrected. As I said before, I'm not 100% sure I agree, but
> nor do I completely disagree (and I think it largely boils down to a
> philosophical distinction, which I think you will agree should take a
> backseat to real, practical concerns). But it's not 2005, and we have a
> ton of history without generation numbers. So adding them now is only
> one piece of the puzzle.
> 
> What's your solution for the rest of it?

I just read some of your later emails to others in the thread. It seems
like your answer is "assume the timestamp-based limiting is good enough
for old history".

I'm OK with that. It obviously falls down in a few specific situations,
but certainly has not been an unbearable problem for the past 5 years.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 19:48                     ` Jeff King
  2011-07-15 20:07                       ` Jeff King
@ 2011-07-15 21:17                       ` Linus Torvalds
  2011-07-15 21:54                         ` Jeff King
  2011-07-15 23:10                         ` Linus Torvalds
  1 sibling, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 21:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 12:48 PM, Jeff King <peff@peff.net> wrote:
>
> Having the information in two different places is my concern, too. And I
> think the fundamental difference between putting it inside or outside
> the commit sha1 (where outside encompasses putting it in a cache, in the
> pack-index, or whatever), is that I see the commit sha1 as somehow more
> "definitive". That is, it is the sole data we pass from repo to repo
> during pushes and pulls, and it is the thing that is consistency-checked
> by hashes.

Sure. That is also the data that is the same for everybody.

That's a big deal, in the sense that it's the only thing we should
rely on if we want consistent behavior. Immediately if core
functionality starts using any other data, behavior becomes "local".
And I think that's really *really* dangerous.

Sure, we have "local behavior" in a lot of small details. We very much
intentionally have it in the ref-logs, and since branches and tags are
local we also have it in things like "--decorate", which obviously
depends on exactly which local refs you have.

We also have local behavior in things like .git/config etc files, so
git can behave very differently for different people even with what is
otherwise an identical repository.

So local behavior is good and expected for some things. We *want* it
for things like colorization decisions, we want it for aliases, and we
want it for branch naming.

But really core behavior shouldn't depend on local information. I
think it would be wrong if something like a merge base decision would
be based on any local information.

For example, should it matter whether something is packed or not? I
really don't think so. That's a pretty random implementation detail,
and if we get different end results because some commit happens to be
packed, vs not packed (because, say, we'd be hiding generation
information in the pack) that would be wrong.

Now, we do have things like merge resolution caches etc (which
obviously do save and use local information again), but I think that's
pretty well clarified.

> So if there is an inconsistency between what the parent pointers
> represent, and what the generation number in "outside" storage says,
> then the outside storage is wrong, and the parent pointers are the right
> answer. It becomes a lot more fuzzy to me if there is an inconsistency
> between what the parent pointers represent, and what the generation
> number says.

So I really don't see why you harp on that. If the generation counters
are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT.

That's a big issue.

Sure, they may be LYING, but that's a different thing entirely. They
will be lying to everybody consistently. There would never be any
question about what the generation number of a commit is.

See what I'm trying to say? There's no way that they would cause
different behavior for different people. Everything is 100%
consistent.

The exact same thing is true of commit dates, btw. They may be
confused as hell, and they may cause us to do bad things when we
traverse the history, but different clocks on different machines will
still not cause git to act differently on different machines. There's
no possibility of inconsistency.

(Of course, different *versions* of git may traverse the history
differently, since we've changed the heuristics over time. So we do
have that kind of inconsistent behavior, where we give different
results from different versions of git).

And btw, having "incorrect" data in the git objects is not the end of
the world. You can generate merge commits that simply have the wrong
parents. That will be confusing as hell to the user, and it will make
future merges not work very well, but it's a bug in the archive, and
that's "ok". The developers may not be very happy about it. In fact,
afaik we've had a few cases like that in the kernel tree, because
early git had bugs where it would not properly forget parents after a
failed merge. Most of them are ARM-related, because the ARM tree was
one of the first users of git (outside of me, but I had fewer issues
with what happens when things go wrong).

So I would not be *too* shocked if we'd end up with "odd" generation
counts due to some odd bug. It sounds unlikely, but my point is that
that is not at all what I'd *worry* about.

> How should that situation be handled? Should fsck check for it and
> complain? Should we just ignore it, even though it may cause our
> traversal algorithms to be inaccurate? Like clock skew, there's not much
> that can be done if the commits are published.

Right. I simply think it's not a big deal.

IOW, if we would rely on generation counts instead of clock dates,
maybe the generation counts would have occasional problems too, but I
suspect they'd be *much* rarer than time-based issues, because at
least the generation count is a well-defined number rather than a
random thing we pick out of emails and badly maintained machines.

That said, I'm not 100% sure at all that we want generation numbers at
all. Their use is pretty limited. If we had had them from the
beginning, I think we would simply have replaced the date-based commit
list sorting with a generation-number-based one, and it should have
been possible to guarantee that we never output a parent before the
commit in rev-parse.

As it is, I have to admit that looking at it, I shudder at changing
the current date-based logic and replacing it with a "date or
generation number".

The date-based one, despite all its fuzziness and not being very well
defined ("Global clock in a distributed system? You're a moron") and
up being a *nice* heuristic for certain human interaction. So it's not
a wonderful solution from a technical standpoint, but it does have (I
think) some nice UI advantages.

(For an example of that: using "--topo-sort" for revision history may
be a very good thing technically, but even if it wasn't for the fact
that it's more expensive, I think that our largely time-based default
order for "git log" in many ways is a better interface for humans. Of
course, when mixed with actually giving a history graph, that changes,
because then you want the "related" commits to group together, rather
than by time. So I think it's just basically a fuzzy area, without any
clear hard rules - which is probably why using that fuzzy timestamp
works so well in practice)

> Those are serious questions that I think should be considered if we are
> going to put a generation header into the commit object, and I haven't
> seen answers for them yet.

I do agree that the really *big* question is "do we even need it at
all". I do like perhaps just tightening the commit timestamp rules.
Because I do think they would probably work very well for the
"contains" problem too.

With the exact same fuzzy downsides, of course. Timestamps aren't
perfect, and they need that annoying fuzz factor thing.

>> That said, I don't think a generation count necessarily "fits" in the
>> pack-file. They are designed to be incremental, so it's not very
>> natural there. But I do think it would be conceptually prettier to
>> have the "depth of commit" be part of the "filesystem" data than to
>> have it as a separate ad-hoc cache.
>
> Sure, I would be fine with that. When you say "packfile", do you mean
> the the general concept, as in it could go in the pack index as opposed
> to the packfile itself? Or specifically in the packfile? The latter
> seems a lot more problematic to me in terms of implementation.

I was thinking the "general" issue - it might make most sense to put
them in the index.
>> If we should have fixed it in the original specification, we damn well
>> should fix it today. It's been "ignorable" because it's just not been
>> important enough. But if git now adds a fundamental cache for them,
>> then that information is clearly no longer "not important enough".
>
> OK, so let's say we add generation headers to each commit. What happens
> next? Are we going to convert algorithms that use timestamps to use
> commit generations? How are we going to handle performance issues when
> dealing with older parts of history that don't have generations?

So I do think the _initial_ question need to be the other way around:
do we have to have generation numbers at all?

I think it's likely a design misfeature not to have them, but
considering that we don't, and have been able to make do without for
so long, I'm also perfectly willing to believe that we could speed up
"contains" dramatically with the same kind of (crazy and inexact)
tricks we use for merge bases.

(Looking at a profile, a third - and the top entry - of the "git tag
--contains" profile cost is just in "clear_commit_marks()" - not doing
any real work, rather *undoing* the work in order to re-do things. So
it's entirely possible that the real issue is simply that
"in_merge_bases()" is badly done, and we could speed things up a lot
independently of anything else).

For example, for the "git tag --contains" thing, what's the
performance effect of just skipping tags that are much older than the
commit we ask for?

                Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 21:17                       ` Linus Torvalds
@ 2011-07-15 21:54                         ` Jeff King
  2011-07-15 23:10                         ` Linus Torvalds
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-15 21:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 02:17:26PM -0700, Linus Torvalds wrote:

> > Having the information in two different places is my concern, too. And I
> > think the fundamental difference between putting it inside or outside
> > the commit sha1 (where outside encompasses putting it in a cache, in the
> > pack-index, or whatever), is that I see the commit sha1 as somehow more
> > "definitive". That is, it is the sole data we pass from repo to repo
> > during pushes and pulls, and it is the thing that is consistency-checked
> > by hashes.
> 
> Sure. That is also the data that is the same for everybody.
> 
> That's a big deal, in the sense that it's the only thing we should
> rely on if we want consistent behavior. Immediately if core
> functionality starts using any other data, behavior becomes "local".
> And I think that's really *really* dangerous.

Yes, I see your argument. I just don't think it's all that big a deal,
because the information is so easily derived from data that _is_ the
same for everybody (and when you _do_ want it to be different locally,
because you are grafting, that is easy to do).

But I think at this point we have both said all there is to say. There
is no actual data to be brought forth in this argument, and we obviously
disagree on this point. So I think we may have to agree to disagree.

And as I said before, I am willing to concede generation numbers in the
commit header. But we need the rest of the solution, too.

> So I really don't see why you harp on that. If the generation counters
> are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT.
> 
> That's a big issue.
> 
> Sure, they may be LYING, but that's a different thing entirely. They
> will be lying to everybody consistently. There would never be any
> question about what the generation number of a commit is.
>
> See what I'm trying to say? There's no way that they would cause
> different behavior for different people. Everything is 100%
> consistent.

Read my email again. I am clearly talking about inconsistency between
two data items in the sha1-checked DAG itself. You then proceed to yell
at me that they are not inconsistent, now talking about inconsistency
between different people with the same DAG, but different caches. In
other words, you are talking about an entirely different type of
inconsistency. And then you proceed to say that the generation numbers
may be lying, which is _exactly_ what I meant when I said inconsistency.

I don't mind arguing with you, even if I think you use capital letters
too frequently; but when you do use them, please take care that I really
am being a bonehead, and it is not you misrepresenting what I said.

As to lying (aka inconsistency between items within the DAG), you say:

> And btw, having "incorrect" data in the git objects is not the end of
> the world. You can generate merge commits that simply have the wrong
> parents. That will be confusing as hell to the user, and it will make
> future merges not work very well, but it's a bug in the archive, and
> that's "ok". The developers may not be very happy about it. In fact,
> afaik we've had a few cases like that in the kernel tree, because
> early git had bugs where it would not properly forget parents after a
> failed merge. Most of them are ARM-related, because the ARM tree was
> one of the first users of git (outside of me, but I had fewer issues
> with what happens when things go wrong).

No, it's not the end of the world. I just think it's worse than the
possibility of inconsistency between two users' idea of the graph,
because the bug stays with you for all of history, instead of getting
fixed with a new version of git.

> That said, I'm not 100% sure at all that we want generation numbers at
> all. Their use is pretty limited. If we had had them from the
> beginning, I think we would simply have replaced the date-based commit
> list sorting with a generation-number-based one, and it should have
> been possible to guarantee that we never output a parent before the
> commit in rev-parse.
> 
> As it is, I have to admit that looking at it, I shudder at changing
> the current date-based logic and replacing it with a "date or
> generation number".
> 
> The date-based one, despite all its fuzziness and not being very well
> defined ("Global clock in a distributed system? You're a moron") and
> up being a *nice* heuristic for certain human interaction. So it's not
> a wonderful solution from a technical standpoint, but it does have (I
> think) some nice UI advantages.

That is the conclusion I am coming to, also. I don't find the external
cache as odious as you obviously do. But that was why I posted the
patches with an RFC tag. I wanted to see how painful people found the
concept. But if it's too ugly a concept, I think the path of least
resistance is just making timestamps suck less (by using more consistent
and robust skew avoidance[1] in our various algorithms, and by perhaps
taking more care to notify the user of skew early, before commits are
published).

And then we don't really need generation numbers anymore.  As elegant as
they might have been if they were there from day one, it's just not
worth the hassle of maintaining the dual solution.

[1] We use "N slop commits" in some places and "allow 86400 seconds of
    skew" in other places. We should probably use both, and apply them
    consistently.

> > Those are serious questions that I think should be considered if we are
> > going to put a generation header into the commit object, and I haven't
> > seen answers for them yet.
> 
> I do agree that the really *big* question is "do we even need it at
> all". I do like perhaps just tightening the commit timestamp rules.
> Because I do think they would probably work very well for the
> "contains" problem too.
> 
> With the exact same fuzzy downsides, of course. Timestamps aren't
> perfect, and they need that annoying fuzz factor thing.

Yeah. But in practice, that fuzz is really easy to implement, has worked
pretty well so far, and doesn't actually hurt performance measurably,
because skew is rare, and a constant, small timestamp tends to equate to
a constant, small number of commits.

> > Sure, I would be fine with that. When you say "packfile", do you mean
> > the the general concept, as in it could go in the pack index as opposed
> > to the packfile itself? Or specifically in the packfile? The latter
> > seems a lot more problematic to me in terms of implementation.
> 
> I was thinking the "general" issue - it might make most sense to put
> them in the index.

If we were to go the cache route, I think I am leaning that way, too, if
only because we don't duplicate the 20-byte sha1 per commit, which keeps
our I/O down.

> > OK, so let's say we add generation headers to each commit. What happens
> > next? Are we going to convert algorithms that use timestamps to use
> > commit generations? How are we going to handle performance issues when
> > dealing with older parts of history that don't have generations?
> 
> So I do think the _initial_ question need to be the other way around:
> do we have to have generation numbers at all?

No, we don't need them. My "contains" patches were already implemented
using timestamps, and it's pretty fast. They fall down only in the face
lying timestamps (i.e., skew). The whole reason to switch to generation
headers was that we could assume they would be correct, and our
algorithms using them would be more likely to be correct.

And I do think a generation header would be more likely to be correct
than a timestamp, if only because timestamps are harder to get right.

> I think it's likely a design misfeature not to have them, but
> considering that we don't, and have been able to make do without for
> so long, I'm also perfectly willing to believe that we could speed up
> "contains" dramatically with the same kind of (crazy and inexact)
> tricks we use for merge bases.

Already done. I can point you to the patches if you want.

> For example, for the "git tag --contains" thing, what's the
> performance effect of just skipping tags that are much older than the
> commit we ask for?

It's as fast as using generations. See these two patches:

  http://article.gmane.org/gmane.comp.version-control.git/150261

  http://article.gmane.org/gmane.comp.version-control.git/150262

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 21:17                       ` Linus Torvalds
  2011-07-15 21:54                         ` Jeff King
@ 2011-07-15 23:10                         ` Linus Torvalds
  2011-07-15 23:16                           ` Linus Torvalds
  2011-07-16  0:40                           ` Jeff King
  1 sibling, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 23:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> For example, for the "git tag --contains" thing, what's the
> performance effect of just skipping tags that are much older than the
> commit we ask for?

Hmm.

Maybe there is something seriously wrong with this trivial patch, but
it gave the right results for the test-cases I threw at it, and passes
the tests.

Before:

   [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct

   real	0m7.548s
   user	0m7.344s
   sys	0m0.116s

After:

   [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off

   real	0m0.161s
   user	0m0.140s
   sys	0m0.016s

and 'correct' and 'date-cut-off' both give the same answer.

The date-based "slop" thing is (at least *meant* to be - note the lack
of any extensive testing) "at least five consecutive commits that have
dates that are more than five days off".

Somebody should double-check my logic. Maybe I'm doing something
stupid. Because that's a *big* difference.

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1614 bytes --]

 commit.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/commit.c b/commit.c
index ac337c7d7dc1..0d33c33a6520 100644
--- a/commit.c
+++ b/commit.c
@@ -737,16 +737,56 @@ struct commit_list *get_merge_bases(struct commit *one, struct commit *two,
 	return get_merge_bases_many(one, 1, &two, cleanup);
 }
 
+#define VISITED (1 << 16)
+
+static int is_recursive_descendant(struct commit *commit, struct commit *target)
+{
+	int slop = 5;
+	parse_commit(target);
+	for (;;) {
+		struct commit_list *parents;
+		if (commit == target)
+			return 1;
+		if (commit->object.flags & VISITED)
+			return 0;
+		commit->object.flags |= VISITED;
+		parse_commit(commit);
+		if (commit->date + 5*24*60*60 < target->date) {
+			if (--slop <= 0)
+				return 0;
+		} else
+			slop = 5;
+		parents = commit->parents;
+		if (!parents)
+			return 0;
+		commit = parents->item;
+		parents = parents->next;
+		while (parents) {
+			if (is_recursive_descendant(parents->item, target))
+				return 1;
+			parents = parents->next;
+		}
+	}
+}
+
+static int is_descendant(struct commit *commit, struct commit *target)
+{
+	int ret = is_recursive_descendant(commit, target);
+	clear_commit_marks(commit, VISITED);
+	return ret;
+}
+
 int is_descendant_of(struct commit *commit, struct commit_list *with_commit)
 {
 	if (!with_commit)
 		return 1;
+
 	while (with_commit) {
 		struct commit *other;
 
 		other = with_commit->item;
 		with_commit = with_commit->next;
-		if (in_merge_bases(other, &commit, 1))
+		if (is_descendant(commit, other))
 			return 1;
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 23:10                         ` Linus Torvalds
@ 2011-07-15 23:16                           ` Linus Torvalds
  2011-07-15 23:36                             ` Linus Torvalds
  2011-07-16  0:40                           ` Jeff King
  1 sibling, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 23:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 4:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Maybe there is something seriously wrong with this trivial patch, but
> it gave the right results for the test-cases I threw at it, and passes
> the tests.
>
> Before:

I have fewer branches than tags, but I get something similar for "git
branch --contains":

  [torvalds@i5 linux]$ time git branch --contains v2.6.12 | sha1sum
  9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510  -

  real	0m4.205s
  user	0m4.112s
  sys	0m0.084s
  [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.12 | sha1sum
  9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510  -

  real	0m0.112s
  user	0m0.100s
  sys	0m0.008s

ie identical results, except one took 4.2s and with the patch it took 0.1s.

This is all hot-cache, of course, and on a fast machine.

                    Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 23:16                           ` Linus Torvalds
@ 2011-07-15 23:36                             ` Linus Torvalds
  2011-07-16  0:42                               ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-15 23:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

And one last comment:

On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I have fewer branches than tags, but I get something similar for "git
> branch --contains":

The time-based heuristic does seem to be important. If I just remove
it, I get increasingly long times for things that aren't contained in
my branches.

And in fact, I think that is why the code used the merge-base helper
functions - not because it wanted merge bases, but because the merge
base stuff will work from either end until it decides things aren't
relevant any more. Because *without* the time-based heuristics, the
trivial "is this a descendant" algorithm ends up working very badly
for the case where the target doesn't exist in the branches. Examples
of NOT having a date-based cut-off, but just doing the straightforward
(non-merge-base) ancestry walk:

  time ~/git/git branch --contains v2.6.12
  real	0m0.113s

  [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39
  real	0m3.691s

and what ends up happening is that in the latter case, every branch
walks all the way to the root and checks every commit (walking all the
merges too). While in the first case, it's very quick because it will
find that particular commit when it walk straight backwards (so it
doesn't even have to do a lot of recursion - the first branch that
hits that commit will be a success), so it won't have to look at all
the side ways of getting there.

Of course, the above particular difference happens to be due to the
"depth-first" implementation working well for the thing I am searching
for.  But it does show that the date-based cut-off matters due to
traversal issues like that.

                          Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 23:36                             ` Linus Torvalds
@ 2011-07-16  0:42                               ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-16  0:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 04:36:40PM -0700, Linus Torvalds wrote:

> On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I have fewer branches than tags, but I get something similar for "git
> > branch --contains":
> 
> The time-based heuristic does seem to be important. If I just remove
> it, I get increasingly long times for things that aren't contained in
> my branches.
> 
> And in fact, I think that is why the code used the merge-base helper
> functions - not because it wanted merge bases, but because the merge
> base stuff will work from either end until it decides things aren't
> relevant any more. Because *without* the time-based heuristics, the
> trivial "is this a descendant" algorithm ends up working very badly
> for the case where the target doesn't exist in the branches. Examples
> of NOT having a date-based cut-off, but just doing the straightforward
> (non-merge-base) ancestry walk:
> 
>   time ~/git/git branch --contains v2.6.12
>   real	0m0.113s
> 
>   [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39
>   real	0m3.691s

Yes, exactly. That is why my first patch (which goes to a recursive
search), takes about the same amount of time as "git rev-list --all"
(and I suspect your 3.691s above is similar). And then the second one
drops that again to .03s.

I think you are simply recreating the strategy and timings I have posted
several times now.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 23:10                         ` Linus Torvalds
  2011-07-15 23:16                           ` Linus Torvalds
@ 2011-07-16  0:40                           ` Jeff King
  1 sibling, 0 replies; 89+ messages in thread
From: Jeff King @ 2011-07-16  0:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

On Fri, Jul 15, 2011 at 04:10:23PM -0700, Linus Torvalds wrote:

> On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > For example, for the "git tag --contains" thing, what's the
> > performance effect of just skipping tags that are much older than the
> > commit we ask for?
> 
> Hmm.
> 
> Maybe there is something seriously wrong with this trivial patch, but
> it gave the right results for the test-cases I threw at it, and passes
> the tests.
> 
> Before:
> 
>    [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct
> 
>    real	0m7.548s
>    user	0m7.344s
>    sys	0m0.116s
> 
> After:
> 
>    [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off
> 
>    real	0m0.161s
>    user	0m0.140s
>    sys	0m0.016s
> 
> and 'correct' and 'date-cut-off' both give the same answer.

Without even looking carefully at your patches for any minor mistakes, I
can tell you that the speedup you're seeing is approximately right.
Because it's almost exactly the same optimization I made in my
timestamp-based patches (links to which I sent you earlier today).

However, you can make it even faster. The "tag --contains" code will ask
"is_descendant_of" repeatedly for the same set of "want" commits. So you
end up traversing some parts of the graph over and over. My patches
share the marks over a set of contains traversals, so you only ever
touch each commit once. And that's what my patches do.

With yours, on my box:

  $ time git tag --contains HEAD~1000 >/dev/null
  real    0m0.113s
  user    0m0.104s
  sys     0m0.008s

and mine:

  $ time git tag --contains HEAD~1000 >/dev/null
  real    0m0.035s
  user    0m0.020s
  sys     0m0.012s

I suspect you can make the difference even more prominent by having more
tags, or by having multiple "want" commits.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  1:19               ` Linus Torvalds
  2011-07-15  2:41                 ` Geert Bosch
  2011-07-15  7:46                 ` Jeff King
@ 2011-07-15  9:12                 ` Jakub Narebski
  2011-07-15  9:17                   ` Long, Martin
  2 siblings, 1 reply; 89+ messages in thread
From: Jakub Narebski @ 2011-07-15  9:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Git Mailing List, Junio C Hamano, Jakub Narebski

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@peff.net> wrote:
> >
> > However, I'm not 100% convinced leaving generation numbers out was a
> > mistake. The git philosophy seems always to have been to keep the
> > minimal required information in the DAG.
> 
> Yes.
> 
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
> 
> In other words, I do agree that we should strive for minimal required
> information.
> 
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!
> 
> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

You store redundant information, one that is used to speed up
calculations, in a cache.

[...]
> > Generation numbers are _completely_ redundant with the actual structure
> > of history represented by the parent pointers.

What is more important the perceived structure of history can change
by three mechanisms:

 * grafts
 * replace objects
 * shallow clone

I can understand that you don't want to worry about grafts - they are
a terrible hack.  We can simply turn off using generation numbers
stored in commit if they are present.

The problem with shallow clones is only at beginning, when some of
commits in shallow repository does not have generation numbers.  You
cannot simply calculate generation number for a new commit in such
case.

But what about REPLACE OBJECTS?  If one for example use "git replace"
on root commit to join contemporary repository with historical
repository... this is not addressed in your emails.

And let's not forget the fact that we need cache for old commits which
don't have yet generation number in a commit.

BTW. you are not fair comparing size of code.  

First, some of Peff code is about _using_ generation numbers, which
will be needed regardless of whether generation numbers are stored in
cache or packfile index, or whether they are embedded in commit
objects.

Second, with generation number commit header you need to write fsck
code, and have to consider size of this yet-to-be-written code.

[...]
> > I liken it somewhat to the "don't store renames" debate.
> 
> That's total and utter bullshit.

I think Peff meant here that if you make mistakes in calculating
rename info or generation number, and have incorrect information
stored in commit object, you are f**ked.

> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.
> 
> Just look at "git blame -C" to remind yourself why rename information is wrong.

Also doing full code movement and copying detection (that is what "git
blame -C" does) rather than simplistic whole-file rename detection is
pretty much impossible at commit time.

Nb. most SCMs that use path-id based rename tracking require that user
explicitly marks renames using "scm move" or "scm rename" (well,
Mercurial has a tool for rename detection before commit, "hg
addremove").  But asking user to mark code movements is simply
infeasible.

> But even more importantly, look at git merges. Look at how git has
> gotten merging right since pretty much day #1, and has absolutely no
> issues with files that got generated two different ways. Look at every
> SCM that tries to do rename detection, and look at how THEY CANNOT DO
> MERGES RIGHT.
> 
> It's that simple. Rename detection is not about avoiding "redundant
> data". It's about doing the right thing.

Well, rename tracking supporters say that heuristic rename detection
can be wrong.

By the way, what happened to "wholesame directory rename detection"
patches?  Without them in the situation where one side renamed
directory, and other created new file in said directory git on merge
creates file in re-created old name of directory...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  9:12                 ` Jakub Narebski
@ 2011-07-15  9:17                   ` Long, Martin
  2011-07-15 15:33                     ` Long, Martin
  0 siblings, 1 reply; 89+ messages in thread
From: Long, Martin @ 2011-07-15  9:17 UTC (permalink / raw)
  To: git

I strongly agree with Linus that the cache should not form part of the
solution to this problem, but could maybe be a later add-on, which
improved performance.

There is a possible improvement, which may remove the need for the
cache. It doesn't solve the issue of broken numbers, but I think the
key to that is just to ensure the traversal algorithm is
deterministic, stable, and immutable.

Firstly, I presume the generation number would not form part of the
SHA1 calculation? No? Cool.

When calculating a generation number by doing a traversal, would it
not be possible to update some, or all, commit objects touched, with
their generation numbers. Again, this would be expensive, but there
would possibly be even quicker gains than Linus's original proposal to
just add numbers to the new commit.

A compromise might be to only update some commits - notably those with
2 or more parents, so that both parents don't need to be traversed,
and possibly every nth commit (to give regular checkpoints that can be
utilised when traversing a branch). I would suggest commits with 2
children for the latter, but with my limited knowledge of the
implementation, I understand that Is more difficult to find.
Obviously, these numbers would only be pegged locally, and wouldn't by
synced on push, as they already exist on the far end. However, it
could be possible to run a process on a bare repo to shoot through and
peg commits, then at least new clones will be "well pegged"

Martin Long
UK

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15  9:17                   ` Long, Martin
@ 2011-07-15 15:33                     ` Long, Martin
  2011-07-15 16:15                       ` Drew Northup
  0 siblings, 1 reply; 89+ messages in thread
From: Long, Martin @ 2011-07-15 15:33 UTC (permalink / raw)
  To: git

>
> Firstly, I presume the generation number would not form part of the
> SHA1 calculation? No? Cool.

I suspect this may be where my suggestion falls down. Though I suspect
there is a case for object metadata which doesn't form part of the
SHA. Would generation number tampering be a concern?

Caching offers the ability to store that metadata, to provide the same
performance gain, but maintain the integrity of the SHA chain.
However, it does still leave the generation number liable to
tampering, meaning a generic non-SHA metadata solution might be
better.

TBH, there are few situations where historical generations are useful
- finding gen numbers of tags is one of them. Most cases are going to
be for new commits, and in that case, a few new commits at the tip of
each branch will very quickly reduce the number of traversals. What
use case would really create enough traversals that it should be a
performance concern?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-15 15:33                     ` Long, Martin
@ 2011-07-15 16:15                       ` Drew Northup
  0 siblings, 0 replies; 89+ messages in thread
From: Drew Northup @ 2011-07-15 16:15 UTC (permalink / raw)
  To: Long, Martin
  Cc: git, Jeff King, Junio C Hamano, Jakub Narebski, Linus Torvalds,
	Geert Bosch

On Fri, 2011-07-15 at 16:33 +0100, Long, Martin wrote:
> >
> > Firstly, I presume the generation number would not form part of the
> > SHA1 calculation? No? Cool.
> 
> I suspect this may be where my suggestion falls down. Though I suspect
> there is a case for object metadata which doesn't form part of the
> SHA. Would generation number tampering be a concern?

If you take Jeff's perspective on the purpose of generation numbers
(representing metadata about the DAG in a more readily-available format)
then "tampering" is not really a concern as the metadata is merely local
(to the running instance of Git) ephemera that we can cache between runs
for the sake of efficiency. Linus' perspective on generation numbers
seems to be of a more hard and fast type of data.

So, are we really talking about [corpus] generation numbers (used to
describe the state of the DAG in the way one describes his known family
tree) or are we talking about _revision_numbers_ (used to describe the
commit, as Subversion does)? I think we've got two (or more) groups
talking about different things (and aims) and trying to use the same
words to do so.

> Caching offers the ability to store that metadata, to provide the same
> performance gain, but maintain the integrity of the SHA chain.
> However, it does still leave the generation number liable to
> tampering, meaning a generic non-SHA metadata solution might be
> better.

I'm not sure where you are going with this. I wouldn't think "tampering"
with _current_DAG-based ephemera would do much other than create a
performance hit. If you are really talking about a static
_revision_number_ then that belongs in the commit, where it cannot be
changed (and may be completely meaningless when taken out of context, as
SVN revision numbers are). What such a number may entail is probably up
for discussion, but perhaps in a different thread.

> TBH, there are few situations where historical generations are useful
> - finding gen numbers of tags is one of them. Most cases are going to
> be for new commits, and in that case, a few new commits at the tip of
> each branch will very quickly reduce the number of traversals. What
> use case would really create enough traversals that it should be a
> performance concern?

The answer to this is found in a previous thread
http://article.gmane.org/gmane.comp.version-control.git/176807

(remember, generation number vs. revision number...)

Also, please don't cull the CC list! (Added Geert Bosch)

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:37 ` Jeff King
  2011-07-14 18:47   ` Linus Torvalds
@ 2011-07-14 18:52   ` Linus Torvalds
  2011-07-14 19:08     ` Jakub Narebski
  2011-07-14 20:26   ` Junio C Hamano
  2 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2011-07-14 18:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Junio C Hamano

On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote:
>
> There's also one other issue with generation numbers. How do you handle
> grafts and object-replacement refs?  If you graft history, your embedded
> generation numbers will all be junk, and you can't trust them.

So I don't think this is a real problem in practice.

Grafts are already unreliable. You cannot sanely merge over a graft,
and it has nothing to do with generation numbers.

I'm actually sorry that we ever did grafting. It's fundamentally
broken, and can actually destroy your repository (by hiding real
parents and then causing the commits to get garbage collected). So I
don't think grafting should be used as an argument for or against
anything - it's a hack that breaks some fundamental git database
constraints.

                             Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:52   ` Linus Torvalds
@ 2011-07-14 19:08     ` Jakub Narebski
  0 siblings, 0 replies; 89+ messages in thread
From: Jakub Narebski @ 2011-07-14 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Git Mailing List, Junio C Hamano, Jakub Narebski

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote:
> >
> > There's also one other issue with generation numbers. How do you handle
> > grafts and object-replacement refs?  If you graft history, your embedded
> > generation numbers will all be junk, and you can't trust them.
> 
> So I don't think this is a real problem in practice.
> 
> Grafts are already unreliable. You cannot sanely merge over a graft,
> and it has nothing to do with generation numbers.
> 
> I'm actually sorry that we ever did grafting. It's fundamentally
> broken, and can actually destroy your repository (by hiding real
> parents and then causing the commits to get garbage collected). So I
> don't think grafting should be used as an argument for or against
> anything - it's a hack that breaks some fundamental git database
> constraints.

What about object-replacement refs (i.e. "git replace" and refs/replace/)?

This is modern replacement for grafts mechanism, which is safe against
garbage collecting, and contrary to grafts it is transferable (as a ref).

With replacement objects (e.g. to repair some fragment of history to
make it bisectable - I think that was original idea behind introducing
git-replace, or instead of grafts to join with historical repository -
IIRC the reason why grafts mechanism was created) you can also have
invalid generation numbers if they are stored in commit headers.  With
generation cache we can simply invaliate it if grafts or replacements
change...

P.S. grafts are quite useful when doing history surgery.  Create
grafts, check history, use git-filter-branch to make new DAG
permanent, remove grafts.

P.P.S. What about "grafts lite", i.e. shallow clone?  With generation
cache we can invalidate it when depth changes...

-- 
Jakub Narębski
Poland

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 18:37 ` Jeff King
  2011-07-14 18:47   ` Linus Torvalds
  2011-07-14 18:52   ` Linus Torvalds
@ 2011-07-14 20:26   ` Junio C Hamano
  2011-07-14 20:41     ` Jeff King
  2 siblings, 1 reply; 89+ messages in thread
From: Junio C Hamano @ 2011-07-14 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Git Mailing List

Jeff King <peff@peff.net> writes:

> There's also one other issue with generation numbers. How do you handle
> grafts and object-replacement refs?  If you graft history, your embedded
> generation numbers will all be junk, and you can't trust them.

By the way, I doubt your "invalidate and recompute generation cache when
replacement changes" would really work when we consider object transfer
(which is the whole point of deprecating graft with object replacement
mechanism). For the purpose of connectivity check during object transfer,
we deliberately _ignore_ the object replacements, so you would at least
want to have an ability to show the generation number according to the
"true" history recorded in commits (which can come from Linus's in-commit
generation number once everybody migrates) and the generation number that
takes grafts and replacements into account (for which we cannot depend on
in-commit record).

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 20:26   ` Junio C Hamano
@ 2011-07-14 20:41     ` Jeff King
  2011-07-14 21:30       ` Junio C Hamano
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2011-07-14 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

On Thu, Jul 14, 2011 at 01:26:32PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > There's also one other issue with generation numbers. How do you handle
> > grafts and object-replacement refs?  If you graft history, your embedded
> > generation numbers will all be junk, and you can't trust them.
> 
> By the way, I doubt your "invalidate and recompute generation cache when
> replacement changes" would really work when we consider object transfer
> (which is the whole point of deprecating graft with object replacement
> mechanism). For the purpose of connectivity check during object transfer,
> we deliberately _ignore_ the object replacements, so you would at least
> want to have an ability to show the generation number according to the
> "true" history recorded in commits (which can come from Linus's in-commit
> generation number once everybody migrates) and the generation number that
> takes grafts and replacements into account (for which we cannot depend on
> in-commit record).

It should actually work in that scenario, at least with replace refs,
but the performance is suboptimal. The copy of git doing the object
transfer will turn off read_replace_refs, our validity token will
not match, we will see that our cache is no longer valid, and
regenerate it. Another run with replace-refs turned on will do the same
thing in reverse. Even two programs running simultaneously will still be
correct, because the cache is replaced atomically.

However, there are two issues:

  1. I don't think grafts have a "respect grafts" flag in the same way;
     I haven't looked at how the packing code decides not to respect
     them, but the "stir graft info into the checksum" data should use
     the same check.

  2. If you do a lot of object transfer, you will ping-pong back and
     forth between cache versions, which is inefficient. It would
     probably be better to store the cache that is valid under condition
     $SHA1 as:

       .git/cache/generations/$SHA1

     In most cases, you would have a single file (i.e., you are not
     using replace refs at all). But if you did, then you keep two
     separate caches, one for the view from replace-refs, and one for
     the standard view.

If we ignore replace refs and grafts, as Linus suggested, and always
store the true generation number, then we could generate it at pack time
(and even put it in the pack index if we want to deal with a version
bump there).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Git commit generation numbers
  2011-07-14 20:41     ` Jeff King
@ 2011-07-14 21:30       ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2011-07-14 21:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Git Mailing List

Jeff King <peff@peff.net> writes:

> It should actually work in that scenario, at least with replace refs,...
> regenerate it. Another run ...

I know; that is what I called "doubt it would really work". Having to
regenerate twice does not count as working.

> However, there are two issues:
>
>   1. I don't think grafts have a "respect grafts" flag in the same way;
>      I haven't looked at how the packing code decides not to respect
>      them, but the "stir graft info into the checksum" data should use
>      the same check.

I do not think graft and object transfer meshes well at all, so I wouldn't
worry about it.

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2011-09-06 10:02 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-17 18:27 Git commit generation numbers George Spelvin
2011-07-17 19:00 ` Long, Martin
2011-07-17 19:30 ` Linus Torvalds
2011-07-17 23:39   ` George Spelvin
2011-07-17 23:58     ` Linus Torvalds
2011-07-18  5:13       ` George Spelvin
2011-07-18 10:28         ` Anthony Van de Gejuchte
2011-07-18 11:48           ` George Spelvin
2011-07-20 20:51             ` Nicolas Pitre
2011-07-20 22:16               ` George Spelvin
2011-07-20 23:26                 ` david
2011-07-20 23:36                   ` Nicolas Pitre
2011-07-21  0:08                     ` Phil Hord
2011-07-21  0:18                       ` david
2011-07-21  0:37                         ` Shawn Pearce
2011-07-21  0:47                           ` Phil Hord
2011-07-21  4:26                           ` david
2011-07-21 12:43                             ` George Spelvin
2011-07-21 19:19                               ` Jakub Narebski
2011-07-21 20:27                                 ` George Spelvin
2011-07-21 20:33                                   ` Shawn Pearce
2011-07-22 12:18                                   ` Jakub Narebski
2011-07-22 13:09                                     ` Nicolas Pitre
2011-07-22 18:02                                       ` david
2011-07-22 18:34                                         ` Jakub Narebski
2011-07-22 19:06                                           ` Linus Torvalds
2011-07-22 22:02                                             ` Jeff King
2011-07-28 15:00                                             ` Felipe Contreras
2011-09-06 10:02                                               ` Ramkumar Ramachandra
2011-07-22 19:08                                           ` david
2011-07-22 19:40                                             ` Nicolas Pitre
2011-07-22 18:02                                     ` david
2011-07-21  0:39                         ` Phil Hord
2011-07-21  0:58                       ` Nicolas Pitre
2011-07-21  1:09                         ` Phil Hord
2011-07-21 12:03                   ` Drew Northup
2011-07-21 12:55                     ` George Spelvin
2011-07-21 15:57                       ` Drew Northup
2011-07-21 16:24                         ` Phil Hord
2011-07-21 22:40                           ` Pēteris Kļaviņš
2011-07-22  9:30                             ` Christian Couder
2011-07-21 17:36                         ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2011-07-14 18:24 Linus Torvalds
2011-07-14 18:37 ` Jeff King
2011-07-14 18:47   ` Linus Torvalds
2011-07-14 18:55     ` Linus Torvalds
2011-07-14 19:12       ` Jeff King
2011-07-14 19:46       ` Ted Ts'o
2011-07-14 19:51         ` Linus Torvalds
2011-07-14 20:07           ` Jeff King
2011-07-14 20:08           ` Ted Ts'o
2011-07-14 19:08     ` Jeff King
2011-07-14 19:23       ` Linus Torvalds
2011-07-14 20:01         ` Jeff King
2011-07-14 20:19           ` Linus Torvalds
2011-07-14 20:31             ` Jeff King
2011-07-15  1:19               ` Linus Torvalds
2011-07-15  2:41                 ` Geert Bosch
2011-07-15  7:46                 ` Jeff King
2011-07-15 16:10                   ` Linus Torvalds
2011-07-15 16:18                     ` Shawn Pearce
2011-07-15 16:44                       ` Linus Torvalds
2011-07-15 18:42                         ` Ted Ts'o
2011-07-15 19:00                           ` Linus Torvalds
2011-07-16  9:16                           ` Christian Couder
2011-07-18  3:41                             ` Jeff King
2011-07-19  4:14                               ` Christian Couder
2011-07-19 20:00                                 ` Jeff King
2011-07-21  6:29                                   ` Christian Couder
2011-07-15 18:46                         ` Tony Luck
2011-07-15 18:58                           ` Linus Torvalds
2011-07-15 19:48                     ` Jeff King
2011-07-15 20:07                       ` Jeff King
2011-07-15 21:17                       ` Linus Torvalds
2011-07-15 21:54                         ` Jeff King
2011-07-15 23:10                         ` Linus Torvalds
2011-07-15 23:16                           ` Linus Torvalds
2011-07-15 23:36                             ` Linus Torvalds
2011-07-16  0:42                               ` Jeff King
2011-07-16  0:40                           ` Jeff King
2011-07-15  9:12                 ` Jakub Narebski
2011-07-15  9:17                   ` Long, Martin
2011-07-15 15:33                     ` Long, Martin
2011-07-15 16:15                       ` Drew Northup
2011-07-14 18:52   ` Linus Torvalds
2011-07-14 19:08     ` Jakub Narebski
2011-07-14 20:26   ` Junio C Hamano
2011-07-14 20:41     ` Jeff King
2011-07-14 21:30       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).