* Re: Git commit generation numbers
@ 2011-07-17 18:27 George Spelvin
2011-07-17 19:00 ` Long, Martin
2011-07-17 19:30 ` Linus Torvalds
0 siblings, 2 replies; 89+ messages in thread
From: George Spelvin @ 2011-07-17 18:27 UTC (permalink / raw)
To: git; +Cc: linux, torvalds
> The thing I hate about it is very fundamental: I think it's a hack around a basic git
> design mistake. And it's a mistake we have known about for a long time.
>
> Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically
> say "we made a mistake in the original commit design, and instead of fixing it we
> create a separate workaround for it".
>
> THAT I find distasteful. My reaction is that if we're going to add generation
> numbers, then were should just do it the way we should have done them originally,
> rather than as some separate hack.
There are a few design mistakes in git. The way the object type
and size are prefixed to the data for hasing purposes, which prevents
aligned fetching from memory-mapped data in the hashing code, isn't too
pretty either.
But git has generally preferred to avoid storing information that can
be recomputed. File renames are the big example. given this, why the
heck store generation numbers?
They *can* be computed on demand, so arguably they *should*. Cacheing is
then an optimization, just like packs, pack indexes, the hashed object
storage directories, and all that.
I'm in the "make it a cache" camp, honestly.
For example, here's a different possible generation number scheme.
By making the generation number a cache, it becomes a valid alternative
to experiment with.
Simply store a topologically sorted list of commits. Each commit's
position can serve as a generation number, and is greater than the
positions of all ancestors. But by using the offset within the list,
the number is stored implicitly.
Generation numbers don't have to be consecutive as long as they're
correctly ordered, so you could, e.g. choose to make them unique.
I don't think this is actually worth it; I'm just using it as a
not-completely-insane example of a different design that nonetheless
achieves the same goal.
Why freeze this in the object format?
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-17 18:27 Git commit generation numbers George Spelvin @ 2011-07-17 19:00 ` Long, Martin 2011-07-17 19:30 ` Linus Torvalds 1 sibling, 0 replies; 89+ messages in thread From: Long, Martin @ 2011-07-17 19:00 UTC (permalink / raw) To: George Spelvin; +Cc: git, torvalds > Why freeze this in the object format? Because if you put it in the object format, then it gets pushed and pulled around, thereby putting generation numbers in every clone. I'm starting to think put them in the object store, for exactly that reason, and to start moving repositories in a direction where the look more like they would if this had been done correctly from the start. Then, because some operations are still going to create a lot of traversals, a cache is always an option to improve the performance in that area. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-17 18:27 Git commit generation numbers George Spelvin 2011-07-17 19:00 ` Long, Martin @ 2011-07-17 19:30 ` Linus Torvalds 2011-07-17 23:39 ` George Spelvin 1 sibling, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-17 19:30 UTC (permalink / raw) To: George Spelvin; +Cc: git On Sun, Jul 17, 2011 at 11:27 AM, George Spelvin <linux@horizon.com> wrote: > > There are a few design mistakes in git. The way the object type > and size are prefixed to the data for hasing purposes, which prevents > aligned fetching from memory-mapped data in the hashing code, isn't too > pretty either. Why would you ever care? That makes no sense. > But git has generally preferred to avoid storing information that can > be recomputed. File renames are the big example. given this, why the > heck store generation numbers? Guys, please don't bring up file renames. I explained once already why bringing up file renames just makes you look like a f^&% moron. Let me explain one more time: - Storing file renames is STUPID. It's stupid for very fundamental reasons that have absolutely *NOTHING* to do with "it can be computed later". It's fundamentally stupid because it will FOREVER SCREW UP YOUR DATA, and because it will make merging an unmitigated disaster and make your repository depend on how you *created* your data, rather than on what the data is. It will totally break the situation of one person doing a rename, while another person does something else to the metadata (eg a create of the same filename). Trying to track file identities will leave to very fundamentally unsolvable issues like "which file identity do we choose when two different files get the same name", or "which file identity will we choose when one file splits in two". Git doesn't track renames, because unlike pretty much every other SCM out there, git really does have a good design, and because I damn well understood the real problems. So bringing it up as an example of "we don't store it because we can compute it" is really totally idiotic. It's a sign of not understanding the problems with renames. Stop doing it. That argument is totally irrelevant. Really. It's like saying "We shouldn't do generation numbers because fish don't use bicycles". The only thing that kind of argument does is to make me convinced that you don't understand the problem enough to be worth even arguing with. It is not only a worthless argument, but it makes your every other argument suspect. Comprende? Stop it. > They *can* be computed on demand, so arguably they *should*. Umm, no. That's actually a really bad argument. There are valid things that we "should" do, but they have nothing to do with "if something can be done, it should be done". That's just a crazy argument. A thing we really *should* do is perform well. And be really reliable. And support a distributed workflow. Those are real arguments that aren't about "just because it's there". Now, some of those arguments can then be used to say "don't bother storing redundant data". For example, redundant data takes disk space and network bandwidth, and if something can be recomputed cheaply (ie if it doesn't have a negative impact on performance), then redundant data is just bad. And what appears like a much better argument (right now) is that some data isn't needed AT ALL, because you can make do with other data entirely (ie dates). But "just because we could recompute it" is a bad bad reason. The thing is, the very basic design of git is all about *incomplete* DAG traversal. The DAG traversal part is pretty obvious and simple, but the *partial* thing really is very very important. We absolutely need it for reasonable scalability. We've spent a *lot* of time in git development on trying to perform really well by avoiding work. Not just in revision traversal, but in many other areas too (like making diff and merge much faster by being able to handle whole identical recursive subdirectories by just checking the SHA1, for example). That's a *really* fundamental design issue in git. Performance was always a primary goal. And by primary, I really mean primary. As in "more important than just about anything else". There were other primary goals, but really not very many. And there really aren't very good ways to limit DAG traversal. Generation numbers are one of the very few fundamental ones. We hacked around it with dates, and it works pretty well in practice (well enough that I'm certainly ok with the hack), but it's definitely one of the areas where git simply does something "wrong". It's simply not a entirely reliable algorithm, and that fact makes me a bit uncomfortable with it. (Now, in theory, a global *approximate* time is theoretically possible in a distributed environment, and as such it's arguable that "global time with a slop that is based on the speed of light and knowledge of location" is at least theoretically sound. So the real problem with commit dates is that people simply don't have good clocks. So it's a practical problem rather than a theoretical one, and it's a practical problem that doesn't really cause enough problems in practice to not be workable. But I'm making excuses for it, and I _know_ I'm making excuses for it, so I'm not really happy about it) And it's just about the only area where I am aware of git doing something "wrong". Which is why I would like to have had generation numbers even though the dates do work. Anyway, to get back to the actual issue of caching vs not caching: if you think "we could compute it dynamically" means that we should, then we damn well shouldn't cache it either - why cache it, when you could just compute it. And if it's worth it to waste resources on the cache in order to avoid performance issues, then it damn well would be ok to waste (fewer) resources on just saving the generation number in the object data base. And make that *fundamental* fix to a hack that git has had since pretty much day one. And btw, git didn't have the date-based hack originally, because I didn't think it would be problematic enough. I thought that we could do universally efficient partial DAG traversal - not having to go all the way to the root - based purely on the DAG. The code in "everybody_uninteresting()" tries to be that "limit DAG traversal by only looking at the DAG itself", and it works for many simple situations. But it turns out that it does *not* work for many other cases. So the generation number really is very very fundamnetal. It's absolutely not some "additional information that can be computed", because the whole AND ONLY point of having the number is to not compute it. We are never interested in the generation number for its own sake. We are only interested in it in order to avoid having to look at the rest of the DAG. So no, the number fundamentally isn't computable, because computing it obviates the need for it. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-17 19:30 ` Linus Torvalds @ 2011-07-17 23:39 ` George Spelvin 2011-07-17 23:58 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-17 23:39 UTC (permalink / raw) To: linux, torvalds; +Cc: git > So the generation number really is very very fundamnetal. It's > absolutely not some "additional information that can be computed", > because the whole AND ONLY point of having the number is to not > compute it. > > We are never interested in the generation number for its own sake. We > are only interested in it in order to avoid having to look at the rest > of the DAG. You're making my point and somehow not seeing it. What you're describing here is the archetpical cache. The only reason for having a memory cache is to avoid accessing memory! The only reason for having a TLB is to avoid walking the page tables! The only reason for having a page cache is to avoid hitting the disk! The only reason for having a dcache is to avoid traversing the file system directories! And yes, the only reason for having a generation number cache is to avoid traversing the DAG. D'oh. Do you think this is somehow news to anyone? The fundamental nature of a cache is that it lets you look something up quickly that you could compute but don't want to. I'm slapping my forehead like Homer Simpson here. The fact that computing the generation number is expensive is why it's worth cacheing. But the fact that it *can* be computed is a reason not to clutter the published commit object format with it. The generation number is NOT FUNDAMENTAL. It contains no information that's not already in the DAG. The danger of putting it into a commit is that you'll do it wrong, and thereby screw everything up. If we have broken code that generates a broken cache, we fix the code and the bugs magically go away. If we have broken code that generates a broken commit object, we have a huge problem. Just like we don't ship pack indexes around, but recompute them on arrival. The index is essential for performance, but it's absolutely non-essential for correctness. As a general design principle, the exported data structures, like the commits, should be as simple as possible. Do not include extraneous or redundant data, because then you have to deal with the possibility of inconsistency. This leads to bugs. (Frequently buffer overflow bugs.) Maybe it would have been worth violating that principle during the initial git design. I still see a good argument for not doing that even if we had a time machine. But now that the commit format is established and widely used, the argument has far more force. Changing the commit format provides zero functionality gain, and the performance gain can be obtained a different way. Maybe a bit more code, but nothing extraordinary. To me, the KISS principle says "don't change the commit format!" Now, you complain about code complexity. But this is a read-only cache. The generation number of a commit object never changes. There's no update operation. Like an I-cache, if there's ever any problem, throw it away. Arguing that "the patch to put it in the commit object is smaller" is stupidly short-sighted. Now every version of git from now until forever has to support both kinds of commit objects. (And browsing old git trees will forever be slow.) You only take on that sort of legacy support burden if you absolutely have to. > But "just because we could recompute it" is a bad bad reason. Bull puckey. You're ugly and stupid and WRONG. It's an excellent reason. I'm amazed that you're not seeing it. The principle is "don't include redundant data in a transport format." Because it can be recomputed, it's redundant. Therefore, it shouldn't be included in the transport format. It's exactly the same principle as "don't store the indexes in the database dump" and "don't store filename hashes in file system archives". This is a principle, not an iron-clad rule. It can be violated for good and sufficient reasons, notably performance. But in this case, we can get the performance without it. Without, in fact, changing the git transport format at all. And "don't change a widely-used transport format" is ANOTHER important principle. Backward-compatible is much better than incompatible, but far better to avoid changing it at all. Breaking two such principles without an absolutely iron-clad reason is ugly and stupid and wrong. (As you well know, the more general principle is "don't store redundant data AT ALL unless you need to for performance". Redundant data is A Bad Thing. It can get out of sync. But if you have to, a private cache is much better than a exchange format.) Put another way, it IS stupid, it IS expendable, and therefore it SHOULD go. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-17 23:39 ` George Spelvin @ 2011-07-17 23:58 ` Linus Torvalds 2011-07-18 5:13 ` George Spelvin 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-17 23:58 UTC (permalink / raw) To: George Spelvin; +Cc: git On Sun, Jul 17, 2011 at 4:39 PM, George Spelvin <linux@horizon.com> wrote: > > I'm slapping my forehead like Homer Simpson here. The fact that computing > the generation number is expensive is why it's worth cacheing. But the > fact that it *can* be computed is a reason not to clutter the published > commit object format with it. And I'm slapping *my* forehead. Nobody has *ever* given a reason why the cache would be better than just making it explicit. That's my issue. Why is that so hard for people to understand? The cache is just EXTRA WORK. To take your TLB example: it's like having a TLB for a page table that would be as easy to just create in a way that it's *faster* to look up in the actual data structure than it would be to look up in the cache. Or to take your disk cache example: wouldn't you say that a disk cache is a F&*&ING BAD IDEA if it is slower than the disk it caches? Seriously. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-17 23:58 ` Linus Torvalds @ 2011-07-18 5:13 ` George Spelvin 2011-07-18 10:28 ` Anthony Van de Gejuchte 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-18 5:13 UTC (permalink / raw) To: linux, torvalds; +Cc: git > Nobody has *ever* given a reason why the cache would be better than > just making it explicit. I thought I listed a few. Let me be clearer. 1) It involves changing the commit format. Since the change is backward-compatible, it's not too bad, but this is still fundamentally A Bad Thing, to be avoided if possible. 2) It can't be retrofitted to help historical browsing. 3) You have to support commits without generation numbers forever. This is a support burden. If you can generate generation numbers for an entire repository, including pre-existing commits, you can *throw out* the commit date heuristic code entirely. 4) It can't be made to work with grafts or replace objects. 5) It includes information which is redundant, but hard to verify, in git objects. Leading to potentially bizarre and version-dependent behaviour if it's wrong. (Checking that the numbers are consistent is the same work as regenerating a cache.) 6) It makes git commits slightly larger. (Okay, that's reaching.) > Why is that so hard for people to understand? The cache is just EXTRA WORK. That's why it *might* have been a good idea to include the number in the original design. But now that the design is widely deployed, it's better to avoid changing the design if not necessary. With a bit of extra work, it's not necessary. > To take your TLB example: it's like having a TLB for a page table that > would be as easy to just create in a way that it's *faster* to look up > in the actual data structure than it would be to look up in the cache. You've subtly jumped points. The original point was that it's worth precomputing and storing the generation numbers. I was trying to say that this is fundamentally a caching operation. Now we're talking about *where* to store the cached generation numbers. Your point, which is a very valid one, is that they are to be stored on disk, exactly one per commit, can be computed when the commit is generated, and are accessed at the same time as the commit, so it makes all kinds of sense to store them *with* the commits. As part of them, even. This has the huge benefit that it does away with the need for a *separate* data structure. (Kinda sorts like the way AMD stores instruction boundaries in the L1 I-cache, avoiding the need for a separate data structure.) I'm arguing that, despite this annoying overhead, there are valid reasons to want to store it separately. There are some practical ones, but the basic one is an esthetic/maintainability judgement of "less cruft in the commit objects is worth more cruft in the code". Git has done very well partly *because* of the minimality of its basic persistent object database format. I think we should be very reluctant to add to that without a demonstrated need that *cannot* be met in another way. In this particular case, a TLB is not a transport format. It's okay to add redundant cruft to make it faster, because it only lasts until the next reboot. (A more apropos, software-oriented analogy might be "struct page".) A git commit object *is* a transport format, one specifically designed for transporting data a very long way forward in time, so it should be designed with considerable care, and cruft ruthlessly eradicated. Whatever you add to it has to be supported by every git implementation, forever. As does every implementation bug ever produced. A cache, on the other hand, is purely a local implementation detail. It can be changed between versions with much less effort. I agree it's more implementation work. But the upside is a cleaner struct commit. Which is a very good thing. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-18 5:13 ` George Spelvin @ 2011-07-18 10:28 ` Anthony Van de Gejuchte 2011-07-18 11:48 ` George Spelvin 0 siblings, 1 reply; 89+ messages in thread From: Anthony Van de Gejuchte @ 2011-07-18 10:28 UTC (permalink / raw) To: George Spelvin; +Cc: torvalds, git On 18-jul-2011, at 07:13, George Spelvin wrote: >> Nobody has *ever* given a reason why the cache would be better than >> just making it explicit. > > I thought I listed a few. Let me be clearer. > > 1) It involves changing the commit format. Since the change is > backward-compatible, it's not too bad, but this is still fundamentally > A Bad Thing, to be avoided if possible. Git is designed to ignore data in this case afaik, so I do not see any reason why backwards-compatibility gets broken here. > > 2) It can't be retrofitted to help historical browsing. I like to see more (valid) arguments, as I do not see what you are trying to explain. > > 3) You have to support commits without generation numbers forever. > This is a support burden. If you can generate generation numbers for > an entire repository, including pre-existing commits, you can *throw > out* the commit date heuristic code entirely. I'll give you a few months to rethink at this statement until this feature does get used widely. I think there was never a moment where we would ever think to rebuild older commits as this would break the hash of the commits where many people are potential looking for. > > 4) It can't be made to work with grafts or replace objects. > > 5) It includes information which is redundant, but hard to verify, > in git objects. Leading to potentially bizarre and version-dependent > behaviour if it's wrong. (Checking that the numbers are consistent > is the same work as regenerating a cache.) The data is *consistent* as long as the hash doesn't change, storing the data in the commits *can* reduce resource and makes calculations cheaper. Therefore, I think there are enough reasons to add the generation number in the commit. Yes, many data can be calculated or can be an overhead, but as Torvalds already said, it can be used as consistency check. If the data does get wrong, then its probably caused by something stupid enough to break the rules. Yes, this is a problem but I think there are already enough reasons given, look back to the archives of this topic. Ok, there is one possible thing that *can* go wrong and that is when you are changing history with generation numbers with an older git client. (And thats a good reason to communicate with others as clear as possible about this feature, but its still not version-dependent as it doesn't require a client to use it) > > 6) It makes git commits slightly larger. (Okay, that's reaching.) > >> Why is that so hard for people to understand? The cache is just EXTRA WORK. > > That's why it *might* have been a good idea to include the number in > the original design. But now that the design is widely deployed, it's > better to avoid changing the design if not necessary. > > With a bit of extra work, it's not necessary. > >> To take your TLB example: it's like having a TLB for a page table that >> would be as easy to just create in a way that it's *faster* to look up >> in the actual data structure than it would be to look up in the cache. > > You've subtly jumped points. The original point was that it's worth > precomputing and storing the generation numbers. I was trying to > say that this is fundamentally a caching operation. > > Now we're talking about *where* to store the cached generation numbers. > > Your point, which is a very valid one, is that they are to be stored > on disk, exactly one per commit, can be computed when the commit is > generated, and are accessed at the same time as the commit, so it makes > all kinds of sense to store them *with* the commits. As part of them, > even. > > This has the huge benefit that it does away with the need for a *separate* > data structure. (Kinda sorts like the way AMD stores instruction > boundaries in the L1 I-cache, avoiding the need for a separate data > structure.) > > I'm arguing that, despite this annoying overhead, there are valid reasons > to want to store it separately. There are some practical ones, but the > basic one is an esthetic/maintainability judgement of "less cruft in > the commit objects is worth more cruft in the code". > > Git has done very well partly *because* of the minimality of its basic > persistent object database format. I think we should be very reluctant > to add to that without a demonstrated need that *cannot* be met in > another way. > > > In this particular case, a TLB is not a transport format. It's okay > to add redundant cruft to make it faster, because it only lasts until > the next reboot. (A more apropos, software-oriented analogy might be > "struct page".) > > A git commit object *is* a transport format, one specifically designed > for transporting data a very long way forward in time, so it should be > designed with considerable care, and cruft ruthlessly eradicated. > > Whatever you add to it has to be supported by every git implementation, > forever. As does every implementation bug ever produced. > > A cache, on the other hand, is purely a local implementation detail. > It can be changed between versions with much less effort. > > I agree it's more implementation work. But the upside is a cleaner > struct commit. Which is a very good thing. A cache would use more resources because they can become invalid at any point and *should* be recalculated by every client. We are processing data that *can* be reused by everybody with a git client which has this specific feature, but does not break anything with an older client. So please, calculate things only once as this may save a *lot* of time :-) I would see more advantage in a cache if the data could differs on every client, but that still doesn't mean that you should use one. > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Maybe I shouldn't even have responded to this as I tend not to agree with the given opinions to use a cache, even when I think that Torvalds starts throwing arguments as well for certain reasons, but thats probably my wrong thinking at it. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-18 10:28 ` Anthony Van de Gejuchte @ 2011-07-18 11:48 ` George Spelvin 2011-07-20 20:51 ` Nicolas Pitre 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-18 11:48 UTC (permalink / raw) To: anthonyvdgent, linux; +Cc: git, torvalds >> 1) It involves changing the commit format. Since the change is >> backward-compatible, it's not too bad, but this is still fundamentally >> A Bad Thing, to be avoided if possible. > Git is designed to ignore data in this case afaik, so I do not see any > reason why backwards-compatibility gets broken here. That's what I just wrote. "The change is backward-compatible" is a simpler and shorter way of writing "it doesn't break backwards-compatibility" (to put the generation number in the commit object). I just said that *any* change is still undesirable. >> 2) It can't be retrofitted to help historical browsing. > I like to see more (valid) arguments, as I do not see what you are > trying to explain. I apologize for being unclear. I meant that if you store the generation in the commit, then you can't add generation numbers to an existing repository ("retrofit") in order to speed up --contains and --topo-sort operations on pre-existing git repositories. (Without recomputing all the hashes and breaking the ability to merge with people not using the feature.) As Linus points out, this is not likely to be a major performance issue in practice, as operations like finding merge bases overwhelmingly use recent objects (which will have generation numbers once the feature goes in), but it is a measurable disadvantage. >> 3) You have to support commits without generation numbers forever. >> This is a support burden. If you can generate generation numbers for >> an entire repository, including pre-existing commits, you can *throw >> out* the commit date heuristic code entirely. > I'll give you a few months to rethink at this statement until this > feature does get used widely. I think there was never a moment where > we would ever think to rebuild older commits as this would break the > hash of the commits where many people are potential looking for. I'm afraid that your English grammar is sufficiently mangled here that I don't understand *your* point. Which is a shame because it's one of my more important points. Storing the generation number inside the commit means that a commit with a generation number has a different hash than a commit without one. This means that people won't want to break the hashes of existing commits by adding them. In many cases, ever. Which means that git will have to be able to work without the generation numbers forever. If the generation numbers are stored in a separate data structure that can be added to an existing repository, then a new version of git can do that when needed. Which lets git depend on always having the the generation numbers to do all history walking and stop using commit date based heuristics completely. >> 4) It can't be made to work with grafts or replace objects. >> >> 5) It includes information which is redundant, but hard to verify, >> in git objects. Leading to potentially bizarre and version-dependent >> behaviour if it's wrong. (Checking that the numbers are consistent >> is the same work as regenerating a cache.) > The data is *consistent* as long as the hash doesn't change, storing the > data in the commits *can* reduce resource and makes calculations cheaper. You're mixing up two issues. Storing the generation number *anywhere* can make calculations cheaper. Storing them in the commit is indeed the *simplest* place, but the calculation cost point is equally true if the numbers are stored somewhere else. As for consistency... I'm defining "consistent" as consistency between the generation number and the parent pointers. This is the property that the history-walking optimizations depend on. A commit's generation number is consistent if it is larger than the generation number of any of its parents. (Optionally, you may require that it be larger by exatly 1.) A generation number is *not* consistent if is less than or equal to the generation number of one of its parents. If this happens, history walking code that uses the generation numbers will not produce correct output. Further, the nature of the incorrectness will depend on implementation details ("potentially bizarre and version-dependent behaviour") of the history-walking code. By computing the generation numbers when needed, the entire "what happens if someone makes a commit with an inconsistent generation number" problem goes away. It goes from "not likely to happen" or "somthing that has to be checked for when receiving objects" to "can't happen". The computation to verify that an incoming commit's generation number is consistent is exactly the same computation needed to compute the generation number it should have: look up all parent commit generation numbers and take the maximum. The only question is whether we store the result after computing it, or compare with the included generation number and possibly print an error message. For example, suppose I generate a commit with a generation number of UINT_MAX. Will this crash git? That's a new error condition the code has to worry about. If I generate the generation number locally, I know that can't happen in any repository that I can download in a reasonable period of time. If we had generation numbers from day 1, we could just require that they always be checked, and an inconsistent object could be always rejected. But since old git versions ignore the generation number in commits, a bad generation number could spread a long way before someone notices it. It becomes a visible problem. Not a really big one (I'm pretty sure that refusing to pull it introduces no security holes), but it's an error condition that we have to actually think about. > A cache would use more resources because they can become invalid at any > point and *should* be recalculated by every client. We are processing > data that *can* be reused by everybody with a git client which has this > specific feature, but does not break anything with an older client. > > So please, calculate things only once as this may save a *lot* of time :-) This is silly. The cache can't become invalid except by disk corruption, which can corrupt numbers stored in the commit object just the same. (The corruption can be detected by git-fsck, but that's also true independent of where the numbers are stored.) And the work to recalculate the numbers is far less than the work to garbage collect, or repack, or generate the index of an incoming pack, or any of a dozen operations that are normally done by all clients. (Don't get me started on rename detection!) This is a completely misplaced optimization. Walking every commit in the repository takes a few seconds and enough memory that we don't want to do it every "git log" operation, but it's barely perceptible compared to other repository maintenance operations. Do it once when you install a new git software version and then you can forget about it. > I would see more advantage in a cache if the data could differs on > every client, but that still doesn't mean that you should use one. If you use grafts or replace objects, it can be. That's my point 4) above. Supporting these makes maintaining a cache trickier, but it's simply impossible to do with in-commit generation numbers. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-18 11:48 ` George Spelvin @ 2011-07-20 20:51 ` Nicolas Pitre 2011-07-20 22:16 ` George Spelvin 0 siblings, 1 reply; 89+ messages in thread From: Nicolas Pitre @ 2011-07-20 20:51 UTC (permalink / raw) To: George Spelvin; +Cc: anthonyvdgent, git, Linus Torvalds On Mon, 18 Jul 2011, George Spelvin wrote: > Storing the generation number inside the commit means that a commit > with a generation number has a different hash than a commit without one. > This means that people won't want to break the hashes of existing commits > by adding them. In many cases, ever. > > Which means that git will have to be able to work without the generation > numbers forever. I've been diverting myself from $day_job by reading through this thread. Still, I couldn't make my mind between having the generation number stored in the commit object or in a separate cache by reading all the arguments for each until now. Admittedly I'm not as involved in the design of Git as I once was, so my comments can be considered with the same proportions. Obviously, with a perfect design, we would have had gen numbers from the beginning. But we did mistakes, and now have to regret and live with them (and yes I have my own share of responsibility for some of those regrets which are now embodied in the Git data format). > If the generation numbers are stored in a separate data structure that > can be added to an existing repository, then a new version of git can > do that when needed. Which lets git depend on always having the the > generation numbers to do all history walking and stop using commit date > based heuristics completely. To me this is the killer argument. Being able to forget about the broken date heuristics entirely and simplify the code is what makes the external cache so fundamentally better as it can be applied to any existing repositories. And it has no backward compatibility issues as old Git version won't work any worse if they can't make any usage of that cache. The alternative of having to sometimes use the generation number, sometimes use the possibly broken commit date, makes for much more complicated code that has to be maintained forever. Having a solution that starts working only after a certain point in history doesn't look eleguant to me at all. It is not like having different pack formats where back and forth conversions can be made for the _entire_ history. And if you don't care about graft/replace then the cached data is immutable just like the in-commit version would, so there is no consistency issues. If you do care about graft/replace (or who knows what other dag alteration scheme might be created in 5 years from now) then a separate cache will be required _anyway_, regardless of any in-commit gen number. So to say that if a generation number is _really_ needed, then it should go in a separate cache. Saying that if we would have done it initially then it would have been inside the commit object is not a good enough justification to do it today if it can't be applied to the whole of already existing repositories and avoid special cases. I however have not formed any opinion on that fundamental question i.e. whether or not gen numbers are worth it in today's conditions. Neither did I think about the actual cache format (I don't think that adding it to the pack index is a good idea if grafts are to be honored) which certainly has bearing on that fundamental question too. But I don't see the point of starting to add them now to commit objects, even if we regret not doing it initially, simply because having them appear randomly based on the Git version/implementation being used is still much uglier than some ad hoc cache or even not having them at all. Nicolas ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-20 20:51 ` Nicolas Pitre @ 2011-07-20 22:16 ` George Spelvin 2011-07-20 23:26 ` david 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-20 22:16 UTC (permalink / raw) To: linux, nico; +Cc: anthonyvdgent, git, torvalds > The alternative of having to sometimes use the generation number, > sometimes use the possibly broken commit date, makes for much more > complicated code that has to be maintained forever. Having a solution > that starts working only after a certain point in history doesn't look > eleguant to me at all. It is not like having different pack formats > where back and forth conversions can be made for the _entire_ history. It seemed like a pretty strong argument to me, too. > And if you don't care about graft/replace then the cached data is > immutable just like the in-commit version would, so there is no > consistency issues. If you do care about graft/replace (or who knows > what other dag alteration scheme might be created in 5 years from now) > then a separate cache will be required _anyway_, regardless of any > in-commit gen number. A possible workaround would be to keep track of the largest generation number skew introduced by any graft, and add that safety factor into the history-walking code, but that would be painful if you replace a single large commit with an equivalent long development history, such as adding a historical development tree behind a recently-cut-off one. or development history You can do a workaround at the expense of ine > Neither did I think about the actual cache format (I don't think that > adding it to the pack index is a good idea if grafts are to be honored) > which certainly has bearing on that fundamental question too. I was thinking of something very close to the V2 pack format. http://book.git-scm.com/7_the_packfile.html A magic number, a 256-entry fanout table, a sorted list of 20-byte hashes, followed by a matching list of 4-byte generation numbers. Ending with a 20-byte hash of the replaces and grafts state that this cache is valid for, and a hash of the cache itself. A bit of code factoring should make it easy to share much of the code. It would certainly be possible to share the SHA1 table in an existing pack index and store the generation numbers of the base (no replacement) case, but you'd have to store null values for all the non-commit objects. That takes 4 bytes per object, while a separate list of commits takes 24 bytes per commit. A separate list is better if commits are less than 1/6 of all objects. Looking at git's own object database, we have: 66125 blobs (45.50%) 49292 trees (33.92%) 29554 commits (20.33%) 362 tags ( 0.25%) 145333 total So we're actually a bit over the 16.66% optimum. but it's not far enough to be a real efficiency problem. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-20 22:16 ` George Spelvin @ 2011-07-20 23:26 ` david 2011-07-20 23:36 ` Nicolas Pitre 2011-07-21 12:03 ` Drew Northup 0 siblings, 2 replies; 89+ messages in thread From: david @ 2011-07-20 23:26 UTC (permalink / raw) To: George Spelvin; +Cc: nico, anthonyvdgent, git, torvalds On Wed, 20 Jul 2011, George Spelvin wrote: >> The alternative of having to sometimes use the generation number, >> sometimes use the possibly broken commit date, makes for much more >> complicated code that has to be maintained forever. Having a solution >> that starts working only after a certain point in history doesn't look >> eleguant to me at all. It is not like having different pack formats >> where back and forth conversions can be made for the _entire_ history. > > It seemed like a pretty strong argument to me, too. except that you then have different caches on different systems. If the generation number is part of the repository then it's going to be the same for everyone. in either case, you still have the different heristics depending on what version of git someone is running David Lang ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-20 23:26 ` david @ 2011-07-20 23:36 ` Nicolas Pitre 2011-07-21 0:08 ` Phil Hord 2011-07-21 12:03 ` Drew Northup 1 sibling, 1 reply; 89+ messages in thread From: Nicolas Pitre @ 2011-07-20 23:36 UTC (permalink / raw) To: david; +Cc: George Spelvin, anthonyvdgent, git, torvalds On Wed, 20 Jul 2011, david@lang.hm wrote: > On Wed, 20 Jul 2011, George Spelvin wrote: > > > > The alternative of having to sometimes use the generation number, > > > sometimes use the possibly broken commit date, makes for much more > > > complicated code that has to be maintained forever. Having a solution > > > that starts working only after a certain point in history doesn't look > > > eleguant to me at all. It is not like having different pack formats > > > where back and forth conversions can be made for the _entire_ history. > > > > It seemed like a pretty strong argument to me, too. > > except that you then have different caches on different systems. So what? > If the generation number is part of the repository then it's going to > be the same for everyone. The actual generation number will be, and has to be, the same for everyone with the same repository content, regardless of the cache used. It is a well defined number with no room to interpretation. > in either case, you still have the different heristics depending on what > version of git someone is running Indeed. Nicolas ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-20 23:36 ` Nicolas Pitre @ 2011-07-21 0:08 ` Phil Hord 2011-07-21 0:18 ` david 2011-07-21 0:58 ` Nicolas Pitre 0 siblings, 2 replies; 89+ messages in thread From: Phil Hord @ 2011-07-21 0:08 UTC (permalink / raw) To: Nicolas Pitre; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds On 07/20/2011 07:36 PM, Nicolas Pitre wrote: > On Wed, 20 Jul 2011, david@lang.hm wrote: > >> If the generation number is part of the repository then it's going to >> be the same for everyone. > The actual generation number will be, and has to be, the same for > everyone with the same repository content, regardless of the cache used. > It is a well defined number with no room to interpretation. Nonsense. Even if the generation number is well-defined and shared by all clients, the only quasi-essential definition is "for each A in ancestors_of(B), gen(A) < gen(B)". In practice, the actual generation number *will be the same* for everyone with the same repository content, unless and until someone develops a different calculation method. But there is no reason to require that the number *has to be* the same for everyone unless you expect (or require) everyone to share their gen-caches. Surely there will be a competent and efficient gen-cache API. But most code can just ask if B --contains A or even just use rev-list and benefit from the increased speed of the answer. Because most code doesn't really care about the gen numbers themselves, but only the speed of determining ancestry. Phil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:08 ` Phil Hord @ 2011-07-21 0:18 ` david 2011-07-21 0:37 ` Shawn Pearce 2011-07-21 0:39 ` Phil Hord 2011-07-21 0:58 ` Nicolas Pitre 1 sibling, 2 replies; 89+ messages in thread From: david @ 2011-07-21 0:18 UTC (permalink / raw) To: Phil Hord; +Cc: Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds On Wed, 20 Jul 2011, Phil Hord wrote: > On 07/20/2011 07:36 PM, Nicolas Pitre wrote: >> On Wed, 20 Jul 2011, david@lang.hm wrote: >> >>> If the generation number is part of the repository then it's going to >>> be the same for everyone. >> The actual generation number will be, and has to be, the same for >> everyone with the same repository content, regardless of the cache used. >> It is a well defined number with no room to interpretation. > > Nonsense. > > Even if the generation number is well-defined and shared by all clients, the > only quasi-essential definition is "for each A in ancestors_of(B), gen(A) < > gen(B)". > > In practice, the actual generation number *will be the same* for everyone > with the same repository content, unless and until someone develops a > different calculation method. But there is no reason to require that the > number *has to be* the same for everyone unless you expect (or require) > everyone to share their gen-caches. and I think this is why Linus is not happy with a cache. He is seeing this as something that has significantly more value if it is going to be consistant in a distributed manner than if it's just something calculated locally that can be different from other systems. if it's just locally generated, then I could easily see generation numbers being different on different people's ssstems, dependin on the order that they see commits (either locally generated or pulled from others) If it's part of the commit, then as that commit gets propogated the generation number gets propogated as well, and every repository will agree on what the generation number is for any commit that's shared. I agree that this consistancy guarantee seems to be valuable. > Surely there will be a competent and efficient gen-cache API. But most code > can just ask if B --contains A or even just use rev-list and benefit from the > increased speed of the answer. Because most code doesn't really care about > the gen numbers themselves, but only the speed of determining ancestry. in that case, why bother with generation numbers at all? the improved data based heristic seems to solve that problem. David Lang ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:18 ` david @ 2011-07-21 0:37 ` Shawn Pearce 2011-07-21 0:47 ` Phil Hord 2011-07-21 4:26 ` david 2011-07-21 0:39 ` Phil Hord 1 sibling, 2 replies; 89+ messages in thread From: Shawn Pearce @ 2011-07-21 0:37 UTC (permalink / raw) To: david Cc: Phil Hord, Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds On Wed, Jul 20, 2011 at 17:18, <david@lang.hm> wrote: > > if it's just locally generated, then I could easily see generation numbers > being different on different people's ssstems, dependin on the order that > they see commits (either locally generated or pulled from others) But this should only happen if the user fudges with their Git sources and makes Git produce a different generation number. If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A)) + 1" then it doesn't matter who merged what commits, the same commit appears at the same part of the graph relative to all of its ancestors, and therefore always has the same generation number. This is true whether or not the commit contains the generation number. > If it's part of the commit, then as that commit gets propogated the > generation number gets propogated as well, and every repository will agree > on what the generation number is for any commit that's shared. This isn't really as beneficial as you are making it out to be. We already can agree on what the generation number should be for any given commit, if you topo-sort the commit DAG, you get the same result. > I agree that this consistancy guarantee seems to be valuable. Its valuable, but its consistent either with a cache, or not. -- Shawn. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:37 ` Shawn Pearce @ 2011-07-21 0:47 ` Phil Hord 2011-07-21 4:26 ` david 1 sibling, 0 replies; 89+ messages in thread From: Phil Hord @ 2011-07-21 0:47 UTC (permalink / raw) To: Shawn Pearce Cc: david, Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds On 07/20/2011 08:37 PM, Shawn Pearce wrote: > On Wed, Jul 20, 2011 at 17:18,<david@lang.hm> wrote: >> if it's just locally generated, then I could easily see generation numbers >> being different on different people's ssstems, dependin on the order that >> they see commits (either locally generated or pulled from others) > But this should only happen if the user fudges with their Git sources > and makes Git produce a different generation number. > > If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A)) > + 1" then it doesn't matter who merged what commits, the same commit > appears at the same part of the graph relative to all of its > ancestors, and therefore always has the same generation number. This > is true whether or not the commit contains the generation number. Interesting. I was going to disagree with the latter part of your statement, but then I realized you're right. And that your algorithm allows duplicate generation numbers. And that there's nothing wrong with that. Because it meets the one quasi-essential need, "for each A in ancestors_of(B), gen(A) < gen(B)". >> If it's part of the commit, then as that commit gets propogated the >> generation number gets propogated as well, and every repository will agree >> on what the generation number is for any commit that's shared. > This isn't really as beneficial as you are making it out to be. We > already can agree on what the generation number should be for any > given commit, if you topo-sort the commit DAG, you get the same > result. > >> I agree that this consistancy guarantee seems to be valuable. > Its valuable, but its consistent either with a cache, or not. I still fail to see the value. Phil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:37 ` Shawn Pearce 2011-07-21 0:47 ` Phil Hord @ 2011-07-21 4:26 ` david 2011-07-21 12:43 ` George Spelvin 1 sibling, 1 reply; 89+ messages in thread From: david @ 2011-07-21 4:26 UTC (permalink / raw) To: Shawn Pearce Cc: Phil Hord, Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds On Wed, 20 Jul 2011, Shawn Pearce wrote: > On Wed, Jul 20, 2011 at 17:18, <david@lang.hm> wrote: >> >> if it's just locally generated, then I could easily see generation numbers >> being different on different people's ssstems, dependin on the order that >> they see commits (either locally generated or pulled from others) > > But this should only happen if the user fudges with their Git sources > and makes Git produce a different generation number. > > If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A)) > + 1" then it doesn't matter who merged what commits, the same commit > appears at the same part of the graph relative to all of its > ancestors, and therefore always has the same generation number. This > is true whether or not the commit contains the generation number. I have to think about this more, but I'm wondering about cases where the same result ia achieved via different methods, something along the lines of one person developing something with _many_ commits (creating a large generation number) that one person merges far sooner than another, causing the commits that they do after the merge to have much larger generation numbers than someone making the same changes, but doing the merge later something like C9 \ C2 - C10 - C11 - C12 vs C9 \ C2 - C3 - C4 - C5 - C10 where the C10-12 in the first set and C3-5 in the second set are completely unrelated to what's done in C9 and C12 in the first set and C10 in the sedond set are identical trees. now I know that part of a commit is what it's parents are, so that is different (and that may be enough to say that generations don't matter and this entire issue is moot), but I haven't thought about it long enough to convince myself what would (or should) happen in these cases. David Lang >> If it's part of the commit, then as that commit gets propogated the >> generation number gets propogated as well, and every repository will agree >> on what the generation number is for any commit that's shared. > > This isn't really as beneficial as you are making it out to be. We > already can agree on what the generation number should be for any > given commit, if you topo-sort the commit DAG, you get the same > result. > >> I agree that this consistancy guarantee seems to be valuable. > > Its valuable, but its consistent either with a cache, or not. > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 4:26 ` david @ 2011-07-21 12:43 ` George Spelvin 2011-07-21 19:19 ` Jakub Narebski 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-21 12:43 UTC (permalink / raw) To: david, spearce; +Cc: anthonyvdgent, git, hordp, linux, nico, torvalds On <david@lang.hm> wrote: > On Wed, 20 Jul 2011, Shawn Pearce wrote: >> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A)) >> + 1" then it doesn't matter who merged what commits, the same commit >> appears at the same part of the graph relative to all of its >> ancestors, and therefore always has the same generation number. This >> is true whether or not the commit contains the generation number. > I have to think about this more, but I'm wondering about cases where the > same result ia achieved via different methods, something along the lines > of one person developing something with _many_ commits (creating a large > generation number) that one person merges far sooner than another, causing > the commits that they do after the merge to have much larger generation > numbers than someone making the same changes, but doing the merge later Can't happen. Using the basic algorithm as Shawn described, the generation number is defined uniquely by the ancestor DAG. The generation number is the length of the longest path to a root (zero-ancestor) commit through the DAG. If you look at past discussion, several people have thought it was okay to bake into the commit precsiely because it can be computed once and will never change. However, git does have some ability to amend the history DAG after it's been written, using grafts and replace objects. These can change generation numbers, presisely because they change the DAG. > something like > > C9 > \ > C2 - C10 - C11 - C12 > > vs > C9 > \ > C2 - C3 - C4 - C5 - C10 > > where the C10-12 in the first set and C3-5 in the second set are > completely unrelated to what's done in C9 and C12 in the first set > and C10 in the second set are identical trees. The generation numbers in the above are as follows: First example: C2 = C9 = 0 C10 = 1 = max(C2, C9) + 1 C11 = 2 = C10 + 1 C12 = 3 = C11 + 1 Second example: C2 = C9 = 0 C3 = 1 = C2 + 1 C4 = 2 = C2 + 1 C5 = 3 = C4 + 1 C10 = 4 = max(C5, C9) + 1 Now, the history pruning works fine if the "+1" is replaced my any other non-zero increment, but it's not clear why you'd bother. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 12:43 ` George Spelvin @ 2011-07-21 19:19 ` Jakub Narebski 2011-07-21 20:27 ` George Spelvin 0 siblings, 1 reply; 89+ messages in thread From: Jakub Narebski @ 2011-07-21 19:19 UTC (permalink / raw) To: George Spelvin; +Cc: david, spearce, anthonyvdgent, git, hordp, nico, torvalds George Spelvin, could you please try not mangle CC to include only emails, stripping names (e.g. "spearce@spearce.org" instead of "Shawn Pearce <spearce@spearce.org>")? "George Spelvin" <linux@horizon.com> writes: > On <david@lang.hm> wrote: >> On Wed, 20 Jul 2011, Shawn Pearce wrote: >>> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A)) >>> + 1" then it doesn't matter who merged what commits, the same commit >>> appears at the same part of the graph relative to all of its >>> ancestors, and therefore always has the same generation number. This >>> is true whether or not the commit contains the generation number. > >> I have to think about this more, but I'm wondering about cases where the >> same result ia achieved via different methods, something along the lines >> of one person developing something with _many_ commits (creating a large >> generation number) that one person merges far sooner than another, causing >> the commits that they do after the merge to have much larger generation >> numbers than someone making the same changes, but doing the merge later > > Can't happen. Using the basic algorithm as Shawn described, the > generation number is defined uniquely by the ancestor DAG. > > The generation number is the length of the longest path to a > root (zero-ancestor) commit through the DAG. > > If you look at past discussion, several people have thought it was > okay to bake into the commit precsiely because it can be computed > once and will never change. > > However, git does have some ability to amend the history DAG after > it's been written, using grafts and replace objects. These can > change generation numbers, presisely because they change the DAG. There is also another issue that I have mentioned, namely incomplete clones - which currently means shallow clone, without access to full history. Nb. grafts are so horrible hack that I would be not against turning off generation numbers if they are used. In the case of replace objects you need both non-replaced and replaced DAG generation numbers. -- Jakub Narębski ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 19:19 ` Jakub Narebski @ 2011-07-21 20:27 ` George Spelvin 2011-07-21 20:33 ` Shawn Pearce 2011-07-22 12:18 ` Jakub Narebski 0 siblings, 2 replies; 89+ messages in thread From: George Spelvin @ 2011-07-21 20:27 UTC (permalink / raw) To: jnareb, linux; +Cc: anthonyvdgent, david, git, hordp, nico, spearce, torvalds > There is also another issue that I have mentioned, namely incomplete > clones - which currently means shallow clone, without access to full > history. As far as history walking is concerned, you can just consider "missing parent" the same as "no parent" and start the generation numbers at 0. As long as you recompute > Nb. grafts are so horrible hack that I would be not against turning > off generation numbers if they are used. Yeah, but it's not too miserable to add support (the logic is very similar to replace objects), and then you would be able to have the history walking code depend on the presence of generation numbers. (The "load the cache" function would regenerate it if necessary.) Only do this if you already have support for "no generation numbers" in the history walking code for (say) loose objects. > In the case of replace objects you need both non-replaced and replaced > DAG generation numbers. Yes, the cache validity/invalidation criteria are the tricky bit. Honestly, this is where the code gets ugly, not computing and storing the generation numbers. One thought on an expanded generation number cache: There are many git operations that use ONLY the commit DAG, and do not actually use any information from the commits other than their hashes and parent pointers. The ones that come to mind are rev-parse, rev-list, describe, name-rev, and merge-base. These could be sped up if, instead of just generation numbers, we kept a complete cached copy of the commit DAG, so the commit objects didn't have to be uncompressed and parsed. This could be provided by an extended form of generation number cache. In addition to listing the generation number of each commit, it would list all the ancestors (by file offset rather than hash, for compactness). Then simple commit walking could load this cache and avoid unpacking commit objects from packs. A compact implementation would abuse the flexibility of generation numbers to make them serve double duty. They would be used as offsets into a table of parent pointers. By keeping the table topologically sorted, the offsets would satisfy the requirements for generation numbers, but would be unique, and there would be additional gaps when a commit had multiple parents. The parent pointers would themselves be 31-bit offsets into the table of SHA-1 hashes, with the msbit meaning "this commit has multiple parents, also look at the following table entry". (If we use offset 0 to mean "no parents", it might be more convenient to have the offset point to the *end* of the run of parents rather than the beginning, so "following" would be earlier in the file, but that's an implementation detail.) I'm assuming that 2^31 commits having (in aggregate) 2^32 parents would be enough for the time being. As a local cache, it can be extended with a software upgrade. There's no need to ever have support for two formats in any given release; just notice that the cache format is wrong, blow it away, and regenerate it. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 20:27 ` George Spelvin @ 2011-07-21 20:33 ` Shawn Pearce 2011-07-22 12:18 ` Jakub Narebski 1 sibling, 0 replies; 89+ messages in thread From: Shawn Pearce @ 2011-07-21 20:33 UTC (permalink / raw) To: George Spelvin; +Cc: jnareb, anthonyvdgent, david, git, hordp, nico, torvalds On Thu, Jul 21, 2011 at 13:27, George Spelvin <linux@horizon.com> wrote: > > be enough for the time being. As a local cache, it can be extended > with a software upgrade. There's no need to ever have support for two > formats in any given release; just notice that the cache format is wrong, > blow it away, and regenerate it. Don't assume that. Consider a repository stored on NFS that is read-only to you. The NFS server has one version of Git installed, and is using cache format A. You have a newer version of Git installed on your workstation, using cache format B. Now you cannot use this repository as a local filesystem... its only available to you over the Git protocols. This breaks a number of people's environments. :-) Its better if we can avoid having to change file formats very often, even if they are a local "cache". -- Shawn. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 20:27 ` George Spelvin 2011-07-21 20:33 ` Shawn Pearce @ 2011-07-22 12:18 ` Jakub Narebski 2011-07-22 13:09 ` Nicolas Pitre 2011-07-22 18:02 ` david 1 sibling, 2 replies; 89+ messages in thread From: Jakub Narebski @ 2011-07-22 12:18 UTC (permalink / raw) To: George Spelvin Cc: Anthony Van de Gejuchte, David Lang, git, Phil Hord, Nicolas Pitre, Shawn Pearce, Linus Torvalds On Thu, 21 Jul 2011, George Spelvin wrote: > > There is also another issue that I have mentioned, namely incomplete > > clones - which currently means shallow clone, without access to full > > history. > > As far as history walking is concerned, you can just consider "missing > parent" the same as "no parent" and start the generation numbers at 0. > As long as you recompute. Well, shallow clone case can be considered both for putting 'true' generation numbers in commit header, and against it. For, because with generation numbers in commits you can use true generation numbers. Against, because if there are commits without generation numbers in header, you cannot assign true generation number, and you can only use "shallow" generation number, in generation numbers cache. > > Nb. grafts are so horrible hack that I would be not against turning > > off generation numbers if they are used. > > Yeah, but it's not too miserable to add support (the logic is very similar > to replace objects), and then you would be able to have the history walking > code depend on the presence of generation numbers. (The "load the cache" > function would regenerate it if necessary.) > > Only do this if you already have support for "no generation numbers" in > the history walking code for (say) loose objects. Grafts are non-transferable, and if you use them to cull rather than add history they are unsafe against garbage collection... I think. > > In the case of replace objects you need both non-replaced and replaced > > DAG generation numbers. > > Yes, the cache validity/invalidation criteria are the tricky bit. > Honestly, this is where the code gets ugly, not computing and storing > the generation numbers. BTW. with storing generation number in commit header there is a problem what would old version of git, one which does not understand said header, do during rebase. Would it strip unknown headers, or would it copy generation number verbatim - which means that it can be incorrect? BTW2. code size comparing in-commit and external cache cases must take into account yet to be written fsck for in-commit generation numbers. -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 12:18 ` Jakub Narebski @ 2011-07-22 13:09 ` Nicolas Pitre 2011-07-22 18:02 ` david 2011-07-22 18:02 ` david 1 sibling, 1 reply; 89+ messages in thread From: Nicolas Pitre @ 2011-07-22 13:09 UTC (permalink / raw) To: Jakub Narebski Cc: George Spelvin, Anthony Van de Gejuchte, David Lang, git, Phil Hord, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, Jakub Narebski wrote: > BTW. with storing generation number in commit header there is a problem > what would old version of git, one which does not understand said header, > do during rebase. Would it strip unknown headers, or would it copy > generation number verbatim - which means that it can be incorrect? They would indeed be copied verbatim and become incorrect. Nicolas ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 13:09 ` Nicolas Pitre @ 2011-07-22 18:02 ` david 2011-07-22 18:34 ` Jakub Narebski 0 siblings, 1 reply; 89+ messages in thread From: david @ 2011-07-22 18:02 UTC (permalink / raw) To: Nicolas Pitre Cc: Jakub Narebski, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, Nicolas Pitre wrote: > On Fri, 22 Jul 2011, Jakub Narebski wrote: > >> BTW. with storing generation number in commit header there is a problem >> what would old version of git, one which does not understand said header, >> do during rebase. Would it strip unknown headers, or would it copy >> generation number verbatim - which means that it can be incorrect? > > They would indeed be copied verbatim and become incorrect. how would they become incorrect? David Lang ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 18:02 ` david @ 2011-07-22 18:34 ` Jakub Narebski 2011-07-22 19:06 ` Linus Torvalds 2011-07-22 19:08 ` david 0 siblings, 2 replies; 89+ messages in thread From: Jakub Narebski @ 2011-07-22 18:34 UTC (permalink / raw) To: david Cc: Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, David Lang <david@lang.hm> wrote: > On Fri, 22 Jul 2011, Nicolas Pitre wrote: > > On Fri, 22 Jul 2011, Jakub Narebski wrote: > > > > > BTW. with storing generation number in commit header there is a problem > > > what would old version of git, one which does not understand said header, > > > do during rebase. Would it strip unknown headers, or would it copy > > > generation number verbatim - which means that it can be incorrect? > > > > They would indeed be copied verbatim and become incorrect. > > how would they become incorrect? Let's assume that the following history was created with new git, one that correcly adds generation number header to commits: A(1)---B(2)---C(3)---D(4)---E(5) <-- master \ \----x(3)---y(4)---z(5) <-- foo The numbers are generation numbers in commit object. Let's assume that this repository is fetched into repository instance that is managed by older git, one that doesn't understand generation header. Then, if we do [old]$ git rebase master foo and if old git _copies_ generation number header _verbatim_, we would get: A(1)---B(2)---C(3)---D(4)---E(5) <-- master \ \---x'(3)--y'(4)--z'(5) <-- foo Those generation numbers are *incorrect*; they should be: A(1)---B(2)---C(3)---D(4)---E(5) <-- master \ \---x'(6)--y'(7)--z'(8) <-- foo That is IF unknown headers are copied verbatim during rebase. For "encoding" header this is a good thing, for "generation" it isn't. -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 18:34 ` Jakub Narebski @ 2011-07-22 19:06 ` Linus Torvalds 2011-07-22 22:02 ` Jeff King 2011-07-28 15:00 ` Felipe Contreras 2011-07-22 19:08 ` david 1 sibling, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-22 19:06 UTC (permalink / raw) To: Jakub Narebski Cc: david, Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote: > > That is IF unknown headers are copied verbatim during rebase. For > "encoding" header this is a good thing, for "generation" it isn't. Afaik, they aren't copied verbatim, and never have been. Afaik, the only thing that has *ever* written commits is "commit_tree()" (originally "main()" in commit-tree.c). Why is this red herring even being discussed? Of course you can always generate bogus commits by writing them by hand. But that's irrelevant. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 19:06 ` Linus Torvalds @ 2011-07-22 22:02 ` Jeff King 2011-07-28 15:00 ` Felipe Contreras 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-22 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Narebski, david, Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce On Fri, Jul 22, 2011 at 12:06:08PM -0700, Linus Torvalds wrote: > On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote: > > > > That is IF unknown headers are copied verbatim during rebase. For > > "encoding" header this is a good thing, for "generation" it isn't. > > Afaik, they aren't copied verbatim, and never have been. Afaik, the > only thing that has *ever* written commits is "commit_tree()" > (originally "main()" in commit-tree.c). Why is this red herring even > being discussed? In git.git, that is the case. There are other programs that may write git commits, though. Try: http://www.google.com/codesearch#search/&q=hash-object.*commit&type=cs Many uses seem OK (they are generating a commit from scratch). This one at least (the sixth result from the search above) would actually generate buggy generation headers (it modifies parents but passes other headers through): http://www.google.com/codesearch#XUVcT9DKB_U/replace&ct=rc&cd=7&q=hash-object.*commit It may be worth saying that such code is stupid and ugly and wrong, or that it is not deployed widely enough to care about. But it's not entirely a red herring. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 19:06 ` Linus Torvalds 2011-07-22 22:02 ` Jeff King @ 2011-07-28 15:00 ` Felipe Contreras 2011-09-06 10:02 ` Ramkumar Ramachandra 1 sibling, 1 reply; 89+ messages in thread From: Felipe Contreras @ 2011-07-28 15:00 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Narebski, david, Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce On Fri, Jul 22, 2011 at 10:06 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Jul 22, 2011 at 11:34 AM, Jakub Narebski <jnareb@gmail.com> wrote: >> >> That is IF unknown headers are copied verbatim during rebase. For >> "encoding" header this is a good thing, for "generation" it isn't. > > Afaik, they aren't copied verbatim, and never have been. Afaik, the > only thing that has *ever* written commits is "commit_tree()" > (originally "main()" in commit-tree.c). Why is this red herring even > being discussed? > > Of course you can always generate bogus commits by writing them by > hand. But that's irrelevant. Let's suppose for a moment that the commits do have these wrong generation numbers, shouldn't a fetch on the newer client check these and show an error? But what if they are pushed to a central server that has old version of git? It would be messy. -- Felipe Contreras ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-28 15:00 ` Felipe Contreras @ 2011-09-06 10:02 ` Ramkumar Ramachandra 0 siblings, 0 replies; 89+ messages in thread From: Ramkumar Ramachandra @ 2011-09-06 10:02 UTC (permalink / raw) To: Git List Cc: Linus Torvalds, Jeff King, Jakub Narebski, david, Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, Phil Hord, Shawn Pearce, Felipe Contreras Hi, First, let me start out by saying that I'm a fairly new contributor to Git, and I'm far less experienced than the other people on this thread. I've read through all the discussions time and again, and thought about the problem for some time now - I can't say I understand it as fully as many of you do, but I think I may have a slightly different perspective to offer. In what way is Git fundamentally different from Subversion? It's the simplicity of the data model. From the simplest building block, a key-value store, we have been able to compose and build things on top of it. The reason we built centralized version control systems earlier is because it was *easier* to address the composition problems. We dumped all related repository and problems into one central server. With so much information in one place, things are tightly coupled and problems are easier to solve. Still not convinced? What's the weakest component in Git today? Undoubtedly submodules. Ofcourse, a large part of the reason is that many people don't use submodules, and hence it doesn't improve -- but it's actually a circular problem. People don't use submodules, because it's so featureless and hard to develop. Why is it so hard? Back to the fundamental problem of composition from simple building blocks. In submodules, we have to take entire DAGs and build a composite DAG. The key pieces of information are deep inside Git's fundamnetals: Gitlinks. Other projects try like Gitslave try to attack the problem on a more superficial level, but they all hit a barrier when they discover that they can't compose big blocks of data: you need simple building blocks to compose. It's the same story with C (and now, Haskell). Why does everyone like C so much? Because it only provides fundamental building blocks and gives people the freedom to compose the way they like. It doesn't provide big "template blocks" like Java, because they tend to be restrictive in the long run. Sure, Java is easier to start out with, but people soon realize that big blocks can't compose. More than arguing about backward compatibility, and about how older versions of Git commits won't have generation numbers, I think this is what we should be focusing on. Sure, it'll additionally make sense to put in a cache to speed things up now, but we need to think about what Git will be 10~15 years from now. The fundamental pieces of information required for composition must be present in the fundamental building blocks. The real question we should be asking is: "Should Git have had commit generation numbers in 2005?". If the answer is "yes", we should put them in now before it becomes even harder, bending over backwards for backward compatibility if necessary. Otherwise, we'll regret this decision 10~15 years later, when we're faced with deeper issues. If you want a concrete example, think about how you'd compose DAGs together (again, the submodules problem): where is the information required to prune each DAG and compose? I wish I could write this in myself, but I'm afraid I don't have the engineering skill yet. I'll be happy to contribute whatever little I can, and participate in the review process. Thanks. -- Ram ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 18:34 ` Jakub Narebski 2011-07-22 19:06 ` Linus Torvalds @ 2011-07-22 19:08 ` david 2011-07-22 19:40 ` Nicolas Pitre 1 sibling, 1 reply; 89+ messages in thread From: david @ 2011-07-22 19:08 UTC (permalink / raw) To: Jakub Narebski Cc: Nicolas Pitre, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, Jakub Narebski wrote: > On Fri, 22 Jul 2011, David Lang <david@lang.hm> wrote: >> On Fri, 22 Jul 2011, Nicolas Pitre wrote: >>> On Fri, 22 Jul 2011, Jakub Narebski wrote: >>> >>>> BTW. with storing generation number in commit header there is a problem >>>> what would old version of git, one which does not understand said header, >>>> do during rebase. Would it strip unknown headers, or would it copy >>>> generation number verbatim - which means that it can be incorrect? >>> >>> They would indeed be copied verbatim and become incorrect. >> >> how would they become incorrect? > > Let's assume that the following history was created with new git, one > that correcly adds generation number header to commits: > > > A(1)---B(2)---C(3)---D(4)---E(5) <-- master > \ > \----x(3)---y(4)---z(5) <-- foo > > The numbers are generation numbers in commit object. > > Let's assume that this repository is fetched into repository instance > that is managed by older git, one that doesn't understand generation > header. > > Then, if we do > > [old]$ git rebase master foo > > and if old git _copies_ generation number header _verbatim_, we would > get: > > A(1)---B(2)---C(3)---D(4)---E(5) <-- master > \ > \---x'(3)--y'(4)--z'(5) <-- foo > > Those generation numbers are *incorrect*; they should be: > > A(1)---B(2)---C(3)---D(4)---E(5) <-- master > \ > \---x'(6)--y'(7)--z'(8) <-- foo > > > That is IF unknown headers are copied verbatim during rebase. For > "encoding" header this is a good thing, for "generation" it isn't. commit headers are _not_ copied during rebase a rebase is not the exact same commit, it's a "logically equivalent" commit. so when you do a rebase, you change the commit headers (you have to change the parent headers in any case, and you would have to change the generation numbers as well) this was discussed earlier in this thread. David Lang ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 19:08 ` david @ 2011-07-22 19:40 ` Nicolas Pitre 0 siblings, 0 replies; 89+ messages in thread From: Nicolas Pitre @ 2011-07-22 19:40 UTC (permalink / raw) To: david Cc: Jakub Narebski, George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, david@lang.hm wrote: > On Fri, 22 Jul 2011, Jakub Narebski wrote: > > > That is IF unknown headers are copied verbatim during rebase. For > > "encoding" header this is a good thing, for "generation" it isn't. > > commit headers are _not_ copied during rebase Yes, this turns out to be true as I forgot that rebase is constructed on top of format-patch+am, and format-patch doesn't preserve the ancillary headers such as the existing "encoding" header, or the hypothetical "generation" header. Nicolas ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-22 12:18 ` Jakub Narebski 2011-07-22 13:09 ` Nicolas Pitre @ 2011-07-22 18:02 ` david 1 sibling, 0 replies; 89+ messages in thread From: david @ 2011-07-22 18:02 UTC (permalink / raw) To: Jakub Narebski Cc: George Spelvin, Anthony Van de Gejuchte, git, Phil Hord, Nicolas Pitre, Shawn Pearce, Linus Torvalds On Fri, 22 Jul 2011, Jakub Narebski wrote: >> Yes, the cache validity/invalidation criteria are the tricky bit. >> Honestly, this is where the code gets ugly, not computing and storing >> the generation numbers. > > BTW. with storing generation number in commit header there is a problem > what would old version of git, one which does not understand said header, > do during rebase. Would it strip unknown headers, or would it copy > generation number verbatim - which means that it can be incorrect? Linus has already pointed out that this is safe. old versions won't create generation numbers, but they will ignore them if they exist. Since commits are not modified after they are created, the old versions don't copy or modify them. David Lang ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:18 ` david 2011-07-21 0:37 ` Shawn Pearce @ 2011-07-21 0:39 ` Phil Hord 1 sibling, 0 replies; 89+ messages in thread From: Phil Hord @ 2011-07-21 0:39 UTC (permalink / raw) To: david; +Cc: Nicolas Pitre, George Spelvin, anthonyvdgent, git, torvalds On 07/20/2011 08:18 PM, david@lang.hm wrote: > On Wed, 20 Jul 2011, Phil Hord wrote: > >> On 07/20/2011 07:36 PM, Nicolas Pitre wrote: >>> On Wed, 20 Jul 2011, david@lang.hm wrote: >>> >>>> If the generation number is part of the repository then it's going to >>>> be the same for everyone. >>> The actual generation number will be, and has to be, the same for >>> everyone with the same repository content, regardless of the cache >>> used. >>> It is a well defined number with no room to interpretation. >> >> Nonsense. >> >> Even if the generation number is well-defined and shared by all >> clients, the only quasi-essential definition is "for each A in >> ancestors_of(B), gen(A) < gen(B)". >> >> In practice, the actual generation number *will be the same* for >> everyone with the same repository content, unless and until someone >> develops a different calculation method. But there is no reason to >> require that the number *has to be* the same for everyone unless you >> expect (or require) everyone to share their gen-caches. > > and I think this is why Linus is not happy with a cache. He is seeing > this as something that has significantly more value if it is going to > be consistant in a distributed manner than if it's just something > calculated locally that can be different from other systems. It will only be used locally, so it needn't be consistent with anyone else's. > > if it's just locally generated, then I could easily see generation > numbers being different on different people's ssstems, dependin on the > order that they see commits (either locally generated or pulled from > others) > > If it's part of the commit, then as that commit gets propogated the > generation number gets propogated as well, and every repository will > agree on what the generation number is for any commit that's shared. > > I agree that this consistancy guarantee seems to be valuable. I can't see why. >> Surely there will be a competent and efficient gen-cache API. But >> most code can just ask if B --contains A or even just use rev-list >> and benefit from the increased speed of the answer. Because most >> code doesn't really care about the gen numbers themselves, but only >> the speed of determining ancestry. > > in that case, why bother with generation numbers at all? the improved > data based heristic seems to solve that problem. Does it? Surely the ruckus would've died down in that case. But I haven't been reading pu. It seems to me that the main drawback to a gen-cache is that it slows down the first operation after even a local clone (with just hardlinks). On the other hand, I see too many nails in the distributed-gen-numbers coffin: legacy commits can't catch up (and therefore suffer), and legacy clients can trash or corrupt even "new-style" commits. Phil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:08 ` Phil Hord 2011-07-21 0:18 ` david @ 2011-07-21 0:58 ` Nicolas Pitre 2011-07-21 1:09 ` Phil Hord 1 sibling, 1 reply; 89+ messages in thread From: Nicolas Pitre @ 2011-07-21 0:58 UTC (permalink / raw) To: Phil Hord; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds On Wed, 20 Jul 2011, Phil Hord wrote: > On 07/20/2011 07:36 PM, Nicolas Pitre wrote: > > On Wed, 20 Jul 2011, david@lang.hm wrote: > > > > > If the generation number is part of the repository then it's going to > > > be the same for everyone. > > The actual generation number will be, and has to be, the same for > > everyone with the same repository content, regardless of the cache used. > > It is a well defined number with no room to interpretation. > > Nonsense. > > Even if the generation number is well-defined and shared by all clients, the > only quasi-essential definition is "for each A in ancestors_of(B), gen(A) < > gen(B)". Sure. But what do you gain by making holes in the sequence? > In practice, the actual generation number *will be the same* for everyone with > the same repository content, unless and until someone develops a different > calculation method. But there is no reason to require that the number *has to > be* the same for everyone unless you expect (or require) everyone to share > their gen-caches. And with the above you clearly reinforced the argument _against_ storing the generation number in the commit object. If you can imagine a different calculation method already, and if it is actually useful, then who knows if something even better could be done eventually. Nicolas ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 0:58 ` Nicolas Pitre @ 2011-07-21 1:09 ` Phil Hord 0 siblings, 0 replies; 89+ messages in thread From: Phil Hord @ 2011-07-21 1:09 UTC (permalink / raw) To: Nicolas Pitre; +Cc: david, George Spelvin, anthonyvdgent, git, torvalds On 07/20/2011 08:58 PM, Nicolas Pitre wrote: > On Wed, 20 Jul 2011, Phil Hord wrote: > >> On 07/20/2011 07:36 PM, Nicolas Pitre wrote: >>> On Wed, 20 Jul 2011, david@lang.hm wrote: >>> >>>> If the generation number is part of the repository then it's going to >>>> be the same for everyone. >>> The actual generation number will be, and has to be, the same for >>> everyone with the same repository content, regardless of the cache used. >>> It is a well defined number with no room to interpretation. >> Nonsense. >> >> Even if the generation number is well-defined and shared by all clients, the >> only quasi-essential definition is "for each A in ancestors_of(B), gen(A)< >> gen(B)". > Sure. But what do you gain by making holes in the sequence? Depends on the algorithm. Probably speed. Possibly more efficient limited-cache building (jit-style discovery in reverse, as-needed, for example). What do you gain by enforcing contiguousness? Why not require all gen numbers to be even? Or prime? ;) >> In practice, the actual generation number *will be the same* for everyone with >> the same repository content, unless and until someone develops a different >> calculation method. But there is no reason to require that the number *has to >> be* the same for everyone unless you expect (or require) everyone to share >> their gen-caches. > And with the above you clearly reinforced the argument _against_ storing > the generation number in the commit object. If you can imagine a > different calculation method already, and if it is actually useful, then > who knows if something even better could be done eventually. Good. Nice to see I'm being self-consistent, then. Phil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-20 23:26 ` david 2011-07-20 23:36 ` Nicolas Pitre @ 2011-07-21 12:03 ` Drew Northup 2011-07-21 12:55 ` George Spelvin 1 sibling, 1 reply; 89+ messages in thread From: Drew Northup @ 2011-07-21 12:03 UTC (permalink / raw) To: david; +Cc: George Spelvin, nico, anthonyvdgent, git, torvalds On Wed, 2011-07-20 at 16:26 -0700, david@lang.hm wrote: > On Wed, 20 Jul 2011, George Spelvin wrote: > > >> The alternative of having to sometimes use the generation number, > >> sometimes use the possibly broken commit date, makes for much more > >> complicated code that has to be maintained forever. Having a solution > >> that starts working only after a certain point in history doesn't look > >> eleguant to me at all. It is not like having different pack formats > >> where back and forth conversions can be made for the _entire_ history. > > > > It seemed like a pretty strong argument to me, too. > > except that you then have different caches on different systems. If the > generation number is part of the repository then it's going to be the same > for everyone. I keep hearing (reading) people stating this utterly unfounded argument. The fact is that for any work not yet integrated back into a shared repository it just isn't true--and even after upstream integration the truth of such a statement may be limited. I have not read yet one discussion about how generation numbers [baked into a commit] deal with rebasing, for instance. Do we assign one more than the revision prior to the base of the rebase operation or do we start with the revision one after the highest of those original commits included in the rebase? Depending on how that is done _drastically_different_ numbers can come out of different repository instances for the same _final_ DAG. This is one major reason why, as I see it, local storage is good for generation numbers and putting them in the commit is bad. I have no problem with putting an _advisory_ "revision number" in the commit. It would not be expected to have a proper "1-to-1 and onto" functional association with the _final_ DAG, but it could potentially get us some nice benefits. We would still need to answer questions like the one I ask above, but it would hurt less to change if we need to. One other sane option that was mentioned at least once in passing was to store the generation number in some Git "filesystem-level" object. This could then be reconciled with each "git gc" or "git fsck" operation if not more often. This is less ad-hoc and messy than a separate cache, becomes amenable to the standard tool-set, and always gets updated (no invalid cache). If an _advisory_ revision number is available in commits that are sent along those could conceivably be used to help build up the local git-fs generation numbers more quickly. (If a "git pull" is issued to our repo, or we push to another, we don't send the generation numbers locally stored--we expect the git-fs machinery to regenerate those on the fly.) I may not be one of the "resident rocket scientists," but that's how I see it. -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 12:03 ` Drew Northup @ 2011-07-21 12:55 ` George Spelvin 2011-07-21 15:57 ` Drew Northup 0 siblings, 1 reply; 89+ messages in thread From: George Spelvin @ 2011-07-21 12:55 UTC (permalink / raw) To: david, drew.northup; +Cc: anthonyvdgent, git, linux, nico, torvalds > I have not read yet one discussion about how generation numbers [baked > into a commit] deal with rebasing, for instance. Do we assign one more > than the revision prior to the base of the rebase operation or do we > start with the revision one after the highest of those original commits > included in the rebase? Depending on how that is done > _drastically_different_ numbers can come out of different repository > instances for the same _final_ DAG. This is one major reason why, as I > see it, local storage is good for generation numbers and putting them in > the commit is bad. Er, no. Whenever a new commit object is generated (as the result of a rebase or not), its commit number is computed based on its parent commits. It is NEVER copied. Just like the parent pointers themselves. Remember, even though we talk about "the same commit" after rebasing, it's really just an EQUIVALENT commit according to some higher-level concept of similarity. As far as the core git engine is concerned, it's always a DIFFERENT commit, with different parent hashes and a different hash itself. This point hasn't been mentioned explicltly precisely because it's so obvious; the history-walking code that the generation numbers are for requires this property to function. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 12:55 ` George Spelvin @ 2011-07-21 15:57 ` Drew Northup 2011-07-21 16:24 ` Phil Hord 2011-07-21 17:36 ` George Spelvin 0 siblings, 2 replies; 89+ messages in thread From: Drew Northup @ 2011-07-21 15:57 UTC (permalink / raw) To: George Spelvin; +Cc: david, anthonyvdgent, git, nico, torvalds On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote: > > I have not read yet one discussion about how generation numbers [baked > > into a commit] deal with rebasing, for instance. Do we assign one more > > than the revision prior to the base of the rebase operation or do we > > start with the revision one after the highest of those original commits > > included in the rebase? Depending on how that is done > > _drastically_different_ numbers can come out of different repository > > instances for the same _final_ DAG. This is one major reason why, as I > > see it, local storage is good for generation numbers and putting them in > > the commit is bad. > > Er, no. Whenever a new commit object is generated (as the result > of a rebase or not), its commit number is computed based on its > parent commits. It is NEVER copied. I don't see the word "copy" in my original. B-O1-O2-O3-O4-O5-O6 \ R1----R2-------R3 What's the correct generation number for R3? I would say gen(B)+3. My reading of the posts made by some others was that they thought gen(O6) was the correct answer. Still others seemed to indicate gen(O6)+1 was the correct answer. I don't think everybody MEANT to be saying such different things--that's just how they appeared on this end. Now, did you mean something different by "commit number?" -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 15:57 ` Drew Northup @ 2011-07-21 16:24 ` Phil Hord 2011-07-21 22:40 ` Pēteris Kļaviņš 2011-07-21 17:36 ` George Spelvin 1 sibling, 1 reply; 89+ messages in thread From: Phil Hord @ 2011-07-21 16:24 UTC (permalink / raw) To: Drew Northup; +Cc: George Spelvin, david, anthonyvdgent, git, nico, torvalds On 07/21/2011 11:57 AM, Drew Northup wrote: > On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote: >>> I have not read yet one discussion about how generation numbers [baked >>> into a commit] deal with rebasing, for instance. Do we assign one more >>> than the revision prior to the base of the rebase operation or do we >>> start with the revision one after the highest of those original commits >>> included in the rebase? Depending on how that is done >>> _drastically_different_ numbers can come out of different repository >>> instances for the same _final_ DAG. This is one major reason why, as I >>> see it, local storage is good for generation numbers and putting them in >>> the commit is bad. >> Er, no. Whenever a new commit object is generated (as the result >> of a rebase or not), its commit number is computed based on its >> parent commits. It is NEVER copied. > I don't see the word "copy" in my original. > > B-O1-O2-O3-O4-O5-O6 > \ > R1----R2-------R3 > > What's the correct generation number for R3? I would say gen(B)+3. And you would be correct if you follow the SoP algorithm. > My > reading of the posts made by some others was that they thought gen(O6) > was the correct answer. Still others seemed to indicate gen(O6)+1 was > the correct answer. Maybe the confusion comes from the different storage mechanisms being discussed. If the generation numbers are in a local cache and used by a single client, the determinism of the specific numbers doesn't much matter. If they are part of the commit, it still doesn't need to be completely deterministic. However, interoperability requires standards, and standards favor determinism, so dogmatic determinism may triumph in that case. 1. gen(06) might make sense if you mean to implement --date-order using gen-numbers, for example. But I don't think it's practical in any case. 2. gen(06)+1 might make sense if you mean to require that gen-numbers are unique per repo. But this is both unsupportable and unnecessary, so it's a non-starter. 3. gen(B)+1 is what you'd get from the the algorithm I saw proposed. All three of these are provably correct by my definition of "correct": "for each A in ancestors_of(B), gen(A) < gen(B)". However, [1] and [2] have some extra features of dubious value. Simpler is better for interoperability, so I like [3] for this purpose. Even [3] has an extra feature I think is unnecessary: determinism. If that "requirement" is dropped, I think all three of these algorithms are (functionally) roughly equivalent. > I don't think everybody MEANT to be saying such > different things--that's just how they appeared on this end. > > Now, did you mean something different by "commit number?" I remain unconvinced that there is value in gen-number distribution, so to my mind, the specific algorithm and whether or not it is deterministic are unimportant. Phil ~ who wasn't really being asked, but felt like answering ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 16:24 ` Phil Hord @ 2011-07-21 22:40 ` Pēteris Kļaviņš 2011-07-22 9:30 ` Christian Couder 0 siblings, 1 reply; 89+ messages in thread From: Pēteris Kļaviņš @ 2011-07-21 22:40 UTC (permalink / raw) To: git On 21/07/2011 5:24 PM, Phil Hord wrote: > Maybe the confusion comes from the different storage mechanisms being > discussed. If the generation numbers are in a local cache and used by a > single client, the determinism of the specific numbers doesn't much > matter. If they are part of the commit, it still doesn't need to be > completely deterministic. However, interoperability requires standards, > and standards favor determinism, so dogmatic determinism may triumph in > that case. > > 1. gen(06) might make sense if you mean to implement --date-order using > gen-numbers, for example. But I don't think it's practical in any case. > > 2. gen(06)+1 might make sense if you mean to require that gen-numbers > are unique per repo. But this is both unsupportable and unnecessary, so > it's a non-starter. > > 3. gen(B)+1 is what you'd get from the the algorithm I saw proposed. > > All three of these are provably correct by my definition of "correct": > "for each A in ancestors_of(B), gen(A) < gen(B)". > > However, [1] and [2] have some extra features of dubious value. Simpler > is better for interoperability, so I like [3] for this purpose. > > Even [3] has an extra feature I think is unnecessary: determinism. If > that "requirement" is dropped, I think all three of these algorithms are > (functionally) roughly equivalent. > >> I don't think everybody MEANT to be saying such >> different things--that's just how they appeared on this end. >> >> Now, did you mean something different by "commit number?" > > I remain unconvinced that there is value in gen-number distribution, so > to my mind, the specific algorithm and whether or not it is > deterministic are unimportant. > The beauty of Git is that no two copies of a Git repository as a whole are the same: some people make shallow copies; others prune away all branches except for the one they are interested in; yet others graft together multiple original repositories. The upshot is that two copies of the same repository may end up having different commits as their root commits, and so the generation numbers computed for their repositories would be different. Indeed, the shallow repository copy could later be filled out with additional underlying commits, and so on. Given this context, I can't see the value in fixing generation numbers within commits. In my mind generation numbers are extremely useful transient helper objects in every Git repository but they have no meaning outside that repository, sort of like GIT_WORK_TREE. Peter ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 22:40 ` Pēteris Kļaviņš @ 2011-07-22 9:30 ` Christian Couder 0 siblings, 0 replies; 89+ messages in thread From: Christian Couder @ 2011-07-22 9:30 UTC (permalink / raw) To: Pēteris Kļaviņš; +Cc: git On Fri, Jul 22, 2011 at 12:40 AM, Pēteris Kļaviņš <klavins@netspace.net.au> wrote: > > The beauty of Git is that no two copies of a Git repository as a whole are > the same: some people make shallow copies; others prune away all branches > except for the one they are interested in; yet others graft together > multiple original repositories. The upshot is that two copies of the same > repository may end up having different commits as their root commits, and so > the generation numbers computed for their repositories would be different. > Indeed, the shallow repository copy could later be filled out with > additional underlying commits, and so on. Not only people want different repos, but with their own repo they want different "views" (or "virtual graph") of it. > Given this context, I can't see the value in fixing generation numbers > within commits. In my mind generation numbers are extremely useful > transient helper objects in every Git repository but they have no meaning > outside that repository, sort of like GIT_WORK_TREE. It's not even per repository that they have a meaning, it's per "view" of the commit graph. Thanks, Christian. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-21 15:57 ` Drew Northup 2011-07-21 16:24 ` Phil Hord @ 2011-07-21 17:36 ` George Spelvin 1 sibling, 0 replies; 89+ messages in thread From: George Spelvin @ 2011-07-21 17:36 UTC (permalink / raw) To: drew.northup, linux; +Cc: anthonyvdgent, david, git, nico, torvalds Drew Northup wrote: > On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote: >> I have not read yet one discussion about how generation numbers [baked >> into a commit] deal with rebasing, for instance. Do we assign one more >> than the revision prior to the base of the rebase operation or do we >> start with the revision one after the highest of those original commits >> included in the rebase? Depending on how that is done >> _drastically_different_ numbers can come out of different repository >> instances for the same _final_ DAG. This is one major reason why, as I >> see it, local storage is good for generation numbers and putting them in >> the commit is bad. > > Er, no. Whenever a new commit object is generated (as the result > of a rebase or not), its commit number is computed based on its > parent commits. It is NEVER copied. > I don't see the word "copy" in my original. Indeed, you didn't use it; it was my simplified mental model of your suggestion that the rebased commits would have generation numbers that somehow depended on the generation numbers before rebasing. Althouugh you suggested something different, the mistake is the same: the rebased commits' generation numbers have simply no relationship to those of the original pre-rebase commits. The generation numbers depend only on the commits explicitly listed as parents in the commit objects. That's why I went on to explain that the equivalence of the commits produced by a rebase operation is a higher-level concept; the core git object database just knows that they aren't identical, and therefore are different. Thus, they would retain the same relative order as before the rebase (unless you permuted them with rebase -i), but start with the generation number of the rebase target. > B-O1-O2-O3-O4-O5-O6 > \ > R1----R2-------R3 > What's the correct generation number for R3? I would say gen(B)+3. My > reading of the posts made by some others was that they thought gen(O6) > was the correct answer. Still others seemed to indicate gen(O6)+1 was > the correct answer. I don't think everybody MEANT to be saying such > different things--that's just how they appeared on this end. According to the canonical algorithm, it's gen(B)+3 = gen(R2)+1. However, any non-decreasing series is equally permissible for optimizing history walking, so you could add jumps to (for example) make the numbers unique if that simplified anything. I don't think it does simplify anything, so the issue hasn't been discussed much. For the purpose of the optimization enabled by the generation numbers, however, it doesn't actually matter. What matters is that if I am listing commits down multiple branches, once I have walked back on each branch to commits of generation N or less, I know that I have found all possible descendants of all commits of generation N or more. This lets me display the recent part of the commit DAG (back to generation N) without exploring the entire commit treem or worrying that I'll have to "back up" to insert a commit in its proper order. Without precomputed generation numbers, the only way to be sure of this is to explore back to generation 0 (parentless commits) or to use date-based heuristics. > Now, did you mean something different by "commit number?" No, just a bran fart I didn't catch before posting. I meant "generation number". ^ permalink raw reply [flat|nested] 89+ messages in thread
* Git commit generation numbers @ 2011-07-14 18:24 Linus Torvalds 2011-07-14 18:37 ` Jeff King 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 18:24 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano, Jeff King Ok, so I see that the old discussion about generation numbers has resurfaced. And I have to say, with six years of git use, I think it's not a coincidence that the notion of generation numbers has come up several times over the years: I think the lack of them is literally the only real design mistake we have. And I absolutely *detest* the generation number cache thing I see on the list. Maybe I missed the discussion that actually added them to the commits (I don't read the git mailing list regularly any more) but I think it's a mistake to add an external cache to work around the fact that I didn't add the generation numbers originally. So I think we should just add the generation numbers now. We can make the rule be that if a commit doesn't have a generation number, we end up having to compute it (with no real need for caching). Yes, it's expensive. But it's going to be a *lot* less expensive over time as people start using a git version that adds the generation numbers to commits. And we can easily mix this - there's no "flag-day" issues. Old versions of git will ignore the generation number and generate new commits that doesn't have it. New versions of git will generate them, and use them. And once the project starts having generation numbers in some commits, the "generating them" part will get cheaper over time. I'll send out a patch that admittedly does not have much testing as a reply to this one. It ends up being really simple. Of course, maybe it's simple because I did something incredibly stupid, but please take a look. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:24 Linus Torvalds @ 2011-07-14 18:37 ` Jeff King 2011-07-14 18:47 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Jeff King @ 2011-07-14 18:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:24:27AM -0700, Linus Torvalds wrote: > And I have to say, with six years of git use, I think it's not a > coincidence that the notion of generation numbers has come up several > times over the years: I think the lack of them is literally the only > real design mistake we have. Agreed. > And I absolutely *detest* the generation number cache thing I see on > the list. I'd love to have in-commit generation numbers. I'm just not sure we can get the speeds we want without caching them for existing commits. > Maybe I missed the discussion that actually added them to the commits > (I don't read the git mailing list regularly any more) but I think > it's a mistake to add an external cache to work around the fact that I > didn't add the generation numbers originally. > > So I think we should just add the generation numbers now. We can make > the rule be that if a commit doesn't have a generation number, we end > up having to compute it (with no real need for caching). Yes, it's > expensive. But it's going to be a *lot* less expensive over time as > people start using a git version that adds the generation numbers to > commits. I'm not sure that is the best plan. Calculating generation numbers involves going to all roots. So once you have to find any generation number, it's going to be expensive, no matter how many recent commits have generation numbers already in them (but it won't get _more_ expensive as more commits are added; you'll always be traversing from the commit in question down to the roots). As we add new commits with generation numbers, we won't need to do a calculation to get their numbers. But if you are doing something like "tag --contains", you are going to want to know the generation number of old tags (otherwise, you can't know whether your cutoff might hit them or not). IOW, even if we add generation numbers _today_, every "tag --contains" in linux-2.6 is going to end up traversing from v3.0-rc7 down to the roots to get its generation number (v3.0-rc8 would get an embedded generation, of course). So if you aren't going to cache generation numbers, then you might as well write your traversal algorithm to assume you don't know them for old commits. Because calculating them needs to touch every ancestor, and that's probably equivalent to the worst-case for your algorithm. There's also one other issue with generation numbers. How do you handle grafts and object-replacement refs? If you graft history, your embedded generation numbers will all be junk, and you can't trust them. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:37 ` Jeff King @ 2011-07-14 18:47 ` Linus Torvalds 2011-07-14 18:55 ` Linus Torvalds 2011-07-14 19:08 ` Jeff King 2011-07-14 18:52 ` Linus Torvalds 2011-07-14 20:26 ` Junio C Hamano 2 siblings, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 18:47 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote: > > I'd love to have in-commit generation numbers. I'm just not sure we can > get the speeds we want without caching them for existing commits. So my argument would be that we'd simply be much better off fixing the fundamental data structure (which we can), and let it become the long-term solution. Now, if *may* turn out that we'd want to have some cache for generation numbers in commits that don't have them, but I absolutely think that that should be a "add-on" rather than anything fundamental. For example, if we just merge the "add generation numbers to the commit object" logic first, then the "cache" case never really needs to care about us generating new commits. They simply won't need the cache. Also, I suspect that the cache could easily be done as a *small* and *incomplete* cache, ie you don't need to cache all commits, it would be sufficient to cache a few hundred spread-out commits, and just know that "from any commit, the cached commit will be quickly reachable". > I'm not sure that is the best plan. Calculating generation numbers > involves going to all roots. So once you have to find any generation > number, it's going to be expensive, no matter how many recent commits > have generation numbers already in them (but it won't get _more_ > expensive as more commits are added; you'll always be traversing from > the commit in question down to the roots). It only ends up being expensive if the commit has parents that don't have generation numbers. That's a fairly short-term problem. For the kernel, for example, basically no development happens on a base that is older than one or two releases. So if I (and Greg, with the stable tree) start using my patch, within a couple of weeks, pretty much all development would have a generation number in its history. Sure, sometimes I'd merge from people who based their tree on something old, and I'd end up calculating it all. But it would get progressively rarer. > As we add new commits with generation numbers, we won't need to do a > calculation to get their numbers. But if you are doing something like > "tag --contains", you are going to want to know the generation number of > old tags (otherwise, you can't know whether your cutoff might hit them > or not). IOW, even if we add generation numbers _today_, every "tag > --contains" in linux-2.6 is going to end up traversing from v3.0-rc7 > down to the roots to get its generation number (v3.0-rc8 would get an > embedded generation, of course). So that could easily be handled by caching. In fact, I suspect that you could make the cache no associate with a commit ID, but be associated with the tags and heads. But again, then the cache would be a "secondary" issue, not something fundamental. > So if you aren't going to cache generation numbers, then you might as > well write your traversal algorithm to assume you don't know them for > old commits. But that's how our algorithms are *already* written. So why not have that as the fallback? You get the advantage of generation numbers only with modern things, but those are the ones you actually tend to use. Merge bases are *very* seldom historical, for example. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:47 ` Linus Torvalds @ 2011-07-14 18:55 ` Linus Torvalds 2011-07-14 19:12 ` Jeff King 2011-07-14 19:46 ` Ted Ts'o 2011-07-14 19:08 ` Jeff King 1 sibling, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 18:55 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Also, I suspect that the cache could easily be done as a *small* and > *incomplete* cache, ie you don't need to cache all commits, it would > be sufficient to cache a few hundred spread-out commits, and just know > that "from any commit, the cached commit will be quickly reachable". Put another way: we could do the cache not as a real dynamic entity, but as something that gets generated at "git clone" time or when re-packing. I'm actually much more nervous about a cache being inconsistent than I would be about having generation numbers in the tree. The latter we can (and should - but my patch didn't) add a fsck test for, and then you would never get into some situation where there's some really subtle issue with merge base calculation due to a corrupt cache. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:55 ` Linus Torvalds @ 2011-07-14 19:12 ` Jeff King 2011-07-14 19:46 ` Ted Ts'o 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-14 19:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote: > I'm actually much more nervous about a cache being inconsistent than I > would be about having generation numbers in the tree. The latter we > can (and should - but my patch didn't) add a fsck test for, and then > you would never get into some situation where there's some really > subtle issue with merge base calculation due to a corrupt cache. Interesting. I'm nervous about that, too, which is why I _favor_ the cache. Because we calculate the cache ourselves, we know its accurate according to the parent pointers. If we find a bug, we fix it and bump the cache version, which forces it to regenerate. Contrast that with a bogus generation number that makes its way into an actual commit object. That's there for eternity, just like the commit timestamp skew we already have. I find it much less likely to happen than skew in the commit timestamp, if only because generations are a dirt-simple concept. But it is a case where there is duplicated information in the actual DAG, and if that information doesn't match up we are screwed. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:55 ` Linus Torvalds 2011-07-14 19:12 ` Jeff King @ 2011-07-14 19:46 ` Ted Ts'o 2011-07-14 19:51 ` Linus Torvalds 1 sibling, 1 reply; 89+ messages in thread From: Ted Ts'o @ 2011-07-14 19:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote: > On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Also, I suspect that the cache could easily be done as a *small* and > > *incomplete* cache, ie you don't need to cache all commits, it would > > be sufficient to cache a few hundred spread-out commits, and just know > > that "from any commit, the cached commit will be quickly reachable". > > Put another way: we could do the cache not as a real dynamic entity, > but as something that gets generated at "git clone" time or when > re-packing. Would it be considered evil if we put the generation number in the pack, but not consider it part of the formal object (i.e., it would be just a cache, but one that wouldn't change once the pack was created)? - Ted ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 19:46 ` Ted Ts'o @ 2011-07-14 19:51 ` Linus Torvalds 2011-07-14 20:07 ` Jeff King 2011-07-14 20:08 ` Ted Ts'o 0 siblings, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 19:51 UTC (permalink / raw) To: Ted Ts'o; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <tytso@mit.edu> wrote: > > Would it be considered evil if we put the generation number in the > pack, but not consider it part of the formal object (i.e., it would be > just a cache, but one that wouldn't change once the pack was created)? That would actually be a major change to data structures, and would require some serious surgery and be hard to support in a backwards-compatible way (think different git versions accessing the same repository). Much bigger patch than the one I did. So it sounds like it would work - and it would probably be a simple matter of just incrementing the pack version number if you just say "cannot access the pack with old versions" - but I think it's a really fragile approach. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 19:51 ` Linus Torvalds @ 2011-07-14 20:07 ` Jeff King 2011-07-14 20:08 ` Ted Ts'o 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-14 20:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ted Ts'o, Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote: > On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <tytso@mit.edu> wrote: > > > > Would it be considered evil if we put the generation number in the > > pack, but not consider it part of the formal object (i.e., it would be > > just a cache, but one that wouldn't change once the pack was created)? > > That would actually be a major change to data structures, and would > require some serious surgery and be hard to support in a > backwards-compatible way (think different git versions accessing the > same repository). If we put it in the index, but not the pack, then it wouldn't be any more painful than pack index v2. I don't recall there being huge fallout from that; we just gave a reasonable deprecation period before switching it on as the default. I'm not sure it is much less crappy than having the cache in a separate file. It does take less space, since the pack index already contains all of the sha1s. But if we don't like the on-the-fly writing of what was in my series, it would not be hard to generate the same cache during pack-index time. Not having it in a separate file makes it hard to invalidate the cache when the graph changes (due to grafts or replace refs). But maybe we don't care about that. Or maybe it's OK to tell the user to manually rebuild the pack index if they tweak those features. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 19:51 ` Linus Torvalds 2011-07-14 20:07 ` Jeff King @ 2011-07-14 20:08 ` Ted Ts'o 1 sibling, 0 replies; 89+ messages in thread From: Ted Ts'o @ 2011-07-14 20:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote: > > So it sounds like it would work - and it would probably be a simple > matter of just incrementing the pack version number if you just say > "cannot access the pack with old versions" - but I think it's a really > fragile approach. So if we ever change the pack format again, it's something to think about adding, but probably not worth it on its own... What if we simply have a cache file per pack, which again is generated when the pack is first received or generated, but is otherwise not dynamic? It's an extra file which is icky, but it would keep things simpler. - Ted ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:47 ` Linus Torvalds 2011-07-14 18:55 ` Linus Torvalds @ 2011-07-14 19:08 ` Jeff King 2011-07-14 19:23 ` Linus Torvalds 1 sibling, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-14 19:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:47:45AM -0700, Linus Torvalds wrote: > On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote: > > > > I'd love to have in-commit generation numbers. I'm just not sure we can > > get the speeds we want without caching them for existing commits. > > So my argument would be that we'd simply be much better off fixing the > fundamental data structure (which we can), and let it become the > long-term solution. > > Now, if *may* turn out that we'd want to have some cache for > generation numbers in commits that don't have them, but I absolutely > think that that should be a "add-on" rather than anything fundamental. > For example, if we just merge the "add generation numbers to the > commit object" logic first, then the "cache" case never really needs > to care about us generating new commits. They simply won't need the > cache. Sure, I'd be fine with that (modulo the graft issue, which you don't seem to care about). I half-toyed with making an extra "add generation numbers to commit header" on top of my series, but I wanted to first prove that generation numbers actually could yield speedups. > Also, I suspect that the cache could easily be done as a *small* and > *incomplete* cache, ie you don't need to cache all commits, it would > be sufficient to cache a few hundred spread-out commits, and just know > that "from any commit, the cached commit will be quickly reachable". Yeah, that would work. Is it worth the trouble? Your cache size is still O(n). And you still have the complexity of _having_ a cache. Yes, the size is 1/100th of what it was (dropping from 6M to 600K on linux-2.6). But you're also going to spend more time calculating. I think you'd have to measure to see how it performs in practice. > It only ends up being expensive if the commit has parents that don't > have generation numbers. > > That's a fairly short-term problem. For the kernel, for example, > basically no development happens on a base that is older than one or > two releases. So if I (and Greg, with the stable tree) start using my > patch, within a couple of weeks, pretty much all development would > have a generation number in its history. Sure, that makes generation during commit-time cheaper, and eventually the cost just goes away. I'm more concerned that it won't actually speed up algorithms where you look at old commits, which was the whole point in the first place. > > As we add new commits with generation numbers, we won't need to do a > > calculation to get their numbers. But if you are doing something like > > "tag --contains", you are going to want to know the generation number of > > old tags (otherwise, you can't know whether your cutoff might hit them > > or not). IOW, even if we add generation numbers _today_, every "tag > > --contains" in linux-2.6 is going to end up traversing from v3.0-rc7 > > down to the roots to get its generation number (v3.0-rc8 would get an > > embedded generation, of course). > > So that could easily be handled by caching. In fact, I suspect that > you could make the cache no associate with a commit ID, but be > associated with the tags and heads. But again, then the cache would be > a "secondary" issue, not something fundamental. Yeah, you could do that. And it would handle "tag --contains" and "branch --contains" (the latter doesn't even really need a cache; as the branch tips move, they will get new commits with generation numbers). I suspect we could get faster topo-sorting and possibly faster merge-base calculation out of generation numbers, too. But that won't happen if we only have generation numbers for a handful of specific commits. > > So if you aren't going to cache generation numbers, then you might as > > well write your traversal algorithm to assume you don't know them for > > old commits. > > But that's how our algorithms are *already* written. Sort of. We tend to rely on commit timestamps as a proxy for generation numbers. But in the face of clock skew, git will give wrong answers (e.g., Ted posted some examples of name-rev giving wrong answers near some skew in linux-2.6). If we aren't going to go whole-hog on generation numbers, I'm much more tempted to simply keep using commit timestamps. It's easy to build a cache of commits with bogus timestamps (which I've already posted a patch for) if you want to better accuracy at the cost of more complexity. And as time progresses, you tend to ask about commits near the skewed ones less often (and hopefully lessons learned from seeing how the skew occurred will help us prevent them from reocurring in new commits). -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 19:08 ` Jeff King @ 2011-07-14 19:23 ` Linus Torvalds 2011-07-14 20:01 ` Jeff King 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 19:23 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <peff@peff.net> wrote: > > If we aren't going to go whole-hog on generation numbers, I'm much more > tempted to simply keep using commit timestamps. Sure. I think it's entirely reasonable to say that the issue basically boils down to one git question: "can commit X be an ancestor of commit Y" (as a way to basically limit certain algorithms from having to walk all the way down). We've used commit dates for it, and realistically it really has worked very well. But it was always a broken heuristic. So yes, I personally see generation counters as a way to do the commit date comparisons right. And it would be perfectly fine to just say "if there are no generation numbers, we'll use the datestamps instead, and know that they could be incorrect". That "use the datestamps" fallback thing may well involve all the heuristics we already do (ie check for the stamps looking sane, and not trusting just one individual one). Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 19:23 ` Linus Torvalds @ 2011-07-14 20:01 ` Jeff King 2011-07-14 20:19 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-14 20:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 12:23:31PM -0700, Linus Torvalds wrote: > On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <peff@peff.net> wrote: > > > > If we aren't going to go whole-hog on generation numbers, I'm much more > > tempted to simply keep using commit timestamps. > > Sure. I think it's entirely reasonable to say that the issue basically > boils down to one git question: "can commit X be an ancestor of commit > Y" (as a way to basically limit certain algorithms from having to walk > all the way down). We've used commit dates for it, and realistically > it really has worked very well. But it was always a broken heuristic. Yeah, I agree with that. > So yes, I personally see generation counters as a way to do the commit > date comparisons right. And it would be perfectly fine to just say "if > there are no generation numbers, we'll use the datestamps instead, and > know that they could be incorrect". In that case, is it really worth adding generation numbers to the cache? Because they _can_ be wrong, too. I suspect they will be wrong less often than commit timestamps, if only because they're dirt simple to calculate. But all it takes is some crappy porcelain doing: git cat-file commit $foo | munge_the_parents | git hash-object -t commit --stdin -w to give us a bogus object. Sure, we can catch it via fsck. But we could also catch commit timestamp skew via fsck just as easily. > That "use the datestamps" fallback thing may well involve all the > heuristics we already do (ie check for the stamps looking sane, and > not trusting just one individual one). Those aren't foolproof, of course. I asked people a few months ago to run my skew-detection program on various repos, and some repos have long runs of skew (think somebody with a bad clock or a bogus program doing a whole series). But they're fast and work OK in practice. We should apply them more consistently (name-rev, for example, will tolerate a day of skew, but will not look past a single commit). And if people really want to be thorough, we can mark the skewed commits in a cache during "git gc" for them (or they can just say "for this traversal, I want to be thorough; turn off timestamp cutoffs"). Out of curiosity, what don't you like about the generation cache? The idea of using external storage? Generating it on the fly? The particular implementation is too slow or crappy? -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 20:01 ` Jeff King @ 2011-07-14 20:19 ` Linus Torvalds 2011-07-14 20:31 ` Jeff King 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 20:19 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano Jeff King <peff@peff.net> wrote: > >Out of curiosity, what don't you like about the generation cache? The thing I hate about it is very fundamental: I think it's a hack around a basic git design mistake. And it's a mistake we have known about for a long time. Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically say "we made a mistake in the original commit design, and instead of fixing it we create a separate workaround for it". THAT I find distasteful. My reaction is that if we're going to add generation numbers, then were should just do it the way we should have done them originally, rather than as some separate hack. See? That's why I wouldn't have any problem with adding a separate cache on top of it, if it's really required, but I would hope that it isn't really needed. So a cache in itself is not necessarily wrong. But leaving the original design mistake in place IS. And fixing it really ended up being a very tiny patch, no? Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 20:19 ` Linus Torvalds @ 2011-07-14 20:31 ` Jeff King 2011-07-15 1:19 ` Linus Torvalds 0 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-14 20:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 01:19:51PM -0700, Linus Torvalds wrote: > >Out of curiosity, what don't you like about the generation cache? > > The thing I hate about it is very fundamental: I think it's a hack > around a basic git design mistake. And it's a mistake we have known > about for a long time. > > Now, I don't think it's a *fatal* mistake, but I do find it very > broken to basically say "we made a mistake in the original commit > design, and instead of fixing it we create a separate workaround for > it". > > THAT I find distasteful. My reaction is that if we're going to add > generation numbers, then were should just do it the way we should have > done them originally, rather than as some separate hack. > > See? That's why I wouldn't have any problem with adding a separate > cache on top of it, if it's really required, but I would hope that it > isn't really needed. > > So a cache in itself is not necessarily wrong. But leaving the > original design mistake in place IS. Thanks, that makes some sense to me. However, I'm not 100% convinced leaving generation numbers out was a mistake. The git philosophy seems always to have been to keep the minimal required information in the DAG. And I think that has served us well, because we're not saddled with cruft that seemed like a good idea early on, but isn't. Generation numbers are _completely_ redundant with the actual structure of history represented by the parent pointers. Having them in there is not about giving git more information that it doesn't have, but about being a cheap place to stuff a value that is a little expensive to calculate. And so that seems a bit hack-ish to me. I liken it somewhat to the "don't store renames" debate. We don't want to crystallize forever in the history whatever crappy rename-detection algorithm is done at the time of commit. We put the minimum amount of information in the DAG, and it's the runtime's responsibility to get the answer. I think the decision is a little more gray with generation numbers, because it's not about "you got this information with a wrong and crappy algorithm" like it might be with rename detection, but rather "we're sticking this redundant number in the commit object, and we assume that it will always be useful enough to future algorithms to merit being here". > And fixing it really ended up being a very tiny patch, no? Well, yes. But it also doesn't yield a 100-fold speedup in "git tag --contains" for existing repositories. So it's not quite a full solution. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 20:31 ` Jeff King @ 2011-07-15 1:19 ` Linus Torvalds 2011-07-15 2:41 ` Geert Bosch ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 1:19 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@peff.net> wrote: > > However, I'm not 100% convinced leaving generation numbers out was a > mistake. The git philosophy seems always to have been to keep the > minimal required information in the DAG. Yes. And until I saw the patches trying to add generation numbers, I didn't really try to push adding generation numbers to commits (although it actually came up as early as July 2005, so the "let's use generation numbers in commits" thing is *really* old). In other words, I do agree that we should strive for minimal required information. But dammit, if you start using generation numbers, then they *are* required information. The fact that you then hide them in some unarchitected random file doesn't change anything! It just makes it ugly and random, for chrissake! I really don't understand your logic that says that the cache is somehow cleaner. It's a random hack! It's saying "we don't have it in the main data structure, so let's add it to some other one instead, and now we have a consistency and cache generation problem instead". Just look at the size of the patches in question. Your caching patches are bigger and more complicated. Sure, part of it is that your series adds the code to _use_ the generation number, but look purely at the code to maintain them. Why do you think the odd separate cache is somehow better than just doing it right? Seriously? If we require the generation numbers, then they have *become* that minimal information that we should save! And I think that has served us > well, because we're not saddled with cruft that seemed like a good idea > early on, but isn't. Again - we discussed adding generation numbers about 6 years ago. We clearly *should* have done it. Instead, we went with the hacky "let's use commit time", that everybody really knew was technically wrong, and was a hack, but avoided the need. Now, six years later, you clearly are saying that we need the generation numbers, but then you go off and try to say that they should be in some secondary non-architected random collection of data structures that isn't covered by the security and maintenance guarantees that the core git objects are. Dammit, one of the things that makes git special is that the data structures are NOT random odd ad-hoc files. There is a design to them. > Generation numbers are _completely_ redundant with the actual structure > of history represented by the parent pointers. Not true. That's only true if you add ".. if you parse the whole history" to that statement. And we've *never* parsed the whole history, because it's just too expensive and doesn't scale. So right now we depend on commit dates with a few hacks. So no, generation numbers are not at all redundant. They are fundamental. It's why we had this discussion six years ago. > And so that seems a bit hack-ish to me. Um? If you feel that way, then why the hell are you pushing your EVEN MORE HACKISH CACHE PATCHES? That's what this really boils down to. I think that if we have a value that we need, then it should be recorded. In the data structures. Not in some random other location that isn't part of the real git data structures. We don't do caches in git, because we don't NEED to. Sure, gitk has it's hacky cache, but that's not core functionality. I think it's a sign of good design that we can do a "find .git" and explain every single file, and show that it's all core functionality (again, with the exception of "gitk.cache", and I suspect that's because gitk is a script, not because of any really fundamental data issues), and explain it. I think the *cache* is a hell of a lot more hacky than just doing it right. > I liken it somewhat to the "don't store renames" debate. That's total and utter bullshit. Storing renames is *wrong*. I've explained a million times why it's wrong. Doing it is a disaster. I know. I've used systems that did it. It's crap. It's fundamentally information that is actively misleading and WRONG. It's not even that you can do rename detection at run-time, it's that you *HAVE* to do rename detection at run-time, because doing it at commit time is simply utterly and fundamentally *wrong*. Just look at "git blame -C" to remind yourself why rename information is wrong. But even more importantly, look at git merges. Look at how git has gotten merging right since pretty much day #1, and has absolutely no issues with files that got generated two different ways. Look at every SCM that tries to do rename detection, and look at how THEY CANNOT DO MERGES RIGHT. It's that simple. Rename detection is not about avoiding "redundant data". It's about doing the right thing. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 1:19 ` Linus Torvalds @ 2011-07-15 2:41 ` Geert Bosch 2011-07-15 7:46 ` Jeff King 2011-07-15 9:12 ` Jakub Narebski 2 siblings, 0 replies; 89+ messages in thread From: Geert Bosch @ 2011-07-15 2:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Jul 14, 2011, at 21:19, Linus Torvalds wrote: > But dammit, if you start using generation numbers, then they *are* > required information. The fact that you then hide them in some > unarchitected random file doesn't change anything! It just makes it > ugly and random, for chrissake! Generation numbers never will be required information, because we can always compute them. These numbers are really much more similar to other pack index information than anything else. <aside> Sometimes I wish we'd have general "depth" information for each SHA1, which would be the maximum number of steps in the DAG to reach a leaf. This way, if we want to do something like "git log drivers/net/slip.c", we don't have to bother reading the majority of trees that have a depth less than two. The depth can also be used as a limiter for "contains" operations, where we want to see if commit X contains commit Y: depth (X) has to be at least depth (Y). However, any such notion, wether generation or depth or whatever else we'll think of tomorrow, is something particular to a certain implementation of git. It does not add anything to the information we stored. </aside> I don't think my commit should have a different SHA1 from yours, because your tree has a more generation numbers than mine. The beauty and genius of GIT is that it just takes the minimum amount of data needed to uniquely identify the information to be stored, and stores that in a UNIQUE format. By allowing generation numbers to either be present or absent, that's all broken. It's like computing the SHA1 of compressed data: it doesn't depend on the data we store, just about the particular representation we choose. Fortunately we have done away with the first mistake. So, if you're going to add generation numbers, there has to be a flag day, after which generation numbers are required everywhere. Of course it would be possible to recognize "old style" commits and convert them on the fly, but that is true for pretty much any format change. However, adding redundant information seems like a poor excuse for having a flag day. Storing generation data in pack indices on the other hand makes perfect sense: when we generate these indices, we do complete traversals and have all required information trivially at hand. We can never have that many loose objects, so lack of generation information there isn't a big deal. By storing generation information in the index, we can be sure it is consistent with the data contained in the pack, so there are no cache invalidation issues. I know I must have missed some stupid and obvious reason why this is all wrong, I just don't quite see it yet. -Geert ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 1:19 ` Linus Torvalds 2011-07-15 2:41 ` Geert Bosch @ 2011-07-15 7:46 ` Jeff King 2011-07-15 16:10 ` Linus Torvalds 2011-07-15 9:12 ` Jakub Narebski 2 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-15 7:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 06:19:30PM -0700, Linus Torvalds wrote: > Yes. > > And until I saw the patches trying to add generation numbers, I didn't > really try to push adding generation numbers to commits (although it > actually came up as early as July 2005, so the "let's use generation > numbers in commits" thing is *really* old). > > In other words, I do agree that we should strive for minimal required > information. > > But dammit, if you start using generation numbers, then they *are* > required information. The fact that you then hide them in some > unarchitected random file doesn't change anything! It just makes it > ugly and random, for chrissake! So you don't see a difference between storing the information directly in the commit object, where it affects the sha1 of the commit, and calculating and storing it somewhere else? That is what seems ungit to me. You aren't adding new information to the DAG (note I said "DAG" and not commit) that is not already there, but you are changing the ids of commits in the DAG. I'm not saying that's a reason to ultimately reject the idea of putting generation numbers in commit objects. But it is a reason to give us pause and figure out if there are other solutions, because it will be the first time such redundant information has been added. And that's what I've been trying to do during this discussion with you: work out what the options are and evaluate them. > I really don't understand your logic that says that the cache is > somehow cleaner. It's a random hack! It's saying "we don't have it in > the main data structure, so let's add it to some other one instead, > and now we have a consistency and cache generation problem instead". Are packfiles unclean, or a random hack? How about pack indices? What about Nico's and Shawn's ideas for a packv4 that would gain efficiency by storing objects not in their whole format, but in a way that would make tree examination faster (but would be able to restore the whole objects byte for byte)? Those things rely on the idea that the git DAG is a data model that we present to the user, but that we're allowed to do things behind the scenes to make things faster. We're allowed to make an index of offsets of objects in the packfile for faster lookup. Why are we not allowed to use an index for other object data if it will speed up our local algorithms? Again, I'm not saying that the patches I posted are necessarily the answer. Maybe my cache implementation sucks. Maybe the value should go into a pack index instead. Maybe the whole idea is stupid. But I don't think it's worth rejecting out-of-hand the idea that the generation number might be stored outside of the commit object. I do think it's worth talking about what the actual downsides are, as compared to other options. For example, you mentioned there a consistency problem in the paragraph above. What is it? If you mean the problem with refs/replace, then yes, that is an open problem to be solved (though not a hard one, as I already mentioned a solution elsewhere). But is that problem better or worse with this solution versus an embedded generation number? It seems to me that an embedded generation number is even worse. > Just look at the size of the patches in question. Your caching patches > are bigger and more complicated. Sure, part of it is that your series > adds the code to _use_ the generation number, but look purely at the > code to maintain them. It's 300 lines of code. That can also be used to store arbitrary meta-information for commits. I've already achieved significant speedups in some workflows by caching patch-id calculations, which would reuse this code. And I certainly don't think _that_ should go into the commit object. Yes, it's more complex than simply adding a generation number to the commit header. But simply adding a generation number does not actually give the 100-fold speedup I'm seeing. So again, I'm not interested in rejecting solutions out of hand; I'm interested in things like: is the complexity of the cache worth this speedup? What other options do we have, and what speedup do they provide? Do we care enough about this speedup to even bother? > Why do you think the odd separate cache is somehow better than just > doing it right? Seriously? If we require the generation numbers, then > they have *become* that minimal information that we should save! What do you do when generation numbers don't match the DAG represented by the parent pointers? Are you proposing to just ignore it? I'm not asking that question adversarially; ignoring may be the sane thing to do, and we say "generation numbers are to be trusted, even if they don't match parent pointers". > Now, six years later, you clearly are saying that we need the > generation numbers, but then you go off and try to say that they > should be in some secondary non-architected random collection of data > structures that isn't covered by the security and maintenance > guarantees that the core git objects are. I don't think I said we clearly need them. I said we can get speedups by using them, and I showed some patches. I _also_ posted patches showing how to accomplish similar speedups using timestamps. Note that all of my patches started with "RFC". I am trying to figure out which is the best way to proceed. And why _would_ they need to be covered by the security and maintenance guarantees of core objects? You can trivially calculate them from the core objects. Are pack indices also a "secondary non-architected random collection of data structures"? > Dammit, one of the things that makes git special is that the data > structures are NOT random odd ad-hoc files. There is a design to them. There is just as much documentation and design for the new file format I added as there is for pack indices (in fact, they're quite similar in design). I really see them at the same level: something we calculate to speed up some algorithms, but something we could regenerate at any time if we felt like. > > And so that seems a bit hack-ish to me. > > Um? If you feel that way, then why the hell are you pushing your EVEN > MORE HACKISH CACHE PATCHES? Please, there is really no need to shout. And I find it quite silly that you would refer to me as "pushing" these patches when they have been clearly listed as RFC, and everything I have posted in the nearby threads has been about comparing different strategies (with patches and timings for some of those other strategies!). > We don't do caches in git, because we don't NEED to. Sure, gitk has > it's hacky cache, but that's not core functionality. I'm sorry to tell you that there is already a cache for external conversion of diffs for blobs. And that I have a patch series which makes "git cherry" much more pleasant to use by caching patch ids. Do we "need" those? No, of course not. Git works just fine without them, albeit a bit slower. But is it sometimes worth making a space-time tradeoff to make some algorithms faster? I think it sometimes is, depending on the space and time factors, and the complexity of the storage (e.g., consistency problems with caching). > I think it's a sign of good design that we can do a "find .git" and > explain every single file, and show that it's all core functionality > (again, with the exception of "gitk.cache", and I suspect that's > because gitk is a script, not because of any really fundamental data > issues), and explain it. Would it make you happier if we stored the generation data in the pack index when we index the packs? > I think the *cache* is a hell of a lot more hacky than just doing it right. You still haven't explained how we would "do it right" and get the same speedups. When I responded to your initial email, your answers were along the lines of "we could cache fewer things". If your position is to damn the speedup, the cache is not worth the complexity, I can buy that. If your position is that the complexity is not worth it, and we are better off to keep using timestamps, I can buy that. If your position is that you can find a clever way, using only the generation numbers in newly created commits, to get similar speedups in "git {tag,branch} --contains", I'd love to hear it. > > I liken it somewhat to the "don't store renames" debate. > > That's total and utter bullshit. > > Storing renames is *wrong*. I've explained a million times why it's > wrong. Doing it is a disaster. I know. I've used systems that did it. > It's crap. It's fundamentally information that is actively misleading > and WRONG. It's not even that you can do rename detection at run-time, > it's that you *HAVE* to do rename detection at run-time, because doing > it at commit time is simply utterly and fundamentally *wrong*. Yes, I am well aware that stored renames are wrong for merging. The problem is that they not a function of a tree state (which is what a commit stores), but rather of the difference between two states. So when you diff the commit's state with some other arbitrary merge-base, any renames recorded at commit time would be worthless. But consider another case. Each time I run "git log -M --raw", I compute the same renames over and over. Let's say I have a case in which this is annoyingly slow, and want to speed it up. The state of a particular commit and the state of its parents are invariants for a particular sha1 commit id; this is a fundamental property of git, as you well know. So for a given rename-detection algorithm (and any parameters it has), the set of renames between the states will also be an invariant. Now imagine I create a persistent cache mapping the commit sha1 for some sane default set of rename algorithm parameters to a set of rename pairs. My annoyingly slow "log -M" is now faster, and I'm happier. I think you encounter a similar set of questions here as you do with the concept of a generation header. If the information is an invariant for a particular commit sha1, can we and should we store it in the commit object? Is the speedup worth the complexity of a cache? What are the circumstances under which the cache is not applicable, and how often do they come up? Can we accurately detect when the cache is not applicable? And that is why I compared it to the idea of storing renames. Please note that I did _not_ say they were exactly the same situation, or that the answers to one set of questions were the same as the answers to another. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 7:46 ` Jeff King @ 2011-07-15 16:10 ` Linus Torvalds 2011-07-15 16:18 ` Shawn Pearce 2011-07-15 19:48 ` Jeff King 0 siblings, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 16:10 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 12:46 AM, Jeff King <peff@peff.net> wrote: > > So you don't see a difference between storing the information directly > in the commit object, where it affects the sha1 of the commit, and > calculating and storing it somewhere else? Sure, I see the difference. And I think it's uglier to have two different places for required information. > That is what seems ungit to > me. You aren't adding new information to the DAG (note I said "DAG" and > not commit) that is not already there, but you are changing the ids of > commits in the DAG. Umm. It's redundant, but so what? We have tons of redundant information in there already. Those commits are very explicitly using a 40-byte ASCII representation of the 20-byte SHA1 names. The very original deeper object structure is also redundant: we repeat the object size in the object itself, even though it's part of the implicit object format itself. We also very purposefully repeat the type of the object there, even though the type is basically always redundant (in fact, the core git functions require you to give the type of the object as part of the lookup, and will error out if the SHA1 points to the wrong type). That was one of my original design decisions, exactly because I wanted the redundancy for verification. Redundancy isn't a problem. It's a source of sanity checking. I'm not seeing why you are harping on it. I think it's much worse to have the same information in two different places where it can cause inconsistencies that are hard to see and may not be repeatable. If git ever finds the wrong merge base (because, say, the generation numbers are wrong), I want it to be a *repeatable* thing. I want to be able to repeat on the git mailing list "hey, guys, look at what happens when I try to merge commits ABC and XYZ". If you go "yeah, it works for me", then that is bad. What I tried very hard to do in the git data structures is to make them (a) immutable (so the DAG could never have two-way links, for example) and (b) "simple". Right now, we do *have* a "generation number". It's just that it's very easy to corrupt even by mistake. It's called "committer date". We could improve on it. > Are packfiles unclean, or a random hack? How about pack indices? No. Neither of them are unclean or random. The original git design was very much about thinking of the object space as a "filesystem". Now, the original object layout actually used the native OS filesystem, and I naively thought that would be ok. Using aspecialized filesystem instead doesn't really change anything. It's not fundamentally different from the difference between running git on ext3 or btrfs or nfs or whatever. In fact, I think we've had more filesystem-related bugs wrt NFS than we've had with pack-files. The pack indices are actually kind of ugly - and I would have preferred having them in the same file instead of having the worry of consistency across two different files. They *are* the kind of thing that could cause local inconsistency, but they are fairly simple, and they have some serious protection in them (ie they aren't just SHA1'd in themselves, they contain a SHA1 of the pack-file they index in them to make sure that any inconsistency is findable). Again, that's "redundancy". But I consider the packfile/index to be just a filesystem. It really fundamentally *is* that. Partly for that reason, I do think that if the generation count was embedded in the pack-file, that would not be an "ugly" decision. The pack-files have definitely become "core git data structures", and are more than just a local filesystem representation of the objects: they're obviously also the data transport method, even if the rules there are slightly different (no index, thank god, and incomplete "thin" packs). That said, I don't think a generation count necessarily "fits" in the pack-file. They are designed to be incremental, so it's not very natural there. But I do think it would be conceptually prettier to have the "depth of commit" be part of the "filesystem" data than to have it as a separate ad-hoc cache. > Those things rely on the idea that the git DAG is a data model that we > present to the user, but that we're allowed to do things behind the > scenes to make things faster. .. and that is relevant to this discussion exactly *how*? It's not. It's totally irrelevant. I certainly would never walk away from the DAG model. It's a fundamental git decision, and it's the correct one. But it all boils down to one simple issue: we should have added generation counts back in 2005. It's likely the *one* data format decision that I regret. Using commit dates was wrong. Everybody knew it was wrong, but we ended up going with it just to keep the format constant. If I had realized how small the patch was to add generation counters, and that it wouldn't have broken backwards compatibility (ie fsck doesn't start complaining). I would have done it originally, instead of all the crazy hacks we did for commit date verification. And that is what this discussion fundamentally boils down to for me. If we should have fixed it in the original specification, we damn well should fix it today. It's been "ignorable" because it's just not been important enough. But if git now adds a fundamental cache for them, then that information is clearly no longer "not important enough". Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 16:10 ` Linus Torvalds @ 2011-07-15 16:18 ` Shawn Pearce 2011-07-15 16:44 ` Linus Torvalds 2011-07-15 19:48 ` Jeff King 1 sibling, 1 reply; 89+ messages in thread From: Shawn Pearce @ 2011-07-15 16:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 09:10, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Right now, we do *have* a "generation number". It's just that it's > very easy to corrupt even by mistake. It's called "committer date". We > could improve on it. ... > If I had realized how small the patch was to add generation counters, > and that it wouldn't have broken backwards compatibility (ie fsck > doesn't start complaining). I would have done it originally, instead > of all the crazy hacks we did for commit date verification. What about going forward making the requirement that a new commit must have a committer date whose date is >= the maximum date of its parents? We could also add a check during fast-forward merges to refuse to perform the merge if the incoming commit has a committer date too far forward in the future (e.g. more than 5 minutes). If you pull from a moron whose system clock is set such that the committer date isn't a proxy for generation number, Git would just refuse the merge, and you could ask them to fix their objects. -- Shawn. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 16:18 ` Shawn Pearce @ 2011-07-15 16:44 ` Linus Torvalds 2011-07-15 18:42 ` Ted Ts'o 2011-07-15 18:46 ` Tony Luck 0 siblings, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 16:44 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 9:18 AM, Shawn Pearce <spearce@spearce.org> wrote: > > What about going forward making the requirement that a new commit must > have a committer date whose date is >= the maximum date of its > parents? So you suggest just making commit dates be the generation number. I'd be ok with that. It's basically what we've been doing for the last six years. But in that case, we shouldn't be doing the generation count cache either. Btw, I do agree that we probably should add a warning for the case ("your clock is wrong - your commit date is before the commit date of your parents") and maybe require the use of "-f" or something to override it. That would certainly be a good thing quite independently of anything else. So regardless of generation counts, it's probably worth it. But if you think commit date is good enough for generation counts - and I'm not arguing against it - then please tell me why you would then want to have a separate generation count cache. So I would like to repeat: I think our commit-date based hack has been pretty successful. We've lived with it for years and years. Even the "let's try to fix it by adding slop" code is from three years ago (commit 7d004199d1), which means that for three years we never really saw any serious problems. I forget what problem we actually did see - I have this dim memory of it being Ted that had problems with a merge because git picked a crap merge base, but that may just be my Alzheimer's speaking. Obviously there are cases where we miss some merge base and it doesn't really end up mattering, so we may well have a *ton* of commits that have bad dates, but they just haven't affected us enough for us to care. That's fine too - I dislike how our algorithm isn't truly reliable, but at the same time I think we're so robust that it all works regardless. So I think it's ugly and fairly hacky, but it has worked well enough in practice. I dislike our commit dates, but I don't _hate_ them. I do think it was a mistake, but not one I'm especially ashamed of. So why do I dislike the generation count cache so much? I dislike it exactly because "if the commit date isn't good enough, then dammit, we should have just added a generation count". And if we should have added it six years ago, then we should add it today. Not say "oh, we made a mistake six years ago, let's work around the mistake instead of fixing it". That's really what it boils down to. Let's not paper over a mistake. Either we need the generation depth or we don't. And if we do need it, we should replace the date-based hackery with it (where "replace" may well be "still fall back on our traditional date-based hackery in the absense of generation counters"). But if we decide that we don't really need generation counters AT ALL, and can just continue with the commit date hack, then I'm personally ok with that too. So to me, it's a "either or" situation. Either the commit dates are good enough, or we should add generation counts to the commits. But in *neither* case is it ok to do some external cache to work around it. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 16:44 ` Linus Torvalds @ 2011-07-15 18:42 ` Ted Ts'o 2011-07-15 19:00 ` Linus Torvalds 2011-07-16 9:16 ` Christian Couder 2011-07-15 18:46 ` Tony Luck 1 sibling, 2 replies; 89+ messages in thread From: Ted Ts'o @ 2011-07-15 18:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote: > So I would like to repeat: I think our commit-date based hack has been > pretty successful. We've lived with it for years and years. Even the > "let's try to fix it by adding slop" code is from three years ago > (commit 7d004199d1), which means that for three years we never really > saw any serious problems. I forget what problem we actually did see - > I have this dim memory of it being Ted that had problems with a merge > because git picked a crap merge base, but that may just be my > Alzheimer's speaking. My original main issue was simply that "git tag --contains" and "git branch --contains" was either (a) incorrect, or (b) slower than popping up gitk and pulling the information out of the GUI. The reason for (b) is because of gitk.cache. Maybe the answer then is creating a command-line tool (it doesn't have to be in "core" of git) which just pulls the dammned information out of gitk.cache.... (Yes, it's gross, but I'm not worrying about the long-term architecture of git or anything high-falutin' like that. I'm just a poor dumb user who just wants git tag --contains and git branch --contains to be fast and accurate...) - Ted ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 18:42 ` Ted Ts'o @ 2011-07-15 19:00 ` Linus Torvalds 2011-07-16 9:16 ` Christian Couder 1 sibling, 0 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 19:00 UTC (permalink / raw) To: Ted Ts'o; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 11:42 AM, Ted Ts'o <tytso@mit.edu> wrote: > > My original main issue was simply that "git tag --contains" and "git > branch --contains" was either (a) incorrect, or (b) slower than > popping up gitk and pulling the information out of the GUI. The > reason for (b) is because of gitk.cache. With "original issue" I actually meant the case that caused us to add the "slop" commit (7d004199d1). But I was too lazy to try to find the archives from March 2008.. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 18:42 ` Ted Ts'o 2011-07-15 19:00 ` Linus Torvalds @ 2011-07-16 9:16 ` Christian Couder 2011-07-18 3:41 ` Jeff King 1 sibling, 1 reply; 89+ messages in thread From: Christian Couder @ 2011-07-16 9:16 UTC (permalink / raw) To: Ted Ts'o Cc: Linus Torvalds, Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 8:42 PM, Ted Ts'o <tytso@mit.edu> wrote: > On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote: >> So I would like to repeat: I think our commit-date based hack has been >> pretty successful. We've lived with it for years and years. Even the >> "let's try to fix it by adding slop" code is from three years ago >> (commit 7d004199d1), which means that for three years we never really >> saw any serious problems. I forget what problem we actually did see - >> I have this dim memory of it being Ted that had problems with a merge >> because git picked a crap merge base, but that may just be my >> Alzheimer's speaking. > > My original main issue was simply that "git tag --contains" and "git > branch --contains" was either (a) incorrect, or (b) slower than > popping up gitk and pulling the information out of the GUI. The > reason for (b) is because of gitk.cache. > > Maybe the answer then is creating a command-line tool (it doesn't have to > be in "core" of git) which just pulls the dammned information out of > gitk.cache.... > > (Yes, it's gross, but I'm not worrying about the long-term > architecture of git or anything high-falutin' like that. I'm just a > poor dumb user who just wants git tag --contains and git branch > --contains to be fast and accurate...) If "git tag --contains" and "git branch --contains" give incorrect answers because the commiter date is wrong in some commits, then why not use "git replace" to "change" the commiter date in the commits that have a wrong date? Is it because you don't want to use "git replace", or because there is no script to do it automatically, or is there another reason? Thanks, Christian. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-16 9:16 ` Christian Couder @ 2011-07-18 3:41 ` Jeff King 2011-07-19 4:14 ` Christian Couder 0 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-18 3:41 UTC (permalink / raw) To: Christian Couder Cc: Ted Ts'o, Linus Torvalds, Shawn Pearce, Git Mailing List, Junio C Hamano On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote: > If "git tag --contains" and "git branch --contains" give incorrect > answers because the commiter date is wrong in some commits, then why > not use "git replace" to "change" the commiter date in the commits > that have a wrong date? Is it because you don't want to use "git > replace", or because there is no script to do it automatically, or is > there another reason? That would work. There are a few tricky things, though: 1. Most commits have less than 100 skewed commits. But some have many (e.g., thousands in the mesa repo). How well does git cope with large numbers of replace refs, performance-wise? 2. Declaring which commits are skewed is actually tricky. You can find a commit whose timestamp is less than the timestamp of one of its ancestors. But you don't know whether it is skewed, or the ancestor. If you are implementing a list of commits whose timestamps shouldn't be used for traversal cutoff, it doesn't really matter who is _right_; you just care about whether the timestamps are strictly increasing from that point. But once you start replacing commits, you need to put in a reasonable value for the timestamp. So you may well be replacing a perfectly valid commit with one that has bogus, skewed information in the commit timestamp. 3. Any value you put in is actually going to be a lie during things like "git log --pretty=raw". That may be OK. But it is letting an optimization meant to make traversal fast and accurate bleed into the actual data we show the user. 4. Sometimes we need to do traversals on the real objects (e.g., because we are doing upload-pack). To get the benefit, those traversals would presumably need to look at both the original object and the replacement, use the timestamp from the replacement for traversal, but otherwise use the original object. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-18 3:41 ` Jeff King @ 2011-07-19 4:14 ` Christian Couder 2011-07-19 20:00 ` Jeff King 0 siblings, 1 reply; 89+ messages in thread From: Christian Couder @ 2011-07-19 4:14 UTC (permalink / raw) To: Jeff King Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce, Git Mailing List, Junio C Hamano On Monday 18 July 2011 05:41:06 Jeff King wrote: > On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote: > > If "git tag --contains" and "git branch --contains" give incorrect > > answers because the commiter date is wrong in some commits, then why > > not use "git replace" to "change" the commiter date in the commits > > that have a wrong date? Is it because you don't want to use "git > > replace", or because there is no script to do it automatically, or is > > there another reason? > > That would work. There are a few tricky things, though: > > 1. Most commits have less than 100 skewed commits. But some have many > (e.g., thousands in the mesa repo). How well does git cope with > large numbers of replace refs, performance-wise? If it did not cope well, it should be possible to improve the performance. Anyway, another way to fix the problem with "git replace" could be to create branches with commits that have a fixed commiter date and then to use "git replace" only to connect these branches to the graph. For example if you have this: A - B - X1 - X2 - X3 - C - D where X1, X2 and X3 are skewed, then you can create this: A - B - X1 - X2 - X3 - C - D \ Y1 - Y2 - Y3 where Y1, Y2, Y3 are the same as X1, X2, X3 except they are not skewed. Then you only need to do "git replace X3 Y3" so you create only one replace ref. > > 2. Declaring which commits are skewed is actually tricky. You can find > a commit whose timestamp is less than the timestamp of one of its > ancestors. But you don't know whether it is skewed, or the > ancestor. > > If you are implementing a list of commits whose timestamps > shouldn't be used for traversal cutoff, it doesn't really matter > who is _right_; you just care about whether the timestamps are > strictly increasing from that point. > > But once you start replacing commits, you need to put in a > reasonable value for the timestamp. So you may well be replacing a > perfectly valid commit with one that has bogus, skewed information > in the commit timestamp. Perhaps but with "git replace" you can choose to create new replace refs and deprecate the old replace refs to fix this where you got it wrong. It would be easier to do that if "git replace" supported sub directories like "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs more easily. For example you could create new refs in "refs/replace/clock-skew/ted- july-2011-2/" if you found a better fix. And then use these new refs instead of those in "refs/replace/clock-skew/ted-july-2011/". > 3. Any value you put in is actually going to be a lie during things > like "git log --pretty=raw". That may be OK. But it is letting an > optimization meant to make traversal fast and accurate bleed into > the actual data we show the user. With replace refs, the user could choose the "lies" told to him/her by selecting the replace refs or set of replace refs that are used. As commits are immutable, when they are created with bad data, the best we can do is let the user choose if they want to see the original or another "fixed" version. Because the original will always be "true" in a way. > 4. Sometimes we need to do traversals on the real objects (e.g., > because we are doing upload-pack). To get the benefit, those > traversals would presumably need to look at both the original > object and the replacement, use the timestamp from the replacement > for traversal, but otherwise use the original object. Yeah, or maybe when we do traversals on real objects we could afford not to rely on commiter date or some other "fragile" data. Thanks, Christian. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-19 4:14 ` Christian Couder @ 2011-07-19 20:00 ` Jeff King 2011-07-21 6:29 ` Christian Couder 0 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-19 20:00 UTC (permalink / raw) To: Christian Couder Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce, Git Mailing List, Junio C Hamano On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote: > > But once you start replacing commits, you need to put in a > > reasonable value for the timestamp. So you may well be replacing a > > perfectly valid commit with one that has bogus, skewed information > > in the commit timestamp. > > Perhaps but with "git replace" you can choose to create new replace refs and > deprecate the old replace refs to fix this where you got it wrong. > > It would be easier to do that if "git replace" supported sub directories like > "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs > more easily. I think all of the arguments I cut from your email are reasonable, but the crux of the issue comes down to this point. If you are interested in actually correcting the skew, then yes, replace refs are a good solution. But doing so is going to involve somebody looking at the commits and deciding which ones are wrong, and what they should be. And maybe that's a good thing to do for people who really care about cleaning history. But for something like "speed up revision traversal by assuming commit timestamps are roughly increasing", we want something very automated, and what is needs to say is much weaker (not "this is what this commit _should_ say", but rather "this commit might be right, but it is not a good point for cutting off a traversal"). So that's a much easier problem, and it's easy to do in an automated way. So I think while you could use replace refs to handle this issue, it is not always going to be the right solution, and there is room for something simpler (and weaker). -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-19 20:00 ` Jeff King @ 2011-07-21 6:29 ` Christian Couder 0 siblings, 0 replies; 89+ messages in thread From: Christian Couder @ 2011-07-21 6:29 UTC (permalink / raw) To: Jeff King Cc: Christian Couder, Ted Ts'o, Linus Torvalds, Shawn Pearce, Git Mailing List, Junio C Hamano On Tue, Jul 19, 2011 at 10:00 PM, Jeff King <peff@peff.net> wrote: > On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote: > >> Perhaps but with "git replace" you can choose to create new replace refs and >> deprecate the old replace refs to fix this where you got it wrong. >> >> It would be easier to do that if "git replace" supported sub directories like >> "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs >> more easily. > > I think all of the arguments I cut from your email are reasonable, but > the crux of the issue comes down to this point. > > If you are interested in actually correcting the skew, then yes, replace > refs are a good solution. But doing so is going to involve somebody > looking at the commits and deciding which ones are wrong, and what they > should be. I think that we can help the user a lot to find the skew, and then to decide which commits are wrong, and then to fix the skew even if the fix we suggest is far from being perfect. > And maybe that's a good thing to do for people who really > care about cleaning history. Yeah, so maybe at one point we will want to help these people even if we have implemented automatic generation numbers. Then this means that automated generation numbers are useful only if: 1) there are commits with skews 2) the heuristics to deal with some skew don't work 3) the user is too lazy to use the help we (can) provide to fix the skews I think that we can probably find heuristics that will deal with at least 95% of the cases. For example we could perhaps decide that we don't cut off a traversal until the date difference is greater than 5 days. Then in the hopefully few cases where there are really big skews that won't be caught by our heuristics, (but that we can automatically detect when fetching or commiting,) we can perhaps afford to ask the user to do a small analysis to properly fix the skew. I mean that at one point when things are too weird it is ok and perhaps even a good thing to involve the user. > But for something like "speed up revision traversal by assuming commit > timestamps are roughly increasing", we want something very automated, > and what is needs to say is much weaker (not "this is what this commit > _should_ say", but rather "this commit might be right, but it is not a > good point for cutting off a traversal"). So that's a much easier > problem, and it's easy to do in an automated way. Yeah, generation numbers look like an easy thing to do. And yeah, being automated is great too. But it does not mean it is the right thing to do. (Or perhaps we could have them but not save them in any cache, nor in the commit object.) > So I think while you could use replace refs to handle this issue, it is > not always going to be the right solution, and there is room for > something simpler (and weaker). You know, replace refs can be used to fix or improve a lot of things like bad authors, clock skews, bisecting on a fixed up history, working on a larger or smaller repository than the original, and so on. And of course for each of these problems you may find another solution tailored to the problem at hand that will seem simpler or easier. But in the end if you develop all these other solutions you will have developed a lot of stuff that will be harder to maintain, less generic, more complex and so on, that properly developed replace refs. Thanks, Christian. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 16:44 ` Linus Torvalds 2011-07-15 18:42 ` Ted Ts'o @ 2011-07-15 18:46 ` Tony Luck 2011-07-15 18:58 ` Linus Torvalds 1 sibling, 1 reply; 89+ messages in thread From: Tony Luck @ 2011-07-15 18:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 9:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Btw, I do agree that we probably should add a warning for the case > ("your clock is wrong - your commit date is before the commit date of > your parents") and maybe require the use of "-f" or something to > override it. That would certainly be a good thing quite independently > of anything else. So regardless of generation counts, it's probably > worth it. What if my clock is wrong in the opposite direction - set to some time out in 2025. It would pass the check you propose and let the commit go in - but would cause problems for everyone if that tree was pulled into upstream. You'd also want a check in pull(merge) that none of the commits being added were in the future (as defined by the time on your machine). -Tony ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 18:46 ` Tony Luck @ 2011-07-15 18:58 ` Linus Torvalds 0 siblings, 0 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 18:58 UTC (permalink / raw) To: Tony Luck; +Cc: Shawn Pearce, Jeff King, Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 11:46 AM, Tony Luck <tony.luck@intel.com> wrote: > > What if my clock is wrong in the opposite direction - set to some time > out in 2025. > It would pass the check you propose and let the commit go in - but would > cause problems for everyone if that tree was pulled into upstream. I think Shawn suggested that we just notice it at merge time. But yes, it's why (a) I'd suggest we have a "-f" to override and (b) I do think that generation counts are a better idea. You could still screw them up, but it would be due to an outright bug or malicious behavior, rather than simple incompetence on the part of a user. Incompetent users (where "date on the machine set to the wrong century" is just _one_ sign of incompetence) are something git should pretty much take for granted. It may not be the common case, but it's certainly something we should design for and take into account. In contrast, if somebody *wants* to screw his repository up by re-writing objects with "git hash-object" etc, be my guest. We should just make sure fsck catches anything serious. So I would suggest checking the date regardless of any generation count issues, because it would possibly find badly configured machines that should be fixed. The same way we complain when we find no name. Whether it should then be a correctness issue or not is kind of separate. > You'd also want a check in pull(merge) that none of the commits being > added were in the future (as defined by the time on your machine). I don't think you need to care about "none of the commits", just making sure the tip is reasonable. That would not only be expensive, and not what we normally do (we show the diff against endpoints, not all changes, etc). It would also cause problems for "fixed" repositories (ie anything that has historical dates that are wrong, but are ok now). Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 16:10 ` Linus Torvalds 2011-07-15 16:18 ` Shawn Pearce @ 2011-07-15 19:48 ` Jeff King 2011-07-15 20:07 ` Jeff King 2011-07-15 21:17 ` Linus Torvalds 1 sibling, 2 replies; 89+ messages in thread From: Jeff King @ 2011-07-15 19:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 09:10:48AM -0700, Linus Torvalds wrote: > I think it's much worse to have the same information in two different > places where it can cause inconsistencies that are hard to see and may > not be repeatable. If git ever finds the wrong merge base (because, > say, the generation numbers are wrong), I want it to be a *repeatable* > thing. I want to be able to repeat on the git mailing list "hey, guys, > look at what happens when I try to merge commits ABC and XYZ". If you > go "yeah, it works for me", then that is bad. Having the information in two different places is my concern, too. And I think the fundamental difference between putting it inside or outside the commit sha1 (where outside encompasses putting it in a cache, in the pack-index, or whatever), is that I see the commit sha1 as somehow more "definitive". That is, it is the sole data we pass from repo to repo during pushes and pulls, and it is the thing that is consistency-checked by hashes. So if there is an inconsistency between what the parent pointers represent, and what the generation number in "outside" storage says, then the outside storage is wrong, and the parent pointers are the right answer. It becomes a lot more fuzzy to me if there is an inconsistency between what the parent pointers represent, and what the generation number says. How should that situation be handled? Should fsck check for it and complain? Should we just ignore it, even though it may cause our traversal algorithms to be inaccurate? Like clock skew, there's not much that can be done if the commits are published. Those are serious questions that I think should be considered if we are going to put a generation header into the commit object, and I haven't seen answers for them yet. > Partly for that reason, I do think that if the generation count was > embedded in the pack-file, that would not be an "ugly" decision. The > pack-files have definitely become "core git data structures", and are > more than just a local filesystem representation of the objects: > they're obviously also the data transport method, even if the rules > there are slightly different (no index, thank god, and incomplete > "thin" packs). > > That said, I don't think a generation count necessarily "fits" in the > pack-file. They are designed to be incremental, so it's not very > natural there. But I do think it would be conceptually prettier to > have the "depth of commit" be part of the "filesystem" data than to > have it as a separate ad-hoc cache. Sure, I would be fine with that. When you say "packfile", do you mean the the general concept, as in it could go in the pack index as opposed to the packfile itself? Or specifically in the packfile? The latter seems a lot more problematic to me in terms of implementation. > > Those things rely on the idea that the git DAG is a data model that we > > present to the user, but that we're allowed to do things behind the > > scenes to make things faster. > > .. and that is relevant to this discussion exactly *how*? Because keeping the generation information outside of the DAG keeps the model we present to the user simple (and not just the user; the information that we present to other programs), but lets git still use the information without calculating it from scratch each time. Just like we present the data as a DAG of loose objects via things like "git cat-file", even though the underlying storage inside a packfile may be very different. I just don't see those two ideas as fundamentally different. > It's not. It's totally irrelevant. I certainly would never walk away > from the DAG model. It's a fundamental git decision, and it's the > correct one. Of course not. I never suggested we should. > And that is what this discussion fundamentally boils down to for me. > > If we should have fixed it in the original specification, we damn well > should fix it today. It's been "ignorable" because it's just not been > important enough. But if git now adds a fundamental cache for them, > then that information is clearly no longer "not important enough". OK, so let's say we add generation headers to each commit. What happens next? Are we going to convert algorithms that use timestamps to use commit generations? How are we going to handle performance issues when dealing with older parts of history that don't have generations? Again, those are serious questions that need answered. I respect that you think the lack of a generation header is a design decision that should be corrected. As I said before, I'm not 100% sure I agree, but nor do I completely disagree (and I think it largely boils down to a philosophical distinction, which I think you will agree should take a backseat to real, practical concerns). But it's not 2005, and we have a ton of history without generation numbers. So adding them now is only one piece of the puzzle. What's your solution for the rest of it? -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 19:48 ` Jeff King @ 2011-07-15 20:07 ` Jeff King 2011-07-15 21:17 ` Linus Torvalds 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-15 20:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 03:48:07PM -0400, Jeff King wrote: > OK, so let's say we add generation headers to each commit. What happens > next? Are we going to convert algorithms that use timestamps to use > commit generations? How are we going to handle performance issues when > dealing with older parts of history that don't have generations? > > Again, those are serious questions that need answered. I respect that > you think the lack of a generation header is a design decision that > should be corrected. As I said before, I'm not 100% sure I agree, but > nor do I completely disagree (and I think it largely boils down to a > philosophical distinction, which I think you will agree should take a > backseat to real, practical concerns). But it's not 2005, and we have a > ton of history without generation numbers. So adding them now is only > one piece of the puzzle. > > What's your solution for the rest of it? I just read some of your later emails to others in the thread. It seems like your answer is "assume the timestamp-based limiting is good enough for old history". I'm OK with that. It obviously falls down in a few specific situations, but certainly has not been an unbearable problem for the past 5 years. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 19:48 ` Jeff King 2011-07-15 20:07 ` Jeff King @ 2011-07-15 21:17 ` Linus Torvalds 2011-07-15 21:54 ` Jeff King 2011-07-15 23:10 ` Linus Torvalds 1 sibling, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 21:17 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 12:48 PM, Jeff King <peff@peff.net> wrote: > > Having the information in two different places is my concern, too. And I > think the fundamental difference between putting it inside or outside > the commit sha1 (where outside encompasses putting it in a cache, in the > pack-index, or whatever), is that I see the commit sha1 as somehow more > "definitive". That is, it is the sole data we pass from repo to repo > during pushes and pulls, and it is the thing that is consistency-checked > by hashes. Sure. That is also the data that is the same for everybody. That's a big deal, in the sense that it's the only thing we should rely on if we want consistent behavior. Immediately if core functionality starts using any other data, behavior becomes "local". And I think that's really *really* dangerous. Sure, we have "local behavior" in a lot of small details. We very much intentionally have it in the ref-logs, and since branches and tags are local we also have it in things like "--decorate", which obviously depends on exactly which local refs you have. We also have local behavior in things like .git/config etc files, so git can behave very differently for different people even with what is otherwise an identical repository. So local behavior is good and expected for some things. We *want* it for things like colorization decisions, we want it for aliases, and we want it for branch naming. But really core behavior shouldn't depend on local information. I think it would be wrong if something like a merge base decision would be based on any local information. For example, should it matter whether something is packed or not? I really don't think so. That's a pretty random implementation detail, and if we get different end results because some commit happens to be packed, vs not packed (because, say, we'd be hiding generation information in the pack) that would be wrong. Now, we do have things like merge resolution caches etc (which obviously do save and use local information again), but I think that's pretty well clarified. > So if there is an inconsistency between what the parent pointers > represent, and what the generation number in "outside" storage says, > then the outside storage is wrong, and the parent pointers are the right > answer. It becomes a lot more fuzzy to me if there is an inconsistency > between what the parent pointers represent, and what the generation > number says. So I really don't see why you harp on that. If the generation counters are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT. That's a big issue. Sure, they may be LYING, but that's a different thing entirely. They will be lying to everybody consistently. There would never be any question about what the generation number of a commit is. See what I'm trying to say? There's no way that they would cause different behavior for different people. Everything is 100% consistent. The exact same thing is true of commit dates, btw. They may be confused as hell, and they may cause us to do bad things when we traverse the history, but different clocks on different machines will still not cause git to act differently on different machines. There's no possibility of inconsistency. (Of course, different *versions* of git may traverse the history differently, since we've changed the heuristics over time. So we do have that kind of inconsistent behavior, where we give different results from different versions of git). And btw, having "incorrect" data in the git objects is not the end of the world. You can generate merge commits that simply have the wrong parents. That will be confusing as hell to the user, and it will make future merges not work very well, but it's a bug in the archive, and that's "ok". The developers may not be very happy about it. In fact, afaik we've had a few cases like that in the kernel tree, because early git had bugs where it would not properly forget parents after a failed merge. Most of them are ARM-related, because the ARM tree was one of the first users of git (outside of me, but I had fewer issues with what happens when things go wrong). So I would not be *too* shocked if we'd end up with "odd" generation counts due to some odd bug. It sounds unlikely, but my point is that that is not at all what I'd *worry* about. > How should that situation be handled? Should fsck check for it and > complain? Should we just ignore it, even though it may cause our > traversal algorithms to be inaccurate? Like clock skew, there's not much > that can be done if the commits are published. Right. I simply think it's not a big deal. IOW, if we would rely on generation counts instead of clock dates, maybe the generation counts would have occasional problems too, but I suspect they'd be *much* rarer than time-based issues, because at least the generation count is a well-defined number rather than a random thing we pick out of emails and badly maintained machines. That said, I'm not 100% sure at all that we want generation numbers at all. Their use is pretty limited. If we had had them from the beginning, I think we would simply have replaced the date-based commit list sorting with a generation-number-based one, and it should have been possible to guarantee that we never output a parent before the commit in rev-parse. As it is, I have to admit that looking at it, I shudder at changing the current date-based logic and replacing it with a "date or generation number". The date-based one, despite all its fuzziness and not being very well defined ("Global clock in a distributed system? You're a moron") and up being a *nice* heuristic for certain human interaction. So it's not a wonderful solution from a technical standpoint, but it does have (I think) some nice UI advantages. (For an example of that: using "--topo-sort" for revision history may be a very good thing technically, but even if it wasn't for the fact that it's more expensive, I think that our largely time-based default order for "git log" in many ways is a better interface for humans. Of course, when mixed with actually giving a history graph, that changes, because then you want the "related" commits to group together, rather than by time. So I think it's just basically a fuzzy area, without any clear hard rules - which is probably why using that fuzzy timestamp works so well in practice) > Those are serious questions that I think should be considered if we are > going to put a generation header into the commit object, and I haven't > seen answers for them yet. I do agree that the really *big* question is "do we even need it at all". I do like perhaps just tightening the commit timestamp rules. Because I do think they would probably work very well for the "contains" problem too. With the exact same fuzzy downsides, of course. Timestamps aren't perfect, and they need that annoying fuzz factor thing. >> That said, I don't think a generation count necessarily "fits" in the >> pack-file. They are designed to be incremental, so it's not very >> natural there. But I do think it would be conceptually prettier to >> have the "depth of commit" be part of the "filesystem" data than to >> have it as a separate ad-hoc cache. > > Sure, I would be fine with that. When you say "packfile", do you mean > the the general concept, as in it could go in the pack index as opposed > to the packfile itself? Or specifically in the packfile? The latter > seems a lot more problematic to me in terms of implementation. I was thinking the "general" issue - it might make most sense to put them in the index. >> If we should have fixed it in the original specification, we damn well >> should fix it today. It's been "ignorable" because it's just not been >> important enough. But if git now adds a fundamental cache for them, >> then that information is clearly no longer "not important enough". > > OK, so let's say we add generation headers to each commit. What happens > next? Are we going to convert algorithms that use timestamps to use > commit generations? How are we going to handle performance issues when > dealing with older parts of history that don't have generations? So I do think the _initial_ question need to be the other way around: do we have to have generation numbers at all? I think it's likely a design misfeature not to have them, but considering that we don't, and have been able to make do without for so long, I'm also perfectly willing to believe that we could speed up "contains" dramatically with the same kind of (crazy and inexact) tricks we use for merge bases. (Looking at a profile, a third - and the top entry - of the "git tag --contains" profile cost is just in "clear_commit_marks()" - not doing any real work, rather *undoing* the work in order to re-do things. So it's entirely possible that the real issue is simply that "in_merge_bases()" is badly done, and we could speed things up a lot independently of anything else). For example, for the "git tag --contains" thing, what's the performance effect of just skipping tags that are much older than the commit we ask for? Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 21:17 ` Linus Torvalds @ 2011-07-15 21:54 ` Jeff King 2011-07-15 23:10 ` Linus Torvalds 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-15 21:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 02:17:26PM -0700, Linus Torvalds wrote: > > Having the information in two different places is my concern, too. And I > > think the fundamental difference between putting it inside or outside > > the commit sha1 (where outside encompasses putting it in a cache, in the > > pack-index, or whatever), is that I see the commit sha1 as somehow more > > "definitive". That is, it is the sole data we pass from repo to repo > > during pushes and pulls, and it is the thing that is consistency-checked > > by hashes. > > Sure. That is also the data that is the same for everybody. > > That's a big deal, in the sense that it's the only thing we should > rely on if we want consistent behavior. Immediately if core > functionality starts using any other data, behavior becomes "local". > And I think that's really *really* dangerous. Yes, I see your argument. I just don't think it's all that big a deal, because the information is so easily derived from data that _is_ the same for everybody (and when you _do_ want it to be different locally, because you are grafting, that is easy to do). But I think at this point we have both said all there is to say. There is no actual data to be brought forth in this argument, and we obviously disagree on this point. So I think we may have to agree to disagree. And as I said before, I am willing to concede generation numbers in the commit header. But we need the rest of the solution, too. > So I really don't see why you harp on that. If the generation counters > are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT. > > That's a big issue. > > Sure, they may be LYING, but that's a different thing entirely. They > will be lying to everybody consistently. There would never be any > question about what the generation number of a commit is. > > See what I'm trying to say? There's no way that they would cause > different behavior for different people. Everything is 100% > consistent. Read my email again. I am clearly talking about inconsistency between two data items in the sha1-checked DAG itself. You then proceed to yell at me that they are not inconsistent, now talking about inconsistency between different people with the same DAG, but different caches. In other words, you are talking about an entirely different type of inconsistency. And then you proceed to say that the generation numbers may be lying, which is _exactly_ what I meant when I said inconsistency. I don't mind arguing with you, even if I think you use capital letters too frequently; but when you do use them, please take care that I really am being a bonehead, and it is not you misrepresenting what I said. As to lying (aka inconsistency between items within the DAG), you say: > And btw, having "incorrect" data in the git objects is not the end of > the world. You can generate merge commits that simply have the wrong > parents. That will be confusing as hell to the user, and it will make > future merges not work very well, but it's a bug in the archive, and > that's "ok". The developers may not be very happy about it. In fact, > afaik we've had a few cases like that in the kernel tree, because > early git had bugs where it would not properly forget parents after a > failed merge. Most of them are ARM-related, because the ARM tree was > one of the first users of git (outside of me, but I had fewer issues > with what happens when things go wrong). No, it's not the end of the world. I just think it's worse than the possibility of inconsistency between two users' idea of the graph, because the bug stays with you for all of history, instead of getting fixed with a new version of git. > That said, I'm not 100% sure at all that we want generation numbers at > all. Their use is pretty limited. If we had had them from the > beginning, I think we would simply have replaced the date-based commit > list sorting with a generation-number-based one, and it should have > been possible to guarantee that we never output a parent before the > commit in rev-parse. > > As it is, I have to admit that looking at it, I shudder at changing > the current date-based logic and replacing it with a "date or > generation number". > > The date-based one, despite all its fuzziness and not being very well > defined ("Global clock in a distributed system? You're a moron") and > up being a *nice* heuristic for certain human interaction. So it's not > a wonderful solution from a technical standpoint, but it does have (I > think) some nice UI advantages. That is the conclusion I am coming to, also. I don't find the external cache as odious as you obviously do. But that was why I posted the patches with an RFC tag. I wanted to see how painful people found the concept. But if it's too ugly a concept, I think the path of least resistance is just making timestamps suck less (by using more consistent and robust skew avoidance[1] in our various algorithms, and by perhaps taking more care to notify the user of skew early, before commits are published). And then we don't really need generation numbers anymore. As elegant as they might have been if they were there from day one, it's just not worth the hassle of maintaining the dual solution. [1] We use "N slop commits" in some places and "allow 86400 seconds of skew" in other places. We should probably use both, and apply them consistently. > > Those are serious questions that I think should be considered if we are > > going to put a generation header into the commit object, and I haven't > > seen answers for them yet. > > I do agree that the really *big* question is "do we even need it at > all". I do like perhaps just tightening the commit timestamp rules. > Because I do think they would probably work very well for the > "contains" problem too. > > With the exact same fuzzy downsides, of course. Timestamps aren't > perfect, and they need that annoying fuzz factor thing. Yeah. But in practice, that fuzz is really easy to implement, has worked pretty well so far, and doesn't actually hurt performance measurably, because skew is rare, and a constant, small timestamp tends to equate to a constant, small number of commits. > > Sure, I would be fine with that. When you say "packfile", do you mean > > the the general concept, as in it could go in the pack index as opposed > > to the packfile itself? Or specifically in the packfile? The latter > > seems a lot more problematic to me in terms of implementation. > > I was thinking the "general" issue - it might make most sense to put > them in the index. If we were to go the cache route, I think I am leaning that way, too, if only because we don't duplicate the 20-byte sha1 per commit, which keeps our I/O down. > > OK, so let's say we add generation headers to each commit. What happens > > next? Are we going to convert algorithms that use timestamps to use > > commit generations? How are we going to handle performance issues when > > dealing with older parts of history that don't have generations? > > So I do think the _initial_ question need to be the other way around: > do we have to have generation numbers at all? No, we don't need them. My "contains" patches were already implemented using timestamps, and it's pretty fast. They fall down only in the face lying timestamps (i.e., skew). The whole reason to switch to generation headers was that we could assume they would be correct, and our algorithms using them would be more likely to be correct. And I do think a generation header would be more likely to be correct than a timestamp, if only because timestamps are harder to get right. > I think it's likely a design misfeature not to have them, but > considering that we don't, and have been able to make do without for > so long, I'm also perfectly willing to believe that we could speed up > "contains" dramatically with the same kind of (crazy and inexact) > tricks we use for merge bases. Already done. I can point you to the patches if you want. > For example, for the "git tag --contains" thing, what's the > performance effect of just skipping tags that are much older than the > commit we ask for? It's as fast as using generations. See these two patches: http://article.gmane.org/gmane.comp.version-control.git/150261 http://article.gmane.org/gmane.comp.version-control.git/150262 -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 21:17 ` Linus Torvalds 2011-07-15 21:54 ` Jeff King @ 2011-07-15 23:10 ` Linus Torvalds 2011-07-15 23:16 ` Linus Torvalds 2011-07-16 0:40 ` Jeff King 1 sibling, 2 replies; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 23:10 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano [-- Attachment #1: Type: text/plain, Size: 1049 bytes --] On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > For example, for the "git tag --contains" thing, what's the > performance effect of just skipping tags that are much older than the > commit we ask for? Hmm. Maybe there is something seriously wrong with this trivial patch, but it gave the right results for the test-cases I threw at it, and passes the tests. Before: [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct real 0m7.548s user 0m7.344s sys 0m0.116s After: [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off real 0m0.161s user 0m0.140s sys 0m0.016s and 'correct' and 'date-cut-off' both give the same answer. The date-based "slop" thing is (at least *meant* to be - note the lack of any extensive testing) "at least five consecutive commits that have dates that are more than five days off". Somebody should double-check my logic. Maybe I'm doing something stupid. Because that's a *big* difference. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1614 bytes --] commit.c | 42 +++++++++++++++++++++++++++++++++++++++++- 1 files changed, 41 insertions(+), 1 deletions(-) diff --git a/commit.c b/commit.c index ac337c7d7dc1..0d33c33a6520 100644 --- a/commit.c +++ b/commit.c @@ -737,16 +737,56 @@ struct commit_list *get_merge_bases(struct commit *one, struct commit *two, return get_merge_bases_many(one, 1, &two, cleanup); } +#define VISITED (1 << 16) + +static int is_recursive_descendant(struct commit *commit, struct commit *target) +{ + int slop = 5; + parse_commit(target); + for (;;) { + struct commit_list *parents; + if (commit == target) + return 1; + if (commit->object.flags & VISITED) + return 0; + commit->object.flags |= VISITED; + parse_commit(commit); + if (commit->date + 5*24*60*60 < target->date) { + if (--slop <= 0) + return 0; + } else + slop = 5; + parents = commit->parents; + if (!parents) + return 0; + commit = parents->item; + parents = parents->next; + while (parents) { + if (is_recursive_descendant(parents->item, target)) + return 1; + parents = parents->next; + } + } +} + +static int is_descendant(struct commit *commit, struct commit *target) +{ + int ret = is_recursive_descendant(commit, target); + clear_commit_marks(commit, VISITED); + return ret; +} + int is_descendant_of(struct commit *commit, struct commit_list *with_commit) { if (!with_commit) return 1; + while (with_commit) { struct commit *other; other = with_commit->item; with_commit = with_commit->next; - if (in_merge_bases(other, &commit, 1)) + if (is_descendant(commit, other)) return 1; } return 0; ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 23:10 ` Linus Torvalds @ 2011-07-15 23:16 ` Linus Torvalds 2011-07-15 23:36 ` Linus Torvalds 2011-07-16 0:40 ` Jeff King 1 sibling, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 23:16 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 4:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Maybe there is something seriously wrong with this trivial patch, but > it gave the right results for the test-cases I threw at it, and passes > the tests. > > Before: I have fewer branches than tags, but I get something similar for "git branch --contains": [torvalds@i5 linux]$ time git branch --contains v2.6.12 | sha1sum 9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510 - real 0m4.205s user 0m4.112s sys 0m0.084s [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.12 | sha1sum 9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510 - real 0m0.112s user 0m0.100s sys 0m0.008s ie identical results, except one took 4.2s and with the patch it took 0.1s. This is all hot-cache, of course, and on a fast machine. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 23:16 ` Linus Torvalds @ 2011-07-15 23:36 ` Linus Torvalds 2011-07-16 0:42 ` Jeff King 0 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-15 23:36 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano And one last comment: On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I have fewer branches than tags, but I get something similar for "git > branch --contains": The time-based heuristic does seem to be important. If I just remove it, I get increasingly long times for things that aren't contained in my branches. And in fact, I think that is why the code used the merge-base helper functions - not because it wanted merge bases, but because the merge base stuff will work from either end until it decides things aren't relevant any more. Because *without* the time-based heuristics, the trivial "is this a descendant" algorithm ends up working very badly for the case where the target doesn't exist in the branches. Examples of NOT having a date-based cut-off, but just doing the straightforward (non-merge-base) ancestry walk: time ~/git/git branch --contains v2.6.12 real 0m0.113s [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39 real 0m3.691s and what ends up happening is that in the latter case, every branch walks all the way to the root and checks every commit (walking all the merges too). While in the first case, it's very quick because it will find that particular commit when it walk straight backwards (so it doesn't even have to do a lot of recursion - the first branch that hits that commit will be a success), so it won't have to look at all the side ways of getting there. Of course, the above particular difference happens to be due to the "depth-first" implementation working well for the thing I am searching for. But it does show that the date-based cut-off matters due to traversal issues like that. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 23:36 ` Linus Torvalds @ 2011-07-16 0:42 ` Jeff King 0 siblings, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-16 0:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 04:36:40PM -0700, Linus Torvalds wrote: > On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I have fewer branches than tags, but I get something similar for "git > > branch --contains": > > The time-based heuristic does seem to be important. If I just remove > it, I get increasingly long times for things that aren't contained in > my branches. > > And in fact, I think that is why the code used the merge-base helper > functions - not because it wanted merge bases, but because the merge > base stuff will work from either end until it decides things aren't > relevant any more. Because *without* the time-based heuristics, the > trivial "is this a descendant" algorithm ends up working very badly > for the case where the target doesn't exist in the branches. Examples > of NOT having a date-based cut-off, but just doing the straightforward > (non-merge-base) ancestry walk: > > time ~/git/git branch --contains v2.6.12 > real 0m0.113s > > [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39 > real 0m3.691s Yes, exactly. That is why my first patch (which goes to a recursive search), takes about the same amount of time as "git rev-list --all" (and I suspect your 3.691s above is similar). And then the second one drops that again to .03s. I think you are simply recreating the strategy and timings I have posted several times now. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 23:10 ` Linus Torvalds 2011-07-15 23:16 ` Linus Torvalds @ 2011-07-16 0:40 ` Jeff King 1 sibling, 0 replies; 89+ messages in thread From: Jeff King @ 2011-07-16 0:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Fri, Jul 15, 2011 at 04:10:23PM -0700, Linus Torvalds wrote: > On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > For example, for the "git tag --contains" thing, what's the > > performance effect of just skipping tags that are much older than the > > commit we ask for? > > Hmm. > > Maybe there is something seriously wrong with this trivial patch, but > it gave the right results for the test-cases I threw at it, and passes > the tests. > > Before: > > [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct > > real 0m7.548s > user 0m7.344s > sys 0m0.116s > > After: > > [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off > > real 0m0.161s > user 0m0.140s > sys 0m0.016s > > and 'correct' and 'date-cut-off' both give the same answer. Without even looking carefully at your patches for any minor mistakes, I can tell you that the speedup you're seeing is approximately right. Because it's almost exactly the same optimization I made in my timestamp-based patches (links to which I sent you earlier today). However, you can make it even faster. The "tag --contains" code will ask "is_descendant_of" repeatedly for the same set of "want" commits. So you end up traversing some parts of the graph over and over. My patches share the marks over a set of contains traversals, so you only ever touch each commit once. And that's what my patches do. With yours, on my box: $ time git tag --contains HEAD~1000 >/dev/null real 0m0.113s user 0m0.104s sys 0m0.008s and mine: $ time git tag --contains HEAD~1000 >/dev/null real 0m0.035s user 0m0.020s sys 0m0.012s I suspect you can make the difference even more prominent by having more tags, or by having multiple "want" commits. -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 1:19 ` Linus Torvalds 2011-07-15 2:41 ` Geert Bosch 2011-07-15 7:46 ` Jeff King @ 2011-07-15 9:12 ` Jakub Narebski 2011-07-15 9:17 ` Long, Martin 2 siblings, 1 reply; 89+ messages in thread From: Jakub Narebski @ 2011-07-15 9:12 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Git Mailing List, Junio C Hamano, Jakub Narebski Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@peff.net> wrote: > > > > However, I'm not 100% convinced leaving generation numbers out was a > > mistake. The git philosophy seems always to have been to keep the > > minimal required information in the DAG. > > Yes. > > And until I saw the patches trying to add generation numbers, I didn't > really try to push adding generation numbers to commits (although it > actually came up as early as July 2005, so the "let's use generation > numbers in commits" thing is *really* old). > > In other words, I do agree that we should strive for minimal required > information. > > But dammit, if you start using generation numbers, then they *are* > required information. The fact that you then hide them in some > unarchitected random file doesn't change anything! It just makes it > ugly and random, for chrissake! > > I really don't understand your logic that says that the cache is > somehow cleaner. It's a random hack! It's saying "we don't have it in > the main data structure, so let's add it to some other one instead, > and now we have a consistency and cache generation problem instead". You store redundant information, one that is used to speed up calculations, in a cache. [...] > > Generation numbers are _completely_ redundant with the actual structure > > of history represented by the parent pointers. What is more important the perceived structure of history can change by three mechanisms: * grafts * replace objects * shallow clone I can understand that you don't want to worry about grafts - they are a terrible hack. We can simply turn off using generation numbers stored in commit if they are present. The problem with shallow clones is only at beginning, when some of commits in shallow repository does not have generation numbers. You cannot simply calculate generation number for a new commit in such case. But what about REPLACE OBJECTS? If one for example use "git replace" on root commit to join contemporary repository with historical repository... this is not addressed in your emails. And let's not forget the fact that we need cache for old commits which don't have yet generation number in a commit. BTW. you are not fair comparing size of code. First, some of Peff code is about _using_ generation numbers, which will be needed regardless of whether generation numbers are stored in cache or packfile index, or whether they are embedded in commit objects. Second, with generation number commit header you need to write fsck code, and have to consider size of this yet-to-be-written code. [...] > > I liken it somewhat to the "don't store renames" debate. > > That's total and utter bullshit. I think Peff meant here that if you make mistakes in calculating rename info or generation number, and have incorrect information stored in commit object, you are f**ked. > Storing renames is *wrong*. I've explained a million times why it's > wrong. Doing it is a disaster. I know. I've used systems that did it. > It's crap. It's fundamentally information that is actively misleading > and WRONG. It's not even that you can do rename detection at run-time, > it's that you *HAVE* to do rename detection at run-time, because doing > it at commit time is simply utterly and fundamentally *wrong*. > > Just look at "git blame -C" to remind yourself why rename information is wrong. Also doing full code movement and copying detection (that is what "git blame -C" does) rather than simplistic whole-file rename detection is pretty much impossible at commit time. Nb. most SCMs that use path-id based rename tracking require that user explicitly marks renames using "scm move" or "scm rename" (well, Mercurial has a tool for rename detection before commit, "hg addremove"). But asking user to mark code movements is simply infeasible. > But even more importantly, look at git merges. Look at how git has > gotten merging right since pretty much day #1, and has absolutely no > issues with files that got generated two different ways. Look at every > SCM that tries to do rename detection, and look at how THEY CANNOT DO > MERGES RIGHT. > > It's that simple. Rename detection is not about avoiding "redundant > data". It's about doing the right thing. Well, rename tracking supporters say that heuristic rename detection can be wrong. By the way, what happened to "wholesame directory rename detection" patches? Without them in the situation where one side renamed directory, and other created new file in said directory git on merge creates file in re-created old name of directory... -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 9:12 ` Jakub Narebski @ 2011-07-15 9:17 ` Long, Martin 2011-07-15 15:33 ` Long, Martin 0 siblings, 1 reply; 89+ messages in thread From: Long, Martin @ 2011-07-15 9:17 UTC (permalink / raw) To: git I strongly agree with Linus that the cache should not form part of the solution to this problem, but could maybe be a later add-on, which improved performance. There is a possible improvement, which may remove the need for the cache. It doesn't solve the issue of broken numbers, but I think the key to that is just to ensure the traversal algorithm is deterministic, stable, and immutable. Firstly, I presume the generation number would not form part of the SHA1 calculation? No? Cool. When calculating a generation number by doing a traversal, would it not be possible to update some, or all, commit objects touched, with their generation numbers. Again, this would be expensive, but there would possibly be even quicker gains than Linus's original proposal to just add numbers to the new commit. A compromise might be to only update some commits - notably those with 2 or more parents, so that both parents don't need to be traversed, and possibly every nth commit (to give regular checkpoints that can be utilised when traversing a branch). I would suggest commits with 2 children for the latter, but with my limited knowledge of the implementation, I understand that Is more difficult to find. Obviously, these numbers would only be pegged locally, and wouldn't by synced on push, as they already exist on the far end. However, it could be possible to run a process on a bare repo to shoot through and peg commits, then at least new clones will be "well pegged" Martin Long UK ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 9:17 ` Long, Martin @ 2011-07-15 15:33 ` Long, Martin 2011-07-15 16:15 ` Drew Northup 0 siblings, 1 reply; 89+ messages in thread From: Long, Martin @ 2011-07-15 15:33 UTC (permalink / raw) To: git > > Firstly, I presume the generation number would not form part of the > SHA1 calculation? No? Cool. I suspect this may be where my suggestion falls down. Though I suspect there is a case for object metadata which doesn't form part of the SHA. Would generation number tampering be a concern? Caching offers the ability to store that metadata, to provide the same performance gain, but maintain the integrity of the SHA chain. However, it does still leave the generation number liable to tampering, meaning a generic non-SHA metadata solution might be better. TBH, there are few situations where historical generations are useful - finding gen numbers of tags is one of them. Most cases are going to be for new commits, and in that case, a few new commits at the tip of each branch will very quickly reduce the number of traversals. What use case would really create enough traversals that it should be a performance concern? ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-15 15:33 ` Long, Martin @ 2011-07-15 16:15 ` Drew Northup 0 siblings, 0 replies; 89+ messages in thread From: Drew Northup @ 2011-07-15 16:15 UTC (permalink / raw) To: Long, Martin Cc: git, Jeff King, Junio C Hamano, Jakub Narebski, Linus Torvalds, Geert Bosch On Fri, 2011-07-15 at 16:33 +0100, Long, Martin wrote: > > > > Firstly, I presume the generation number would not form part of the > > SHA1 calculation? No? Cool. > > I suspect this may be where my suggestion falls down. Though I suspect > there is a case for object metadata which doesn't form part of the > SHA. Would generation number tampering be a concern? If you take Jeff's perspective on the purpose of generation numbers (representing metadata about the DAG in a more readily-available format) then "tampering" is not really a concern as the metadata is merely local (to the running instance of Git) ephemera that we can cache between runs for the sake of efficiency. Linus' perspective on generation numbers seems to be of a more hard and fast type of data. So, are we really talking about [corpus] generation numbers (used to describe the state of the DAG in the way one describes his known family tree) or are we talking about _revision_numbers_ (used to describe the commit, as Subversion does)? I think we've got two (or more) groups talking about different things (and aims) and trying to use the same words to do so. > Caching offers the ability to store that metadata, to provide the same > performance gain, but maintain the integrity of the SHA chain. > However, it does still leave the generation number liable to > tampering, meaning a generic non-SHA metadata solution might be > better. I'm not sure where you are going with this. I wouldn't think "tampering" with _current_DAG-based ephemera would do much other than create a performance hit. If you are really talking about a static _revision_number_ then that belongs in the commit, where it cannot be changed (and may be completely meaningless when taken out of context, as SVN revision numbers are). What such a number may entail is probably up for discussion, but perhaps in a different thread. > TBH, there are few situations where historical generations are useful > - finding gen numbers of tags is one of them. Most cases are going to > be for new commits, and in that case, a few new commits at the tip of > each branch will very quickly reduce the number of traversals. What > use case would really create enough traversals that it should be a > performance concern? The answer to this is found in a previous thread http://article.gmane.org/gmane.comp.version-control.git/176807 (remember, generation number vs. revision number...) Also, please don't cull the CC list! (Added Geert Bosch) -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:37 ` Jeff King 2011-07-14 18:47 ` Linus Torvalds @ 2011-07-14 18:52 ` Linus Torvalds 2011-07-14 19:08 ` Jakub Narebski 2011-07-14 20:26 ` Junio C Hamano 2 siblings, 1 reply; 89+ messages in thread From: Linus Torvalds @ 2011-07-14 18:52 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List, Junio C Hamano On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote: > > There's also one other issue with generation numbers. How do you handle > grafts and object-replacement refs? If you graft history, your embedded > generation numbers will all be junk, and you can't trust them. So I don't think this is a real problem in practice. Grafts are already unreliable. You cannot sanely merge over a graft, and it has nothing to do with generation numbers. I'm actually sorry that we ever did grafting. It's fundamentally broken, and can actually destroy your repository (by hiding real parents and then causing the commits to get garbage collected). So I don't think grafting should be used as an argument for or against anything - it's a hack that breaks some fundamental git database constraints. Linus ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:52 ` Linus Torvalds @ 2011-07-14 19:08 ` Jakub Narebski 0 siblings, 0 replies; 89+ messages in thread From: Jakub Narebski @ 2011-07-14 19:08 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Git Mailing List, Junio C Hamano, Jakub Narebski Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@peff.net> wrote: > > > > There's also one other issue with generation numbers. How do you handle > > grafts and object-replacement refs? If you graft history, your embedded > > generation numbers will all be junk, and you can't trust them. > > So I don't think this is a real problem in practice. > > Grafts are already unreliable. You cannot sanely merge over a graft, > and it has nothing to do with generation numbers. > > I'm actually sorry that we ever did grafting. It's fundamentally > broken, and can actually destroy your repository (by hiding real > parents and then causing the commits to get garbage collected). So I > don't think grafting should be used as an argument for or against > anything - it's a hack that breaks some fundamental git database > constraints. What about object-replacement refs (i.e. "git replace" and refs/replace/)? This is modern replacement for grafts mechanism, which is safe against garbage collecting, and contrary to grafts it is transferable (as a ref). With replacement objects (e.g. to repair some fragment of history to make it bisectable - I think that was original idea behind introducing git-replace, or instead of grafts to join with historical repository - IIRC the reason why grafts mechanism was created) you can also have invalid generation numbers if they are stored in commit headers. With generation cache we can simply invaliate it if grafts or replacements change... P.S. grafts are quite useful when doing history surgery. Create grafts, check history, use git-filter-branch to make new DAG permanent, remove grafts. P.P.S. What about "grafts lite", i.e. shallow clone? With generation cache we can invalidate it when depth changes... -- Jakub Narębski Poland ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 18:37 ` Jeff King 2011-07-14 18:47 ` Linus Torvalds 2011-07-14 18:52 ` Linus Torvalds @ 2011-07-14 20:26 ` Junio C Hamano 2011-07-14 20:41 ` Jeff King 2 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2011-07-14 20:26 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Git Mailing List Jeff King <peff@peff.net> writes: > There's also one other issue with generation numbers. How do you handle > grafts and object-replacement refs? If you graft history, your embedded > generation numbers will all be junk, and you can't trust them. By the way, I doubt your "invalidate and recompute generation cache when replacement changes" would really work when we consider object transfer (which is the whole point of deprecating graft with object replacement mechanism). For the purpose of connectivity check during object transfer, we deliberately _ignore_ the object replacements, so you would at least want to have an ability to show the generation number according to the "true" history recorded in commits (which can come from Linus's in-commit generation number once everybody migrates) and the generation number that takes grafts and replacements into account (for which we cannot depend on in-commit record). ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 20:26 ` Junio C Hamano @ 2011-07-14 20:41 ` Jeff King 2011-07-14 21:30 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Jeff King @ 2011-07-14 20:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List On Thu, Jul 14, 2011 at 01:26:32PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > There's also one other issue with generation numbers. How do you handle > > grafts and object-replacement refs? If you graft history, your embedded > > generation numbers will all be junk, and you can't trust them. > > By the way, I doubt your "invalidate and recompute generation cache when > replacement changes" would really work when we consider object transfer > (which is the whole point of deprecating graft with object replacement > mechanism). For the purpose of connectivity check during object transfer, > we deliberately _ignore_ the object replacements, so you would at least > want to have an ability to show the generation number according to the > "true" history recorded in commits (which can come from Linus's in-commit > generation number once everybody migrates) and the generation number that > takes grafts and replacements into account (for which we cannot depend on > in-commit record). It should actually work in that scenario, at least with replace refs, but the performance is suboptimal. The copy of git doing the object transfer will turn off read_replace_refs, our validity token will not match, we will see that our cache is no longer valid, and regenerate it. Another run with replace-refs turned on will do the same thing in reverse. Even two programs running simultaneously will still be correct, because the cache is replaced atomically. However, there are two issues: 1. I don't think grafts have a "respect grafts" flag in the same way; I haven't looked at how the packing code decides not to respect them, but the "stir graft info into the checksum" data should use the same check. 2. If you do a lot of object transfer, you will ping-pong back and forth between cache versions, which is inefficient. It would probably be better to store the cache that is valid under condition $SHA1 as: .git/cache/generations/$SHA1 In most cases, you would have a single file (i.e., you are not using replace refs at all). But if you did, then you keep two separate caches, one for the view from replace-refs, and one for the standard view. If we ignore replace refs and grafts, as Linus suggested, and always store the true generation number, then we could generate it at pack time (and even put it in the pack index if we want to deal with a version bump there). -Peff ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Git commit generation numbers 2011-07-14 20:41 ` Jeff King @ 2011-07-14 21:30 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2011-07-14 21:30 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Git Mailing List Jeff King <peff@peff.net> writes: > It should actually work in that scenario, at least with replace refs,... > regenerate it. Another run ... I know; that is what I called "doubt it would really work". Having to regenerate twice does not count as working. > However, there are two issues: > > 1. I don't think grafts have a "respect grafts" flag in the same way; > I haven't looked at how the packing code decides not to respect > them, but the "stir graft info into the checksum" data should use > the same check. I do not think graft and object transfer meshes well at all, so I wouldn't worry about it. ^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2011-09-06 10:02 UTC | newest] Thread overview: 89+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-17 18:27 Git commit generation numbers George Spelvin 2011-07-17 19:00 ` Long, Martin 2011-07-17 19:30 ` Linus Torvalds 2011-07-17 23:39 ` George Spelvin 2011-07-17 23:58 ` Linus Torvalds 2011-07-18 5:13 ` George Spelvin 2011-07-18 10:28 ` Anthony Van de Gejuchte 2011-07-18 11:48 ` George Spelvin 2011-07-20 20:51 ` Nicolas Pitre 2011-07-20 22:16 ` George Spelvin 2011-07-20 23:26 ` david 2011-07-20 23:36 ` Nicolas Pitre 2011-07-21 0:08 ` Phil Hord 2011-07-21 0:18 ` david 2011-07-21 0:37 ` Shawn Pearce 2011-07-21 0:47 ` Phil Hord 2011-07-21 4:26 ` david 2011-07-21 12:43 ` George Spelvin 2011-07-21 19:19 ` Jakub Narebski 2011-07-21 20:27 ` George Spelvin 2011-07-21 20:33 ` Shawn Pearce 2011-07-22 12:18 ` Jakub Narebski 2011-07-22 13:09 ` Nicolas Pitre 2011-07-22 18:02 ` david 2011-07-22 18:34 ` Jakub Narebski 2011-07-22 19:06 ` Linus Torvalds 2011-07-22 22:02 ` Jeff King 2011-07-28 15:00 ` Felipe Contreras 2011-09-06 10:02 ` Ramkumar Ramachandra 2011-07-22 19:08 ` david 2011-07-22 19:40 ` Nicolas Pitre 2011-07-22 18:02 ` david 2011-07-21 0:39 ` Phil Hord 2011-07-21 0:58 ` Nicolas Pitre 2011-07-21 1:09 ` Phil Hord 2011-07-21 12:03 ` Drew Northup 2011-07-21 12:55 ` George Spelvin 2011-07-21 15:57 ` Drew Northup 2011-07-21 16:24 ` Phil Hord 2011-07-21 22:40 ` Pēteris Kļaviņš 2011-07-22 9:30 ` Christian Couder 2011-07-21 17:36 ` George Spelvin -- strict thread matches above, loose matches on Subject: below -- 2011-07-14 18:24 Linus Torvalds 2011-07-14 18:37 ` Jeff King 2011-07-14 18:47 ` Linus Torvalds 2011-07-14 18:55 ` Linus Torvalds 2011-07-14 19:12 ` Jeff King 2011-07-14 19:46 ` Ted Ts'o 2011-07-14 19:51 ` Linus Torvalds 2011-07-14 20:07 ` Jeff King 2011-07-14 20:08 ` Ted Ts'o 2011-07-14 19:08 ` Jeff King 2011-07-14 19:23 ` Linus Torvalds 2011-07-14 20:01 ` Jeff King 2011-07-14 20:19 ` Linus Torvalds 2011-07-14 20:31 ` Jeff King 2011-07-15 1:19 ` Linus Torvalds 2011-07-15 2:41 ` Geert Bosch 2011-07-15 7:46 ` Jeff King 2011-07-15 16:10 ` Linus Torvalds 2011-07-15 16:18 ` Shawn Pearce 2011-07-15 16:44 ` Linus Torvalds 2011-07-15 18:42 ` Ted Ts'o 2011-07-15 19:00 ` Linus Torvalds 2011-07-16 9:16 ` Christian Couder 2011-07-18 3:41 ` Jeff King 2011-07-19 4:14 ` Christian Couder 2011-07-19 20:00 ` Jeff King 2011-07-21 6:29 ` Christian Couder 2011-07-15 18:46 ` Tony Luck 2011-07-15 18:58 ` Linus Torvalds 2011-07-15 19:48 ` Jeff King 2011-07-15 20:07 ` Jeff King 2011-07-15 21:17 ` Linus Torvalds 2011-07-15 21:54 ` Jeff King 2011-07-15 23:10 ` Linus Torvalds 2011-07-15 23:16 ` Linus Torvalds 2011-07-15 23:36 ` Linus Torvalds 2011-07-16 0:42 ` Jeff King 2011-07-16 0:40 ` Jeff King 2011-07-15 9:12 ` Jakub Narebski 2011-07-15 9:17 ` Long, Martin 2011-07-15 15:33 ` Long, Martin 2011-07-15 16:15 ` Drew Northup 2011-07-14 18:52 ` Linus Torvalds 2011-07-14 19:08 ` Jakub Narebski 2011-07-14 20:26 ` Junio C Hamano 2011-07-14 20:41 ` Jeff King 2011-07-14 21:30 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).