* Change set based shallow clone
@ 2006-09-07 19:52 Jon Smirl
2006-09-07 20:21 ` Jakub Narebski
` (2 more replies)
0 siblings, 3 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-07 19:52 UTC (permalink / raw)
To: git, linux@horizon.com
Here's a change set based shallow clone scheme I've been thinking
about, does it have potential?
When the client wants a shallow clone it starts by telling the server
all of the HEADs and how many change sets down each of those HEADs it
has locally. That's a small amout of data to transmit and it can be
easily tracked. Let's ignore merged branches for the moment.
The client then says I want at least 10 (or N) change sets for all of
the HEADs present at the server. The server starts from each HEAD and
works backwards until it encounters a change set present on the
client. At that point it will be able to compute efficient deltas to
send.
If you haven't updated for six months when the server walks backwards
for 10 change sets it's not going to find anything you have locally.
When this situation is encountered the server needs to generate a
delta just for you between one of the change sets it knows you have
and one of the 10 change sets you want. By generating this one-off
delta it lets you avoid the need to fetch all of the objects back to a
common branch ancestor. The delta functions as a jump over the
intervening space.
In the case of an initial shallow clone the client won't have anything
to delta against. The server will be forced to send a full version
for one of the 10 change sets requested and deltas for the rest.
Getting an initial shallow clone should take about as long as a CVS
check out.
This scheme does require the server to sometimes generate custom diffs
for the client, but in all the cases I have been working with
everything is always IO bound so it is better to spend some CPU to
reduce the IO needed.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 101+ messages in thread* Re: Change set based shallow clone 2006-09-07 19:52 Change set based shallow clone Jon Smirl @ 2006-09-07 20:21 ` Jakub Narebski 2006-09-07 20:41 ` Jon Smirl 2006-09-07 22:07 ` Junio C Hamano 2006-09-08 1:01 ` linux 2 siblings, 1 reply; 101+ messages in thread From: Jakub Narebski @ 2006-09-07 20:21 UTC (permalink / raw) To: git Jon Smirl wrote: > If you haven't updated for six months when the server walks backwards > for 10 change sets it's not going to find anything you have locally. > When this situation is encountered the server needs to generate a > delta just for you between one of the change sets it knows you have > and one of the 10 change sets you want. By generating this one-off > delta it lets you avoid the need to fetch all of the objects back to a > common branch ancestor. The delta functions as a jump over the > intervening space. I don't understand. Git is _not_ patchset based (like GNU Arch, or Mercurial, or CVS). It is snapshot based. So if you want to download "skip", you need only for the local part of doenloader to make appropriate grafts, like below *--*--*--*--*--*--*--*--*--*--*--HEAD (server) *--*--*...........*--*--*--*--*--HEAD (shallow/sparse clone) But the part you were talking about is _easy_ part; the hard part is merges including merging branch which was split off the trunk before cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 20:21 ` Jakub Narebski @ 2006-09-07 20:41 ` Jon Smirl 2006-09-07 21:33 ` Jeff King ` (3 more replies) 0 siblings, 4 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-07 20:41 UTC (permalink / raw) To: Jakub Narebski; +Cc: git On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote: > I don't understand. Git is _not_ patchset based (like GNU Arch, or I meant change set to refer to a commit plus trees plus blobs that make it up. These may be present in full or delta form. > Mercurial, or CVS). It is snapshot based. So if you want to download > "skip", you need only for the local part of doenloader to make appropriate > grafts, like below > > > *--*--*--*--*--*--*--*--*--*--*--HEAD (server) > > *--*--*...........*--*--*--*--*--HEAD (shallow/sparse clone) > > But the part you were talking about is _easy_ part; the hard part is > merges including merging branch which was split off the trunk before > cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc. Does an average user do these things? The shallow clone is there to address the casual user who gags at a five hour download to get an initial check out Mozilla when they want to make a five line change or just browse the source for a few minutes. I would expect advanced users to have a full tree present. I was going to have the dangling references from the shallow clone point to 'not-present' objects. When you try to do any of the more complex operations you would hit these objects and fault down more of the tree. There would also be a command to bring down all of the objects to fully populate a sparse tree. You could do the shallow clone to begin with and then do the full tree populate overnight or in the background. Maybe the answer is to build a shallow clone tool for casual use, and then if you try to run anything too complex on it git just tells you that you have to download the entire tree. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 20:41 ` Jon Smirl @ 2006-09-07 21:33 ` Jeff King 2006-09-07 21:51 ` Jakub Narebski 2006-09-07 21:37 ` Jakub Narebski ` (2 subsequent siblings) 3 siblings, 1 reply; 101+ messages in thread From: Jeff King @ 2006-09-07 21:33 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git On Thu, Sep 07, 2006 at 04:41:20PM -0400, Jon Smirl wrote: > Maybe the answer is to build a shallow clone tool for casual use, and > then if you try to run anything too complex on it git just tells you > that you have to download the entire tree. Can you list which operations you'd like to do on the shallow clone? Many operations that look at history are going to fail with a very short history (e.g., finding a merge base). Some operations can work with a short history, but the results are probably not useful (seeing the last 10 commits from git-log is probably not interesting). If you just want the latest snapshot, the remote git-archive work going on right now will probably take care of that. -Peff ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 21:33 ` Jeff King @ 2006-09-07 21:51 ` Jakub Narebski 0 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-07 21:51 UTC (permalink / raw) To: git Jeff King wrote: > Many operations that look at history are going to fail with a very short > history (e.g., finding a merge base). Some operations can work with a > short history, but the results are probably not useful (seeing the last > 10 commits from git-log is probably not interesting). If you just want > the latest snapshot, the remote git-archive work going on right now will > probably take care of that. That is the source of the idea of _sparse_ clone, which would include e.g. top 10 commits from each of selected branches, all commits pointed by branches and tags, and all merge bases, and roots, joined using grafts. E.g. if the full history looks like this. -*.|.......... head / | -*--*--*- -*--*--x--*-|-*--*--*-- head / \ / | *--*--*--*--*--*--*--*--*--*--*--*--*-|-*--*--*-- head \...............|.......... tag | \- [cut-off] then the sparse clone history would look like that: (---) denotes parentage, (...) denotes grafts below | ,......x....|.*--*--*-- head : | *....................*--*.............|.*--*--*-- head \...............|.......... tag | \- [cut-off] of course assuming that we have no earlier commits (i.e. sparse clone, not sparse fetch). Of course, that solves only some of troubles... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 20:41 ` Jon Smirl 2006-09-07 21:33 ` Jeff King @ 2006-09-07 21:37 ` Jakub Narebski 2006-09-07 22:14 ` Junio C Hamano 2006-09-08 8:48 ` Andreas Ericsson 3 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-07 21:37 UTC (permalink / raw) To: git Jon Smirl wrote: > On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote: >> I don't understand. Git is _not_ patchset based (like GNU Arch, or > > I meant change set to refer to a commit plus trees plus blobs that > make it up. These may be present in full or delta form. > >> Mercurial, or CVS). It is snapshot based. So if you want to download >> "skip", you need only for the local part of downloader to make >> appropriate grafts, like below >> >> >> *--*--*--*--*--*--*--*--*--*--*--HEAD (server) >> >> *--*--*...........*--*--*--*--*--HEAD (shallow/sparse clone) >> >> But the part you were talking about is _easy_ part; the hard part is >> merges including merging branch which was split off the trunk before >> cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc. > > Does an average user do these things? The shallow clone is there to > address the casual user who gags at a five hour download to get an > initial check out Mozilla when they want to make a five line change or > just browse the source for a few minutes. The question is not if average user do these things, but do average user encounter these things. I think that merging branches into 'master' branch (trunk) is quite common, and merging trunk into branches for checking is also common. As to checking out latest revision... with git-archive --remote (or git-tar-tree --remote as of now) you can download newest version as archive; although without commit object I think. > I would expect advanced users to have a full tree present. > > I was going to have the dangling references from the shallow clone > point to 'not-present' objects. When you try to do any of the more > complex operations you would hit these objects and fault down more of > the tree. Only if you consider git-log to be complex operation. It would be better I think to cover laps using grafts. In first post you are talking about downloading diff covering the lap/skip; it would be important only if git was patchset based. You need to download commits and referenced objects, _except_ cutting off some parents. > There would also be a command to bring down all of the objects to > fully populate a sparse tree. You could do the shallow clone to begin > with and then do the full tree populate overnight or in the > background. And that should be done by marking "shallow"/"sparse" grafts and just delete them (or update if needed). > Maybe the answer is to build a shallow clone tool for casual use, and > then if you try to run anything too complex on it git just tells you > that you have to download the entire tree. The shallow/sparse/lazy[*1*] clone proposal is one of recurring proposals (besides subproject support) on git mailing list. Footnotes: [*1*] Lazy clone, also known as remote alternatives. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 20:41 ` Jon Smirl 2006-09-07 21:33 ` Jeff King 2006-09-07 21:37 ` Jakub Narebski @ 2006-09-07 22:14 ` Junio C Hamano 2006-09-07 23:09 ` Jon Smirl 2006-09-08 8:48 ` Andreas Ericsson 3 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-07 22:14 UTC (permalink / raw) To: Jon Smirl; +Cc: git "Jon Smirl" <jonsmirl@gmail.com> writes: > Does an average user do these things? The shallow clone is there to > address the casual user who gags at a five hour download to get an > initial check out Mozilla when they want to make a five line change or > just browse the source for a few minutes. >... > Maybe the answer is to build a shallow clone tool for casual use, and > then if you try to run anything too complex on it git just tells you > that you have to download the entire tree. For that kind of thing, "git-tar-tree --remote" would suffice I would imagine. The five line change can be tracked locally by creating an initial commit from the tar-tree extract; such a casual user will not be pushing or asking to pull but sending in patches to upstream, no? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 22:14 ` Junio C Hamano @ 2006-09-07 23:09 ` Jon Smirl 2006-09-10 23:20 ` Anand Kumria 0 siblings, 1 reply; 101+ messages in thread From: Jon Smirl @ 2006-09-07 23:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On 9/7/06, Junio C Hamano <junkio@cox.net> wrote: > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > Does an average user do these things? The shallow clone is there to > > address the casual user who gags at a five hour download to get an > > initial check out Mozilla when they want to make a five line change or > > just browse the source for a few minutes. > >... > > Maybe the answer is to build a shallow clone tool for casual use, and > > then if you try to run anything too complex on it git just tells you > > that you have to download the entire tree. > > For that kind of thing, "git-tar-tree --remote" would suffice I > would imagine. The five line change can be tracked locally by > creating an initial commit from the tar-tree extract; such a > casual user will not be pushing or asking to pull but sending in > patches to upstream, no? >From my observation the casual user does something like this: get a shallow clone look at it for a while pull once a day to keep it up to date decide to make some changes start a local branch commit changes on local branch push these changes to someone else for review maybe pull changes on the branch back from the other person keep pulling updates from the main repository merge these updates to the local branch browse around on the recent logs to see who changed what finally push the branch to a maintainer pull it back down from the main repository after the maintainer puts it in abandon work on local branch Some people can't figure out the branch step and just copy the repository. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 23:09 ` Jon Smirl @ 2006-09-10 23:20 ` Anand Kumria 0 siblings, 0 replies; 101+ messages in thread From: Anand Kumria @ 2006-09-10 23:20 UTC (permalink / raw) To: git On Thu, 07 Sep 2006 19:09:21 -0400, Jon Smirl wrote: > On 9/7/06, Junio C Hamano <junkio@cox.net> wrote: >> "Jon Smirl" <jonsmirl@gmail.com> writes: >> >> > Does an average user do these things? The shallow clone is there to >> > address the casual user who gags at a five hour download to get an >> > initial check out Mozilla when they want to make a five line change or >> > just browse the source for a few minutes. >> >... >> > Maybe the answer is to build a shallow clone tool for casual use, and >> > then if you try to run anything too complex on it git just tells you >> > that you have to download the entire tree. >> >> For that kind of thing, "git-tar-tree --remote" would suffice I >> would imagine. The five line change can be tracked locally by >> creating an initial commit from the tar-tree extract; such a >> casual user will not be pushing or asking to pull but sending in >> patches to upstream, no? > > From my observation the casual user does something like this: > > get a shallow clone This could basically be something which look at the remote HEAD and pulls down a copy of that commit/tree (and associated objects), right? > look at it for a while > pull once a day to keep it up to date Same again. > decide to make some changes > start a local branch > commit changes on local branch > > push these changes to someone else for review > maybe pull changes on the branch back from the other person [...] At what point, if any, do you envisage a casual user pulling down a full copy of the repository? Cheers, Anand ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 20:41 ` Jon Smirl ` (2 preceding siblings ...) 2006-09-07 22:14 ` Junio C Hamano @ 2006-09-08 8:48 ` Andreas Ericsson 3 siblings, 0 replies; 101+ messages in thread From: Andreas Ericsson @ 2006-09-08 8:48 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git Jon Smirl wrote: > On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote: >> I don't understand. Git is _not_ patchset based (like GNU Arch, or > > I meant change set to refer to a commit plus trees plus blobs that > make it up. These may be present in full or delta form. > >> Mercurial, or CVS). It is snapshot based. So if you want to download >> "skip", you need only for the local part of doenloader to make >> appropriate >> grafts, like below >> >> >> *--*--*--*--*--*--*--*--*--*--*--HEAD (server) >> >> *--*--*...........*--*--*--*--*--HEAD (shallow/sparse clone) >> >> But the part you were talking about is _easy_ part; the hard part is >> merges including merging branch which was split off the trunk before >> cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc. > > Does an average user do these things? The shallow clone is there to > address the casual user who gags at a five hour download to get an > initial check out Mozilla when they want to make a five line change or > just browse the source for a few minutes. > A better idea would be to allow those users to download a gzipped tarball of a pre-grafted repository. It shouldn't be terribly difficult to set up an update-hook that creates the pre-grafted repository for you whenever a tag (or some such) is created in the repo you host wherever everybody does their initial clone from. As I understand it (although I've admittedly followed the git mailing-list sporadically the past three or so months), grafts already work as intended, and the users can then fetch into their grafted repo to get a bare minimum of objects. > > There would also be a command to bring down all of the objects to > fully populate a sparse tree. You could do the shallow clone to begin > with and then do the full tree populate overnight or in the > background. > With the pre-grafted history this would work as follow $ mkdir pregraft $ wget http://pre-grafts.mozilla.org/pregrafted.git.tgz $ cd pregraft $ tar xvzf ../pregrafted.git.tgz $ cd .. $ git clone mozilla-repo-url >& /dev/null & $ cd pregraft # work, work, work; full clone completes $ cd ../mozilla-repo $ git pull ../pregraft master or something similar. iow, you get the small repo quickly and can start hacking while the full-history clone is downloading. If I understand grafts correctly, you could then merge in your changes made in the grafted repo to the one with full history. > Maybe the answer is to build a shallow clone tool for casual use, and > then if you try to run anything too complex on it git just tells you > that you have to download the entire tree. > I believe all tools that work with history understand grafts already, and if so they should provide sane messages when the user attempts to access history beyond the grafts. I might have missed or misunderstood something, but this seems to me like a simple solution to a complex problem. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 19:52 Change set based shallow clone Jon Smirl 2006-09-07 20:21 ` Jakub Narebski @ 2006-09-07 22:07 ` Junio C Hamano 2006-09-07 22:40 ` Jakub Narebski ` (2 more replies) 2006-09-08 1:01 ` linux 2 siblings, 3 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-07 22:07 UTC (permalink / raw) To: Jon Smirl; +Cc: git I've been postponing thinking about shallow clones with a plan to do so when I have absolutely nothing else to do for a solid few days. Since my time is booked pretty much full with day job and the ongoing pack-file topics these days, I will most likely stick to that plan after I finish this message. Not because I think it is an insignificant problem at the fringe (I did not say "when I have nothing better to do"), but because I feel this is something important (and hard) that requires "think about it and only about it straight through for a few days, dropping everything else" kind of thinking. When you have a ref under .git/refs/ pointing at an object, you are making this assertion: The object store associated with this repository (usually, .git/objects and alternates pointed at by .git/objects/info/alternates) holds _all_ objects that are recursively reachable from this object by following tag reference, commit ancestry chain, commit to its tree, and tree to its subtrees and blobs. So by definition, a shallow clone is a broken/incomplete repository. Unless we fix the breakage in some way. One way to do so which has often been talked about is to use the grafts facility. You cut off the history at some commit, and cauterize the commit ancestry chain at that point by telling git to pretend that the commit does not have any parent. While this would work after setting it up, in the sense that "git log" would stop at the commit and "git fsck-objects" would say nothing is missing, it is cumbersome to set up, we do not have an easy way to fetch (or clone -- but clone is a special case of fetch where the downstream starts out from emptiness) from such a repository, unplugging to get histories further back with a later fetch is quite involved, and it is not clear what it means to further fetch from such a shallow clone. Where does the difficulty in fetching come from? Mostly because the upload side and download side needs to agree on the altered history. Let's say the download side is empty and somehow tells the upload side that it wants the latest 5-commit worth of stuff starting from the tip: - The upload side runs rev-list --objects HEAD~5..HEAD | pack-objects --stdout to give you the needed data. This is almost [*1*] doable by telling it that you have HEAD~5. - Also the upload side can say "pretend that this commit HEAD~4 has no parent, although cat-file on it would say otherwise". - The downloader creates a suitable grafts file to cauterize commit HEAD~4. Is this enough? Hardly. Consider: o---o---o---o---o---*---*---*---*---* \ ^ / ^ \ HEAD~5 / HEAD \ / o---o---*---*---* ^ ^ HEAD^^2~3 HEAD^^2 5-commit request should apply to send '*' commits, so the original "HEAD~5..HEAD" was bogus to begin with. You have to say "HEAD --not HEAD~5 HEAD^^2~3" to get the above. Coming up with this restriction needs to be done on the upload side. Doing so on the download side means it needs to retrieve the recent "real" history to do that computation to come up with where to cut things off, which is crazy. However, it may be that the above graph is not what the usual downloader wants. If the upload side is _the_ official repository of some sort that merges in subsystem branches, and if you are mostly interested in the official repository's history, you might have meant to ask this with your 5-commit request instead: o---o---o---o---o---*---*---*---*---* \ ^ / ^ \ HEAD~5 / HEAD \ / o---o---o---o---o ^ HEAD^^2 That is, you would want to cauterize the side branch. The pack needs to be built with "HEAD --not HEAD~5 HEAD^^2", and the cauterize instructions would say "HEAD~4 has no parent, HEAD^ has only one parent that is HEAD~2", instead of saying just "HEAD~4 has no parent." So whoever is computing this cauterization point has a policy decision to make, because the meaning of "5-commits away" or "recent 10 days" are very fuzzy when merges are involved. It obviously is easy to declare that the policy decision is solely made on the uploader side, from the implementation point of view, but I do not offhand know what the ramification of that decision is. Oh, about "recent 10 days". The side branch might be an very old development that were merged recently into the mainline. From mainline's point of view, everything that happened on the side branch is something new to him (think of Linus pulled from Jeff who has been cooking his stuff for quite some time). Looking at each individual commit on the side branch, however, they have older commit and author timestamps. If you pick "I want mainline's point of view", that means you need to compute (in the above picture): - notice commit timestamp of HEAD^ (the merge) is within the recent 10-day window. - using HEAD^^1 (supposedly the state of the mainline before the merge happened [*2*]) and HEAD^^2 (side branch), compute: rev-list HEAD^^1..HEAD^^2 the results are the "new" ones that appeared when the merge was made, so they are within the recent 10-day window from the mainline's point of view. Include them. What that means is, assuming that HEAD~5 is also the 10-day cut-off point on the mainline, and HEAD^ is recent, we would send this: o---o---o---o---o---*---*---*---*---* \ ^ / ^ \ HEAD~5 / HEAD \ / *---*---*---*---* ^ HEAD^^2 I'll leave it as an exercise to the readers to Come up with the pack generation specification and cauterizing instructions to give to the downloader, but you see how quickly this becomes very complex, and we still haven't talked about the above when multiple merge bases are involved. Thinking about it a bit more, maybe "from mainline's point of view" policy would include the same entire side branch when limiting by depth of commits. After all the whole side branch was something new when HEAD^ merge was created. But that is another policy decision. And the downloader side cannot make that decision before seeing what the true ancestry graph looks like. Let's say we somehow managed to make a shallow clone. The downloader has '*' commits: o---o---o---o---o---*---*---*---*---* \ ^ / ^ \ HEAD~5 / HEAD \ / o---o---*---*---* ^ ^ HEAD^^2~3 HEAD^^2 In the meantime, the uploader side made some more commits: x---x---x---x---x---x / \ / \ / \ o---o---o---F---o---*---*---*---*---H---x---X \ / \ / \ / o---o---*---*---* Now the downloader wants to update. How does that go? - The uploader says the tip is at X - The downloader says it has old HEAD (H), H^, ... - Suppose the development on the new side branch recently merged did not touch most of the files since it forked from the mainline at F, but the mainline made lot of changes that were not touched by commits 'x', which is now part of the downloader's history. The uploader, if it did not know the downloader has altered history, would take "have H" from the downloader to mean "the other side has everything reachable from H", so it would compute X..H (commits that are not in H but in X and trees and blobs needed to complete them). But that can leave out the trees and blobs that existed in F (and was part of commits on the new side branch). - So the uploader side needs to know how the history on the downloader's side is altered from reality when interpreting "have H". It does not mean "H and all the history behind it" to the uploader anymore. One way to do so is to send grafts information from downloader to uploader. If the uploader _knows_ that the altered history ends at the leftmost '*' commit on the mainline, then it can take it into account the fact that the downloader does not have commit 'F'. That, however, introduces another problem. Where should it stop? Obviously X needs to be sent, so is X^, but how do we know which 'x' commit to stop at? Do we record the initial cut-off criteria (remember, we started the clone with 5-commit cutoff) somewhere in the downloader's repository and send that to the uploader so that the uploader can apply the same rule to include only 5 commits from the new side branch? What happens if there are less than 5 commits on that new side branch? Would we end up including 'F', which the downloader specifically did not want when making this shallow clone initially? I won't go into the details, but when you think about what needs to happen when fetching _from_ a shallow clone (i.e. the uploader side is incomplete), your head will explode ;-). It is solvable in the same sense that you can solve the previous "fetching to update" problem by trying to define a reasonable semantics to the operation by answering the questions I posed in the above paragraph (which obviously is no way an exhaustive set), but the result will be quite complex. It would involve sending graft information both from uploader and downloader (after all the downloader side can also be shallow but with different cut-off points from the uploader) and somehow coming up with the intersection of both to arrive at a shared view of altered history before starting the traversal to find what to send. Also often people would want to deepen history (i.e. fetching older history than initially fetched) after "checking out only the tip". It is a reasonable request, and is somewhat easier problem to solve than the initial shallow clone problem. If you have a working initial shallow clone, essentially you can: - Find the earlier cut off points by looking at your own grafts file; - Ask the uploader to clone shallowly from the parents missing from your repository. and update the resulting grafts information. Anyway, it is a useful feature, an important operation, and it involves a lot of thinking (and hard work, obviously). I will not think about this anymore, until I have absolutely nothing else to do for a solid few days, but you probably haven't read this far ;-). -jc *1* Currently there is no way to find out what HEAD~10 is during the initial protocol handshake, but this message is mostly about potential protocol update needs. *2* First parent of a merge is _usually_ what the repository owner saw, i.e. the mainline merging side branch into it, but that is not universally true when fast forwards are involved. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 22:07 ` Junio C Hamano @ 2006-09-07 22:40 ` Jakub Narebski 2006-09-08 3:54 ` Martin Langhoff 2006-09-08 5:05 ` Aneesh Kumar K.V 2 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-07 22:40 UTC (permalink / raw) To: git Junio C Hamano wrote: > One way to do so is to send grafts information from downloader > to uploader. If the uploader _knows_ that the altered history > ends at the leftmost '*' commit on the mainline, then it can > take it into account the fact that the downloader does not have > commit 'F'. That, however, introduces another problem. Where > should it stop? Obviously X needs to be sent, so is X^, but > how do we know which 'x' commit to stop at? Do we record the > initial cut-off criteria (remember, we started the clone with > 5-commit cutoff) somewhere in the downloader's repository and > send that to the uploader so that the uploader can apply the > same rule to include only 5 commits from the new side branch? > What happens if there are less than 5 commits on that new side > branch? Would we end up including 'F', which the downloader > specifically did not want when making this shallow clone > initially? > > I won't go into the details, but when you think about what needs > to happen when fetching _from_ a shallow clone (i.e. the > uploader side is incomplete), your head will explode ;-). It is > solvable in the same sense that you can solve the previous > "fetching to update" problem by trying to define a reasonable > semantics to the operation by answering the questions I posed in > the above paragraph (which obviously is no way an exhaustive > set), but the result will be quite complex. It would involve > sending graft information both from uploader and downloader > (after all the downloader side can also be shallow but with > different cut-off points from the uploader) and somehow coming > up with the intersection of both to arrive at a shared view of > altered history before starting the traversal to find what to > send. The proposed solution was to send graft information and cutoff from downloader to uploader, coming up with the effective graft points which would be intersection of downloader and uploader grafts[*1*], and then use this intersection as graft information when calculating what to send. With sparse clone we would have: #---o---#---o---o---*---*---*---*---* \ ^ / ^ \ HEAD~5 / HEAD \ / o---o---*---*---* ^ ^ HEAD^^2~3 HEAD^^2 where # and * are the commit which the downloader has (# are additional commits for sparse rather thatn shallow clone). x---x---x---x---x---x / \ / \ / \ #---o---#---F---o---*---*---*---*---H---x---X \ / \ / \ / o---o---*---*---* Now, when we are shallow fetching, we should have remembered somewhere the cutoff used (or use the same cutoff). So we sould get then x---*---*---*---*---* / \ / \ / \ #---o---#---#---o---*---*---*---*---*---*---* \ / \ / \ / o---o---*---*---* Footnotes: [*1*] It would be some kind of graph intersection. Note that the repository we clone from might be shallow clone, shallower in parts than our repository, or might be repository with historic repository grafted on (grafts adding history, not simplyfying it). [*2*] I agree that shallow/sparse clone is hard. Lazy clone (or remote alternatives) is also hard: when we do download objects, do we download-ahead, do we cache downloaded object or move them to local repository, how often we clean cache etc. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 22:07 ` Junio C Hamano 2006-09-07 22:40 ` Jakub Narebski @ 2006-09-08 3:54 ` Martin Langhoff 2006-09-08 5:30 ` Junio C Hamano 2006-09-08 5:05 ` Aneesh Kumar K.V 2 siblings, 1 reply; 101+ messages in thread From: Martin Langhoff @ 2006-09-08 3:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jon Smirl, git On 9/8/06, Junio C Hamano <junkio@cox.net> wrote: ... > Anyway, it is a useful feature, an important operation, and > it involves a lot of thinking (and hard work, obviously). > > I will not think about this anymore, until I have absolutely > nothing else to do for a solid few days, but you probably > haven't read this far ;-). For when you come back, then. I agree this is rather complicated, so much so that I suspect that we may be barking up the wrong tree. People who want shallow clones are actually asking for a "light" clone, in terms of "how much do I need to download". If everyone has the whole commit chain (but may be missing olden blobs, and even trees), the problem becomes a lot easier. On the wire, all the client needs to say from root commit X to later commits A, B, C, D, can I have all the relevant blobs? (Some may end up being resent, but that's a tradeoff). On the server side, keeping commits (and/or trees) in a segregated pack makes sense as an optimization. On the client side, a similar segregated pack for the 'known blobless' commits/trees may make sense to be able to be able treat ops on the blobless part of history differently. {From a user interaction point of view, it is also easier to error out saying "don't have it, but you'll get it if you ask for history from [SHA1] onwards".} just my .2 m ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 3:54 ` Martin Langhoff @ 2006-09-08 5:30 ` Junio C Hamano 2006-09-08 7:15 ` Martin Langhoff 2006-09-08 14:20 ` Jon Smirl 0 siblings, 2 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-08 5:30 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git "Martin Langhoff" <martin.langhoff@gmail.com> writes: > On 9/8/06, Junio C Hamano <junkio@cox.net> wrote: > ... >> Anyway, it is a useful feature, an important operation, and >> it involves a lot of thinking (and hard work, obviously). >> >> I will not think about this anymore, until I have absolutely >> nothing else to do for a solid few days, but you probably >> haven't read this far ;-). > > For when you come back, then. I agree this is rather complicated, so > much so that I suspect that we may be barking up the wrong tree. > > People who want shallow clones are actually asking for a "light" > clone, in terms of "how much do I need to download". If everyone has > the whole commit chain (but may be missing olden blobs, and even > trees), the problem becomes a lot easier. No, I do not think so. You are just pushing the same problem to another layer. Reachability through commit ancestry chain is no more special than reachability through commit-tree or tree-containment relationships. The grafts mechanism happen to treat commit ancestry more specially than others, but that is not inherent property of git [*1*]. It is only an implementation detail (and limitation -- grafts cannot express anything but commit ancestry) of the current system. But let's touch a slightly different but related topic first. People do not ask for shallow clones. They just want faster transfer of "everything they care about". Shallow and lazy clones are implementation techniques to achieve the goal of making everything they care about appear available, typically taking advantage of the fact that people care about recent past a lot more than they do about ancient history. Shallow clones do so by explicitly saying "sorry you told me earlier you do not care about things older than X so if you now care about things older than X please do such and such to deepen your history". What you are saying is a variant of shallow that says "NAK; commits I have but trees and blobs associated with such an old commit you have to ask me to retrieve from elsewhere". Lazy clones do so by "faulting missing objects on demand" [*2*]. That is essentially automating the "please do such and such" part. So they are all not that different. Now, first and foremost, while I would love to have a system that gracefully operates with a "sparse" repository that lacks objects that should exist from tag/commit/tree/blob reachability point of view, it is an absolute requirement that I can tell why objects are missing from a repository when I find some are missing by running fsck-objects [*3*]. If repository is a shallow clone, not having some object may be expected, but I want to be able to tell repository corruption locally even in that case, so "assume lack of object is perhaps it was not downloaded by shallow cloning" is an unacceptable attitude, and "when you find an object missing from your repository, you can ask the server [*4*], and if the server does not have it then you know your repository is corrupt otherwise it is Ok" is a wrong answer. I talked about the need of upload-pack protocol extension to exchange grafts information between uploader and downloader to come up with the resulting graft and update the grafts in downloaded repository after objects tranfer is done. It is needed because by having such cauterizing graft entries I can be sure that the repository is not corrupt when fsck-objects that does look at grafts says "nothing is missing". Jon talked about "fault-in on demand and leave stub objects until downloading the real thing"; those stub objects are natural extension of grafts facility but extends to trees and blobs. Either of these would allow me to validate the sparse repository locally. Now I'll really shut up ;-). [*1*] For example, rev-list --object knows that it needs to show the tree-containment when given only a tree object without any commit. [*2*] You probably could write a FUSE module that mount on your .git/objects, respond to accesses to .git/objects/?? directory and files with 38-byte hexadecimal names under them by talking to other repositories over the wire. No, I am not going to do it, and I will probably not touch it even with ten-foot pole for reasons I state elsewhere in this message. [*3*] The real fsck-objects always looks at grafts. The fictitious version of fsck-objects I am talking about here is the one that uses parents recorded on the "parent " line of commit objects, that is, the true object level reachability. [*4*] In git, there is no inherent server vs client or upstream vs downstream relationship between repositories. You may be even fetching from many people and do not have a set upstream at all. Or you are _the_ upstream, and your notebook has the latest devevelopment history, and after pushing that latest history to your mothership repository, you may decide you do not want ancient development history on a puny notebook, and locally cauterize the history on your notebook repository and prune ancient stuff. The objects missing from the notebook repository are retrievable from your mothership repository again if/when needed and you as the user would know that (and if you are lucky you may even remember that a few months later), but git doesn't, and there is no reason for git to want to know it. If we want to do "fault-in on demand", we need to add a way to say "it is Ok for this repository to be incomplete -- objects that are missing must be completed from that repository (or those repositories)". But that's quite a special case of having a fixed upstream. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 5:30 ` Junio C Hamano @ 2006-09-08 7:15 ` Martin Langhoff 2006-09-08 8:33 ` Junio C Hamano 2006-09-08 17:18 ` A Large Angry SCM 2006-09-08 14:20 ` Jon Smirl 1 sibling, 2 replies; 101+ messages in thread From: Martin Langhoff @ 2006-09-08 7:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jon Smirl, git On 9/8/06, Junio C Hamano <junkio@cox.net> wrote: > "Martin Langhoff" <martin.langhoff@gmail.com> writes: > > People who want shallow clones are actually asking for a "light" > > clone, in terms of "how much do I need to download". If everyone has > > the whole commit chain (but may be missing olden blobs, and even > > trees), the problem becomes a lot easier. > > No, I do not think so. You are just pushing the same problem to > another layer. > > Reachability through commit ancestry chain is no more special > than reachability through commit-tree or tree-containment > relationships. The grafts mechanism happen to treat commit Agreed that it is no more special. OTOH, if we focus on the fact that people want to avoid high-cost data transfers, transferring commit chains is cheap and allows the client to ask good questions when talking to the server. So as far as tradeoffs go, it allows you to keep the protocol simple, and minimise complex guessing at either end of the wire. > But let's touch a slightly different but related topic first. > People do not ask for shallow clones. They just want faster > transfer of "everything they care about". Shallow and lazy I'd disagree a bit here. They care about the whole project, and in time they'll find that out and end up pulling it all if they use git much at all ;-) They want fast, cheap initial checkouts. ... > So they are all not that different. Earlier you were pointing out how hard it was for the client to even know what to ask for because it can't see the whole picture. Having the ancestry complete means you always know what to ask for. > Now, first and foremost, while I would love to have a system > that gracefully operates with a "sparse" repository that lacks > objects that should exist from tag/commit/tree/blob reachability > point of view, it is an absolute requirement that I can tell why > objects are missing from a repository when I find some are > missing by running fsck-objects [*3*]. I agree -- and you can keep those objects you know are expected to be missing listed in an "packless" idx file somewhere. > If repository is a > shallow clone, not having some object may be expected, but I > want to be able to tell repository corruption locally even in > that case, +1 > I talked about the need of upload-pack protocol extension As far as I can see, this would not need any change to the upload-pack protocol. There are some hard problems in dealing with a sparse repo that need thinking through. My thinking is that by having the whole commit chain around the protocol can be kept sane, by virtue of the local repo always having a clear "overall" picture, including knowing what it's missing. > [*4*] In git, there is no inherent server vs client or upstream > vs downstream relationship between repositories. Here an importaant distiction must be made. A "publishing" repo cannot be sparse. A sparse repo probably cannot be cloned from. > You may be > even fetching from many people and do not have a set upstream at > all. Or you are _the_ upstream, and your notebook has the > latest devevelopment history, and after pushing that latest > history to your mothership repository, you may decide you do not > want ancient development history on a puny notebook, and locally > cauterize the history on your notebook repository and prune > ancient stuff. Well, that's easy again: "prune old blobs and list them in an idx" should work well. cheers, martin ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 7:15 ` Martin Langhoff @ 2006-09-08 8:33 ` Junio C Hamano 2006-09-08 17:18 ` A Large Angry SCM 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-08 8:33 UTC (permalink / raw) To: Martin Langhoff; +Cc: git "Martin Langhoff" <martin.langhoff@gmail.com> writes: > As far as I can see, this would not need any change to the upload-pack > protocol. That I have to disagree. Take your earlier example of "I have X and I learned you have A B C D". Now the fetch that got X was a commit-only one (but you have full tree for X), but you got X from somebody else, not the uploader you are talking with right now. There is a common ancestor Y somewhere behind X, but between Y and X you lack trees and blobs. How would the current protocol work (I am explaining that it won't be "not need any change")? After you express interest for A, B, C, D with "want" messages, you start telling "have X", "have X~1", have "X~2",... to the uploader (X, X~1 etc. are object names not symbolic). Eventually the uploader would recognize Y that is an ancestor of X that it has and will Ack it, you stop traversing the ancestor of an acked one and say "done". So now we have a common ancestor and the upload side knows it can omit commits behind that common commit Y, trees and blobs contained in Y. See an Oops here? You do NOT have trees and blobs associated with commit Y. I am not saying we should not change the protocol. I am just trying to explain that the problem is not something you can fudge without changing the protocol. As a first level approximation, we could in addition to the commit object name have a bit that says "do I have trees and blobs associated with the commit" bit on each "have" message (by the way, this is _expensive_. You have to traverse down the tree contained in each commit using has_sha1_file() recursively to see if you have anything missing from _each_ commit you send to the other). Alternatively, you can say "I have this commit, but I do not have this tree and I do not even know if I have blobs needed to complete that tree because I do not know what that tree I am missing contains -- I may have them, I may not. I truly do not know" to convey the same information. Think about it a bit -- saying "I know I am missing this tree" is one thing, but if we end up saying "I do not even know what I am missing", it is like saying "don't care what I say, just send everything to me". Are we gaining much by having only commit objects on our end? Once you end up sending a full tree, it is like doing an initial checkout over CVS when you think you are incrementally fetching, and you are already lost. For example, a recent kernel tarball compressed with gzip is around 50MB; the history for 34k commits since 2.6.12-rc2 fully packed is slightly less than 150MB. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 7:15 ` Martin Langhoff 2006-09-08 8:33 ` Junio C Hamano @ 2006-09-08 17:18 ` A Large Angry SCM 1 sibling, 0 replies; 101+ messages in thread From: A Large Angry SCM @ 2006-09-08 17:18 UTC (permalink / raw) To: Martin Langhoff; +Cc: Junio C Hamano, Jon Smirl, git Martin Langhoff wrote: > On 9/8/06, Junio C Hamano <junkio@cox.net> wrote: ... >> [*4*] In git, there is no inherent server vs client or upstream >> vs downstream relationship between repositories. > > Here an importaant distiction must be made. A "publishing" repo cannot > be sparse. A sparse repo probably cannot be cloned from. With the use of "placeholder" objects; neither one of these assertions is true. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 5:30 ` Junio C Hamano 2006-09-08 7:15 ` Martin Langhoff @ 2006-09-08 14:20 ` Jon Smirl 2006-09-08 15:50 ` Jakub Narebski 1 sibling, 1 reply; 101+ messages in thread From: Jon Smirl @ 2006-09-08 14:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Langhoff, git On 9/8/06, Junio C Hamano <junkio@cox.net> wrote: > [*4*] In git, there is no inherent server vs client or upstream > vs downstream relationship between repositories. You may be > even fetching from many people and do not have a set upstream at > all. Or you are _the_ upstream, and your notebook has the > latest devevelopment history, and after pushing that latest > history to your mothership repository, you may decide you do not > want ancient development history on a puny notebook, and locally > cauterize the history on your notebook repository and prune > ancient stuff. The objects missing from the notebook repository > are retrievable from your mothership repository again if/when > needed and you as the user would know that (and if you are lucky > you may even remember that a few months later), but git doesn't, > and there is no reason for git to want to know it. If we want > to do "fault-in on demand", we need to add a way to say "it is > Ok for this repository to be incomplete -- objects that are > missing must be completed from that repository (or those > repositories)". But that's quite a special case of having a > fixed upstream. A 'not-present' object would be a normal git object, it would contain a list of sha1s that it is stubbing for. These alias sha1s need to end up somewhere where the git tools can find them. Maybe put all of the 'not-present' objects into their own pack and add the aliases to the index. If the aliases point back the 'not-present' object everything can be validated even if the alias sha1s don't match the one of the object. This pack would be private and not sent to other repositories. It would be useful to maintain a list of possible remote repositories to search for the missing objects if needed. This list could be shared with people that clone from you. If you clone from a remote repository that is a partial copy you may not be able to get all of the objects you want. The only choice here is to point them to 'not-present' stub and try searching the alternative repository list. If you really want to build something for the future, have all of the git repositories on the net talk to each other and use a distributed hash scheme to spread redundant copies of the objects out over the cloud of servers. This will have the effect of creating a global namespace for all git projects. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 14:20 ` Jon Smirl @ 2006-09-08 15:50 ` Jakub Narebski 2006-09-09 3:13 ` Petr Baudis 0 siblings, 1 reply; 101+ messages in thread From: Jakub Narebski @ 2006-09-08 15:50 UTC (permalink / raw) To: git My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives mechanism. We put URI for repository (repositories) that hosts the project, and we would need at start to download at least heads and tags, and only heads and tags. When there is request for an object (commit, tree, blob or tag) which we don't have locally, we ask remote repository if it have it, and download it (and perhaps some related objects); this needs extension I guess for the git downloader ('wantonly'), it should be fairly easy for dumb downloaders if you don't mind getting whole pack i.e. perhaps more than intended. Perhaps 'wantonly' for git protocol should work the same: one would get ready pack (perhaps with limit on pack size: otherwise one would get only the requested objects). One thing that needs extending for lazy clone is git-fsck... otherwise you would get whole repository when calling git-fsck. Simplest would be fsck-ing only the local part, perhaps checking if the "missing" objects are present in remote repository, but not downloading them and not checking connectivity further... and of course notifying user that this is lazy clone. Unless we run git-fsck-objects on the remote side, then pass results to the local side. There is probably bunch of problems with this idea... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 15:50 ` Jakub Narebski @ 2006-09-09 3:13 ` Petr Baudis 2006-09-09 8:39 ` Jakub Narebski 0 siblings, 1 reply; 101+ messages in thread From: Petr Baudis @ 2006-09-09 3:13 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Dear diary, on Fri, Sep 08, 2006 at 05:50:40PM CEST, I got a letter where Jakub Narebski <jnareb@gmail.com> said that... > My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives > mechanism. We put URI for repository (repositories) that hosts the project, > and we would need at start to download at least heads and tags, and only > heads and tags. One thing to note is that you won't last very long without getting at least basically all the commits from the history. git log, git merge-base and such would either just suck them all, get partially moved to the server side, or would undergo quite a painful and slooooooooow process "get me commit X... thank you, sir. hmm, it appears that its parent is commit Y. could you get me commit Y, please...? thank you, sir. hmm, it appears...". -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ Snow falling on Perl. White noise covering line noise. Hides all the bugs too. -- J. Putnam ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 3:13 ` Petr Baudis @ 2006-09-09 8:39 ` Jakub Narebski 0 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-09 8:39 UTC (permalink / raw) To: git Petr Baudis wrote: > Dear diary, on Fri, Sep 08, 2006 at 05:50:40PM CEST, I got a letter > where Jakub Narebski <jnareb@gmail.com> said that... >> My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives >> mechanism. We put URI for repository (repositories) that hosts the project, >> and we would need at start to download at least heads and tags, and only >> heads and tags. > > One thing to note is that you won't last very long without getting > at least basically all the commits from the history. git log, git > merge-base and such would either just suck them all, get partially moved > to the server side, or would undergo quite a painful and slooooooooow > process "get me commit X... thank you, sir. hmm, it appears that its > parent is commit Y. could you get me commit Y, please...? thank you, > sir. hmm, it appears...". As I said there is load of troubles with lazy clone/fetch = remote alternatives I didn't thought about. git log/git rev-list and git fsck-objects among them. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 22:07 ` Junio C Hamano 2006-09-07 22:40 ` Jakub Narebski 2006-09-08 3:54 ` Martin Langhoff @ 2006-09-08 5:05 ` Aneesh Kumar K.V 2 siblings, 0 replies; 101+ messages in thread From: Aneesh Kumar K.V @ 2006-09-08 5:05 UTC (permalink / raw) To: git Junio C Hamano wrote: > I've been postponing thinking about shallow clones with a plan > to do so when I have absolutely nothing else to do for a solid > few days. Since my time is booked pretty much full with day job > and the ongoing pack-file topics these days, I will most likely > stick to that plan after I finish this message. > > Not because I think it is an insignificant problem at the fringe > (I did not say "when I have nothing better to do"), but because > I feel this is something important (and hard) that requires > "think about it and only about it straight through for a few > days, dropping everything else" kind of thinking. > > When you have a ref under .git/refs/ pointing at an object, you > are making this assertion: > > The object store associated with this repository > (usually, .git/objects and alternates pointed at by > .git/objects/info/alternates) holds _all_ objects that > are recursively reachable from this object by following > tag reference, commit ancestry chain, commit to its > tree, and tree to its subtrees and blobs. > > So by definition, a shallow clone is a broken/incomplete > repository. Unless we fix the breakage in some way. > > One way to do so which has often been talked about is to use the > grafts facility. You cut off the history at some commit, and > cauterize the commit ancestry chain at that point by telling git > to pretend that the commit does not have any parent. While this > would work after setting it up, in the sense that "git log" > would stop at the commit and "git fsck-objects" would say > nothing is missing, it is cumbersome to set up, we do not have > an easy way to fetch (or clone -- but clone is a special case of > fetch where the downstream starts out from emptiness) from such > a repository, unplugging to get histories further back with > a later fetch is quite involved, and it is not clear what it > means to further fetch from such a shallow clone. Pardon my ignorance, but won't having a http:// or ssh based url support in objects/info/alternates will support such shallow clones. What i am wondering is can't we make the objects/info/alternates following code traverse the network transport and make git log work like cvs log in case things are not in local repository. This should make all the git operation slow, but if you are interested in speed one can clone the repository fully. -aneesh ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-07 19:52 Change set based shallow clone Jon Smirl 2006-09-07 20:21 ` Jakub Narebski 2006-09-07 22:07 ` Junio C Hamano @ 2006-09-08 1:01 ` linux 2006-09-08 2:23 ` Jon Smirl 2 siblings, 1 reply; 101+ messages in thread From: linux @ 2006-09-08 1:01 UTC (permalink / raw) To: git, jonsmirl, linux > Here's a change set based shallow clone scheme I've been thinking > about, does it have potential? Well, let's look at the problems... You might want to look at the explanation of the current git network protocol at http://www.spinics.net/lists/git/msg08899.html I'm just understanding it myself, but as best I understant it, it has four phases: - First, the server lists all of the heads it has, by SHA1 and name. - Second, the client tells it which ones it wants, by name. - Third, an interactive protocol is entered in which the client tells the server which commits it has in terms the server can understand. This proceeds as follows: - For each head the client has, it lists commit objects going backwards in history. - As soon as the server sees one that it has heard of, it sends back "ACK <hash> continue". The client stops tracing that branch and starts in on the next one. - (Periodically, the client sends a ping to the server, which responds "NAK" if it's still alive.) - This continues until the client says "done". At this point, the server knows a set of commits which the client has, and a set of commits which the client wants. Then it invokes git-rev-list --objects-edge <want1> <want2> ^<have1> ^<have2> (In practice, there will be a lot more than two of each, but forgive me if I simplify.) This builds a list of commits that are ancestors of <want1> and <want2> but not <have1> or <have2>. (A commit is considered an ancestor of itself, so <want1> and <want2> are in the set, while <have1> and <have2> are not.) Then it enumerates the complete trees of each of those commits. Then it subtracts from that object set all of the trees and blobs in the <have1> and <have2> commits. This is a good approximation to the set of objects that the client needs to be sent. It does not search the ancestors of <have1> and <have2>, so it is possible that a copy of some object exists in an older commit, and the state in <want1> is a reversion to a pre-<have1> state, but it's unlikely and not worth looking for. Sending it redundantly is harmless. Then, this set of objects is piped to git-pack-objects, which finds them and builds a pack file. If the objects are stored locally as deltas relative to other objects which are either to be sent, or are in the <have1> or <have2> commits (which information is included in the data output by git-rev-list), the delta is copied to the pack file verbatim. Otherwise, more CPU is spent to pack it. Again, the server is allowed to check for deltas against ancestors of <have1> or <have2>, but doesn't bother for efficiency. Fourth, the pack file is sent to the client, which unpacks it (and discards any duplicates). > When the client wants a shallow clone it starts by telling the server > all of the HEADs and how many change sets down each of those HEADs it > has locally. That's a small amout of data to transmit and it can be > easily tracked. Let's ignore merged branches for the moment. When you say "change set", I'm going to assume you mean "commit object". Okay. Now, the server hasn't heard of one or more of those commit objects, because they're local changes. What then? Another issue is that a client with a nearly-full copy has to do a full walk of its history to determine the depth count that it has. That can be more than 2500 commits down in the git repository, and worse in the mozilla one. It's actually pretty quick (git-show-branch --more=99999 will do it), but git normally tries to avoid full traversals like the plague Oh, and was "for the moment" supposed to last past the end of your e-mail? I don't see what to do if there's a merge in the history and the depth on different sides is not equal. E.g. the history looks like: ...a---b---c---d---e---f / \ ...w---x---y HEAD / ...p---q---r---s Where "..." means that there are ancestors, but they're missing. > If you haven't updated for six months when the server walks backwards > for 10 change sets it's not going to find anything you have locally. > When this situation is encountered the server needs to generate a > delta just for you between one of the change sets it knows you have > and one of the 10 change sets you want. By generating this one-off > delta it lets you avoid the need to fetch all of the objects back to a > common branch ancestor. The delta functions as a jump over the > intervening space. Your choice of words keeps giving me the impression that you believe that a "change set" is a monolithic object that includes all the changes made to all the files. It's neither monolithic nor composed of changes. A commit objects consists soley of metadata, and contains a pointer to a tree object, which points recursively to the entire project state at the time of the commit. There is massive sharing of component objects between successive commits, but they are NOT stored as deltas relative to one another. The pack-forming heuristics tend to achieve that effect, but it is not guaranteed or required by design. Please understand that, deep in your bones: git is based on snapshots, not deltas. But okay, so we've sent the client the latest 10 commits, with a dangling tail at the bottom. (The files may have been sent as deltas against the old state, or just fresh compressed copies, but that doesn't matter.) Then the heads like "origin" have been advanced. So the old commit history is now unreferenced garbage; nothing points to it, and it will be deleted next time git-prune is run. Is that the intended behavior? Or should updates to an existing clone always complete the connections? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 1:01 ` linux @ 2006-09-08 2:23 ` Jon Smirl 2006-09-08 8:36 ` Jakub Narebski 2006-09-08 18:42 ` linux 0 siblings, 2 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-08 2:23 UTC (permalink / raw) To: linux@horizon.com; +Cc: git On 7 Sep 2006 21:01:12 -0400, linux@horizon.com <linux@horizon.com> wrote: > > When the client wants a shallow clone it starts by telling the server > > all of the HEADs and how many change sets down each of those HEADs it > > has locally. That's a small amout of data to transmit and it can be > > easily tracked. Let's ignore merged branches for the moment. > > When you say "change set", I'm going to assume you mean "commit object". > > Okay. Now, the server hasn't heard of one or more of those commit > objects, because they're local changes. What then? Toss them, if they don't exist on the server the server is going to be able to send any objects for them. > Another issue is that a client with a nearly-full copy has to do a full > walk of its history to determine the depth count that it has. That can > be more than 2500 commits down in the git repository, and worse in the > mozilla one. It's actually pretty quick (git-show-branch --more=99999 > will do it), but git normally tries to avoid full traversals like the > plague Client would track this incrementally and not recompute it each time. > Oh, and was "for the moment" supposed to last past the end of your e-mail? > I don't see what to do if there's a merge in the history and the depth > on different sides is not equal. E.g. the history looks like: > > ...a---b---c---d---e---f > / \ > ...w---x---y HEAD > / > ...p---q---r---s > > Where "..." means that there are ancestors, but they're missing. > > > If you haven't updated for six months when the server walks backwards > > for 10 change sets it's not going to find anything you have locally. > > When this situation is encountered the server needs to generate a > > delta just for you between one of the change sets it knows you have > > and one of the 10 change sets you want. By generating this one-off > > delta it lets you avoid the need to fetch all of the objects back to a > > common branch ancestor. The delta functions as a jump over the > > intervening space. > > Your choice of words keeps giving me the impression that you believe > that a "change set" is a monolithic object that includes all the changes > made to all the files. It's neither monolithic nor composed of changes. > A commit objects consists soley of metadata, and contains a pointer to > a tree object, which points recursively to the entire project state at > the time of the commit. I was using change set to refer to snapshot. > There is massive sharing of component objects between successive > commits, but they are NOT stored as deltas relative to one another. Yes, most of the sharing occurs via the tree structures. > The pack-forming heuristics tend to achieve that effect, but it is not > guaranteed or required by design. > > Please understand that, deep in your bones: git is based on snapshots, > not deltas. > > > But okay, so we've sent the client the latest 10 commits, with a dangling > tail at the bottom. (The files may have been sent as deltas against the > old state, or just fresh compressed copies, but that doesn't matter.) > Then the heads like "origin" have been advanced. > > So the old commit history is now unreferenced garbage; nothing points > to it, and it will be deleted next time git-prune is run. Is that > the intended behavior? Or should updates to an existing clone always > complete the connections? If you follow the links in what looks to be a dangling object sooner or latter you will run into the root object or a 'not present' object. If you hit one of those the objects are not dangling and should be preserved. Here is another way to look at the shallow clone problem. The only public ids in a git tree are the head and tag pointers. Send these to the client. Now let's modify the git tools to fault the full objects in one by one from the server whenever a git operation needs the object. Dangling references would point to 'not-present' objects. For a typical user using a model like this, how much of the Mozilla repository would end up being faulted into their machine? Mozilla has 2M objects and 250K commits in a 450MB pack. My estimate is that a typical user is going to touch less than 200K of the objects and maybe less than 100K. Of course always faulting in full objects is wasteful. A smart scheme would be to try and anticipate with some read ahead and figure out ways to send deltas. Tools like gitk would need to only touch the objects needed to draw the screen and not run the full commit chain at startup. This experiment can be done fairly easily. Put all of the kernel source into a single pack file. Modify the git tools to set a bit in the index file if an object is accessed. Use the pack for a few days and then dump out the results. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 2:23 ` Jon Smirl @ 2006-09-08 8:36 ` Jakub Narebski 2006-09-08 8:39 ` Junio C Hamano 2006-09-08 18:42 ` linux 1 sibling, 1 reply; 101+ messages in thread From: Jakub Narebski @ 2006-09-08 8:36 UTC (permalink / raw) To: git Jon Smirl wrote: > Here is another way to look at the shallow clone problem. The only > public ids in a git tree are the head and tag pointers. Send these to > the client. Now let's modify the git tools to fault the full objects > in one by one from the server whenever a git operation needs the > object. Dangling references would point to 'not-present' objects. > > For a typical user using a model like this, how much of the Mozilla > repository would end up being faulted into their machine? Mozilla has > 2M objects and 250K commits in a 450MB pack. My estimate is that a > typical user is going to touch less than 200K of the objects and maybe > less than 100K. > > Of course always faulting in full objects is wasteful. A smart scheme > would be to try and anticipate with some read ahead and figure out > ways to send deltas. Tools like gitk would need to only touch the > objects needed to draw the screen and not run the full commit chain at > startup. Yes, that is also recurring _lazy_ clone idea (or remote alternatives[*1*] idea). With it's own set of difficulties, like cache management, readahead and such, and of course differentiating between not present, and present on the remote server. And what to do when network fails... wouldn't it be just easier to use networking system like Coda or InterMezzo? [*1*] From Documentation/repository-layout.txt or http://www.kernel.org/pub/software/scm/git/docs/repository-layout.html You can be using `objects/info/alternates` mechanism, or `$GIT_ALTERNATE_OBJECT_DIRECTORIES` mechanism to 'borrow' objects from other object stores. A repository with this kind of incomplete object store is not suitable to be published for use with dumb transports but otherwise is OK as long as `objects/info/alternates` points at the right object stores it borrows from. [...] objects/info/alternates:: This file records absolute filesystem paths of alternate object stores that this object store borrows objects from, one pathname per line. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 8:36 ` Jakub Narebski @ 2006-09-08 8:39 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-08 8:39 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com> writes: > [*1*] From Documentation/repository-layout.txt or > http://www.kernel.org/pub/software/scm/git/docs/repository-layout.html > > You can be using `objects/info/alternates` mechanism, or > `$GIT_ALTERNATE_OBJECT_DIRECTORIES` mechanism to 'borrow' > objects from other object stores. A repository with this kind > of incomplete object store is not suitable to be published for > use with dumb transports but otherwise is OK as long as > `objects/info/alternates` points at the right object stores > it borrows from. Ah, you spotted an obsolete documentation again ;-). ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 2:23 ` Jon Smirl 2006-09-08 8:36 ` Jakub Narebski @ 2006-09-08 18:42 ` linux 2006-09-08 21:13 ` Jon Smirl 1 sibling, 1 reply; 101+ messages in thread From: linux @ 2006-09-08 18:42 UTC (permalink / raw) To: jonsmirl, linux; +Cc: git Thanks for making suggestions. It's easier to knock straw men down than make them. But forgive me if I have fun pointing out the holes... >> Okay. Now, the server hasn't heard of one or more of those commit >> objects, because they're local changes. What then? > > Toss them, if they don't exist on the server the server is going to be > able to send any objects for them. Er... that makes "git pull git://git.other-hacker.personal/dir" impossible. There's a reason that git can handle this as it presently exists: a-b-c-d-e-f <-- HEAD / Hacker 1: o-o-o-o <-- origin Hacker 2: o-o-o-o-o-o <-- origin \ v-w-x-y <-- HEAD Suppose hacker 2, who happened to sync with upstream more recently, wants to pull from hacker 1, in the hopes of building a-b-c-d-e-f / \ Hacker 2: o-o-o-o-o-o z <-- HEAD \ / v-w-x-y This works now, and should continue to work. > Client would track this incrementally and not recompute it each time. Yes, this is probably possible. I haven't worked it out, but given a cache of precomputed (commit,depth) numbers, you can trace back from the new heads until you hit a cache entry. > If you follow the links in what looks to be a dangling object sooner > or latter you will run into the root object or a 'not present' object. > If you hit one of those the objects are not dangling and should be > preserved. I don't understand. It seems like you're saying that any commit without an ancestor, and all objects recursively pointing to it, are not garbage. How would anything ever get declared garbage in this case? > Here is another way to look at the shallow clone problem. The only > public ids in a git tree are the head and tag pointers. Send these to > the client. Now let's modify the git tools to fault the full objects > in one by one from the server whenever a git operation needs the > object. Dangling references would point to 'not-present' objects. Er... that would fault in a gigabyte the first time someone ran gitk, or several other history-browsing commands. Don't you need a way to say "tell the user this isn't present and will take an hour to download"? Well, thinking, there are actually two shallow clone possibilities: 1) Don't load the unwanted commit objects 2) Clone the commit objects, but not their trees. The latter would let you browse the commit history, at least. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 18:42 ` linux @ 2006-09-08 21:13 ` Jon Smirl 2006-09-08 22:27 ` Jakub Narebski 2006-09-08 23:09 ` Linus Torvalds 0 siblings, 2 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-08 21:13 UTC (permalink / raw) To: linux@horizon.com; +Cc: git > > Here is another way to look at the shallow clone problem. The only > > public ids in a git tree are the head and tag pointers. Send these to > > the client. Now let's modify the git tools to fault the full objects > > in one by one from the server whenever a git operation needs the > > object. Dangling references would point to 'not-present' objects. > > Er... that would fault in a gigabyte the first time someone ran gitk, > or several other history-browsing commands. Don't you need a way to say > "tell the user this isn't present and will take an hour to download"? gitk would need to be modified to only run enough of the commit tree to draw what is displayed in the window. As you page down it would retrive more commits if needed. There is no need for gitk to run 250K commits when I'm usually never going to look at them all. Of course this may mean some rework for gitk. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 21:13 ` Jon Smirl @ 2006-09-08 22:27 ` Jakub Narebski 2006-09-08 23:09 ` Linus Torvalds 1 sibling, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-08 22:27 UTC (permalink / raw) To: git Jon Smirl wrote: >> > Here is another way to look at the shallow clone problem. The only >> > public ids in a git tree are the head and tag pointers. Send these to >> > the client. Now let's modify the git tools to fault the full objects >> > in one by one from the server whenever a git operation needs the >> > object. Dangling references would point to 'not-present' objects. >> >> Er... that would fault in a gigabyte the first time someone ran gitk, >> or several other history-browsing commands. Don't you need a way to say >> "tell the user this isn't present and will take an hour to download"? > > gitk would need to be modified to only run enough of the commit tree > to draw what is displayed in the window. As you page down it would > retrive more commits if needed. There is no need for gitk to run 250K > commits when I'm usually never going to look at them all. Of course > this may mean some rework for gitk. Or remembering to set --max-count or --max-age parameters to gitk (which are then passed to git-rev-list). -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 21:13 ` Jon Smirl 2006-09-08 22:27 ` Jakub Narebski @ 2006-09-08 23:09 ` Linus Torvalds 2006-09-08 23:28 ` Jon Smirl 2006-09-09 1:05 ` Paul Mackerras 1 sibling, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-08 23:09 UTC (permalink / raw) To: Jon Smirl; +Cc: linux@horizon.com, Git Mailing List, Paul Mackerras On Fri, 8 Sep 2006, Jon Smirl wrote: > > gitk would need to be modified to only run enough of the commit tree > to draw what is displayed in the window. As you page down it would > retrive more commits if needed. There is no need for gitk to run 250K > commits when I'm usually never going to look at them all. Of course > this may mean some rework for gitk. Actually, it's more than a little rework. The _real_ problem is that gitk uses "--topo-order" to git-rev-list, which basically means that git-rev-list needs to parse EVERY SINGLE COMMIT. So far, nobody has written a topological sort that can reasonably efficiently handle partial trees and automatically notice when there is no way that a DAG cannot have any more references to a commit any more (if you can show that the result would cause a cycle, you could output a partial topological tree early). It's possible that no such topological sorting (ie the efficient kind, that can work with partial DAG's) even exists. So if you want to be able to visualize a partial repository, you need to teach git to not need "--topo-order" first. That's likely the _big_ change. After that, the rest should be easy. [ One way to do it might be: the normal ordering of revisions without "--topo-order) is "_close_ to topological", and gitk could just decide to re-compute the whole graph whenever it gets a commit that has a parent that it has already graphed. Done right, it would probably almost never actually generate re-computed graphs (if you only actually generate the graph when the user scrolls down to it). Getting rid of the --topo-order requirement would speed up gitk absolutely immensely, especially for unpacked cold-cache archives. So it would probably be a good thing to do, regardless of any shallow clone issues ] Hmm? Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 23:09 ` Linus Torvalds @ 2006-09-08 23:28 ` Jon Smirl 2006-09-08 23:45 ` Paul Mackerras 2006-09-10 12:41 ` Paul Mackerras 2006-09-09 1:05 ` Paul Mackerras 1 sibling, 2 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-08 23:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux@horizon.com, Git Mailing List, Paul Mackerras On 9/8/06, Linus Torvalds <torvalds@osdl.org> wrote: > Getting rid of the --topo-order requirement would speed up gitk > absolutely immensely, especially for unpacked cold-cache archives. So it > would probably be a good thing to do, regardless of any shallow clone > issues ] > > Hmm? gitk takes about a minute to come up on the Mozilla repo when everything is in cache. It takes about twice as long when things are cold. It's enough of delay that I don't use the tool. > > Linus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 23:28 ` Jon Smirl @ 2006-09-08 23:45 ` Paul Mackerras 2006-09-09 1:45 ` Jon Smirl 2006-09-10 12:41 ` Paul Mackerras 1 sibling, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-08 23:45 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List Jon Smirl writes: > gitk takes about a minute to come up on the Mozilla repo when > everything is in cache. It takes about twice as long when things are > cold. It's enough of delay that I don't use the tool. How long does it take for git rev-list --topo-order to produce its first line of output, in the cache-hot case? That is, is it just gitk's graph layout that is taking the time, or is it git rev-list (meaning that gitk needs to stop using --topo-order)? Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 23:45 ` Paul Mackerras @ 2006-09-09 1:45 ` Jon Smirl 0 siblings, 0 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-09 1:45 UTC (permalink / raw) To: Paul Mackerras; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List On 9/8/06, Paul Mackerras <paulus@samba.org> wrote: > Jon Smirl writes: > > > gitk takes about a minute to come up on the Mozilla repo when > > everything is in cache. It takes about twice as long when things are > > cold. It's enough of delay that I don't use the tool. > > How long does it take for git rev-list --topo-order to produce its > first line of output, in the cache-hot case? That is, is it just > gitk's graph layout that is taking the time, or is it git rev-list > (meaning that gitk needs to stop using --topo-order)? Cold cache: jonsmirl@jonsmirl:/opt/t1$ time git rev-list --all --topo-order --parents >/dev/null real 0m27.687s user 0m5.672s sys 0m0.560s Hot cache: jonsmirl@jonsmirl:/opt/t1$ time git rev-list --all --topo-order --parents >/dev/null real 0m6.041s user 0m5.768s sys 0m0.276s jonsmirl@jonsmirl:/opt/t1$ I have enough RAM (3GB) that everything fits. Hot cache, 27 seconds to get first line of display in gitk It takes about two minutes get everything loaded into the window and scroll to the head This a 450MB pack, 250K commits, 2M objects. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 23:28 ` Jon Smirl 2006-09-08 23:45 ` Paul Mackerras @ 2006-09-10 12:41 ` Paul Mackerras 2006-09-10 14:56 ` Jon Smirl 2006-09-11 0:03 ` Shawn Pearce 1 sibling, 2 replies; 101+ messages in thread From: Paul Mackerras @ 2006-09-10 12:41 UTC (permalink / raw) To: Jon Smirl; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List Jon Smirl writes: > gitk takes about a minute to come up on the Mozilla repo when > everything is in cache. It takes about twice as long when things are > cold. It's enough of delay that I don't use the tool. I've been doing some timing measurements with Jon's repo. The results are interesting. It turns out that a lot of the initial delay is in reading all the references. With ~1600 heads and almost as many tags, readrefs was taking about 30 seconds (on my 2.5GHz quad G5), largely because it was doing two execs for each tag, a git rev-parse and a git cat-file, which was a bit stupid. With that fixed, it's about 5 or 6 seconds from starting gitk to seeing a window with commits listed in it in the hot cache case, even still using --topo-order. Without --topo-order it's about 2-3 seconds to seeing the window with commits listed, but the overall process takes longer (I still need to figure out why). In the cold cache case, it takes about 32 seconds to read all the references, even with the fixed version of readrefs, since that's about how long git ls-remote .git takes. Also, if you do gitk --all (as I often do), then gitk does a git rev-parse, which takes about 20 seconds (but then readrefs only takes 10 seconds since the heads are in cache). The bottom line is that I can speed up the startup for the hot-cache case quite a lot. The cold-cache case is going to take about 20-30 seconds whatever I do unless Linus or Junio can come up with a way to pack the heads and tags. I could read the refs asynchronously but there will still be a long delay in git rev-parse if you give arguments such as --all. Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 12:41 ` Paul Mackerras @ 2006-09-10 14:56 ` Jon Smirl 2006-09-10 16:10 ` linux 2006-09-10 18:51 ` Junio C Hamano 2006-09-11 0:03 ` Shawn Pearce 1 sibling, 2 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-10 14:56 UTC (permalink / raw) To: Paul Mackerras; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List On 9/10/06, Paul Mackerras <paulus@samba.org> wrote: > Jon Smirl writes: > > > gitk takes about a minute to come up on the Mozilla repo when > > everything is in cache. It takes about twice as long when things are > > cold. It's enough of delay that I don't use the tool. > > I've been doing some timing measurements with Jon's repo. The results > are interesting. Using the Mozilla repo you downloaded is not a normal situation since it is 100% packed. Most people are going to have a few thousand loose objects floating around too. Loose objects really slow things down. You noticed too that forks of small apps are relatively slow. The first pass of the import tools used fork for everything and took a week to run with 60% of the time spent in the kernel. There may be some work to do on fork in the kernel. Does mapping the kernel into the top 1G slow down fork of these small apps? Or are they dynamically linked to something that is bringing in millions of pages? When I was doing oprofile all of the time was in the actual fork call and page table copying. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 14:56 ` Jon Smirl @ 2006-09-10 16:10 ` linux 2006-09-10 18:00 ` Jon Smirl 2006-09-10 18:51 ` Junio C Hamano 1 sibling, 1 reply; 101+ messages in thread From: linux @ 2006-09-10 16:10 UTC (permalink / raw) To: jonsmirl; +Cc: git, linux, paulus, torvalds > You noticed too that forks of small apps are relatively slow. The > first pass of the import tools used fork for everything and took a > week to run with 60% of the time spent in the kernel. There may be > some work to do on fork in the kernel. Does mapping the kernel into > the top 1G slow down fork of these small apps? Or are they dynamically > linked to something that is bringing in millions of pages? When I was > doing oprofile all of the time was in the actual fork call and page > table copying. Actually, Linux has one of the fastest forks around, 100-200 us on modern x86. For large executables, the shared page tables patch (is it merged yet?) might help. No, mapping the kernel does NOT slow anything down. Those page tables are shared between all processes, and use the x86's global bit to avoid being forced out of the TLB during context switch. Attempting to make the kernel mapping vary between processes would slow things down enormously! I'm not finding reasonably recent Linux/FreeBSD comparisons. There are plenty of Max OS X comparisons, but it appears to particularly suck, so that's a bit of a straw man. See, e.g. http://www.anandtech.com/mac/showdoc.aspx?i=2520&p=7 or some figures from 2002: http://lists.apple.com/archives/darwin-kernel/2002/Oct/msg00009.html ---------------------------------------------------------------- Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ---- g4 Darwin 1.4 799 2.21 3.19 12.6 16.4 30.4 3.78 9.88 1576 4095 13.K g4 Darwin 5.5 797 2.26 3.15 14.5 17.8 30.4 3.82 10.2 5685 12.K 40.K g4.local. Darwin 6.0 797 1.47 2.73 17.2 20.7 27.2 3.00 10.5 7517 17.K 41.K g4.local. Darwin 6.0 (after misconfiguration fixed) 1575 g4 Linux 2.4.18- 799 0.42 0.69 2.52 3.79 33.6 1.23 3.08 659. 2642 12.K --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ---- jim Darwin 1.4 448 2.85 4.23 9.53 17 4.83 14 2402 6817 19K jim.local Darwin 6.1 448 1.89 3.71 21 29 43 3.85 15 2832 8955 19K For small processes such as lmbench uses, you can see from the above that exec is much more expensive than fork. Also http://www.opersys.com/lrtbf/index.html http://marc.theaimsgroup.com/?l=linux-kernel&m=112086443319815 http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Administration/hyper-threading-linux/page4.html As for small apps and dynamic linking, glibc isn't tiny (prelink can help), but that's post-exec. If it's in the fork() call, then it's just a case that your app's memory space is big and even copying all of its page tables is expensive. If you're invoking one of the git programs that's a shell script, that adds its own startup overhead, of course. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 16:10 ` linux @ 2006-09-10 18:00 ` Jon Smirl 2006-09-10 19:03 ` linux 0 siblings, 1 reply; 101+ messages in thread From: Jon Smirl @ 2006-09-10 18:00 UTC (permalink / raw) To: linux@horizon.com; +Cc: git, paulus, torvalds On 10 Sep 2006 12:10:07 -0400, linux@horizon.com <linux@horizon.com> wrote: > Actually, Linux has one of the fastest forks around, 100-200 us on > modern x86. For large executables, the shared page tables patch (is it > merged yet?) might help. Here is the opfile of the kernel 3467889 18.9893 copy_page_range 2190416 11.9941 unmap_vmas 1156011 6.3300 page_fault 887794 4.8613 release_pages 860853 4.7138 page_remove_rmap 633243 3.4675 get_page_from_freelist 398773 2.1836 do_wp_page 344422 1.8860 __mutex_lock_slowpath 280070 1.5336 __handle_mm_fault 241713 1.3236 do_page_fault 238398 1.3054 __d_lookup 236654 1.2959 vm_normal_page At the time this was measured parsecvs was executing millions of git command using system(command). 40% of the CPU was in the kernel, it stayed that way for hours. 18262372 41.0441 /home/good/vmlinux 5465741 12.2841 /usr/bin/cvs 4374336 9.8312 /lib/libc-2.4.so 3627709 8.1532 /lib/libcrypto.so.0.9.8a 2494610 5.6066 /usr/bin/oprofiled 2471238 5.5540 /usr/lib/libz.so.1.2.3 945349 2.1246 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so 933646 2.0983 /usr/local/bin/git-read-tree 758776 1.7053 /usr/local/bin/git-write-tree 642502 1.4440 /lib/ld-2.4.so 472903 1.0628 /nvidia 379254 0.8524 /usr/local/bin/git-pack-objects Maybe we are looking at the wrong thing, it may be fast to fork a process, is it fast for the process to exit? -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 18:00 ` Jon Smirl @ 2006-09-10 19:03 ` linux 2006-09-10 20:00 ` Linus Torvalds 0 siblings, 1 reply; 101+ messages in thread From: linux @ 2006-09-10 19:03 UTC (permalink / raw) To: jonsmirl, linux; +Cc: git, paulus, torvalds > Maybe we are looking at the wrong thing, it may be fast to fork a > process, is it fast for the process to exit? The lmbench benchmark figures I cited are for fork+exit. The other is fork+exec+exit. > At the time this was measured parsecvs was executing millions of git > command using system(command). 40% of the CPU was in the kernel, it > stayed that way for hours. Ah! You are aware, aren't you, that system(string) invokes /bin/sh to parse the string, right? So you're starting bash several million times. ("man system" explains.) A direct fork() (or even faster, vfork) and exec() is going to have a lot less overhead, although it's more work to code. See Stevens for excellent examples. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 19:03 ` linux @ 2006-09-10 20:00 ` Linus Torvalds 2006-09-10 21:00 ` Jon Smirl 2006-09-10 22:41 ` Paul Mackerras 0 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-10 20:00 UTC (permalink / raw) To: linux; +Cc: jonsmirl, git, paulus On Sun, 10 Sep 2006, linux@horizon.com wrote: > > A direct fork() (or even faster, vfork) and exec() is going to have a > lot less overhead, although it's more work to code. See Stevens for > excellent examples. Well, that said, the Linux fork/exec/exit is certainly fairly efficient, but nothing can hide the fact that it _is_ a very expensive operation. So we may do a fork/exit in a millisecond, but on the other hand, we can generally open a file in a microsecond. So we're definitely talking about several orders of magnitude difference.. And a lot of the trivial git helpers are literally not doing a lot more than opening a file, reading it, and parsing it. That's for things like looking up the SHA1 of a branch head, for example. Doing it in-line might be a microsecond or two, while spawning a shell to execute git-rev-parse will take you several microseconds. Do it a million times, and the difference is now a second vs an hour. Or a few minutes vs a week. So in that sense, it is absolutely undeniable that fork/exit is slow. Everything is relative. The Linux fork/exit is fast when compared to most other systems fork/exit, but it's slow as hell when compared to just doing some small op directly in-line. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 20:00 ` Linus Torvalds @ 2006-09-10 21:00 ` Jon Smirl 2006-09-11 2:49 ` Linus Torvalds 2006-09-10 22:41 ` Paul Mackerras 1 sibling, 1 reply; 101+ messages in thread From: Jon Smirl @ 2006-09-10 21:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, git, paulus On 9/10/06, Linus Torvalds <torvalds@osdl.org> wrote: > On Sun, 10 Sep 2006, linux@horizon.com wrote: > > > > A direct fork() (or even faster, vfork) and exec() is going to have a > > lot less overhead, although it's more work to code. See Stevens for > > excellent examples. > > Well, that said, the Linux fork/exec/exit is certainly fairly efficient, > but nothing can hide the fact that it _is_ a very expensive operation. cvs2svn + fastimport can import the same Mozilla repo in about 2hrs that was taking parsecvs about five days to do. The algorithms are not that different, cvs2svn is probably slower than parsecvs for detecting changesets. The difference is mostly due to the removal of forks. Is the kernel mapped into user processes using huge pages? That would reduce some of the pressure on TLBs. Another thing that should be mapped with huge pages is the video RAM. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 21:00 ` Jon Smirl @ 2006-09-11 2:49 ` Linus Torvalds 0 siblings, 0 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-11 2:49 UTC (permalink / raw) To: Jon Smirl; +Cc: linux, git, paulus On Sun, 10 Sep 2006, Jon Smirl wrote: > > Is the kernel mapped into user processes using huge pages? Yup. Huge pages and global. So the kernel should have fairly low TLB impact (depending on CPU - the kernel also has less locality than user space, since the data structures are all over the place, and some CPU's have very few hugepage entries. System code generally tends to have much worse I$ and TLB behaviour than most user apps, but at least the hugepage mappings tend to mitigate the TLB side somewhat). > Another thing that should be mapped with huge pages is the video RAM. I doubt it ends up being a huge deal in most cases. If you don't use the graphics chip to access video RAM with accelerated methods, the TLB is the _least_ of your worries. Going over the PCI bus (even with PCI-X etc) is the real problem. There's a reason you want to do acceleration even in 2D.. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 20:00 ` Linus Torvalds 2006-09-10 21:00 ` Jon Smirl @ 2006-09-10 22:41 ` Paul Mackerras 2006-09-11 2:55 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-10 22:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, jonsmirl, git Linus Torvalds writes: > So we may do a fork/exit in a millisecond, but on the other hand, we can > generally open a file in a microsecond. So we're definitely talking about > several orders of magnitude difference.. Unless the file isn't in memory, in which case it's probably around a millisecond or more... Do you think there is any way to speed up the cold-cache case for git-ls-remote and git-rev-parse with thousands of heads and tags? Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 22:41 ` Paul Mackerras @ 2006-09-11 2:55 ` Linus Torvalds 2006-09-11 3:18 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-11 2:55 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux, jonsmirl, git On Mon, 11 Sep 2006, Paul Mackerras wrote: > > Do you think there is any way to speed up the cold-cache case for > git-ls-remote and git-rev-parse with thousands of heads and tags? Nothing obvious comes to mind. If we did the same pack-file approach that we do for objects, the problem ends up being that _updating_ things is really hard. What we could do (and might work) is that a "git repack" would create a "packed representation of the heads too". The issue is that once you create a pack, you _really_ don't want to use read-modify-write to modify it ever afterwards. That's how you get nasty corruption. The "append-only" side of the git object database is really what makes things so stable, and packing multiple heads in the same file automatically means that if something goes wrong, it's _disastrously_ wrong, in a way it isn't when you have separate files. So we could generate a "pack of references", but then any modifications would be done with the current loose "file objects" approach, and just have the filesystem override the pack-files. The problem then actually becomes one of _deleting_ branches, because then we'd have to add a "negative branch" loose object. Ugly. An alternate approach might be to keep the current filesystem layour, but simply do the readdir over the whole reference situation in one pass, then sort them by inode number, and look them up in order. That would help on some filesystems, at least. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 2:55 ` Linus Torvalds @ 2006-09-11 3:18 ` Linus Torvalds 2006-09-11 6:35 ` Junio C Hamano 2006-09-11 18:54 ` Junio C Hamano 2006-09-11 8:36 ` Paul Mackerras 2006-09-11 9:04 ` Jakub Narebski 2 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-11 3:18 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux, jonsmirl, git On Sun, 10 Sep 2006, Linus Torvalds wrote: > > If we did the same pack-file approach that we do for objects, the problem > ends up being that _updating_ things is really hard. What we could do (and > might work) is that a "git repack" would create a "packed representation > of the heads too". To clarify: I'm _not_ suggesting actually using the "pack-file" representation itself for the references. I'm saying that we could have something like a .git/refs-packed file, which could be (for example) just a plain linear text-file, of the form 19ed368b17abfb9ad5c7467ea74fd8a045d96b43 refs/heads/html 60a6bf5f53635005f4f68d8b8a33172309193623 refs/heads/maint 2d32e76f893b2ac432201ce7596c8bab691691e6 refs/heads/man a41fae9c46a4cb5e59cc1a17d19f6a3a6cbfbb2f refs/heads/master 61af0aaf26e003a61848e551bbd57e78e94eacdc refs/heads/next 585729203fef6ade64923277e7151b2e3a4ca330 refs/heads/pu 997283a8e87179b5b87a909686869d7843c8e19a refs/heads/todo a0e7d36193b96f552073558acf5fcc1f10528917 refs/tags/junio-gpg-pub d6602ec5194c87b0fc87103ca4d67251c76f233a refs/tags/v0.99 f25a265a342aed6041ab0cc484224d9ca54b6f41 refs/tags/v0.99.1 c5db5456ae3b0873fc659c19fafdde22313cc441 refs/tags/v0.99.2 7ceca275d047c90c0c7d5afb13ab97efdf51bd6e refs/tags/v0.99.3 b3e9704ecdf48869f635f0aa99ddfb513f885aff refs/tags/v0.99.4 07e38db6a5a03690034d27104401f6c8ea40f1fc refs/tags/v0.99.5 f12e22d4c12c3d0263fa681f25c06569f643da0f refs/tags/v0.99.6 f8696fcd2abc446a5ccda3e414b731eff2a7e981 refs/tags/v0.99.7 1094cf40f7029f803421c1dcc971238507c830c5 refs/tags/v0.99.7a da30c6c39cd3b048952a15929c5440acfd71b912 refs/tags/v0.99.7b 9165ec17fde255a1770886189359897dbb541012 refs/tags/v0.99.7c 02b2acff8bafb6d73c6513469cdda0c6c18c4138 refs/tags/v0.99.7d ... ie it would contain just a linear file with the "<hex></tab><refname>" format. Then, the way to look up a reference would be: - look it up in the traditional loose file - if it exists, and contains zeros (or not a hex value), it's considered a "negative entry", and the branch doesn't exist - otherwise, if it's a good SHA1, that's the result - if it's not there, look it up in the ".git/refs-packed" file by just doing a simple linear scan (trivial, and actually efficient - we're talking about just a few kB of memory after all, and the _cost_ is actually the IO, where "simple linear scan" is actually very good for performance). The end result would be that we'd probably have very few loose references (we'd get them whenever we change a ref, or delete one), making the lookup scale better. The big _bulk_ of the references tend to be very stable, notably they are tags that seldom - if ever - change, and would thus stay just in the packed refs file. So the normal situation would be that you'd have a few hundred (maybe, for a bigger project) refs in the single .git/refs-packed file, totalling a few kB of disk-space, and then you might have a handful of "active" heads that are in the traditional single-file format .git/refs/<filename> because they are beign actively modified and have changed since the last repack. I bet it would work fairly well. But somebody would need to implement it. The good news is that the refs-handling code tends to be _fairly_ well abstracted out, because we already wanted that for the logging thing. So we hopefully already don't actually access the loose file objects by hand from shell scripts any more - we use git-rev-parse and git-update-ref etc to look up refnames, and that all goes back to git/refs.c. So _most_ of the bulk of it would probably be in refs.c, but there's obviously also things like git-branch.sh that needs to be taught the new rules about deleting branches etc. Anybody want to try the above rules out? I bet the _only_ real issue would be "for_each_ref()", where it's important to _first_ do the loose objects, and remember them all, so that you do _not_ show the refs that are in the .git/refs-packed file and already got shown because they were loose. NOTE! It's important that whatever sceme used gets locking right. The above suggestion gets it right simply because it doesn't really _change_ anything. Any new or modified ref ends up using the old code, and using a ".lock" file and renaming it automatically does the same thing it ever did. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 3:18 ` Linus Torvalds @ 2006-09-11 6:35 ` Junio C Hamano 2006-09-11 18:54 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 6:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Paul Mackerras Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 10 Sep 2006, Linus Torvalds wrote: >> >> If we did the same pack-file approach that we do for objects, the problem >> ends up being that _updating_ things is really hard. What we could do (and >> might work) is that a "git repack" would create a "packed representation >> of the heads too". > > To clarify: I'm _not_ suggesting actually using the "pack-file" > representation itself for the references. > > I'm saying that we could have something like a > > .git/refs-packed > > file, which could be (for example) just a plain linear text-file, of the > form > > 19ed368b17abfb9ad5c7467ea74fd8a045d96b43 refs/heads/html > 60a6bf5f53635005f4f68d8b8a33172309193623 refs/heads/maint > ... > > ie it would contain just a linear file with the "<hex></tab><refname>" > format. Then, the way to look up a reference would be: >... > NOTE! It's important that whatever sceme used gets locking right. The > above suggestion gets it right simply because it doesn't really _change_ > anything. Any new or modified ref ends up using the old code, and using a > ".lock" file and renaming it automatically does the same thing it ever > did. This is all interesting and can be done by somebody interested in the really core part of the system. I was just looking at Paul's "new" branch, and realized that there is something else that somebody _else_ can do in parallel to make life easier for gitk and gitweb, and it is not on so bare-metal-core side. The latest commit in Paul's "new" branch says: We were doing two execs for each tag - one to map the tag ID to a commit ID (which git ls-remote was already giving us as refs/tags/foo^{}) and one to read the contents of the tag for later display.... I think this is a typical problem any Porcelain layer faces. To grab summary information for all refs in a concise form in one go. We discussed such a potential command about a month ago with Jakub, but the suggestion did not go anywhere. http://thread.gmane.org/gmane.comp.version-control.git/25013/focus=25055 In addition to what I described there, such a command should at least allow specifying some special formatting for tag objects, such as dereferencing it once (i.e. what the object name and type of the referenced object is) and dereferencing it repeatedly until a non-tag appears (i.e. "$that_tag^{}" along with its type). The show-ref command will alleviate the fork+exec problem; the core side update you suggested for ref handling will improve the performance of the implementation of such a command and can be done in parallel. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 3:18 ` Linus Torvalds 2006-09-11 6:35 ` Junio C Hamano @ 2006-09-11 18:54 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 18:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Paul Mackerras, linux, jonsmirl, git Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 10 Sep 2006, Linus Torvalds wrote: >> >> If we did the same pack-file approach that we do for objects, the problem >> ends up being that _updating_ things is really hard. What we could do (and >> might work) is that a "git repack" would create a "packed representation >> of the heads too". > > To clarify: I'm _not_ suggesting actually using the "pack-file" > representation itself for the references. > > I'm saying that we could have something like a > > .git/refs-packed > > file, which could be (for example) just a plain linear text-file, of the > form > > 19ed368b17abfb9ad5c7467ea74fd8a045d96b43 refs/heads/html > 60a6bf5f53635005f4f68d8b8a33172309193623 refs/heads/maint > 2d32e76f893b2ac432201ce7596c8bab691691e6 refs/heads/man >... > 9165ec17fde255a1770886189359897dbb541012 refs/tags/v0.99.7c > 02b2acff8bafb6d73c6513469cdda0c6c18c4138 refs/tags/v0.99.7d > ... > > ie it would contain just a linear file with the "<hex></tab><refname>" > format. This depends on how expensive it is to read a file backwards, but it might make sense to have a single append-only file. ${GIT_DIR}/refs-packed file would be a sequence of records, each record is of form: type SP timestamp LF payload LF length LF where length and timestamp are ASCII integers, the former is the length of this record and you would use to compute the file offset you give lseek() to seek back to the beginning of this record. The latter is the creation time of this record. The type of the record is either full or incremental (perhaps 'F' and 'I'). A full record describes in the above "object-name refname" tuple format. An incremental record has entries only for the subset of refs, with 0{40} object name denoting deletion of a ref as in your suggestion. A few nice things about this representation: - It gives reflog for free (we need to add 'author' and 'reason' encoded after timestamp, though); it also gives reflog to tags which I sometimes miss. - Garbage collection is straightforward, although it requires locking the whole refs. - You can now have 'foo' and 'foo/a', 'foo/b', branches at the same time. It is natural to organize a topic branch 'foo' which consists of a merge of subtopics 'foo/a', 'foo/b', ..., but this cannot be expressed with the current directory-file based scheme. Some minor details: - After every N (I do not think we would need to make this configurable, but we need to decide on a sane value) updates, we would want to write a full record so that we do not have to traverse too long a chain. - An incremental record could be delta since the last full, instead of delta since the last record, as the above introduction hints. If we start from A, B, and C, and then A and B are modified in turn, we might have: Full: A B C Incr: A' Incr: B' but we could have instead: Full: A B C Incr: A' Incr: A' B' For this to make sense, the record header needs to record a back-pointer to the last full record before the current incremental record. With that, we would need to read at most two records to find out the current status. Locking between writers and readers can become problematic. The reader would read the last line to find the record length, seek that much back to find the beginning of the "then-latest" record, but when there is a simultanous writer, there is no guarantee that what at the last line of the file _is_ the length of the last record. The writer may be in the middle of appending something. We may be able to work it around by: - Add hash of the record at the end; - The reader assumes that there is usually no collision, reads the hash and the length, seeks back and reads the record and validate the hash -- if the hash does not match, it is likely that it detected the writer-reader collision so it should wait until the file grows and stops growing and then retry. But writer-writer contention obviously needs a file lock. Maybe we can use flock/lockf/fcntl(F_SETLKW)? If we are going to do that then protecting the reader-writer side with the same mechanism may be simpler and cleaner. > NOTE! It's important that whatever sceme used gets locking right. The > above suggestion gets it right simply because it doesn't really _change_ > anything. Any new or modified ref ends up using the old code, and using a > ".lock" file and renaming it automatically does the same thing it ever > did. The above changes things quite a bit, so locking aspect of it needs to be thought through thouroughly. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 2:55 ` Linus Torvalds 2006-09-11 3:18 ` Linus Torvalds @ 2006-09-11 8:36 ` Paul Mackerras 2006-09-11 14:26 ` linux 2006-09-11 9:04 ` Jakub Narebski 2 siblings, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-11 8:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, jonsmirl, git Linus Torvalds writes: > So we could generate a "pack of references", but then any modifications > would be done with the current loose "file objects" approach, and just > have the filesystem override the pack-files. The problem then actually > becomes one of _deleting_ branches, because then we'd have to add a > "negative branch" loose object. Ugly. Could we do a cache of the refs that stores the stat information for each of the files under .git/refs plus the sha1 that the ref points to? In other words this cache would do for the refs what the index does for the working directory. Reading all the refs would mean we still had to stat each of the files, but that's much quicker than reading them in the cold-cache case. In the common case when most of the stat information matches, we don't have to read the file because we have the sha1 that the file contains right there in the cache. Ideally we would have two sha1 values in the cache - the sha1 in the file, and if that is the ID of a tag object, we would also put the sha1 of the commit that the tag points to in the cache. Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 8:36 ` Paul Mackerras @ 2006-09-11 14:26 ` linux 2006-09-11 15:01 ` Jon Smirl 2006-09-11 16:47 ` Junio C Hamano 0 siblings, 2 replies; 101+ messages in thread From: linux @ 2006-09-11 14:26 UTC (permalink / raw) To: paulus, torvalds; +Cc: git, jonsmirl, linux > Could we do a cache of the refs that stores the stat information for > each of the files under .git/refs plus the sha1 that the ref points > to? In other words this cache would do for the refs what the index > does for the working directory. Reading all the refs would mean we > still had to stat each of the files, but that's much quicker than > reading them in the cold-cache case. In the common case when most of > the stat information matches, we don't have to read the file because > we have the sha1 that the file contains right there in the cache. Well, that could save one of two seeks, but that's not *much* quicker. (Indeed, a git ref would fit into the 60 bytes of block pointer space in an ext2/3 inode if regular files were stuffed there as well as symlinks.) > Ideally we would have two sha1 values in the cache - the sha1 in the > file, and if that is the ID of a tag object, we would also put the > sha1 of the commit that the tag points to in the cache. Now that's not a bad idea. Hacking it in to Linus's scheme, that's <foo sha>\t<foo^{} sha>\tfoo A couple of thoughts: 1) I bet Hans Reiser is enjoying this; he's been agitating for better lots-of-small-files support for years. 2) Since I've written about two caches in a few minutes (here and in git-rev-list), a standardized cache validation hook for git-fsck-objects and git-prune's use might be useful. 3) If we use Linus's idea of a flat "static refs" file overridden by loose refs (presumably, refs would be stuffed in if their mod times got old enough, and on initial import you'd use the timestamp of the commit they point to), we'll have to do a bit of a dance to move refs to and from it. Basically, to move refs into the refs file, it's - Read all the old refs and loose refs and write the new refs file. - Rename the new refs file into place. - For each loose ref moved in, lock it, verify it hasn'd changed, and delete it. with some more locking to prevent two people from doing this at once. Folks looking up tags will do an FS search, then validate their refs file cache, then if necessary, suck in the refs file. Now, exploding a refs file into loose refs is tricky. There's the possible race condition with a reader: A: Looks for loose ref "foo", doesn't find it. B: Write out loose ref "foo" B: Deletes now-unpacked refs file A: Looks for refs file, doesn't find it. A: Concludes that ref "foo" doesn't exist. The only solution I can think of is to stat the refs file at the start of the operation and restart from the beginning if it changes by the time it actually opens and read it. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 14:26 ` linux @ 2006-09-11 15:01 ` Jon Smirl 2006-09-11 16:47 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-11 15:01 UTC (permalink / raw) To: linux@horizon.com; +Cc: paulus, torvalds, git On 11 Sep 2006 10:26:44 -0400, linux@horizon.com <linux@horizon.com> wrote: > > Could we do a cache of the refs that stores the stat information for > > each of the files under .git/refs plus the sha1 that the ref points > > to? In other words this cache would do for the refs what the index > > does for the working directory. Reading all the refs would mean we > > still had to stat each of the files, but that's much quicker than > > reading them in the cold-cache case. In the common case when most of > > the stat information matches, we don't have to read the file because > > we have the sha1 that the file contains right there in the cache. > > Well, that could save one of two seeks, but that's not *much* quicker. > (Indeed, a git ref would fit into the 60 bytes of block pointer space > in an ext2/3 inode if regular files were stuffed there as well as symlinks.) Does everyone hate my idea of putting the sha in the file name and using the directory structure as a simple database with locking? No one made any comments. ------------------------------------------------ Here's a hack, instead of of putting the sha inside the file, put the sha into the filename. master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 Now you can just do a directory listing and get all of the data quickly. To keep the existing porcelain working add a symlink. ln -s master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 master You might want the sha1 encoded names in a new director -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 14:26 ` linux 2006-09-11 15:01 ` Jon Smirl @ 2006-09-11 16:47 ` Junio C Hamano 2006-09-11 21:52 ` Paul Mackerras 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 16:47 UTC (permalink / raw) To: linux; +Cc: git, paulus, torvalds linux@horizon.com writes: >> Ideally we would have two sha1 values in the cache - the sha1 in the >> file, and if that is the ID of a tag object, we would also put the >> sha1 of the commit that the tag points to in the cache. > > Now that's not a bad idea. Hacking it in to Linus's scheme, that's > > <foo sha>\t<foo^{} sha>\tfoo That's a dubious idea. - Why assume a tag points directly at a commit, or if it is not, why assume "foo^{}" (dereferencing repeatedly until we get a non-tag) is special? - Why assume the user wants access to only the object name of what the tag points at? Perhaps most users would want to have its type, dates (committer and author), and probably the first line of the commit message if it is (and most likely it is) a commit? -- at least gitweb and gitk would want these. You should separate the issue of the internal data structure implementation and programming interface for Porcelains. From the internal data structure point-of-view, the second one is a redundant piece of information. Caching it _would_ speed up the access to it, but then the issue becomes where we draw the line to stop. It is probably more useful to think about what kind of information formatted in what way is often wanted by Porcelains who want to grab many refs in one-go. If you can come up with a set that can satisfy everybody using that as the cache file format would be fine, but I strongly doubt you can satisfy everybody. In which case, thinking about the ways for the Porcelain to express flexibly what information is wanted and formatted in what way, and have a command to access that (git-show-refs, anybody?) would be more fruitful, and at that point, the internal representation of the cached data becomes the implementation detail. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 16:47 ` Junio C Hamano @ 2006-09-11 21:52 ` Paul Mackerras 2006-09-11 23:47 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-11 21:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: linux, git, torvalds Junio C Hamano writes: > That's a dubious idea. > > - Why assume a tag points directly at a commit, or if it is > not, why assume "foo^{}" (dereferencing repeatedly until we > get a non-tag) is special? Umm, I'm not sure what you're getting at here - if one shouldn't make those assumptions, why does git ls-remote output both the tag and tag^{} lines? > - Why assume the user wants access to only the object name of > what the tag points at? Perhaps most users would want to > have its type, dates (committer and author), and probably the > first line of the commit message if it is (and most likely it > is) a commit? -- at least gitweb and gitk would want these. There are two things here. Gitk needs to know which IDs have tags when displaying the graph, and their names. It doesn't need to know the other details until someone clicks on the commit or the tag. Thus the information that needs to be collected for every tag at startup is just the name and the sha1 id of the commit it eventually points to (if it doesn't eventually point to a commit it's basically not interesting). Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 21:52 ` Paul Mackerras @ 2006-09-11 23:47 ` Junio C Hamano 2006-09-12 0:06 ` Jakub Narebski 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 23:47 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux, git, torvalds Paul Mackerras <paulus@samba.org> writes: > Junio C Hamano writes: > >> That's a dubious idea. >> >> - Why assume a tag points directly at a commit, or if it is >> not, why assume "foo^{}" (dereferencing repeatedly until we >> get a non-tag) is special? > > Umm, I'm not sure what you're getting at here - if one shouldn't make > those assumptions, why does git ls-remote output both the tag and > tag^{} lines? This was originally done to support Cogito's tag following which was in its infancy. So in that sense it is already special (iow we know one user that can take advantage of it), but my point was that its usefulness for a commit chain fetching application (i.e. Cogito) does not automatically mean it is also useful for visualizers like gitk and gitweb. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 23:47 ` Junio C Hamano @ 2006-09-12 0:06 ` Jakub Narebski 2006-09-12 0:18 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Jakub Narebski @ 2006-09-12 0:06 UTC (permalink / raw) To: git Junio C Hamano wrote: > Paul Mackerras <paulus@samba.org> writes: > >> Junio C Hamano writes: >> >>> That's a dubious idea. >>> >>> - Why assume a tag points directly at a commit, or if it is >>> not, why assume "foo^{}" (dereferencing repeatedly until we >>> get a non-tag) is special? >> >> Umm, I'm not sure what you're getting at here - if one shouldn't make >> those assumptions, why does git ls-remote output both the tag and >> tag^{} lines? > > This was originally done to support Cogito's tag following which > was in its infancy. So in that sense it is already special (iow > we know one user that can take advantage of it), but my point > was that its usefulness for a commit chain fetching application > (i.e. Cogito) does not automatically mean it is also useful for > visualizers like gitk and gitweb. Actually 'git ls-remote .' _is_ useful for gitweb, see the new git_get_references code. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-12 0:06 ` Jakub Narebski @ 2006-09-12 0:18 ` Junio C Hamano 2006-09-12 0:25 ` Jakub Narebski 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-12 0:18 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com> writes: > Actually 'git ls-remote .' _is_ useful for gitweb, see the new > git_get_references code. You are missing the point. We are not discussing if having foo^{} is useful. There is no argument about it. Without it the user is forced to ask rev-parse. The point is if it is Ok to assume foo and foo^{} (and nothing else) are enough to make Porcelains and visualizers happy, and I suspected the answer was no (remember show-ref discussion?). ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-12 0:18 ` Junio C Hamano @ 2006-09-12 0:25 ` Jakub Narebski 0 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-12 0:25 UTC (permalink / raw) To: git Junio C Hamano wrote: > Jakub Narebski <jnareb@gmail.com> writes: > >> Actually 'git ls-remote .' _is_ useful for gitweb, see the new >> git_get_references code. > > You are missing the point. > > We are not discussing if having foo^{} is useful. There is no > argument about it. Without it the user is forced to ask > rev-parse. > > The point is if it is Ok to assume foo and foo^{} (and nothing > else) are enough to make Porcelains and visualizers happy, and I > suspected the answer was no (remember show-ref discussion?). Well, it is enough for ref markers (this commit is this head or tag), for dumb http transport (indo/refs output is the same as ls-remote), and not much else... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 2:55 ` Linus Torvalds 2006-09-11 3:18 ` Linus Torvalds 2006-09-11 8:36 ` Paul Mackerras @ 2006-09-11 9:04 ` Jakub Narebski 2 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-11 9:04 UTC (permalink / raw) To: git Linus Torvalds wrote: > On Mon, 11 Sep 2006, Paul Mackerras wrote: >> >> Do you think there is any way to speed up the cold-cache case for >> git-ls-remote and git-rev-parse with thousands of heads and tags? > > Nothing obvious comes to mind. > > If we did the same pack-file approach that we do for objects, the problem > ends up being that _updating_ things is really hard. What we could do (and > might work) is that a "git repack" would create a "packed representation > of the heads too". Because usually there is reasonable number of heads, and that is the tags that might be problem, because the number of tags likes to grow with the history, perhaps we could pack only tags heads into the pack? This has additional advantage that tags, especially annotated tags (i.e. true tags objects) changes rarely. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 14:56 ` Jon Smirl 2006-09-10 16:10 ` linux @ 2006-09-10 18:51 ` Junio C Hamano 2006-09-11 0:04 ` Shawn Pearce 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-10 18:51 UTC (permalink / raw) To: Jon Smirl; +Cc: git, Shawn Pearce "Jon Smirl" <jonsmirl@gmail.com> writes: > Using the Mozilla repo you downloaded is not a normal situation since > it is 100% packed. Most people are going to have a few thousand loose > objects floating around too. Loose objects really slow things down. When you have a few thousand loose objects you definitely should consider repacking (not "repack -a" for Mozilla case, perhaps). We could benefit from the suggested organization of one base archive pack plus a current active pack. The core side code to help doing so was posted here which followed a discussion on how to have repack make use of it last week. http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326 Any takers? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 18:51 ` Junio C Hamano @ 2006-09-11 0:04 ` Shawn Pearce 2006-09-11 0:42 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Shawn Pearce @ 2006-09-11 0:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jon Smirl, git Junio C Hamano <junkio@cox.net> wrote: > We could benefit from the suggested organization of one base > archive pack plus a current active pack. The core side code to > help doing so was posted here which followed a discussion on how > to have repack make use of it last week. > > http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326 > > Any takers? I thought this was settled and you had it implemented? Is this still an open topic? -- Shawn. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 0:04 ` Shawn Pearce @ 2006-09-11 0:42 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 0:42 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: >> We could benefit from the suggested organization of one base >> archive pack plus a current active pack. The core side code to >> help doing so was posted here which followed a discussion on how >> to have repack make use of it last week. >> >> http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326 >> >> Any takers? > > I thought this was settled and you had it implemented? Is this > still an open topic? I've implemented the core side and I think it is ready. People who actually have need for such a repack are welcome to take a crack at it at the Porcelain-ish level. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 12:41 ` Paul Mackerras 2006-09-10 14:56 ` Jon Smirl @ 2006-09-11 0:03 ` Shawn Pearce 2006-09-11 0:41 ` Junio C Hamano 1 sibling, 1 reply; 101+ messages in thread From: Shawn Pearce @ 2006-09-11 0:03 UTC (permalink / raw) To: Paul Mackerras Cc: Jon Smirl, Linus Torvalds, linux@horizon.com, Git Mailing List Paul Mackerras <paulus@samba.org> wrote: [snip] > The bottom line is that I can speed up the startup for the hot-cache > case quite a lot. The cold-cache case is going to take about 20-30 > seconds whatever I do unless Linus or Junio can come up with a way to > pack the heads and tags. I could read the refs asynchronously but > there will still be a long delay in git rev-parse if you give > arguments such as --all. I've been thinking about implementing ref storage within a Git tree object. Just store the commit/tag/object IDs in a tree (or graph of trees) with a mode of '0'. Anchor that under '.git/refs-tree'. Any edit of a ref would "lock" .git/refs-tree, create a new tree containing the update, then replace .git/refs-tree. But it would put additional stress on the objects directory by creating a lot of trees which would never get pulled into pack files and thus would need to be pruned away on a regular basis. It also would make parallel updates more difficult on the server side as everyone would need to wait for the lock to .git/refs-tree before they can change any ref; today users only need to wait for the ref they are trying to change. It also doesn't help looking up a ref quickly; although trees are sorted they are variable length entries which forces the application to read the entire tree to find its entry. Given those three downsides I haven't put anything to code yet. -- Shawn. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 0:03 ` Shawn Pearce @ 2006-09-11 0:41 ` Junio C Hamano 2006-09-11 1:04 ` Jakub Narebski 2006-09-11 2:11 ` Jon Smirl 0 siblings, 2 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 0:41 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: > I've been thinking about implementing ref storage within a Git tree > object. Just store the commit/tag/object IDs in a tree (or graph > of trees) with a mode of '0'. Anchor that under '.git/refs-tree'. > Any edit of a ref would "lock" .git/refs-tree, create a new tree > containing the update, then replace .git/refs-tree. > > But it would put additional stress on the objects directory by > creating a lot of trees which would never get pulled into pack > files and thus would need to be pruned away on a regular basis. I do not see any advantage of making such a phoney object at all, but I do agree that the current one file per ref can be less than optimal when your repository has tons of tags or refs. We've designed the ref interface well enough so that only very narrow core functions know the implementation (and properly written Porcelain would access refs via update-ref and symbolic-ref interfaces), so the impact of changing the one-file-per-ref implementation to something based on a single simple databasy file (e.g. gdbm, bdb, sqlite, ...) would be reasonably limited. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 0:41 ` Junio C Hamano @ 2006-09-11 1:04 ` Jakub Narebski 2006-09-11 2:44 ` Shawn Pearce 2006-09-11 2:11 ` Jon Smirl 1 sibling, 1 reply; 101+ messages in thread From: Jakub Narebski @ 2006-09-11 1:04 UTC (permalink / raw) To: git Junio C Hamano wrote: > the impact of changing the > one-file-per-ref implementation to something based on a single > simple databasy file (e.g. gdbm, bdb, sqlite, ...) One of the complaints against Subversion was that it use BerkeleyDB (bdb) backend... but it was before it acquired fsfs interface. Perhaps we could use it too. Or perhaps it is for something like Electra, or ReiserFS file properites access... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 1:04 ` Jakub Narebski @ 2006-09-11 2:44 ` Shawn Pearce 2006-09-11 5:27 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Shawn Pearce @ 2006-09-11 2:44 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com> wrote: > Junio C Hamano wrote: > > > the impact of changing the > > one-file-per-ref implementation to something based on a single > > simple databasy file (e.g. gdbm, bdb, sqlite, ...) > > One of the complaints against Subversion was that it use BerkeleyDB > (bdb) backend... but it was before it acquired fsfs interface. Perhaps > we could use it too. I'm against the latest Berkely DB (Sleepy Cat) implementations. Every time I've stored data in them or in applications which use them I've lost everything. GNU dbm might not be too bad. SQL Lite is overkill. I was thinking of using a tree object only because its a well defined Git file format that's already in use. Plus its "easy" to repair by loading it into an index, twiddline it and invoking git-write-tree to convert it back. But there's a lot of downsides. This is probably something that is easily solved by a simple fixed record format holding a 20 byte SHA1 (binary) and a fixed width null terminated string holding the ref name, with the records sorted by ref name. Its yet another file format with yet another set of utilities needed but we pretty much have those (update-ref). > Or perhaps it is for something like Electra, or ReiserFS file properites > access... Except not everyone has those filesystems. :-) -- Shawn. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 2:44 ` Shawn Pearce @ 2006-09-11 5:27 ` Junio C Hamano 2006-09-11 6:08 ` Shawn Pearce 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 5:27 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Jakub Narebski Shawn Pearce <spearce@spearce.org> writes: > Jakub Narebski <jnareb@gmail.com> wrote: >> Junio C Hamano wrote: >> >> > the impact of changing the >> > one-file-per-ref implementation to something based on a single >> > simple databasy file (e.g. gdbm, bdb, sqlite, ...) >> >> One of the complaints against Subversion was that it use BerkeleyDB >> (bdb) backend... but it was before it acquired fsfs interface. Perhaps >> we could use it too. > > I'm against the latest Berkely DB (Sleepy Cat) implementations. > Every time I've stored data in them or in applications which use > them I've lost everything. GNU dbm might not be too bad. SQL Lite > is overkill. >... > This is probably something that is easily solved by a simple fixed > record format holding a 20 byte SHA1 (binary) and a fixed width null > terminated string holding the ref name, with the records sorted > by ref name. Its yet another file format with yet another set of > utilities needed but we pretty much have those (update-ref). Yup. That is one reasonable implementation of "single simple databasy file" I suggested. Or we could just borrow .git/info/refs format. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 5:27 ` Junio C Hamano @ 2006-09-11 6:08 ` Shawn Pearce 2006-09-11 7:11 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Shawn Pearce @ 2006-09-11 6:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jakub Narebski Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > This is probably something that is easily solved by a simple fixed > > record format holding a 20 byte SHA1 (binary) and a fixed width null > > terminated string holding the ref name, with the records sorted > > by ref name. Its yet another file format with yet another set of > > utilities needed but we pretty much have those (update-ref). > > Yup. That is one reasonable implementation of "single simple > databasy file" I suggested. Or we could just borrow .git/info/refs > format. I think Linus was suggesting the same thing, just not nearly as obvious. He also was looking for ways to not update it frequently. But I can't say I'm a fan of negative entries. :-) I'd probably rather code this as a fixed width binary heap file. It should scale reasonable well to larger sets of refs. I know someone on the list wanted to store Perforce change identifiers as tags in Git but it was taking up more disk in ref files than in pack-*.pack. That could be a very larger number of refs and having O(log N) access there might be good. But at this point I have more on my plate than I have time for. :-) I've got to get that map window code merged against master or next (preference Junio? next touches code I'm touching too but those aren't in master yet) for 32 bit offset. Should be slightly easier to do but the pack verification change really threw a wrench into my merge. Other things have also kept me from really working on it. I probably won't be able to look at it again until Wednesday. I also want to spend more time on the dictionary packing prototype to see if its worth spending even more time on. My amd64 box is happily installing itself as write this (finally!) so I now have some more reasonable hardware to deal with transforming the Mozilla pack. :) -- Shawn. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 6:08 ` Shawn Pearce @ 2006-09-11 7:11 ` Junio C Hamano 2006-09-11 17:52 ` Shawn Pearce 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-11 7:11 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: > I think Linus was suggesting the same thing, just not nearly as > obvious. He also was looking for ways to not update it frequently. > But I can't say I'm a fan of negative entries. :-) I like Linus's suggestion quite a lot actually. It would be a nice little project that needs to touch reasonably well-isolated area of the code, but I suspect it would need a bit more exposure and familiarity to the existing code than somebody who looks at the core code for the first time. > But at this point I have more on my plate than I have time for. :-) You've done quite a lot of work on refs area yourself, so you might want to give somebody else the chance to try first ;-). > I've got to get that map window code merged against master or next > (preference Junio? next touches code I'm touching too but those > aren't in master yet) for 32 bit offset. As you know I've shelved the 64-bit offset stuff. My preference is to base it on 106d710b but basing it on 'next' would be fine. Between 106d710b and next there is only one unrelated change to sha1_file.c, which is the hexval[] change that is in 'master'. I've sent out "What's in" tonight but forgot to talk about plans for topics in "next", so here is an update. - These things are probably Ok, and after a bit more testing I'd like to push them out to 'master' by my Wednesday: - Andy Whitcroft's "send-pack to use git-rev-list --stdin" - "git apply" to apply binary without explicit --binary flag; - "git diff --binary" to do full-index only for binary patch; - "git unpack-objects -r", except it needs a better name (perhaps --recover). - And these by the end of week: - Jeff King's git-runstatus; - Frank and Rene's git-archive along with git-daemon change; - I am not _so_ comfortable with these quite yet, but plan to test it a bit more to gain confidence and decide when to push out to 'master' perhaps by the middle of the week: - pack-objects --unpacked=this-pack --unpacked=that-pack - pack-objects --stdin > I also want to spend more time on the dictionary packing prototype to > see if its worth spending even more time on. So far I liked the per-object-type dictionary conjecture the most, primarily because my gut feeling tells me that it would involve the least amount of hassle in the interoperability area. But honestly speaking I am not looking forward to a packfile format update at this moment. I'd like things to settle down a bit now, after merging good bits from 'next' to 'master' and declare 1.4.3 first, perhaps by the end of the month. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 7:11 ` Junio C Hamano @ 2006-09-11 17:52 ` Shawn Pearce 0 siblings, 0 replies; 101+ messages in thread From: Shawn Pearce @ 2006-09-11 17:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <junkio@cox.net> wrote: > As you know I've shelved the 64-bit offset stuff. My preference > is to base it on 106d710b but basing it on 'next' would be fine. > Between 106d710b and next there is only one unrelated change to > sha1_file.c, which is the hexval[] change that is in 'master'. OK. I'll rebase on 106d710b and try to get the series out in a day or two. > So far I liked the per-object-type dictionary conjecture the > most, primarily because my gut feeling tells me that it would > involve the least amount of hassle in the interoperability area. Agreed, except I don't think its going to save us very much (~4%) and its enough of a change still to make things a little ugly. If the "true" dictionary compression idea has any value it may save us even more and make it easier to implement fast full text searching across files. But its definately more complex. So I'm going to shut up now. > But honestly speaking I am not looking forward to a packfile > format update at this moment. I'd like things to settle down a > bit now, after merging good bits from 'next' to 'master' and > declare 1.4.3 first, perhaps by the end of the month. No worry there. I won't have enough time between now and the end of this month to create a dictionary pack that's even ready for pu, let alone next. I hope we see 1.4.3 before then. :-) -- Shawn. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 0:41 ` Junio C Hamano 2006-09-11 1:04 ` Jakub Narebski @ 2006-09-11 2:11 ` Jon Smirl 1 sibling, 0 replies; 101+ messages in thread From: Jon Smirl @ 2006-09-11 2:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, git On 9/10/06, Junio C Hamano <junkio@cox.net> wrote: > I do not see any advantage of making such a phoney object at > all, but I do agree that the current one file per ref can be > less than optimal when your repository has tons of tags or > refs. Here's a hack, instead of of putting the sha inside the file, put the sha into the filename. master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 Now you can just do a directory listing and get all of the data quickly. To keep the existing porcelain working add a symlink. ln -s master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 master You might want the sha1 encoded names in a new directory -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-08 23:09 ` Linus Torvalds 2006-09-08 23:28 ` Jon Smirl @ 2006-09-09 1:05 ` Paul Mackerras 2006-09-09 2:56 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-09 1:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List Linus Torvalds writes: > [ One way to do it might be: the normal ordering of revisions without > "--topo-order) is "_close_ to topological", and gitk could just decide > to re-compute the whole graph whenever it gets a commit that has a > parent that it has already graphed. Done right, it would probably almost > never actually generate re-computed graphs (if you only actually > generate the graph when the user scrolls down to it). > > Getting rid of the --topo-order requirement would speed up gitk > absolutely immensely, especially for unpacked cold-cache archives. So it > would probably be a good thing to do, regardless of any shallow clone > issues ] > > Hmm? I recently added code to gitk to add extra commits after the graph has been laid out, provided that the extra commits have one parent and no children. If I generalized that to handle arbitrary numbers of parents and children, it might go close to what you're suggesting. It might get nasty if we have laid out A and then later B, and then C comes along and turns out to be a child of A but a parent of B, meaning that both B and C have to be put above A. Another thing I have been thinking of is that gitk probably should impose a time limit of say 3 months by default, unless the user specifies some other ending condition (such as commitid.., ^commitid or --since=something-else). Together with a menu to select the time limit, I think that would be quite usable and would make gitk start up *much* faster. Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 1:05 ` Paul Mackerras @ 2006-09-09 2:56 ` Linus Torvalds 2006-09-09 3:23 ` Junio C Hamano 2006-09-09 3:31 ` Paul Mackerras 0 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-09 2:56 UTC (permalink / raw) To: Paul Mackerras; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Paul Mackerras wrote: > > It might get nasty if we have laid out A and then later B, and then C > comes along and turns out to be a child of A but a parent of B, > meaning that both B and C have to be put above A. Right. This is why I would suggest just recomputing the thing entirely, instead of trying to make it incremental. It would definitely cause re-organization of the tree when you find a new relationship between two old commits that basically creates a new orderign between them. Trying to do that incrementally sounds really really hard. But just _detecting_ the situation that you are getting a new commit that has a parent that you have already shown (and that thus must go _before_ one of the things you've shown already, and implies a new line reacing it) is very easy. And then re-generating the graph might not be too bad. Running "gitk" on the kernel with a hot cache and fully packed (and enough memory) isn't too bad, because git rev-list literally takes under a second for that case, so the advantage of avoiding --topo-order isn't _that_ noticeable. And the "fully packed" part is probably the most important part. A packed Linux historic tree takes just under six seconds cold-cache and under two seconds hot-cache, but that's because pack-files are _really_ good at mapping all the commits in just one go, and at the beginning of the pack-file. But try the same thing with a fully unpacked kernel, and you'll see the real pain of having to traverse all of history. We're talking minutes, even when hot in the cache. > Another thing I have been thinking of is that gitk probably should > impose a time limit of say 3 months by default I don't think time is good, because for an old project (like the Linux historic repo), three months is literally nothing. So it would be much better to go simply by numbers, but sadly, that won't help - simply because if we want to know the 100 first commits in --topo-order, we'll do the whole history and sort it _first_, and then do the "pick top 100 commits" only after that. (Which may not be sensible, of course, but it's what we do.. It's almost impossible to do it the other way, because you won't know until the point where you do "get_revision()" if you are _actually_ going to use a commit or not, so counting them before the end is fundamentally very hard). > Together with a menu to select the time limit, I think that would be > quite usable and would make gitk start up *much* faster. The menu would help, of course. But it would be even nicer if you'd be able to make do without the --topo-order. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 2:56 ` Linus Torvalds @ 2006-09-09 3:23 ` Junio C Hamano 2006-09-09 3:31 ` Paul Mackerras 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-09 3:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > And the "fully packed" part is probably the most important part. A packed > Linux historic tree takes just under six seconds cold-cache and under two > seconds hot-cache, but that's because pack-files are _really_ good at > mapping all the commits in just one go, and at the beginning of the > pack-file. > > But try the same thing with a fully unpacked kernel, and you'll see the > real pain of having to traverse all of history. We're talking minutes, > even when hot in the cache. Just a random thought... Does that suggest the general purpose filesystem you would use for "a fully unpacked" case have room for improvements? Would a special purpose filesystem that is designed to host .git/objects/??/ directory help? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 2:56 ` Linus Torvalds 2006-09-09 3:23 ` Junio C Hamano @ 2006-09-09 3:31 ` Paul Mackerras 2006-09-09 4:04 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-09 3:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List Linus Torvalds writes: > Right. This is why I would suggest just recomputing the thing entirely, > instead of trying to make it incremental. It would definitely cause > re-organization of the tree when you find a new relationship between two > old commits that basically creates a new orderign between them. I think it may be possible to back up the layout algorithm to the row where the parent is and redo it from there down. Interestingly, I only see two commits out of order on the linux-2.6 repository, but on the full history (up to linux-2.6.12-rc2) I see 520. > The menu would help, of course. But it would be even nicer if you'd be > able to make do without the --topo-order. The graph does seem to look a little nicer with --topo-order, in that it tends to do a whole string of commits on a single branch in a bunch, whereas without --topo-order it seems to hop from branch to branch more. Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 3:31 ` Paul Mackerras @ 2006-09-09 4:04 ` Linus Torvalds 2006-09-09 8:47 ` Marco Costalba 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2006-09-09 4:04 UTC (permalink / raw) To: Paul Mackerras; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Paul Mackerras wrote: > > > The menu would help, of course. But it would be even nicer if you'd be > > able to make do without the --topo-order. > > The graph does seem to look a little nicer with --topo-order, in that > it tends to do a whole string of commits on a single branch in a > bunch, whereas without --topo-order it seems to hop from branch to > branch more. Yeah, see "--date-order". It's topologically sorted, but within that topological sort it is in the "date order", which approximates the normal non-sorted output. Without sorting, you also won't see any sorting by branch, although it could be done by the consumer of the rev-list data (ie you could do it in gitk, similarly to how you'd effectively do a topological sort of your own). So it actually boils down to the exact same thing: - git-rev-list doing whatever sorting is "convenient", in that it's the one common interface to git revisions, so once you do it there, you do it for everybody. - however, for much the same reason, git-rev-list by definition doesn't know _who_ it is sorting things for, or how it's going to be visualized (exactly because it's the "one convenient central place", of course), and as a result it needs to not only get the right answer, it needs to get it on the first try and can't go back incrementally to fix up the list of commits. So the convenience of doing any kind of sorting in git-rev-list is by definition also the thing that causes it to have bad latency, in that it needs to parse the _whole_ history. In contrast, doing some of the same sorting that git-rev-list already does in the consumer instead, is obviously duplicating the same basic notions, but once you do it in the consumer, you can suddenly do things that simply weren't possible in git-rev-list - do things incrementally, and then if you notice half-way that you did something wrong, you can go back and fix it up (which can be quite acceptable). That just isn't an option in a linear environment like the git-rev-list output. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 4:04 ` Linus Torvalds @ 2006-09-09 8:47 ` Marco Costalba 2006-09-09 17:33 ` Linus Torvalds 0 siblings, 1 reply; 101+ messages in thread From: Marco Costalba @ 2006-09-09 8:47 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > > In contrast, doing some of the same sorting that git-rev-list already does > in the consumer instead, is obviously duplicating the same basic notions, > but once you do it in the consumer, you can suddenly do things that simply > weren't possible in git-rev-list - do things incrementally, and then if > you notice half-way that you did something wrong, you can go back and fix > it up (which can be quite acceptable). That just isn't an option in a > linear environment like the git-rev-list output. > Perhaps is total idiocy but why do not implement the fix-up logic directly in git-rev-list? If the out of order revisions are a small amount of the total then could be possible to have something like git rev-list --topo-order --with-appended-fixups HEAD Where, while git-rev-list is working _whithout sorting the whole tree first_, when finds an out of order revision stores it in a fixup-list buffer and *at the end* of normal git-rev-lsit the buffer is flushed to receiver, so that the drawing logic does not change and the out of order revisions arrive at the end, already packed, sorted and prepared by git-rev-list. Instead of changing drwing logic is then enough to add a fixup routine that cycles through out of order commits and updates the graph. This also avoids recursive reordering where perhaps the same part of a graph should be redrawn more then one time due to differtent new parents to the same child arrives at different times. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 8:47 ` Marco Costalba @ 2006-09-09 17:33 ` Linus Torvalds 2006-09-09 18:04 ` Marco Costalba 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2006-09-09 17:33 UTC (permalink / raw) To: Marco Costalba Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Marco Costalba wrote: > > Perhaps is total idiocy but why do not implement the fix-up logic > directly in git-rev-list? It's possible in theory, but in practice it's not worth it. Why? Because you really want the _asynchronous_ nature of having a separate user, that only shows _partial_ results. In other words, we could reasonably easily make git-rev-list do something like - output all revisions in the normal non-topological ordering - when git-rev-list notices a topological sort error, it outputs a "FIXME" line, and restarts the whole thing - printing out the commits in the newly fixed ordering - and does it right the next time around - it then continues doing this until it's totally exhausted the whole commit list and has done one final output in the proper topological ordering. Possible? Yes. BUT: - as long as git-rev-list is entirely single-threaded (and trust me, it's going to be that, because otherwise we'd be better off doing it in a separate process - like gitk), this means that it will be _entirely_ unaware of what has actually been _shown_, so it will restart a LOT more than the external interactive process would do. So it would be much worse than doing it externally and knowing what you've actually shown to the user (if you haven't shown the bad thing yet, there's no reason to restart). - Again, as long as it's single-threaded, git-rev-list will block once it has filled up the pipeline between the processes, so instead of parsing everything in parallel with the "show it all", if will synchronize with the showing process all the time, and especially so when it needs to re-show the stuff that it already sent once. So it's also fairly inefficient. However, what you seem to imply is something different: > Where, while git-rev-list is working _whithout sorting the whole tree > first_, when finds an out of order revision stores it in a fixup-list > buffer and *at the end* of normal git-rev-lsit the buffer is flushed > to receiver, so that the drawing logic does not change and the out of > order revisions arrive at the end, already packed, sorted and prepared > by git-rev-list. But this is exactly what we already do. We flush things *at the end* because that's when we actually know the ordering. And that's exactly why "git-rev-list --topo-ordering" has a latency ranging from a few seconds to a few minutes for large projects (depending on whether they are packed or not). The "wait for the end" is _not_ good, exactly because the end will take some time to arrive. The whole point is to start outputting the data early, and thet BY DEFINITION means that the order of revisions isn't guaranteed to be in topological order. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 17:33 ` Linus Torvalds @ 2006-09-09 18:04 ` Marco Costalba 2006-09-09 18:44 ` linux 2006-09-09 20:05 ` Linus Torvalds 0 siblings, 2 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-09 18:04 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > - output all revisions in the normal non-topological ordering > > - when git-rev-list notices a topological sort error, it outputs a > "FIXME" line, and restarts the whole thing - printing out the commits > in the newly fixed ordering - and does it right the next time around > Sorry, but I don't understand why you should restart the whole thing instead of store away the FIXME commit and continue. If you read my previous e-mail in this thread perhaps it is better explained the whole idea. Anyhow the basic is: -git-rev-list starts outputting the data early (order is not guaranteed) -before to mark for output a revision check if it breaks --topo-order -if the above is true store the revision away and *do not send* - at the end you get an early started steram of topological corrects revisions without preordering, just because you filter away the (few?) revisions that are not in order. The list you get is guaranteed to be in topo order although my not be complete. - _then_ you send the missing revisions that where previously filtered out. At this stage the receiver has already drwan the graph, indeed it has start drwaing as soon as the first revisons arrived and *very important* receiver used old and fast topo-order parsing code. - finally the fixup routine at receiver's end updates the graph with the info from the small set of out of order revisions filtered out before and sent at the end (only this small set is sent at the end). Sorry if it is not still clear, in the previous my e-mail in this thread there is also a small example that could be useful. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 18:04 ` Marco Costalba @ 2006-09-09 18:44 ` linux 2006-09-09 19:17 ` Marco Costalba 2006-09-09 20:05 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: linux @ 2006-09-09 18:44 UTC (permalink / raw) To: mcostalba, torvalds; +Cc: git, jonsmirl, linux, paulus > Anyhow the basic is: > > -git-rev-list starts outputting the data early (order is not guaranteed) > > -before to mark for output a revision check if it breaks --topo-order > > -if the above is true store the revision away and *do not send* > > - at the end you get an early started steram of topological corrects > revisions without > preordering, just because you filter away the (few?) revisions that > are not in order. > The list you get is guaranteed to be in topo order although my not be complete. > > - _then_ you send the missing revisions that where previously > filtered out. At this stage the receiver has already drwan the graph, > indeed it has start drwaing as soon as the first revisons arrived and > *very important* receiver used old and fast topo-order parsing code. > > - finally the fixup routine at receiver's end updates the graph with > the info from the small set of out of order revisions filtered out > before and sent at the end (only this small set is sent at the end). The problem is that the gitk display code doesn't *like* input like this; it's only designed to append to a list. Handling insertions would be hard work for a rare corner case, and a perpetual source of bugs. Unless gitk does a complete second pass, or course, which would guarantee an annoying flicker a few seconds after startup. And Twice the work. The original alternative was to note that out-of-order commits usually aren't *very* out of order. So if gitk could checkpoint the state of its display algorithm periodically, it could just, when seeing an out-of-order commit, rewind to the last checkpoint before the problem and replay while merging in the corrections. While that is potentially O(n^2) (if you feed it all the commits in reverse order), in practice it - Is less than 2*n work, and - Always maintains the longest possible correct prefix of recent commits. It's the old stuff that takes a while to fill in. Either way, gitk has to have topo-sorting code added to it. I think Linus' idea was to avoid that. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 18:44 ` linux @ 2006-09-09 19:17 ` Marco Costalba 0 siblings, 0 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-09 19:17 UTC (permalink / raw) To: linux@horizon.com; +Cc: torvalds, git, jonsmirl, paulus On 9 Sep 2006 14:44:41 -0400, linux@horizon.com <linux@horizon.com> wrote: > > Anyhow the basic is: > > > > -git-rev-list starts outputting the data early (order is not guaranteed) > > > > -before to mark for output a revision check if it breaks --topo-order > > > > -if the above is true store the revision away and *do not send* > > > > - at the end you get an early started steram of topological corrects > > revisions without > > preordering, just because you filter away the (few?) revisions that > > are not in order. > > The list you get is guaranteed to be in topo order although my not be complete. > > > > - _then_ you send the missing revisions that where previously > > filtered out. At this stage the receiver has already drwan the graph, > > indeed it has start drwaing as soon as the first revisons arrived and > > *very important* receiver used old and fast topo-order parsing code. > > > > - finally the fixup routine at receiver's end updates the graph with > > the info from the small set of out of order revisions filtered out > > before and sent at the end (only this small set is sent at the end). > > The problem is that the gitk display code doesn't *like* input like this; > it's only designed to append to a list. Handling insertions would be > hard work for a rare corner case, and a perpetual source of bugs. > > Unless gitk does a complete second pass, or course, which would > guarantee an annoying flicker a few seconds after startup. > And Twice the work. > I think I need to add another argument here, I didn't mention before for clarity (indeed I'm not very good at this it seems ;-) ) I don't know for gitk, perhaps Paul can better explain, but the usual optimization of a git visualizer is simply to not draw any graph until that part does became visibile on the screen. So your arguments are true but the fact is that there is no graph insertion handling at all in almost all cases, but only insertion in loaded revs list ; when the user scrolls to the inserted commit only _then_ the graph will be calculated and dispalyed (I'm not sure in gitk, but in qgit it works that way). So there's no flickering too, and not double work. Indeed lazy/on demand graph drawing policy is very well supported by the above schema, and the above schema is good also because of the lazy graph drawing. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 18:04 ` Marco Costalba 2006-09-09 18:44 ` linux @ 2006-09-09 20:05 ` Linus Torvalds 2006-09-09 20:43 ` Jeff King 2006-09-10 3:49 ` Marco Costalba 1 sibling, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2006-09-09 20:05 UTC (permalink / raw) To: Marco Costalba Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Marco Costalba wrote: > > Sorry, but I don't understand why you should restart the whole thing > instead of store away the FIXME commit and continue. > > If you read my previous e-mail in this thread perhaps it is better > explained the whole idea. > > Anyhow the basic is: > > -git-rev-list starts outputting the data early (order is not guaranteed) > > -before to mark for output a revision check if it breaks --topo-order > > -if the above is true store the revision away and *do not send* You don't seem to understand. It's not that individual new commits might be "out of order", and you can suppress them. It's that individual commits end up showing that OTHER COMMITS, THAT YOU HAVE ALREADY SEEN AND SHOWN, can now be shown to be "out of order". So you're _not_ just suppressing and delaying the few commits that didn't conform to the topological sorting. What you're asking for is impossible. You can't just delay those to the end, because they implicitly are telling you that the earlier commits were already shown in the wrong order. The example is A <--- tip of branch / \ B E | | | F | / C | D ... where the lettering is in "date order" (ie "A" is more recent than "B" etc). In this situation, we'd start following the branch A->B->C->D->.. before we even start looking at E and F, because they all _look_ more recent. But then, at some point, we see E and F, and suddenly when looking at F we realize that C was printed out much much earlier, and was already shown as having only one child. We can't just say "delay F to the end", because we'd then have to draw the line _backwards_ to C, which we haven't even left room for in the graph, not to mention that the end result would look absolutely disgusting even if we had. So we started out painting this as A / \ B | | | C | | | D | | | | E | | | F | | before we notice how wrong it was, and now we have to go _back_, and repaint both E and F as being above C, because only once we hit F do we suddenly realize that yeah, it really _was_ younger than C, despite the timestamp (which was off, because somebody had set his clock to be a week in the past by mistake). So we can't just delay F and "fix it up" at the end. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 20:05 ` Linus Torvalds @ 2006-09-09 20:43 ` Jeff King 2006-09-09 21:11 ` Junio C Hamano 2006-09-09 21:40 ` Linus Torvalds 2006-09-10 3:49 ` Marco Costalba 1 sibling, 2 replies; 101+ messages in thread From: Jeff King @ 2006-09-09 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: Marco Costalba, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On Sat, Sep 09, 2006 at 01:05:42PM -0700, Linus Torvalds wrote: > The example is > > A <--- tip of branch > / \ > B E > | | > | F > | / > C > | > D > ... > > where the lettering is in "date order" (ie "A" is more recent than "B" > etc). In this situation, we'd start following the branch A->B->C->D->.. > before we even start looking at E and F, because they all _look_ more > recent. I'm just coming into this discussion in the middle and know very little about the rev-list code, so please humor me and tell me why my suggestion is completely irrelevant. The problem you describe seems to come from doing a depth-first display of each branch. Why not look at the tip of each "active" branch simultaneously and pick the one with the most recent date? Something like: void show_all(commit_array start) { commit_array active; copy_array(active, start); sort_array(active); while (active.size > 0) { show_commit(active.data[0]); push_back(active.data[0].parents); pop_front(active); sort_array(active); } } Obviously you could use a specialized data structure to add nodes into the right place instead of resorting constantly. -Peff ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 20:43 ` Jeff King @ 2006-09-09 21:11 ` Junio C Hamano 2006-09-09 21:14 ` Jeff King 2006-09-09 21:40 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-09 21:11 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> writes: > I'm just coming into this discussion in the middle and know very little > about the rev-list code, so please humor me and tell me why my > suggestion is completely irrelevant. Not irrelevant. > The problem you describe seems to come from doing a depth-first display > of each branch. Why not look at the tip of each "active" branch > simultaneously and pick the one with the most recent date? Something > like: That's what we have been doing from day one. The trouble Linus illustrated is that in a global project you cannot rely on timestamps always being correct. You can use them as HINT, but you need to be prepared to do sensible things when some people screw up the time. > On Sat, Sep 09, 2006 at 01:05:42PM -0700, Linus Torvalds wrote: > >> The example is >> >> A <--- tip of branch >> / \ >> B E >> | | >> | F >> | / >> C >> | >> D >> ... >> >> where the lettering is in "date order" (ie "A" is more recent than "B" >> etc). In this situation, we'd start following the branch A->B->C->D->.. >> before we even start looking at E and F, because they all _look_ more >> recent. The ancestry graph, topologically, flows from bottom to top but the timestamps are in F E D C B A order (A is closer to current, F is in the most distant past). Somebody forked from C on a machine with slow clock, made two commits with wrong (from the point of view of the person who made commit C anyway) timestamps, and that side branch ended up merged with B into A. You start following A, read A and find B and E (now B and E are "active" in your lingo), pop B because it is the most recent. We look at B, find C is the parent, push C into the active list (which is always sorted by timestamp order). Now "active" are C and E, and C is most recent so we pop C. In the past people suggested workarounds such as making commit-tree to ensure that a child commit has timestamp no older than any of its parents by either refusing to make such or automatically adjusting. That would surely work around the problem, but if somebody made a commit with a wrong timestamp far into the future, every commit that is made after that will have "adjusted" timestamp that pretty much is meaningless, so that would not work well in practice. If we had a commit generation number field in the commit object, we could have used something like that. Each commit gets generation number that is maximum of its parents' generation number plus one, and we prime the recursive definition by giving root commits generation number of 0, and rev-list would have used the generation number not timestamp to do the above "date-order" sort and we will always get topology right. Of course fsck-objects needs to be taught to check and complain if the generation numbers between parent and child are inverted. But it's too late in the game now -- maybe in git version 47. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 21:11 ` Junio C Hamano @ 2006-09-09 21:14 ` Jeff King 0 siblings, 0 replies; 101+ messages in thread From: Jeff King @ 2006-09-09 21:14 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, Sep 09, 2006 at 02:11:51PM -0700, Junio C Hamano wrote: > The trouble Linus illustrated is that in a global project you > cannot rely on timestamps always being correct. You can use > them as HINT, but you need to be prepared to do sensible things > when some people screw up the time. Right, I forgot about the unreliability of the timestamps. Thanks for the explanation! -Peff ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 20:43 ` Jeff King 2006-09-09 21:11 ` Junio C Hamano @ 2006-09-09 21:40 ` Linus Torvalds 2006-09-09 22:54 ` Jon Smirl 1 sibling, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2006-09-09 21:40 UTC (permalink / raw) To: Jeff King Cc: Marco Costalba, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Jeff King wrote: > > The problem you describe seems to come from doing a depth-first display > of each branch No, not at all. We _don't_ do depth-first, or breadth-first. The thing is, neither would really work, since the "distance" between commits is very variable. Instead, the rule is always to try to traverse the tree in date order. That's most likely the closest order to what you want - it ends up being breadth-first if there is a lot of concurrent work, and if there is one old branch and one new branch, we'll look at the new one first, and approximate depth-first there. > Why not look at the tip of each "active" branch > simultaneously and pick the one with the most recent date? Which is _exactly_ what we do. However, it's just a heuristic. "Most recent date" is not well-defined in a distributed environment that doesn't have a global clock. If somebody does commits on a machine that has the clock just set wrong, "most recent" may well not be _really_ recent. So all the algorithms are actually very careful to just use the date as a heuristic. They have real guarantees, but they aren't about the ordering. The ordering is just used as a way to have _some_ non-random and likely good order to traverse the DAG in. Doing it depth-first would suck (you'd always reach the root before you reach a lot of much more interesting commits), and breadth-first is not well-defined either, since one "level" can be a year old for one parent, and a day old for another. So you could kind of think of the ordering as "breadth-first, modified by date" kind of thing. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 21:40 ` Linus Torvalds @ 2006-09-09 22:54 ` Jon Smirl 2006-09-10 0:18 ` Linus Torvalds 0 siblings, 1 reply; 101+ messages in thread From: Jon Smirl @ 2006-09-09 22:54 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Marco Costalba, Paul Mackerras, linux@horizon.com, Git Mailing List On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > However, it's just a heuristic. "Most recent date" is not well-defined in > a distributed environment that doesn't have a global clock. If somebody > does commits on a machine that has the clock just set wrong, "most recent" > may well not be _really_ recent. When a merge happens could git fix things up in the database by adding a corrected, hidden time stamp that would keep things from having an out of order time sequence? That way you wouldn't need to rediscover the out of order commit each time the tree is generated. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 22:54 ` Jon Smirl @ 2006-09-10 0:18 ` Linus Torvalds 2006-09-10 1:22 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2006-09-10 0:18 UTC (permalink / raw) To: Jon Smirl Cc: Jeff King, Marco Costalba, Paul Mackerras, linux@horizon.com, Git Mailing List On Sat, 9 Sep 2006, Jon Smirl wrote: > > When a merge happens could git fix things up in the database by adding > a corrected, hidden time stamp that would keep things from having an > out of order time sequence? That way you wouldn't need to rediscover > the out of order commit each time the tree is generated. I don't think any such algorithm exists, and it's too late now ;) I was thinking of some "sequence number" algorithm, but the thing is, it's not _merges_ that are the problem per se, it's the "branch points". And you don't actually know that a commit is a branch point in advance: it's a branch point not because you branched, but because somebody else happened to use the same parent as the basis of their independent work. So we could make the rule be something like "when we commit, the new commit must have a datestamp that is equal to or in the future from the parent commits". However, that results in some really strange problems in a distributed environment, where _if_ somebody has their clock set wrong, you'll end up either unable to merge with them, or you have to generate a totally bogus time yourself (which then moves up the chain). It might be simpler to then not think of the thing as a "date" at all, but as a pure sequence number. Then the rule would be: a new commit has to have a sequence number that is "max(of-the-parent-sequence-numbers) + 1". That _may_ or may not actually help. It was discussed (a bit) early on - summer-05 - but I wasn't convinced enough that it would help that I was willing to change the format and add that sequence number. I still am not. I'm not even sure it would really work. In other words, it's probably just not worth it. It's better to just say "dates don't really _matter_, and all that matters is the parenthood information". Which is really what git does - the dates aren't _important_ for any real operation, they are literally just a heuristic. But hey, maybe somebody can come up with some operation that becomes _much_ cheaper with the above kind of sequence number. It shouldn't be hard to test.. Linus ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 0:18 ` Linus Torvalds @ 2006-09-10 1:22 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-10 1:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > But hey, maybe somebody can come up with some operation that becomes > _much_ cheaper with the above kind of sequence number. It shouldn't be > hard to test.. Sometimes I get a feeling that we are saying the same thing in the same thread but in parallel without reading what the other is saying. This is such a thread ;-). ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-09 20:05 ` Linus Torvalds 2006-09-09 20:43 ` Jeff King @ 2006-09-10 3:49 ` Marco Costalba 2006-09-10 4:13 ` Junio C Hamano 1 sibling, 1 reply; 101+ messages in thread From: Marco Costalba @ 2006-09-10 3:49 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > > The example is > > A <--- tip of branch > / \ > B E > | | > | F > | / > C > | > D > ... > Ok now it' clear, thanks. But anyhow I think that it should be possible to avoid the check and reordering on the receiver side. Suppose for a moment to split the graph drawing from the sequence reordering problem, suppose for a moment that receiver does do not draw the graph immediately. As you described, in our case git-rev-list sends the following sequence: A, B, C, D, E, F instead git-rev-list --topo-order would have sent something like A, E, F, B, C, D Now I ask, is it possible to have a sequence (without latency) like A, B, C, D, (-3)E, (-3)F where, in case of not topological correct revisions, git-rev-list gives the hint on the correct position in sequence (3 revs before in our case) where the revision would have been if the sequence would have been --topo-order ? This saves all the checking and reordering on the receiver side and guarantees consistent results on different implementations of git visualizers because the --topo-order algorithm logic is computed in git rev-list only. The visualizers could be sequence order agnostic, i.e. receivers can handle --topo-order or --date-order sequences or any other kind of sequence with the same code, simply they recreate the sequence in the way git-rev-list tells them. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 3:49 ` Marco Costalba @ 2006-09-10 4:13 ` Junio C Hamano 2006-09-10 4:23 ` Marco Costalba 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2006-09-10 4:13 UTC (permalink / raw) To: Marco Costalba Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List "Marco Costalba" <mcostalba@gmail.com> writes: > On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: >> >> The example is >> >> A <--- tip of branch >> / \ >> B E >> | | >> | F >> | / >> C >> | >> D >> ... >> > > Ok now it' clear, thanks. But anyhow I think that it should be > possible to avoid the check and reordering on the receiver side. > > Suppose for a moment to split the graph drawing from the sequence > reordering problem, suppose for a moment that receiver does do not > draw the graph immediately. > > As you described, in our case git-rev-list sends the following sequence: > A, B, C, D, E, F > > instead git-rev-list --topo-order would have sent something like > A, E, F, B, C, D > > Now I ask, is it possible to have a sequence (without latency) like > A, B, C, D, (-3)E, (-3)F > > where, in case of not topological correct revisions, git-rev-list > gives the hint on the correct position in sequence (3 revs before in > our case) where the revision would have been if the sequence would > have been --topo-order ? When rev-list is writing E out, it does not know it is a descendant of something it already emitted (i.e. C) because it hasn't looked at F nor checked its parent yet. So asking for (-3)F may be fine but I think (-3)E is just a fantasy. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:13 ` Junio C Hamano @ 2006-09-10 4:23 ` Marco Costalba 2006-09-10 4:46 ` Marco Costalba 2006-09-10 4:54 ` Junio C Hamano 0 siblings, 2 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-10 4:23 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/10/06, Junio C Hamano <junkio@cox.net> wrote: > "Marco Costalba" <mcostalba@gmail.com> writes: > > > On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > >> > >> The example is > >> > >> A <--- tip of branch > >> / \ > >> B E > >> | | > >> | F > >> | / > >> C > >> | > >> D > >> ... > >> > > > > Ok now it' clear, thanks. But anyhow I think that it should be > > possible to avoid the check and reordering on the receiver side. > > > > Suppose for a moment to split the graph drawing from the sequence > > reordering problem, suppose for a moment that receiver does do not > > draw the graph immediately. > > > > As you described, in our case git-rev-list sends the following sequence: > > A, B, C, D, E, F > > > > instead git-rev-list --topo-order would have sent something like > > A, E, F, B, C, D > > > > Now I ask, is it possible to have a sequence (without latency) like > > A, B, C, D, (-3)E, (-3)F > > > > where, in case of not topological correct revisions, git-rev-list > > gives the hint on the correct position in sequence (3 revs before in > > our case) where the revision would have been if the sequence would > > have been --topo-order ? > > When rev-list is writing E out, it does not know it is a > descendant of something it already emitted (i.e. C) because it > hasn't looked at F nor checked its parent yet. So asking for > (-3)F may be fine but I think (-3)E is just a fantasy. > Ok. What about something like this? A, B, C, D, E, (-3, 1)F where -3 is the correct position in sequence and 1 is the number of revisions before F to whom apply the -3 rule. > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:23 ` Marco Costalba @ 2006-09-10 4:46 ` Marco Costalba 2006-09-10 4:54 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-10 4:46 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/10/06, Marco Costalba <mcostalba@gmail.com> wrote: > On 9/10/06, Junio C Hamano <junkio@cox.net> wrote: > > "Marco Costalba" <mcostalba@gmail.com> writes: > > > > > On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > > >> > > >> The example is > > >> > > >> A <--- tip of branch > > >> / \ > > >> B E > > >> | | > > >> | F > > >> | / > > >> C > > >> | > > >> D > > >> ... > > >> > > > > > > Ok now it' clear, thanks. But anyhow I think that it should be > > > possible to avoid the check and reordering on the receiver side. > > > > > > Suppose for a moment to split the graph drawing from the sequence > > > reordering problem, suppose for a moment that receiver does do not > > > draw the graph immediately. > > > > > > As you described, in our case git-rev-list sends the following sequence: > > > A, B, C, D, E, F > > > > > > instead git-rev-list --topo-order would have sent something like > > > A, E, F, B, C, D > > > > > > Now I ask, is it possible to have a sequence (without latency) like > > > A, B, C, D, (-3)E, (-3)F > > > > > > where, in case of not topological correct revisions, git-rev-list > > > gives the hint on the correct position in sequence (3 revs before in > > > our case) where the revision would have been if the sequence would > > > have been --topo-order ? > > > > When rev-list is writing E out, it does not know it is a > > descendant of something it already emitted (i.e. C) because it > > hasn't looked at F nor checked its parent yet. So asking for > > (-3)F may be fine but I think (-3)E is just a fantasy. > > > Ok. What about something like this? > A, B, C, D, E, (-3, 1)F > > where -3 is the correct position in sequence and 1 is the number of > revisions before F to whom apply the -3 rule. > Suppose git-rev-list gives: A, B, H, C, D, E, F, G, I, L, M, N Suppose git-rev-list --topo-order would have sent A, B, C, D, E, F, G, H, I, L, M, N Perhaps a more general fixup syntax could be: A,B, H, C, D, E, F, G, I(6, C, D, E, F, G, H), L, M, N Where 6 is the number of previous revisions to rearrange with the corresponding correct sequence(C, D, E, F, G, H). Note that the same subset could be rearranged multiple times if it helps git-rev-list parsing, as example it should be possible to have something like: A,B, H, C, D, E, F, G, I(6, C, D, E, F, G, H), L, M(3, L, G, H), N Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:23 ` Marco Costalba 2006-09-10 4:46 ` Marco Costalba @ 2006-09-10 4:54 ` Junio C Hamano 2006-09-10 5:14 ` Marco Costalba ` (2 more replies) 1 sibling, 3 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-10 4:54 UTC (permalink / raw) To: Marco Costalba Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List "Marco Costalba" <mcostalba@gmail.com> writes: > >> > >> A <--- tip of branch > >> / \ > >> B E > >> | | > >> | F > >> | / > >> C > >> | > >> D > >> ... > >> > > Ok. What about something like this? > A, B, C, D, E, (-3, 1)F > > where -3 is the correct position in sequence and 1 is the number of > revisions before F to whom apply the -3 rule. That means F knows who its descendants are, and commit objects do not have that information, so while traversing you need to keep track of all the descendants yourself, doesn't it? And how does that fix-up information help the user of the stream anyway? If I understand your model correctly, the model does not synchronously draw nodes as they are received, so it keeps track of what it has seen so far. When it sees F it can notice that its parent, C, is something it has seen, so it can tell F is wrong. It also knows that it has seen E and E's parent is F (which turned out to be at a wrong spot), so it can suspect E is also in the wrong spot (now, it is fuzzy to me how that chain of suspicion ends at E and does not propagate up to A, but let's think about that later). There is one thing that the user knows a lot better than the generic rev-list output part. It is the size of the output window (how tall the window is). I wonder if there is a way to exploit it, either in the user, or telling the size of the window (and perhaps where the display region at the top begins at) to rev-list... If we are dealing in a 3-item-tall window, we will traverse A B C D, notice they form a single strand of pearls, and can make a mental note that we do not have to worry about ancestors of D for now, because D's ancestors will be drawn below C, which is already the putative bottom edge of the window (when oddballs like E and F comes later, they can only push things down at the bottom edge of the window). Then rev-list can postpone traversing D and go on to traverse other nodes that are in the "active" list (in this case, E). I am starting to suspect that introducing "generation" header to the commit objects might actually be a very good thing. For one thing, rev-list will automatically get the topology always right if we did so. We obviously need to update 'convert-objects' and tell everybody that they need to rewrite their history if we take this route. That kind of flag-day conversion used to be an Ok thing to do, but it is getting harder and harder these days for obvious reasons, though. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:54 ` Junio C Hamano @ 2006-09-10 5:14 ` Marco Costalba 2006-09-10 5:46 ` Junio C Hamano 2006-09-10 15:21 ` linux 2006-09-10 9:49 ` Jakub Narebski 2006-09-10 10:28 ` Josef Weidendorfer 2 siblings, 2 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-10 5:14 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On 9/10/06, Junio C Hamano <junkio@cox.net> wrote: > "Marco Costalba" <mcostalba@gmail.com> writes: > > > >> > > >> A <--- tip of branch > > >> / \ > > >> B E > > >> | | > > >> | F > > >> | / > > >> C > > >> | > > >> D > > >> ... > > >> > > > > Ok. What about something like this? > > A, B, C, D, E, (-3, 1)F > > > > where -3 is the correct position in sequence and 1 is the number of > > revisions before F to whom apply the -3 rule. > > That means F knows who its descendants are, and commit objects > do not have that information, so while traversing you need to > keep track of all the descendants yourself, doesn't it? > Yes! it is, but you do it in git instead of in the visualizer ;-) because I think in any case someone defenitly needs to keep track of it. > And how does that fix-up information help the user of the stream > anyway? If I understand your model correctly, the model does > not synchronously draw nodes as they are received, Visualizers draw only what is on screen so when you start them they draw the first 20/30 lines only. And noramally git output it's faster then user scrolling so when user scrolls down revisions are already arrived. > track of what it has seen so far. When it sees F it can notice > that its parent, C, is something it has seen, so it can tell F > is wrong. It also knows that it has seen E and E's parent is F > (which turned out to be at a wrong spot), so it can suspect E is > also in the wrong spot (now, it is fuzzy to me how that chain > of suspicion ends at E and does not propagate up to A, but let's > think about that later). > Is it possible, in that case flicker will be involved, but it is very uncommon, so in the big amount of all other cases we have no flickering and early graph drawing. Becasue the worst that can happen on visualizer side is flickering we can accept a worst rare case to have a big adavantage (no latency and no flickering) on the (very) common case: "user scrolls to a given page when the corresponding revisions are already arrived and, in case, fixed up". Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 5:14 ` Marco Costalba @ 2006-09-10 5:46 ` Junio C Hamano 2006-09-10 15:21 ` linux 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2006-09-10 5:46 UTC (permalink / raw) To: git "Marco Costalba" <mcostalba@gmail.com> writes: > On 9/10/06, Junio C Hamano <junkio@cox.net> wrote: >> "Marco Costalba" <mcostalba@gmail.com> writes: >> >> > >> >> > >> A <--- tip of branch >> > >> / \ >> > >> B E >> > >> | | >> > >> | F >> > >> | / >> > >> C >> > >> | >> > >> D >> > >> ... >> > >> >> > >> > Ok. What about something like this? >> > A, B, C, D, E, (-3, 1)F >> > >> > where -3 is the correct position in sequence and 1 is the number of >> > revisions before F to whom apply the -3 rule. >> >> That means F knows who its descendants are, and commit objects >> do not have that information, so while traversing you need to >> keep track of all the descendants yourself, doesn't it? >> > > Yes! it is, but you do it in git instead of in the visualizer ;-) > because I think in any case someone defenitly needs to keep track of > it. Not really. To git-rev-list, which does not know how the visualizer uses the data, what you are saying essentially means that it needs to hold onto the data it parsed forever. That's where my earlier suggestion for visualizers to take advantage of the "windowing" information comes from. Since the chain depicted in the above between E..F can be arbitrary long (and you would need to keep track of information for B and C too, well, essentially everything), that is unacceptable for rev-list if you do not give it additional clue when it can discard that information. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 5:14 ` Marco Costalba 2006-09-10 5:46 ` Junio C Hamano @ 2006-09-10 15:21 ` linux 2006-09-10 18:32 ` Marco Costalba 2006-09-11 9:56 ` Paul Mackerras 1 sibling, 2 replies; 101+ messages in thread From: linux @ 2006-09-10 15:21 UTC (permalink / raw) To: junkio, mcostalba; +Cc: git, jonsmirl, linux, paulus, torvalds This conversation is going in so many directions at once that it's getting annoying. The first level decision requires input from the point of view of gitk and qgit internals as to what's easy to implement. I'm trying to figure out the gitk code, but I'm not fluent in tcl, and it has 39 non-boilerplate comment lines in 229 functions and 6308 lines of source, so it requires fairly intensive grokking. Still, while it obviously doesn't render bitmaps until the data appears in the window, it appears as though at least part of the layout (the "layoutmore" function) code is performed eagerly as soon as new data arrives via the getcommitlines callback. (Indeed, it appears that Tk does not inform it of window scroll events, so it can't wait any longer to decide on the layout.) Case 1: The visualizer is NOT CAPABLE of accepting out-of-order input. Without very sophisticated cacheing in git-rev-list, it must process all of the data before outputting any in order to make an absolute guarantee of no out-of-order data despite possibly messed-up timestamps. It is possible to notice that the date-first traversal only has one active chain and flush the queued data at that point, but that situation is unlikely to arise in repositories of non-trivial size, which are exactly the ones for which the batch-sorting delay is annoying. Case 2: The visualizer IS CAPABLE of accepting out-of-order input. Regardless of whether the layout is done eagerly or lazily, this requires the visualizer to potentially undo part of its layout and re-do it, so has a UI implementation cost. The re-doing does not need to be highly efficient; any number of heuristics and exception caches can reduce the occurrence of this in git-rev-list output to very low levels. It's just absolutely excluding it, without losing incremental output, that is difficult. Heuristic: I suspect the most common wrong-timestamp case is a time zone misconfiguration, so holding back output until the tree traversal has advanced 24 hours will eliminate most of the problems. Atypically large time jumps (like more than a year) could also trigger special "gross clock error" handling. Cache: whenever a child timestamped older than an ancestor is encountered, this can be entered in a persistent cache that can be used to give the child a different sorting priority next time. The simplest implementation would propagate this information up a chain of bad timestamps by one commit per git-rev-list invocation, but even that's probably okay. (A study of timestamp ordering problems in existing repositories would be helpful for tuning these.) In case 2, I utterly fail to see how delaying emitting the out-of-order commit is of the slightest help to the UI. The simplest way to merge out-of-order data is with an insertion sort (a.k.a. roll back and reprocess forward), and the cost of that is minimized if the distance to back up is minimized. Some "oops!" annotation on the git-rev-list output may be helpful to tell the UI that it needs to search back, but it already has an internal index of commits, so perhaps even that isn't worth bothering with. Fancier output formats are also more work to process. With sufficient cacheing of exceptions in git-rev-list, it may be practical to just have a single "oops, I screwed up; let's start again" output line, which very rarely triggers. But can we stop designing git-rev-list output formats until we've figured out if and how to implement it in the visualizer? Or, more to the point, visualizers plural. That's the hard part. Then we can see what sort of git-rev-list output would be most convenient. For example, is fixing a small number of out-of-place commits practical, or is it better to purge and restart? The former avoids deleting already-existing objects, while the latter avoids moving them. The original problem is that the long delay between starting a git history browser and being able to browse them is annoying. The visualizer UIs already support browsing while history is flowing in from git-rev-list, but git-rev-list is reading and sorting the entire history before outputting the first line. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 15:21 ` linux @ 2006-09-10 18:32 ` Marco Costalba 2006-09-11 9:56 ` Paul Mackerras 1 sibling, 0 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-10 18:32 UTC (permalink / raw) To: linux@horizon.com; +Cc: junkio, git, jonsmirl, paulus, torvalds > > But can we stop designing git-rev-list output formats until we've figured > out if and how to implement it in the visualizer? Or, more to the point, > visualizers plural. That's the hard part. Then we can see what sort > of git-rev-list output would be most convenient. > Regarding qgit we have two cases: -arrive of fixup set for out of order commits _not_ currently displayed on the screen This is by far the most common case expecially for big archives. In this case the fix speed depends on how far are the fixed commits from the last loaded commit (it's an insertion in a vector). In any case far more quick then redrawing the graph. No flickering side effect here. Implementation is easy. -arrive of fixup set for out of order commits currently displayed on the screen Here implementation _could_ be more difficult, but given the very low statistic of this case (a rare case among rare cases) we could accept the brutal approach of reset the complete graph, *but not the loaded revisions data*, and redraw the graph. With this approach implementation is almost as easy as before but flickering is involved. I agree with you that we should have fixup information as soon as git-rev-list discovers out of order commits. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 15:21 ` linux 2006-09-10 18:32 ` Marco Costalba @ 2006-09-11 9:56 ` Paul Mackerras 2006-09-11 12:39 ` linux 1 sibling, 1 reply; 101+ messages in thread From: Paul Mackerras @ 2006-09-11 9:56 UTC (permalink / raw) To: linux; +Cc: junkio, mcostalba, git, jonsmirl, torvalds linux@horizon.com writes: > I'm trying to figure out the gitk code, but I'm not fluent in tcl, and > it has 39 non-boilerplate comment lines in 229 functions and 6308 lines > of source, so it requires fairly intensive grokking. Sorry :) > Still, while it obviously doesn't render bitmaps until the data > appears in the window, it appears as though at least part of the > layout (the "layoutmore" function) code is performed eagerly as > soon as new data arrives via the getcommitlines callback. It doesn't render bitmaps as such at the Tcl level, what it does is create items on canvases, add text to text widgets, etc. Tk takes care of turning it into bitmaps. The work of drawing the graph is done in three stages - layout, optimization and drawing. The layout and optimization stages are done when we have enough data because, at least for the layout stage, earlier rows can affect later rows arbitrarily far away. For the optimization stage, I don't think the changes it makes can propagate arbitrarily far, so it would be possible to defer that stage, and I am looking into implementing that. The drawing stage is done on demand, just for the part that's actually visible. > (Indeed, it appears that Tk does not inform it of window scroll > events, so it can't wait any longer to decide on the layout.) The Tcl code does get to know when the canvas has been scrolled - see the scrollcanv procedure. > In case 2, I utterly fail to see how delaying emitting the out-of-order > commit is of the slightest help to the UI. The simplest way to merge Indeed. I would want to see it as early as possible. > out-of-order data is with an insertion sort (a.k.a. roll back and > reprocess forward), and the cost of that is minimized if the distance > to back up is minimized. That part isn't too hard; I already have modifications to gitk to handle that much of it. The harder part is backing up the graph drawing algorithm. > For example, is fixing a small number of out-of-place commits practical, > or is it better to purge and restart? The former avoids deleting > already-existing objects, while the latter avoids moving them. For gitk, I'm thinking of a reorder buffer of say 10 entries at the front end to cope with minor misorderings; then if misordering occurs that is outside the scope of the reorder buffer to fix, freeze the layout algorithms at that point, read in the rest of the commits and reorder as necessary, then at the end restart the layout algorithm from scratch, probably with a popup to inform the user what happened. If the user could set the size of the reorder buffer then they could avoid having that happen in future, at the cost of it taking longer to show the first screenful of commits. Paul. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-11 9:56 ` Paul Mackerras @ 2006-09-11 12:39 ` linux 0 siblings, 0 replies; 101+ messages in thread From: linux @ 2006-09-11 12:39 UTC (permalink / raw) To: linux, paulus; +Cc: git, jonsmirl, junkio, mcostalba, torvalds >> I'm trying to figure out the gitk code, but I'm not fluent in tcl, and >> it has 39 non-boilerplate comment lines in 229 functions and 6308 lines >> of source, so it requires fairly intensive grokking. > > Sorry :) I was more making excuses for my failure to understand its properties. Still, if you get the chance, a couple of paragraphs of comments on the primary data structures would be an immense help. Anyway, thanks for the summary. >> For example, is fixing a small number of out-of-place commits practical, >> or is it better to purge and restart? The former avoids deleting >> already-existing objects, while the latter avoids moving them. > For gitk, I'm thinking of a reorder buffer of say 10 entries at the > front end to cope with minor misorderings; then if misordering occurs > that is outside the scope of the reorder buffer to fix, freeze the > layout algorithms at that point, read in the rest of the commits and > reorder as necessary, then at the end restart the layout algorithm > from scratch, probably with a popup to inform the user what happened. > If the user could set the size of the reorder buffer then they could > avoid having that happen in future, at the cost of it taking longer to > show the first screenful of commits. My thoughts were that anything you could do, git-rev-list could do better, including the re-order buffer, so don't even bother. (For example, I was thinking of a time- or depth-based one rather than a count-based one.) So I read your suggestion as a "purge and restart" preference, with git-rev-list keeping a cache of problems encountered so that on most runs, it doesn't have to do it. Putting that logic in git-rev-list means that you don't have to detect ordering problems or do any sorting, just handle an "oops!" line in the output. Basically, given the example graph (with letters indicating timestamps) ...G--D--C--B---A \ / F--E You'd get something like A B E B C C D D G E F *** ERROR - restarting - please wait A B E B C E F F C C D D G G ... ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:54 ` Junio C Hamano 2006-09-10 5:14 ` Marco Costalba @ 2006-09-10 9:49 ` Jakub Narebski 2006-09-10 10:28 ` Josef Weidendorfer 2 siblings, 0 replies; 101+ messages in thread From: Jakub Narebski @ 2006-09-10 9:49 UTC (permalink / raw) To: git Junio C Hamano wrote: > I am starting to suspect that introducing "generation" header to > the commit objects might actually be a very good thing. For > one thing, rev-list will automatically get the topology always > right if we did so. > > We obviously need to update 'convert-objects' and tell everybody > that they need to rewrite their history if we take this route. > That kind of flag-day conversion used to be an Ok thing to do, > but it is getting harder and harder these days for obvious > reasons, though. Wouldn't it be possible to have not that more complex code if some of the commit objects (newer) would have "generation" header, and some of them (older) wouldn't have it? Git would use generation header if it is present, and current heuristic timestamp based code if it is not present. It would be better of course if the new commits have correct "generation" header, so insertion of "new" commit after "old" commit would have some extra overhead... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone 2006-09-10 4:54 ` Junio C Hamano 2006-09-10 5:14 ` Marco Costalba 2006-09-10 9:49 ` Jakub Narebski @ 2006-09-10 10:28 ` Josef Weidendorfer 2 siblings, 0 replies; 101+ messages in thread From: Josef Weidendorfer @ 2006-09-10 10:28 UTC (permalink / raw) To: Junio C Hamano Cc: Marco Costalba, Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List On Sunday 10 September 2006 06:54, Junio C Hamano wrote: > I am starting to suspect that introducing "generation" header to > the commit objects might actually be a very good thing. For > one thing, rev-list will automatically get the topology always > right if we did so. I do not think that any commits need to be changed, but only "rev-list --topo-order" to build its own private cache of sequence numbers for commits. AFAICS, a generation number is pure redundant information, as a single run of "rev-list --topo-order" can recalculate it. If it can not find the number in its private per-repository cache of sequence numbers, it calculates it the hard way (same as currently) and puts the found numbers into the cache afterwards. The cache should be similar in size to a pack index of a fully packed repository, but probably could be way smaller, by storing only one sequence number for a linear sequence of commits with only one parent each. Josef ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: Change set based shallow clone
@ 2006-09-09 10:31 linux
2006-09-09 13:00 ` Marco Costalba
0 siblings, 1 reply; 101+ messages in thread
From: linux @ 2006-09-09 10:31 UTC (permalink / raw)
To: mcostalba; +Cc: git, linux
> If the out of order revisions are a small amount of the total then
> could be possible to have something like
>
> git rev-list --topo-order --with-appended-fixups HEAD
>
> Where, while git-rev-list is working _whithout sorting the whole tree
> first_, when finds an out of order revision stores it in a fixup-list
> buffer and *at the end* of normal git-rev-lsit the buffer is flushed
> to receiver, so that the drawing logic does not change and the out of
> order revisions arrive at the end, already packed, sorted and prepared
> by git-rev-list.
I don't think I understand your proposal. The problem arises when
git-rev-list finds a commit that it should have listed before something
that it has already output.
Just for example:
Commit D: Ancestor B
Commit B: Ancestor A
Commit C: Ancestor B
Commit C is the problem, because if git-rev-list has already output B,
there's no way to back up and insert it in the right place.
How is waiting to output the already-behind-schedule commit C going
to help anything? The idea in gitk is to rewind the layout algorithm
to before B was added, insert C, and replay from there. This is
most efficient if C is encountered as soon after its correct place
as possible.
^ permalink raw reply [flat|nested] 101+ messages in thread* Re: Change set based shallow clone 2006-09-09 10:31 linux @ 2006-09-09 13:00 ` Marco Costalba 0 siblings, 0 replies; 101+ messages in thread From: Marco Costalba @ 2006-09-09 13:00 UTC (permalink / raw) To: linux@horizon.com; +Cc: git On 9 Sep 2006 06:31:57 -0400, linux@horizon.com <linux@horizon.com> wrote: > > If the out of order revisions are a small amount of the total then > > could be possible to have something like > > > > git rev-list --topo-order --with-appended-fixups HEAD > > > > Where, while git-rev-list is working _whithout sorting the whole tree > > first_, when finds an out of order revision stores it in a fixup-list > > buffer and *at the end* of normal git-rev-lsit the buffer is flushed > > to receiver, so that the drawing logic does not change and the out of > > order revisions arrive at the end, already packed, sorted and prepared > > by git-rev-list. > > I don't think I understand your proposal. The problem arises when > git-rev-list finds a commit that it should have listed before something > that it has already output. > > Just for example: > > Commit D: Ancestor B > Commit B: Ancestor A > Commit C: Ancestor B > > Commit C is the problem, because if git-rev-list has already output B, > there's no way to back up and insert it in the right place. > > How is waiting to output the already-behind-schedule commit C going > to help anything? It helps in a number of ways from receiver point of view. - Receiver doesn't need to keep a sorted list of loaded revisions. - Receiver doesn't need to stop parsing input and redraw the graph in the middle of the loading. - Receiver doesn't need to check for possible out of order commits when loading data, but can be assured data that arrives is --topo-order consistent (although my not be complete) - Split the in order parsing code (performance critical and common case) from fixup-code (run at the end and on a small set of commits). - Much faster implementation because the sorting is done only in git-rev-list and only once. Following your example my suggestion was something like this: Instead of: git-rev-list HEAD Commit D: Ancestor B Commit B: Ancestor A Commit C: Ancestor B Commit A: Ancestor E Commit E: Ancestor F Commit F: Ancestor G .... Commit V: Ancestor Z git-rev-list --topo-order ----with-appended-fixups HEAD sends something like: Commit D: Ancestor B Commit B: Ancestor A Commit A: Ancestor E Commit E: Ancestor F Commit F: Ancestor G .... Commit V: Ancestor Z (* end of correct sorted commits and start of out of order commits*) Commit C: Ancestor B ...... Commit N: Ancestor M So that receiver drawing code designed with working with --topo-order kind output could as usual draws the graph until Commit V: Ancestor Z (but without waiting for git-rev-list latency!!) and the new fixup code could *at the end* loop across *already sorted* fixup set: Commit C: Ancestor B ...... Commit N: Ancestor M and update the graph in one go. Of course some flag/syntax should be introduced to trigger the end of sorted code and beginning of fixups in git-rev-list output stream. Marco ^ permalink raw reply [flat|nested] 101+ messages in thread
end of thread, other threads:[~2006-09-12 0:25 UTC | newest] Thread overview: 101+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-07 19:52 Change set based shallow clone Jon Smirl 2006-09-07 20:21 ` Jakub Narebski 2006-09-07 20:41 ` Jon Smirl 2006-09-07 21:33 ` Jeff King 2006-09-07 21:51 ` Jakub Narebski 2006-09-07 21:37 ` Jakub Narebski 2006-09-07 22:14 ` Junio C Hamano 2006-09-07 23:09 ` Jon Smirl 2006-09-10 23:20 ` Anand Kumria 2006-09-08 8:48 ` Andreas Ericsson 2006-09-07 22:07 ` Junio C Hamano 2006-09-07 22:40 ` Jakub Narebski 2006-09-08 3:54 ` Martin Langhoff 2006-09-08 5:30 ` Junio C Hamano 2006-09-08 7:15 ` Martin Langhoff 2006-09-08 8:33 ` Junio C Hamano 2006-09-08 17:18 ` A Large Angry SCM 2006-09-08 14:20 ` Jon Smirl 2006-09-08 15:50 ` Jakub Narebski 2006-09-09 3:13 ` Petr Baudis 2006-09-09 8:39 ` Jakub Narebski 2006-09-08 5:05 ` Aneesh Kumar K.V 2006-09-08 1:01 ` linux 2006-09-08 2:23 ` Jon Smirl 2006-09-08 8:36 ` Jakub Narebski 2006-09-08 8:39 ` Junio C Hamano 2006-09-08 18:42 ` linux 2006-09-08 21:13 ` Jon Smirl 2006-09-08 22:27 ` Jakub Narebski 2006-09-08 23:09 ` Linus Torvalds 2006-09-08 23:28 ` Jon Smirl 2006-09-08 23:45 ` Paul Mackerras 2006-09-09 1:45 ` Jon Smirl 2006-09-10 12:41 ` Paul Mackerras 2006-09-10 14:56 ` Jon Smirl 2006-09-10 16:10 ` linux 2006-09-10 18:00 ` Jon Smirl 2006-09-10 19:03 ` linux 2006-09-10 20:00 ` Linus Torvalds 2006-09-10 21:00 ` Jon Smirl 2006-09-11 2:49 ` Linus Torvalds 2006-09-10 22:41 ` Paul Mackerras 2006-09-11 2:55 ` Linus Torvalds 2006-09-11 3:18 ` Linus Torvalds 2006-09-11 6:35 ` Junio C Hamano 2006-09-11 18:54 ` Junio C Hamano 2006-09-11 8:36 ` Paul Mackerras 2006-09-11 14:26 ` linux 2006-09-11 15:01 ` Jon Smirl 2006-09-11 16:47 ` Junio C Hamano 2006-09-11 21:52 ` Paul Mackerras 2006-09-11 23:47 ` Junio C Hamano 2006-09-12 0:06 ` Jakub Narebski 2006-09-12 0:18 ` Junio C Hamano 2006-09-12 0:25 ` Jakub Narebski 2006-09-11 9:04 ` Jakub Narebski 2006-09-10 18:51 ` Junio C Hamano 2006-09-11 0:04 ` Shawn Pearce 2006-09-11 0:42 ` Junio C Hamano 2006-09-11 0:03 ` Shawn Pearce 2006-09-11 0:41 ` Junio C Hamano 2006-09-11 1:04 ` Jakub Narebski 2006-09-11 2:44 ` Shawn Pearce 2006-09-11 5:27 ` Junio C Hamano 2006-09-11 6:08 ` Shawn Pearce 2006-09-11 7:11 ` Junio C Hamano 2006-09-11 17:52 ` Shawn Pearce 2006-09-11 2:11 ` Jon Smirl 2006-09-09 1:05 ` Paul Mackerras 2006-09-09 2:56 ` Linus Torvalds 2006-09-09 3:23 ` Junio C Hamano 2006-09-09 3:31 ` Paul Mackerras 2006-09-09 4:04 ` Linus Torvalds 2006-09-09 8:47 ` Marco Costalba 2006-09-09 17:33 ` Linus Torvalds 2006-09-09 18:04 ` Marco Costalba 2006-09-09 18:44 ` linux 2006-09-09 19:17 ` Marco Costalba 2006-09-09 20:05 ` Linus Torvalds 2006-09-09 20:43 ` Jeff King 2006-09-09 21:11 ` Junio C Hamano 2006-09-09 21:14 ` Jeff King 2006-09-09 21:40 ` Linus Torvalds 2006-09-09 22:54 ` Jon Smirl 2006-09-10 0:18 ` Linus Torvalds 2006-09-10 1:22 ` Junio C Hamano 2006-09-10 3:49 ` Marco Costalba 2006-09-10 4:13 ` Junio C Hamano 2006-09-10 4:23 ` Marco Costalba 2006-09-10 4:46 ` Marco Costalba 2006-09-10 4:54 ` Junio C Hamano 2006-09-10 5:14 ` Marco Costalba 2006-09-10 5:46 ` Junio C Hamano 2006-09-10 15:21 ` linux 2006-09-10 18:32 ` Marco Costalba 2006-09-11 9:56 ` Paul Mackerras 2006-09-11 12:39 ` linux 2006-09-10 9:49 ` Jakub Narebski 2006-09-10 10:28 ` Josef Weidendorfer -- strict thread matches above, loose matches on Subject: below -- 2006-09-09 10:31 linux 2006-09-09 13:00 ` Marco Costalba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).