Change set based shallow clone

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Change set based shallow clone
@ 2006-09-07 19:52 Jon Smirl
  2006-09-07 20:21 ` Jakub Narebski
                   ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-07 19:52 UTC (permalink / raw)
  To: git, linux@horizon.com

Here's a change set based shallow clone scheme I've been thinking
about, does it have potential?

When the client wants a shallow clone it starts by telling the server
all of the HEADs and how many change sets down each of those HEADs it
has locally. That's a small amout of data to transmit and it can be
easily tracked. Let's ignore merged branches for the moment.

The client then says I want at least 10 (or N) change sets for all of
the HEADs present at the server.  The server starts from each HEAD and
works backwards until it encounters a change set present on the
client. At that point it will be able to compute efficient deltas to
send.

If you haven't updated for six months when the server walks backwards
for 10 change sets it's not going to find anything you have locally.
When this situation is encountered the server needs to generate a
delta just for you between one of the change sets it knows you have
and one of the 10 change sets you want. By generating this one-off
delta it lets you avoid the need to fetch all of the objects back to a
common branch ancestor. The delta functions as a jump over the
intervening space.

In the case of an initial shallow clone the client won't have anything
to delta against.  The server will be forced to send a full version
for one of the 10 change sets requested and deltas for the rest.
Getting an initial shallow clone should take about as long as a CVS
check out.

This scheme does require the server to sometimes generate custom diffs
for the client, but in all the cases I have been working with
everything is always IO bound so it is better to spend some CPU to
reduce the IO needed.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 19:52 Change set based shallow clone Jon Smirl
@ 2006-09-07 20:21 ` Jakub Narebski
  2006-09-07 20:41   ` Jon Smirl
  2006-09-07 22:07 ` Junio C Hamano
  2006-09-08  1:01 ` linux
  2 siblings, 1 reply; 101+ messages in thread
From: Jakub Narebski @ 2006-09-07 20:21 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> If you haven't updated for six months when the server walks backwards
> for 10 change sets it's not going to find anything you have locally.
> When this situation is encountered the server needs to generate a
> delta just for you between one of the change sets it knows you have
> and one of the 10 change sets you want. By generating this one-off
> delta it lets you avoid the need to fetch all of the objects back to a
> common branch ancestor. The delta functions as a jump over the
> intervening space.

I don't understand. Git is _not_ patchset based (like GNU Arch, or
Mercurial, or CVS). It is snapshot based. So if you want to download
"skip", you need only for the local part of doenloader to make appropriate
grafts, like below


 *--*--*--*--*--*--*--*--*--*--*--HEAD    (server)

 *--*--*...........*--*--*--*--*--HEAD    (shallow/sparse clone)

But the part you were talking about is _easy_ part; the hard part is
merges including merging branch which was split off the trunk before
cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 20:21 ` Jakub Narebski
@ 2006-09-07 20:41   ` Jon Smirl
  2006-09-07 21:33     ` Jeff King
                       ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-07 20:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote:
> I don't understand. Git is _not_ patchset based (like GNU Arch, or

I meant change set to refer to a commit plus trees plus blobs that
make it up. These may be present in full or delta form.

> Mercurial, or CVS). It is snapshot based. So if you want to download
> "skip", you need only for the local part of doenloader to make appropriate
> grafts, like below
>
>
>  *--*--*--*--*--*--*--*--*--*--*--HEAD    (server)
>
>  *--*--*...........*--*--*--*--*--HEAD    (shallow/sparse clone)
>
> But the part you were talking about is _easy_ part; the hard part is
> merges including merging branch which was split off the trunk before
> cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc.

Does an average user do these things? The shallow clone is there to
address the casual user who gags at a five hour download to get an
initial check out Mozilla when they want to make a five line change or
just browse the source for a few minutes.

I would expect advanced users to have a full tree present.

I was going to have the dangling references from the shallow clone
point to 'not-present' objects. When you try to do any of the more
complex operations you would hit these objects and fault down more of
the tree.

There would also be a command to bring down all of the objects to
fully populate a sparse tree. You could do the shallow clone to begin
with and then do the full tree populate overnight or in the
background.

Maybe the answer is to build a shallow clone tool for casual use, and
then if you try to run anything too complex on it git just tells you
that you have to download the entire tree.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 20:41   ` Jon Smirl
@ 2006-09-07 21:33     ` Jeff King
  2006-09-07 21:51       ` Jakub Narebski
  2006-09-07 21:37     ` Jakub Narebski
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 101+ messages in thread
From: Jeff King @ 2006-09-07 21:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

On Thu, Sep 07, 2006 at 04:41:20PM -0400, Jon Smirl wrote:

> Maybe the answer is to build a shallow clone tool for casual use, and
> then if you try to run anything too complex on it git just tells you
> that you have to download the entire tree.

Can you list which operations you'd like to do on the shallow clone?

Many operations that look at history are going to fail with a very short
history (e.g., finding a merge base). Some operations can work with a
short history, but the results are probably not useful (seeing the last
10 commits from git-log is probably not interesting). If you just want
the latest snapshot, the remote git-archive work going on right now will
probably take care of that.

-Peff

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 21:33     ` Jeff King
@ 2006-09-07 21:51       ` Jakub Narebski
  0 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-07 21:51 UTC (permalink / raw)
  To: git

Jeff King wrote:

> Many operations that look at history are going to fail with a very short
> history (e.g., finding a merge base). Some operations can work with a
> short history, but the results are probably not useful (seeing the last
> 10 commits from git-log is probably not interesting). If you just want
> the latest snapshot, the remote git-archive work going on right now will
> probably take care of that.

That is the source of the idea of _sparse_ clone, which would include e.g.
top 10 commits from each of selected branches, all commits pointed by
branches and tags, and all merge bases, and roots, joined using grafts.

E.g. if the full history looks like this.

                                    -*.|.......... head  
                                   /   |
         -*--*--*-         -*--*--x--*-|-*--*--*-- head
        /         \       /            |
 *--*--*--*--*--*--*--*--*--*--*--*--*-|-*--*--*-- head
                       \...............|.......... tag
                                       |
                                       \- [cut-off]

then the sparse clone history would look like that:
(---) denotes parentage, (...) denotes grafts below


                                       |
                           ,......x....|.*--*--*-- head
                          :            |
 *....................*--*.............|.*--*--*-- head
                       \...............|.......... tag
                                       |
                                       \- [cut-off]

of course assuming that we have no earlier commits (i.e. sparse clone, not
sparse fetch).

Of course, that solves only some of troubles...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 20:41   ` Jon Smirl
  2006-09-07 21:33     ` Jeff King
@ 2006-09-07 21:37     ` Jakub Narebski
  2006-09-07 22:14     ` Junio C Hamano
  2006-09-08  8:48     ` Andreas Ericsson
  3 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-07 21:37 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote:
>> I don't understand. Git is _not_ patchset based (like GNU Arch, or
> 
> I meant change set to refer to a commit plus trees plus blobs that
> make it up. These may be present in full or delta form.
> 
>> Mercurial, or CVS). It is snapshot based. So if you want to download
>> "skip", you need only for the local part of downloader to make 
>> appropriate grafts, like below
>>
>>
>>  *--*--*--*--*--*--*--*--*--*--*--HEAD    (server)
>>
>>  *--*--*...........*--*--*--*--*--HEAD    (shallow/sparse clone)
>>
>> But the part you were talking about is _easy_ part; the hard part is
>> merges including merging branch which was split off the trunk before
>> cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc.
> 
> Does an average user do these things? The shallow clone is there to
> address the casual user who gags at a five hour download to get an
> initial check out Mozilla when they want to make a five line change or
> just browse the source for a few minutes.

The question is not if average user do these things, but do average user
encounter these things. I think that merging branches into 'master' branch
(trunk) is quite common, and merging trunk into branches for checking is
also common.

As to checking out latest revision... with git-archive --remote (or
git-tar-tree --remote as of now) you can download newest version as
archive; although without commit object I think.

> I would expect advanced users to have a full tree present.
> 
> I was going to have the dangling references from the shallow clone
> point to 'not-present' objects. When you try to do any of the more
> complex operations you would hit these objects and fault down more of
> the tree.

Only if you consider git-log to be complex operation. It would be better I
think to cover laps using grafts.

In first post you are talking about downloading diff covering the lap/skip;
it would be important only if git was patchset based. You need to download
commits and referenced objects, _except_ cutting off some parents.

> There would also be a command to bring down all of the objects to
> fully populate a sparse tree. You could do the shallow clone to begin
> with and then do the full tree populate overnight or in the
> background.

And that should be done by marking "shallow"/"sparse" grafts and just delete
them (or update if needed).

> Maybe the answer is to build a shallow clone tool for casual use, and
> then if you try to run anything too complex on it git just tells you
> that you have to download the entire tree.

The shallow/sparse/lazy[*1*] clone proposal is one of recurring proposals
(besides subproject support) on git mailing list.

Footnotes:
[*1*] Lazy clone, also known as remote alternatives.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 20:41   ` Jon Smirl
  2006-09-07 21:33     ` Jeff King
  2006-09-07 21:37     ` Jakub Narebski
@ 2006-09-07 22:14     ` Junio C Hamano
  2006-09-07 23:09       ` Jon Smirl
  2006-09-08  8:48     ` Andreas Ericsson
  3 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-07 22:14 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Does an average user do these things? The shallow clone is there to
> address the casual user who gags at a five hour download to get an
> initial check out Mozilla when they want to make a five line change or
> just browse the source for a few minutes.
>...
> Maybe the answer is to build a shallow clone tool for casual use, and
> then if you try to run anything too complex on it git just tells you
> that you have to download the entire tree.

For that kind of thing, "git-tar-tree --remote" would suffice I
would imagine.  The five line change can be tracked locally by
creating an initial commit from the tar-tree extract; such a
casual user will not be pushing or asking to pull but sending in
patches to upstream, no?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 22:14     ` Junio C Hamano
@ 2006-09-07 23:09       ` Jon Smirl
  2006-09-10 23:20         ` Anand Kumria
  0 siblings, 1 reply; 101+ messages in thread
From: Jon Smirl @ 2006-09-07 23:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 9/7/06, Junio C Hamano <junkio@cox.net> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> > Does an average user do these things? The shallow clone is there to
> > address the casual user who gags at a five hour download to get an
> > initial check out Mozilla when they want to make a five line change or
> > just browse the source for a few minutes.
> >...
> > Maybe the answer is to build a shallow clone tool for casual use, and
> > then if you try to run anything too complex on it git just tells you
> > that you have to download the entire tree.
>
> For that kind of thing, "git-tar-tree --remote" would suffice I
> would imagine.  The five line change can be tracked locally by
> creating an initial commit from the tar-tree extract; such a
> casual user will not be pushing or asking to pull but sending in
> patches to upstream, no?

>From my observation the casual user does something like this:

get a shallow clone
look at it for a while
pull once a day to keep it up to date

decide to make some changes
start a local branch
commit changes on local branch

push these changes to someone else for review
maybe pull changes on the branch back from the other person

keep pulling updates from the main repository
merge these updates to the local branch
browse around on the recent logs to see who changed what

finally push the branch to a maintainer
pull it back down from the main repository after the maintainer puts it in
abandon work on local branch

Some people can't figure out the branch step and just copy the repository.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 23:09       ` Jon Smirl
@ 2006-09-10 23:20         ` Anand Kumria
  0 siblings, 0 replies; 101+ messages in thread
From: Anand Kumria @ 2006-09-10 23:20 UTC (permalink / raw)
  To: git

On Thu, 07 Sep 2006 19:09:21 -0400, Jon Smirl wrote:

> On 9/7/06, Junio C Hamano <junkio@cox.net> wrote:
>> "Jon Smirl" <jonsmirl@gmail.com> writes:
>>
>> > Does an average user do these things? The shallow clone is there to
>> > address the casual user who gags at a five hour download to get an
>> > initial check out Mozilla when they want to make a five line change or
>> > just browse the source for a few minutes.
>> >...
>> > Maybe the answer is to build a shallow clone tool for casual use, and
>> > then if you try to run anything too complex on it git just tells you
>> > that you have to download the entire tree.
>>
>> For that kind of thing, "git-tar-tree --remote" would suffice I
>> would imagine.  The five line change can be tracked locally by
>> creating an initial commit from the tar-tree extract; such a
>> casual user will not be pushing or asking to pull but sending in
>> patches to upstream, no?
> 
> From my observation the casual user does something like this:
> 
> get a shallow clone

This could basically be something which look at the remote HEAD and pulls
down a copy of that commit/tree (and associated objects), right?

> look at it for a while
> pull once a day to keep it up to date

Same again.

> decide to make some changes
> start a local branch
> commit changes on local branch
> 
> push these changes to someone else for review
> maybe pull changes on the branch back from the other person

[...]

At what point, if any, do you envisage a casual user pulling down a full
copy of the repository?

Cheers,
Anand

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 20:41   ` Jon Smirl
                       ` (2 preceding siblings ...)
  2006-09-07 22:14     ` Junio C Hamano
@ 2006-09-08  8:48     ` Andreas Ericsson
  3 siblings, 0 replies; 101+ messages in thread
From: Andreas Ericsson @ 2006-09-08  8:48 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

Jon Smirl wrote:
> On 9/7/06, Jakub Narebski <jnareb@gmail.com> wrote:
>> I don't understand. Git is _not_ patchset based (like GNU Arch, or
> 
> I meant change set to refer to a commit plus trees plus blobs that
> make it up. These may be present in full or delta form.
> 
>> Mercurial, or CVS). It is snapshot based. So if you want to download
>> "skip", you need only for the local part of doenloader to make 
>> appropriate
>> grafts, like below
>>
>>
>>  *--*--*--*--*--*--*--*--*--*--*--HEAD    (server)
>>
>>  *--*--*...........*--*--*--*--*--HEAD    (shallow/sparse clone)
>>
>> But the part you were talking about is _easy_ part; the hard part is
>> merges including merging branch which was split off the trunk before
>> cutoff-point, history rewriting (c.f. 'pu' branch, and rebases), etc.
> 
> Does an average user do these things? The shallow clone is there to
> address the casual user who gags at a five hour download to get an
> initial check out Mozilla when they want to make a five line change or
> just browse the source for a few minutes.
> 

A better idea would be to allow those users to download a gzipped 
tarball of a pre-grafted repository. It shouldn't be terribly difficult 
to set up an update-hook that creates the pre-grafted repository for you 
whenever a tag (or some such) is created in the repo you host wherever 
everybody does their initial clone from.

As I understand it (although I've admittedly followed the git 
mailing-list sporadically the past three or so months), grafts already 
work as intended, and the users can then fetch into their grafted repo 
to get a bare minimum of objects.

> 
> There would also be a command to bring down all of the objects to
> fully populate a sparse tree. You could do the shallow clone to begin
> with and then do the full tree populate overnight or in the
> background.
> 

With the pre-grafted history this would work as follow

$ mkdir pregraft
$ wget http://pre-grafts.mozilla.org/pregrafted.git.tgz
$ cd pregraft
$ tar xvzf ../pregrafted.git.tgz
$ cd ..
$ git clone mozilla-repo-url >& /dev/null &
$ cd pregraft
# work, work, work; full clone completes
$ cd ../mozilla-repo
$ git pull ../pregraft master

or something similar.

iow, you get the small repo quickly and can start hacking while the 
full-history clone is downloading. If I understand grafts correctly, you 
could then merge in your changes made in the grafted repo to the one 
with full history.

> Maybe the answer is to build a shallow clone tool for casual use, and
> then if you try to run anything too complex on it git just tells you
> that you have to download the entire tree.
> 

I believe all tools that work with history understand grafts already, 
and if so they should provide sane messages when the user attempts to 
access history beyond the grafts. I might have missed or misunderstood 
something, but this seems to me like a simple solution to a complex problem.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 19:52 Change set based shallow clone Jon Smirl
  2006-09-07 20:21 ` Jakub Narebski
@ 2006-09-07 22:07 ` Junio C Hamano
  2006-09-07 22:40   ` Jakub Narebski
                     ` (2 more replies)
  2006-09-08  1:01 ` linux
  2 siblings, 3 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-07 22:07 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

I've been postponing thinking about shallow clones with a plan
to do so when I have absolutely nothing else to do for a solid
few days.  Since my time is booked pretty much full with day job
and the ongoing pack-file topics these days, I will most likely
stick to that plan after I finish this message.

Not because I think it is an insignificant problem at the fringe
(I did not say "when I have nothing better to do"), but because
I feel this is something important (and hard) that requires
"think about it and only about it straight through for a few
days, dropping everything else" kind of thinking.

When you have a ref under .git/refs/ pointing at an object, you
are making this assertion:

        The object store associated with this repository
        (usually, .git/objects and alternates pointed at by
        .git/objects/info/alternates) holds _all_ objects that
        are recursively reachable from this object by following
        tag reference, commit ancestry chain, commit to its
        tree, and tree to its subtrees and blobs.

So by definition, a shallow clone is a broken/incomplete
repository.  Unless we fix the breakage in some way.

One way to do so which has often been talked about is to use the
grafts facility.  You cut off the history at some commit, and
cauterize the commit ancestry chain at that point by telling git
to pretend that the commit does not have any parent.  While this
would work after setting it up, in the sense that "git log"
would stop at the commit and "git fsck-objects" would say
nothing is missing, it is cumbersome to set up, we do not have
an easy way to fetch (or clone -- but clone is a special case of
fetch where the downstream starts out from emptiness) from such
a repository, unplugging to get histories further back with
a later fetch is quite involved, and it is not clear what it
means to further fetch from such a shallow clone.

Where does the difficulty in fetching come from?  Mostly because
the upload side and download side needs to agree on the altered
history.

Let's say the download side is empty and somehow tells the
upload side that it wants the latest 5-commit worth of stuff
starting from the tip:

 - The upload side runs

        rev-list --objects HEAD~5..HEAD | pack-objects --stdout

   to give you the needed data.  This is almost [*1*] doable by
   telling it that you have HEAD~5.

 - Also the upload side can say "pretend that this commit HEAD~4
   has no parent, although cat-file on it would say otherwise".

 - The downloader creates a suitable grafts file to cauterize
   commit HEAD~4.

Is this enough?  Hardly.

Consider:

 o---o---o---o---o---*---*---*---*---*
          \      ^              /    ^ 
           \   HEAD~5          /    HEAD
            \                 /
             o---o---*---*---*
                 ^           ^
             HEAD^^2~3     HEAD^^2

5-commit request should apply to send '*' commits, so the
original "HEAD~5..HEAD" was bogus to begin with.  You have to
say "HEAD --not HEAD~5 HEAD^^2~3" to get the above.

Coming up with this restriction needs to be done on the upload
side.  Doing so on the download side means it needs to retrieve
the recent "real" history to do that computation to come up with
where to cut things off, which is crazy.

However, it may be that the above graph is not what the usual
downloader wants.  If the upload side is _the_ official
repository of some sort that merges in subsystem branches, and
if you are mostly interested in the official repository's
history, you might have meant to ask this with your 5-commit
request instead:

 o---o---o---o---o---*---*---*---*---*
          \      ^              /    ^ 
           \   HEAD~5          /    HEAD
            \                 /
             o---o---o---o---o
                             ^
                           HEAD^^2

That is, you would want to cauterize the side branch.  The pack
needs to be built with "HEAD --not HEAD~5 HEAD^^2", and the
cauterize instructions would say "HEAD~4 has no parent, HEAD^
has only one parent that is HEAD~2", instead of saying just
"HEAD~4 has no parent."

So whoever is computing this cauterization point has a policy
decision to make, because the meaning of "5-commits away" or
"recent 10 days" are very fuzzy when merges are involved.

It obviously is easy to declare that the policy decision is
solely made on the uploader side, from the implementation point
of view, but I do not offhand know what the ramification of that
decision is.

Oh, about "recent 10 days".

The side branch might be an very old development that were
merged recently into the mainline.  From mainline's point of
view, everything that happened on the side branch is something
new to him (think of Linus pulled from Jeff who has been cooking
his stuff for quite some time).  Looking at each individual
commit on the side branch, however, they have older commit and
author timestamps.  If you pick "I want mainline's point of
view", that means you need to compute (in the above picture):

  - notice commit timestamp of HEAD^ (the merge) is within the
    recent 10-day window.

  - using HEAD^^1 (supposedly the state of the mainline before
    the merge happened [*2*]) and HEAD^^2 (side branch),
    compute:

        rev-list HEAD^^1..HEAD^^2

    the results are the "new" ones that appeared when the merge
    was made, so they are within the recent 10-day window from
    the mainline's point of view.  Include them.

What that means is, assuming that HEAD~5 is also the 10-day
cut-off point on the mainline, and HEAD^ is recent, we would
send this:

 o---o---o---o---o---*---*---*---*---*
          \      ^              /    ^ 
           \   HEAD~5          /    HEAD
            \                 /
             *---*---*---*---*
                             ^
                           HEAD^^2

I'll leave it as an exercise to the readers to Come up with the
pack generation specification and cauterizing instructions to
give to the downloader, but you see how quickly this becomes
very complex, and we still haven't talked about the above when
multiple merge bases are involved.

Thinking about it a bit more, maybe "from mainline's point of
view" policy would include the same entire side branch when
limiting by depth of commits.  After all the whole side branch
was something new when HEAD^ merge was created.  But that is
another policy decision.  And the downloader side cannot make
that decision before seeing what the true ancestry graph looks
like.

Let's say we somehow managed to make a shallow clone.  The
downloader has '*' commits:

 o---o---o---o---o---*---*---*---*---*
          \      ^              /    ^ 
           \   HEAD~5          /    HEAD
            \                 /
             o---o---*---*---*
                 ^           ^
             HEAD^^2~3     HEAD^^2

In the meantime, the uploader side made some more commits:

                 x---x---x---x---x---x
                /                     \
               /                       \
              /                         \
 o---o---o---F---o---*---*---*---*---H---x---X
          \                     /
           \                   /
            \                 /
             o---o---*---*---*

Now the downloader wants to update.  How does that go?

 - The uploader says the tip is at X

 - The downloader says it has old HEAD (H), H^, ...

 - Suppose the development on the new side branch recently
   merged did not touch most of the files since it forked from
   the mainline at F, but the mainline made lot of changes that
   were not touched by commits 'x', which is now part of the
   downloader's history.  The uploader, if it did not know the
   downloader has altered history, would take "have H" from the
   downloader to mean "the other side has everything reachable
   from H", so it would compute X..H (commits that are not in H
   but in X and trees and blobs needed to complete them).  But
   that can leave out the trees and blobs that existed in F (and
   was part of commits on the new side branch).

 - So the uploader side needs to know how the history on the
   downloader's side is altered from reality when interpreting
   "have H".  It does not mean "H and all the history behind it"
   to the uploader anymore.

One way to do so is to send grafts information from downloader
to uploader.  If the uploader _knows_ that the altered history
ends at the leftmost '*' commit on the mainline, then it can 
take it into account the fact that the downloader does not have
commit 'F'.  That, however, introduces another problem.  Where
should it stop?  Obviously X needs to be sent, so is X^, but
how do we know which 'x' commit to stop at?  Do we record the
initial cut-off criteria (remember, we started the clone with
5-commit cutoff) somewhere in the downloader's repository and
send that to the uploader so that the uploader can apply the
same rule to include only 5 commits from the new side branch?
What happens if there are less than 5 commits on that new side
branch?  Would we end up including 'F', which the downloader
specifically did not want when making this shallow clone
initially?

I won't go into the details, but when you think about what needs
to happen when fetching _from_ a shallow clone (i.e. the
uploader side is incomplete), your head will explode ;-).  It is
solvable in the same sense that you can solve the previous
"fetching to update" problem by trying to define a reasonable
semantics to the operation by answering the questions I posed in
the above paragraph (which obviously is no way an exhaustive
set), but the result will be quite complex.  It would involve
sending graft information both from uploader and downloader
(after all the downloader side can also be shallow but with
different cut-off points from the uploader) and somehow coming
up with the intersection of both to arrive at a shared view of
altered history before starting the traversal to find what to
send.

Also often people would want to deepen history (i.e. fetching
older history than initially fetched) after "checking out only
the tip".  It is a reasonable request, and is somewhat easier
problem to solve than the initial shallow clone problem.  If you
have a working initial shallow clone, essentially you can:

 - Find the earlier cut off points by looking at your own grafts
   file;

 - Ask the uploader to clone shallowly from the parents missing
   from your repository.

and update the resulting grafts information.

Anyway, it is a useful feature, an important operation, and
it involves a lot of thinking (and hard work, obviously).

I will not think about this anymore, until I have absolutely
nothing else to do for a solid few days, but you probably
haven't read this far ;-).

        -jc

*1* Currently there is no way to find out what HEAD~10 is during
the initial protocol handshake, but this message is mostly about
potential protocol update needs.

*2* First parent of a merge is _usually_ what the repository
owner saw, i.e. the mainline merging side branch into it, but
that is not universally true when fast forwards are involved.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 22:07 ` Junio C Hamano
@ 2006-09-07 22:40   ` Jakub Narebski
  2006-09-08  3:54   ` Martin Langhoff
  2006-09-08  5:05   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-07 22:40 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> One way to do so is to send grafts information from downloader
> to uploader.  If the uploader _knows_ that the altered history
> ends at the leftmost '*' commit on the mainline, then it can 
> take it into account the fact that the downloader does not have
> commit 'F'.  That, however, introduces another problem.  Where
> should it stop?  Obviously X needs to be sent, so is X^, but
> how do we know which 'x' commit to stop at?  Do we record the
> initial cut-off criteria (remember, we started the clone with
> 5-commit cutoff) somewhere in the downloader's repository and
> send that to the uploader so that the uploader can apply the
> same rule to include only 5 commits from the new side branch?
> What happens if there are less than 5 commits on that new side
> branch?  Would we end up including 'F', which the downloader
> specifically did not want when making this shallow clone
> initially?
> 
> I won't go into the details, but when you think about what needs
> to happen when fetching _from_ a shallow clone (i.e. the
> uploader side is incomplete), your head will explode ;-).  It is
> solvable in the same sense that you can solve the previous
> "fetching to update" problem by trying to define a reasonable
> semantics to the operation by answering the questions I posed in
> the above paragraph (which obviously is no way an exhaustive
> set), but the result will be quite complex.  It would involve
> sending graft information both from uploader and downloader
> (after all the downloader side can also be shallow but with
> different cut-off points from the uploader) and somehow coming
> up with the intersection of both to arrive at a shared view of
> altered history before starting the traversal to find what to
> send.

The proposed solution was to send graft information and cutoff from
downloader to uploader, coming up with the effective graft points which
would be intersection of downloader and uploader grafts[*1*], and then use
this intersection as graft information when calculating what to send.

With sparse clone we would have:

 #---o---#---o---o---*---*---*---*---*
          \      ^              /    ^ 
           \   HEAD~5          /    HEAD
            \                 /
             o---o---*---*---*
                 ^           ^
             HEAD^^2~3     HEAD^^2

where # and * are the commit which the downloader has (# are additional
commits for sparse rather thatn shallow clone).

                 x---x---x---x---x---x
                /                     \
               /                       \
              /                         \
 #---o---#---F---o---*---*---*---*---H---x---X
          \                     /
           \                   /
            \                 /
             o---o---*---*---*

Now, when we are shallow fetching, we should have remembered somewhere the
cutoff used (or use the same cutoff). So we sould get then


                 x---*---*---*---*---*
                /                     \
               /                       \
              /                         \
 #---o---#---#---o---*---*---*---*---*---*---*
          \                     /
           \                   /
            \                 /
             o---o---*---*---*


Footnotes:
[*1*] It would be some kind of graph intersection. Note that the repository
we clone from might be shallow clone, shallower in parts than our
repository, or might be repository with historic repository grafted on
(grafts adding history, not simplyfying it).

[*2*] I agree that shallow/sparse clone is hard. Lazy clone (or remote
alternatives) is also hard: when we do download objects, do we
download-ahead, do we cache downloaded object or move them to local
repository, how often we clean cache etc.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 22:07 ` Junio C Hamano
  2006-09-07 22:40   ` Jakub Narebski
@ 2006-09-08  3:54   ` Martin Langhoff
  2006-09-08  5:30     ` Junio C Hamano
  2006-09-08  5:05   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 101+ messages in thread
From: Martin Langhoff @ 2006-09-08  3:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jon Smirl, git

On 9/8/06, Junio C Hamano <junkio@cox.net> wrote:
...
> Anyway, it is a useful feature, an important operation, and
> it involves a lot of thinking (and hard work, obviously).
>
> I will not think about this anymore, until I have absolutely
> nothing else to do for a solid few days, but you probably
> haven't read this far ;-).

For when you come back, then. I agree this is rather complicated, so
much so that I suspect that we may be barking up the wrong tree.

People who want shallow clones are actually asking for a "light"
clone, in terms of "how much do I need to download". If everyone has
the whole commit chain (but may be missing olden blobs, and even
trees), the problem becomes a lot easier.

On the wire, all the client needs to say from root commit X to later
commits A, B, C, D, can I have all the relevant blobs? (Some may end
up being resent, but that's a tradeoff).

On the server side, keeping commits (and/or trees) in a segregated
pack makes sense as an optimization. On the client side, a similar
segregated pack for the 'known blobless' commits/trees may make sense
to be able to be able treat ops on the blobless part of history
differently.

{From a user interaction point of view, it is also easier to error out
saying "don't have it, but you'll get it if you ask for history from
[SHA1] onwards".}

just my .2

m

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  3:54   ` Martin Langhoff
@ 2006-09-08  5:30     ` Junio C Hamano
  2006-09-08  7:15       ` Martin Langhoff
  2006-09-08 14:20       ` Jon Smirl
  0 siblings, 2 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-08  5:30 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> On 9/8/06, Junio C Hamano <junkio@cox.net> wrote:
> ...
>> Anyway, it is a useful feature, an important operation, and
>> it involves a lot of thinking (and hard work, obviously).
>>
>> I will not think about this anymore, until I have absolutely
>> nothing else to do for a solid few days, but you probably
>> haven't read this far ;-).
>
> For when you come back, then. I agree this is rather complicated, so
> much so that I suspect that we may be barking up the wrong tree.
>
> People who want shallow clones are actually asking for a "light"
> clone, in terms of "how much do I need to download". If everyone has
> the whole commit chain (but may be missing olden blobs, and even
> trees), the problem becomes a lot easier.

No, I do not think so.  You are just pushing the same problem to
another layer.

Reachability through commit ancestry chain is no more special
than reachability through commit-tree or tree-containment
relationships.  The grafts mechanism happen to treat commit
ancestry more specially than others, but that is not inherent
property of git [*1*].  It is only an implementation detail (and
limitation -- grafts cannot express anything but commit
ancestry) of the current system.

But let's touch a slightly different but related topic first.
People do not ask for shallow clones.  They just want faster
transfer of "everything they care about".  Shallow and lazy
clones are implementation techniques to achieve the goal of
making everything they care about appear available, typically
taking advantage of the fact that people care about recent past
a lot more than they do about ancient history.

Shallow clones do so by explicitly saying "sorry you told me
earlier you do not care about things older than X so if you now
care about things older than X please do such and such to deepen
your history".  What you are saying is a variant of shallow that
says "NAK; commits I have but trees and blobs associated with
such an old commit you have to ask me to retrieve from
elsewhere".  Lazy clones do so by "faulting missing objects on
demand" [*2*].  That is essentially automating the "please do
such and such" part.  So they are all not that different.

Now, first and foremost, while I would love to have a system
that gracefully operates with a "sparse" repository that lacks
objects that should exist from tag/commit/tree/blob reachability
point of view, it is an absolute requirement that I can tell why
objects are missing from a repository when I find some are
missing by running fsck-objects [*3*].  If repository is a
shallow clone, not having some object may be expected, but I
want to be able to tell repository corruption locally even in
that case, so "assume lack of object is perhaps it was not
downloaded by shallow cloning" is an unacceptable attitude, and
"when you find an object missing from your repository, you can
ask the server [*4*], and if the server does not have it then
you know your repository is corrupt otherwise it is Ok" is a
wrong answer.

I talked about the need of upload-pack protocol extension to
exchange grafts information between uploader and downloader to
come up with the resulting graft and update the grafts in
downloaded repository after objects tranfer is done.  It is
needed because by having such cauterizing graft entries I can be
sure that the repository is not corrupt when fsck-objects that
does look at grafts says "nothing is missing".  Jon talked about
"fault-in on demand and leave stub objects until downloading the
real thing"; those stub objects are natural extension of grafts
facility but extends to trees and blobs.  Either of these would
allow me to validate the sparse repository locally.

Now I'll really shut up ;-).

[*1*] For example, rev-list --object knows that it needs to show
the tree-containment when given only a tree object without any commit.

[*2*] You probably could write a FUSE module that mount on
your .git/objects, respond to accesses to .git/objects/??
directory and files with 38-byte hexadecimal names under them by
talking to other repositories over the wire.  No, I am not going
to do it, and I will probably not touch it even with ten-foot
pole for reasons I state elsewhere in this message.

[*3*] The real fsck-objects always looks at grafts.  The
fictitious version of fsck-objects I am talking about here is
the one that uses parents recorded on the "parent " line of
commit objects, that is, the true object level reachability.

[*4*] In git, there is no inherent server vs client or upstream
vs downstream relationship between repositories.  You may be
even fetching from many people and do not have a set upstream at
all.  Or you are _the_ upstream, and your notebook has the
latest devevelopment history, and after pushing that latest
history to your mothership repository, you may decide you do not
want ancient development history on a puny notebook, and locally
cauterize the history on your notebook repository and prune
ancient stuff.  The objects missing from the notebook repository
are retrievable from your mothership repository again if/when
needed and you as the user would know that (and if you are lucky
you may even remember that a few months later), but git doesn't,
and there is no reason for git to want to know it.  If we want
to do "fault-in on demand", we need to add a way to say "it is
Ok for this repository to be incomplete -- objects that are
missing must be completed from that repository (or those
repositories)".  But that's quite a special case of having a
fixed upstream.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  5:30     ` Junio C Hamano
@ 2006-09-08  7:15       ` Martin Langhoff
  2006-09-08  8:33         ` Junio C Hamano
  2006-09-08 17:18         ` A Large Angry SCM
  2006-09-08 14:20       ` Jon Smirl
  1 sibling, 2 replies; 101+ messages in thread
From: Martin Langhoff @ 2006-09-08  7:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jon Smirl, git

On 9/8/06, Junio C Hamano <junkio@cox.net> wrote:
> "Martin Langhoff" <martin.langhoff@gmail.com> writes:
> > People who want shallow clones are actually asking for a "light"
> > clone, in terms of "how much do I need to download". If everyone has
> > the whole commit chain (but may be missing olden blobs, and even
> > trees), the problem becomes a lot easier.
>
> No, I do not think so.  You are just pushing the same problem to
> another layer.
>
> Reachability through commit ancestry chain is no more special
> than reachability through commit-tree or tree-containment
> relationships.  The grafts mechanism happen to treat commit

Agreed that it is no more special. OTOH, if we focus on the fact that
people want to avoid high-cost data transfers, transferring commit
chains is cheap and allows the client to ask good questions when
talking to the server.

So as far as tradeoffs go, it allows you to keep the protocol simple,
and minimise complex guessing at either end of the wire.

> But let's touch a slightly different but related topic first.
> People do not ask for shallow clones.  They just want faster
> transfer of "everything they care about".  Shallow and lazy

I'd disagree a bit here. They care about the whole project, and in
time they'll find that out and end up pulling it all if they use git
much at all ;-)

They want fast, cheap initial checkouts.

...
> So they are all not that different.

Earlier you were pointing out how hard it was for the client to even
know what to ask for because it can't see the whole picture. Having
the ancestry  complete means you always know what to ask for.

> Now, first and foremost, while I would love to have a system
> that gracefully operates with a "sparse" repository that lacks
> objects that should exist from tag/commit/tree/blob reachability
> point of view, it is an absolute requirement that I can tell why
> objects are missing from a repository when I find some are
> missing by running fsck-objects [*3*].

I agree -- and you can keep those objects you know are expected to be
missing listed in an "packless" idx file somewhere.

> If repository is a
> shallow clone, not having some object may be expected, but I
> want to be able to tell repository corruption locally even in
> that case,

+1

> I talked about the need of upload-pack protocol extension

As far as I can see, this would not need any change to the upload-pack
protocol.

There are some hard problems in dealing with a sparse repo that need
thinking through. My thinking is that by having the whole commit chain
around the protocol can be kept sane, by virtue of the local repo
always having a clear "overall" picture, including knowing what it's
missing.

> [*4*] In git, there is no inherent server vs client or upstream
> vs downstream relationship between repositories.

Here an importaant distiction must be made. A "publishing" repo cannot
be sparse. A sparse repo probably cannot be cloned from.

>  You may be
> even fetching from many people and do not have a set upstream at
> all.  Or you are _the_ upstream, and your notebook has the
> latest devevelopment history, and after pushing that latest
> history to your mothership repository, you may decide you do not
> want ancient development history on a puny notebook, and locally
> cauterize the history on your notebook repository and prune
> ancient stuff.

Well, that's easy again: "prune old blobs and list them in an idx"
should work well.

cheers,

martin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  7:15       ` Martin Langhoff
@ 2006-09-08  8:33         ` Junio C Hamano
  2006-09-08 17:18         ` A Large Angry SCM
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-08  8:33 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> As far as I can see, this would not need any change to the upload-pack
> protocol.

That I have to disagree.  Take your earlier example of "I have X
and I learned you have A B C D".  Now the fetch that got X was a
commit-only one (but you have full tree for X), but you got X
from somebody else, not the uploader you are talking with right
now.  There is a common ancestor Y somewhere behind X, but
between Y and X you lack trees and blobs.

How would the current protocol work (I am explaining that it
won't be "not need any change")?  After you express interest for
A, B, C, D with "want" messages, you start telling "have X",
"have X~1", have "X~2",... to the uploader (X, X~1 etc. are
object names not symbolic).  Eventually the uploader would
recognize Y that is an ancestor of X that it has and will Ack
it, you stop traversing the ancestor of an acked one and say
"done".  So now we have a common ancestor and the upload side
knows it can omit commits behind  that common commit Y, trees
and blobs contained in Y.

See an Oops here?  You do NOT have trees and blobs associated
with commit Y.

I am not saying we should not change the protocol.  I am just
trying to explain that the problem is not something you can
fudge without changing the protocol.

As a first level approximation, we could in addition to the
commit object name have a bit that says "do I have trees and
blobs associated with the commit" bit on each "have" message (by
the way, this is _expensive_.  You have to traverse down the
tree contained in each commit using has_sha1_file() recursively
to see if you have anything missing from _each_ commit you send
to the other).  Alternatively, you can say "I have this commit,
but I do not have this tree and I do not even know if I have
blobs needed to complete that tree because I do not know what
that tree I am missing contains -- I may have them, I may not. I
truly do not know" to convey the same information.

Think about it a bit -- saying "I know I am missing this tree"
is one thing, but if we end up saying "I do not even know what I
am missing", it is like saying "don't care what I say, just send
everything to me".  Are we gaining much by having only commit
objects on our end?

Once you end up sending a full tree, it is like doing an initial
checkout over CVS when you think you are incrementally fetching,
and you are already lost.  For example, a recent kernel tarball
compressed with gzip is around 50MB; the history for 34k commits
since 2.6.12-rc2 fully packed is slightly less than 150MB.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  7:15       ` Martin Langhoff
  2006-09-08  8:33         ` Junio C Hamano
@ 2006-09-08 17:18         ` A Large Angry SCM
  1 sibling, 0 replies; 101+ messages in thread
From: A Large Angry SCM @ 2006-09-08 17:18 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, Jon Smirl, git

Martin Langhoff wrote:
> On 9/8/06, Junio C Hamano <junkio@cox.net> wrote:
...
>> [*4*] In git, there is no inherent server vs client or upstream
>> vs downstream relationship between repositories.
> 
> Here an importaant distiction must be made. A "publishing" repo cannot
> be sparse. A sparse repo probably cannot be cloned from.

With the use of "placeholder" objects; neither one of these assertions
is true.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  5:30     ` Junio C Hamano
  2006-09-08  7:15       ` Martin Langhoff
@ 2006-09-08 14:20       ` Jon Smirl
  2006-09-08 15:50         ` Jakub Narebski
  1 sibling, 1 reply; 101+ messages in thread
From: Jon Smirl @ 2006-09-08 14:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Langhoff, git

On 9/8/06, Junio C Hamano <junkio@cox.net> wrote:
> [*4*] In git, there is no inherent server vs client or upstream
> vs downstream relationship between repositories.  You may be
> even fetching from many people and do not have a set upstream at
> all.  Or you are _the_ upstream, and your notebook has the
> latest devevelopment history, and after pushing that latest
> history to your mothership repository, you may decide you do not
> want ancient development history on a puny notebook, and locally
> cauterize the history on your notebook repository and prune
> ancient stuff.  The objects missing from the notebook repository
> are retrievable from your mothership repository again if/when
> needed and you as the user would know that (and if you are lucky
> you may even remember that a few months later), but git doesn't,
> and there is no reason for git to want to know it.  If we want
> to do "fault-in on demand", we need to add a way to say "it is
> Ok for this repository to be incomplete -- objects that are
> missing must be completed from that repository (or those
> repositories)".  But that's quite a special case of having a
> fixed upstream.

A 'not-present' object would be a normal git object, it would contain
a list of sha1s that it is stubbing for. These alias sha1s need to end
up somewhere where the git tools can find them. Maybe put all of the
'not-present' objects into their own pack and add the aliases to the
index. If the aliases point back the 'not-present' object everything
can be validated even if the alias sha1s don't match the one of the
object. This pack would be private and not sent to other repositories.

It would be useful to maintain a list of possible remote repositories
to search for the missing objects if needed. This list could be shared
with people that clone from you.

If you clone from a remote repository that is a partial copy you may
not be able to get all of the objects you want. The only choice here
is to point them to 'not-present' stub and try searching the
alternative repository list.

If you really want to build something for the future, have all of the
git repositories on the net talk to each other and use a distributed
hash scheme to spread redundant copies of the objects out over the
cloud of servers. This will have the effect of creating a global
namespace for all git projects.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 14:20       ` Jon Smirl
@ 2006-09-08 15:50         ` Jakub Narebski
  2006-09-09  3:13           ` Petr Baudis
  0 siblings, 1 reply; 101+ messages in thread
From: Jakub Narebski @ 2006-09-08 15:50 UTC (permalink / raw)
  To: git

My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives
mechanism. We put URI for repository (repositories) that hosts the project,
and we would need at start to download at least heads and tags, and only
heads and tags.

When there is request for an object (commit, tree, blob or tag) which we
don't have locally, we ask remote repository if it have it, and download it
(and perhaps some related objects); this needs extension I guess for the
git downloader ('wantonly'), it should be fairly easy for dumb downloaders
if you don't mind getting whole pack i.e. perhaps more than intended.
Perhaps 'wantonly' for git protocol should work the same: one would get
ready pack (perhaps with limit on pack size: otherwise one would get only
the requested objects).

One thing that needs extending for lazy clone is git-fsck... otherwise you
would get whole repository when calling git-fsck. Simplest would be
fsck-ing only the local part, perhaps checking if the "missing" objects are
present in remote repository, but not downloading them and not checking
connectivity further... and of course notifying user that this is lazy
clone. Unless we run git-fsck-objects on the remote side, then pass results
to the local side.

There is probably bunch of problems with this idea...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 15:50         ` Jakub Narebski
@ 2006-09-09  3:13           ` Petr Baudis
  2006-09-09  8:39             ` Jakub Narebski
  0 siblings, 1 reply; 101+ messages in thread
From: Petr Baudis @ 2006-09-09  3:13 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Dear diary, on Fri, Sep 08, 2006 at 05:50:40PM CEST, I got a letter
where Jakub Narebski <jnareb@gmail.com> said that...
> My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives
> mechanism. We put URI for repository (repositories) that hosts the project,
> and we would need at start to download at least heads and tags, and only
> heads and tags.

  One thing to note is that you won't last very long without getting
at least basically all the commits from the history. git log, git
merge-base and such would either just suck them all, get partially moved
to the server side, or would undergo quite a painful and slooooooooow
process "get me commit X... thank you, sir. hmm, it appears that its
parent is commit Y.  could you get me commit Y, please...? thank you,
sir. hmm, it appears...".

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Snow falling on Perl. White noise covering line noise.
Hides all the bugs too. -- J. Putnam

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  3:13           ` Petr Baudis
@ 2006-09-09  8:39             ` Jakub Narebski
  0 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-09  8:39 UTC (permalink / raw)
  To: git

Petr Baudis wrote:

> Dear diary, on Fri, Sep 08, 2006 at 05:50:40PM CEST, I got a letter
> where Jakub Narebski <jnareb@gmail.com> said that...
>> My idea for lazy clone/fetch (lazy = on-demand) is via remote alternatives
>> mechanism. We put URI for repository (repositories) that hosts the project,
>> and we would need at start to download at least heads and tags, and only
>> heads and tags.
> 
>   One thing to note is that you won't last very long without getting
> at least basically all the commits from the history. git log, git
> merge-base and such would either just suck them all, get partially moved
> to the server side, or would undergo quite a painful and slooooooooow
> process "get me commit X... thank you, sir. hmm, it appears that its
> parent is commit Y.  could you get me commit Y, please...? thank you,
> sir. hmm, it appears...".

As I said there is load of troubles with lazy clone/fetch = remote 
alternatives I didn't thought about.

git log/git rev-list and git fsck-objects among them.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 22:07 ` Junio C Hamano
  2006-09-07 22:40   ` Jakub Narebski
  2006-09-08  3:54   ` Martin Langhoff
@ 2006-09-08  5:05   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 101+ messages in thread
From: Aneesh Kumar K.V @ 2006-09-08  5:05 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:
> I've been postponing thinking about shallow clones with a plan
> to do so when I have absolutely nothing else to do for a solid
> few days.  Since my time is booked pretty much full with day job
> and the ongoing pack-file topics these days, I will most likely
> stick to that plan after I finish this message.
> 
> Not because I think it is an insignificant problem at the fringe
> (I did not say "when I have nothing better to do"), but because
> I feel this is something important (and hard) that requires
> "think about it and only about it straight through for a few
> days, dropping everything else" kind of thinking.
> 
> When you have a ref under .git/refs/ pointing at an object, you
> are making this assertion:
> 
>         The object store associated with this repository
>         (usually, .git/objects and alternates pointed at by
>         .git/objects/info/alternates) holds _all_ objects that
>         are recursively reachable from this object by following
>         tag reference, commit ancestry chain, commit to its
>         tree, and tree to its subtrees and blobs.
> 
> So by definition, a shallow clone is a broken/incomplete
> repository.  Unless we fix the breakage in some way.
> 
> One way to do so which has often been talked about is to use the
> grafts facility.  You cut off the history at some commit, and
> cauterize the commit ancestry chain at that point by telling git
> to pretend that the commit does not have any parent.  While this
> would work after setting it up, in the sense that "git log"
> would stop at the commit and "git fsck-objects" would say
> nothing is missing, it is cumbersome to set up, we do not have
> an easy way to fetch (or clone -- but clone is a special case of
> fetch where the downstream starts out from emptiness) from such
> a repository, unplugging to get histories further back with
> a later fetch is quite involved, and it is not clear what it
> means to further fetch from such a shallow clone.


Pardon my ignorance, but won't having a http:// or ssh based url support in objects/info/alternates
will support such shallow clones. What i am wondering is can't we make the  objects/info/alternates following code
traverse the network transport and make git log work like cvs log in case things are not in local repository. This
should make all the git operation slow, but if you are interested in speed one can clone the repository fully.

-aneesh 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-07 19:52 Change set based shallow clone Jon Smirl
  2006-09-07 20:21 ` Jakub Narebski
  2006-09-07 22:07 ` Junio C Hamano
@ 2006-09-08  1:01 ` linux
  2006-09-08  2:23   ` Jon Smirl
  2 siblings, 1 reply; 101+ messages in thread
From: linux @ 2006-09-08  1:01 UTC (permalink / raw)
  To: git, jonsmirl, linux

> Here's a change set based shallow clone scheme I've been thinking
> about, does it have potential?

Well, let's look at the problems...

You might want to look at the explanation of the current git network
protocol at
http://www.spinics.net/lists/git/msg08899.html

I'm just understanding it myself, but as best I understant it,
it has four phases:

- First, the server lists all of the heads it has, by SHA1 and name.
- Second, the client tells it which ones it wants, by name.
- Third, an interactive protocol is entered in which the client
  tells the server which commits it has in terms the server can
  understand.  This proceeds as follows:

  - For each head the client has, it lists commit objects going
    backwards in history.
  - As soon as the server sees one that it has heard of, it sends
    back "ACK <hash> continue".  The client stops tracing that
    branch and starts in on the next one.
  - (Periodically, the client sends a ping to the server, which
    responds "NAK" if it's still alive.)
  - This continues until the client says "done".

At this point, the server knows a set of commits which the client has,
and a set of commits which the client wants.  Then it invokes

	git-rev-list --objects-edge <want1> <want2> ^<have1> ^<have2>

(In practice, there will be a lot more than two of each, but forgive
me if I simplify.)

This builds a list of commits that are ancestors of <want1> and <want2>
but not <have1> or <have2>.  (A commit is considered an ancestor of
itself, so <want1> and <want2> are in the set, while <have1> and <have2>
are not.)

Then it enumerates the complete trees of each of those commits.  Then it
subtracts from that object set all of the trees and blobs in the <have1>
and <have2> commits.

This is a good approximation to the set of objects that the client needs
to be sent.  It does not search the ancestors of <have1> and <have2>,
so it is possible that a copy of some object exists in an older commit,
and the state in <want1> is a reversion to a pre-<have1> state, but it's
unlikely and not worth looking for.  Sending it redundantly is harmless.

Then, this set of objects is piped to git-pack-objects, which finds them
and builds a pack file.  If the objects are stored locally as deltas
relative to other objects which are either to be sent, or are in the
<have1> or <have2> commits (which information is included in the data
output by git-rev-list), the delta is copied to the pack file verbatim.
Otherwise, more CPU is spent to pack it.

Again, the server is allowed to check for deltas against ancestors
of <have1> or <have2>, but doesn't bother for efficiency.

Fourth, the pack file is sent to the client, which unpacks it (and
discards any duplicates).

> When the client wants a shallow clone it starts by telling the server
> all of the HEADs and how many change sets down each of those HEADs it
> has locally. That's a small amout of data to transmit and it can be
> easily tracked. Let's ignore merged branches for the moment.

When you say "change set", I'm going to assume you mean "commit object".

Okay.  Now, the server hasn't heard of one or more of those commit
objects, because they're local changes.  What then?

Another issue is that a client with a nearly-full copy has to do a full
walk of its history to determine the depth count that it has.  That can
be more than 2500 commits down in the git repository, and worse in the
mozilla one.  It's actually pretty quick (git-show-branch --more=99999
will do it), but git normally tries to avoid full traversals like the
plague 

Oh, and was "for the moment" supposed to last past the end of your e-mail?
I don't see what to do if there's a merge in the history and the depth
on different sides is not equal.  E.g. the history looks like:

...a---b---c---d---e---f
                  /     \
      ...w---x---y       HEAD
                        /
        ...p---q---r---s

Where "..." means that there are ancestors, but they're missing.

> If you haven't updated for six months when the server walks backwards
> for 10 change sets it's not going to find anything you have locally.
> When this situation is encountered the server needs to generate a
> delta just for you between one of the change sets it knows you have
> and one of the 10 change sets you want. By generating this one-off
> delta it lets you avoid the need to fetch all of the objects back to a
> common branch ancestor. The delta functions as a jump over the
> intervening space.

Your choice of words keeps giving me the impression that you believe
that a "change set" is a monolithic object that includes all the changes
made to all the files.  It's neither monolithic nor composed of changes.
A commit objects consists soley of metadata, and contains a pointer to
a tree object, which points recursively to the entire project state at
the time of the commit.

There is massive sharing of component objects between successive
commits, but they are NOT stored as deltas relative to one another.

The pack-forming heuristics tend to achieve that effect, but it is not
guaranteed or required by design.

Please understand that, deep in your bones: git is based on snapshots,
not deltas.

But okay, so we've sent the client the latest 10 commits, with a dangling
tail at the bottom.  (The files may have been sent as deltas against the
old state, or just fresh compressed copies, but that doesn't matter.)
Then the heads like "origin" have been advanced.

So the old commit history is now unreferenced garbage; nothing points
to it, and it will be deleted next time git-prune is run.  Is that
the intended behavior?  Or should updates to an existing clone always
complete the connections?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  1:01 ` linux
@ 2006-09-08  2:23   ` Jon Smirl
  2006-09-08  8:36     ` Jakub Narebski
  2006-09-08 18:42     ` linux
  0 siblings, 2 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-08  2:23 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git

On 7 Sep 2006 21:01:12 -0400, linux@horizon.com <linux@horizon.com> wrote:
> > When the client wants a shallow clone it starts by telling the server
> > all of the HEADs and how many change sets down each of those HEADs it
> > has locally. That's a small amout of data to transmit and it can be
> > easily tracked. Let's ignore merged branches for the moment.
>
> When you say "change set", I'm going to assume you mean "commit object".
>
> Okay.  Now, the server hasn't heard of one or more of those commit
> objects, because they're local changes.  What then?

Toss them, if they don't exist on the server the server is going to be
able to send any objects for them.

> Another issue is that a client with a nearly-full copy has to do a full
> walk of its history to determine the depth count that it has.  That can
> be more than 2500 commits down in the git repository, and worse in the
> mozilla one.  It's actually pretty quick (git-show-branch --more=99999
> will do it), but git normally tries to avoid full traversals like the
> plague

Client would track this incrementally  and not recompute it each time.

> Oh, and was "for the moment" supposed to last past the end of your e-mail?
> I don't see what to do if there's a merge in the history and the depth
> on different sides is not equal.  E.g. the history looks like:
>
> ...a---b---c---d---e---f
>                   /     \
>       ...w---x---y       HEAD
>                         /
>         ...p---q---r---s
>
> Where "..." means that there are ancestors, but they're missing.
>
> > If you haven't updated for six months when the server walks backwards
> > for 10 change sets it's not going to find anything you have locally.
> > When this situation is encountered the server needs to generate a
> > delta just for you between one of the change sets it knows you have
> > and one of the 10 change sets you want. By generating this one-off
> > delta it lets you avoid the need to fetch all of the objects back to a
> > common branch ancestor. The delta functions as a jump over the
> > intervening space.
>
> Your choice of words keeps giving me the impression that you believe
> that a "change set" is a monolithic object that includes all the changes
> made to all the files.  It's neither monolithic nor composed of changes.
> A commit objects consists soley of metadata, and contains a pointer to
> a tree object, which points recursively to the entire project state at
> the time of the commit.

I was using change set to refer to snapshot.

> There is massive sharing of component objects between successive
> commits, but they are NOT stored as deltas relative to one another.

Yes, most of the sharing occurs via the tree structures.

> The pack-forming heuristics tend to achieve that effect, but it is not
> guaranteed or required by design.
>
> Please understand that, deep in your bones: git is based on snapshots,
> not deltas.
>
>
> But okay, so we've sent the client the latest 10 commits, with a dangling
> tail at the bottom.  (The files may have been sent as deltas against the
> old state, or just fresh compressed copies, but that doesn't matter.)
> Then the heads like "origin" have been advanced.
>
> So the old commit history is now unreferenced garbage; nothing points
> to it, and it will be deleted next time git-prune is run.  Is that
> the intended behavior?  Or should updates to an existing clone always
> complete the connections?

If you follow the links in what looks to be a dangling object sooner
or latter you will run into the root object or a 'not present' object.
If you hit one of those the objects are not dangling and should be
preserved.

Here is another way to look at the shallow clone problem. The only
public ids in a git tree are the head and tag pointers. Send these to
the client. Now let's modify the git tools to fault the full objects
in one by one from the server whenever a git operation needs the
object.  Dangling references would point to 'not-present' objects.

For a typical user using a model like this, how much of the Mozilla
repository would end up being faulted into their machine? Mozilla has
2M objects and 250K commits in a 450MB pack. My estimate is that a
typical user is going to touch less than 200K of the objects and maybe
less than 100K.

Of course always faulting in full objects is wasteful. A smart scheme
would be to try and anticipate with some read ahead and figure out
ways to send deltas. Tools like gitk would need to only touch the
objects needed to draw the screen and not run the full commit chain at
startup.

This experiment can be done fairly easily. Put all of the kernel
source into a single pack file.  Modify the git tools to set a bit in
the index file if an object is accessed. Use the pack for a few days
and then dump out the results.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  2:23   ` Jon Smirl
@ 2006-09-08  8:36     ` Jakub Narebski
  2006-09-08  8:39       ` Junio C Hamano
  2006-09-08 18:42     ` linux
  1 sibling, 1 reply; 101+ messages in thread
From: Jakub Narebski @ 2006-09-08  8:36 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> Here is another way to look at the shallow clone problem. The only
> public ids in a git tree are the head and tag pointers. Send these to
> the client. Now let's modify the git tools to fault the full objects
> in one by one from the server whenever a git operation needs the
> object.  Dangling references would point to 'not-present' objects.
> 
> For a typical user using a model like this, how much of the Mozilla
> repository would end up being faulted into their machine? Mozilla has
> 2M objects and 250K commits in a 450MB pack. My estimate is that a
> typical user is going to touch less than 200K of the objects and maybe
> less than 100K.
> 
> Of course always faulting in full objects is wasteful. A smart scheme
> would be to try and anticipate with some read ahead and figure out
> ways to send deltas. Tools like gitk would need to only touch the
> objects needed to draw the screen and not run the full commit chain at
> startup.

Yes, that is also recurring _lazy_ clone idea (or remote alternatives[*1*]
idea). With it's own set of difficulties, like cache management, readahead
and such, and of course differentiating between not present, and present on
the remote server. And what to do when network fails... wouldn't it be just
easier to use networking system like Coda or InterMezzo?

[*1*] From Documentation/repository-layout.txt or
http://www.kernel.org/pub/software/scm/git/docs/repository-layout.html

You can be using `objects/info/alternates` mechanism, or
`$GIT_ALTERNATE_OBJECT_DIRECTORIES` mechanism to 'borrow'
objects from other object stores.  A repository with this kind
of incomplete object store is not suitable to be published for
use with dumb transports but otherwise is OK as long as
`objects/info/alternates` points at the right object stores
it borrows from.

[...]

objects/info/alternates::
        This file records absolute filesystem paths of alternate
        object stores that this object store borrows objects
        from, one pathname per line.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  8:36     ` Jakub Narebski
@ 2006-09-08  8:39       ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-08  8:39 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> [*1*] From Documentation/repository-layout.txt or
> http://www.kernel.org/pub/software/scm/git/docs/repository-layout.html
>
> You can be using `objects/info/alternates` mechanism, or
> `$GIT_ALTERNATE_OBJECT_DIRECTORIES` mechanism to 'borrow'
> objects from other object stores.  A repository with this kind
> of incomplete object store is not suitable to be published for
> use with dumb transports but otherwise is OK as long as
> `objects/info/alternates` points at the right object stores
> it borrows from.

Ah, you spotted an obsolete documentation again ;-).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08  2:23   ` Jon Smirl
  2006-09-08  8:36     ` Jakub Narebski
@ 2006-09-08 18:42     ` linux
  2006-09-08 21:13       ` Jon Smirl
  1 sibling, 1 reply; 101+ messages in thread
From: linux @ 2006-09-08 18:42 UTC (permalink / raw)
  To: jonsmirl, linux; +Cc: git

Thanks for making suggestions.  It's easier to knock straw men down
than make them.  But forgive me if I have fun pointing out the holes...

>> Okay.  Now, the server hasn't heard of one or more of those commit
>> objects, because they're local changes.  What then?
>
> Toss them, if they don't exist on the server the server is going to be
> able to send any objects for them.

Er... that makes "git pull git://git.other-hacker.personal/dir"
impossible.  There's a reason that git can handle this as it presently
exists: 
                  a-b-c-d-e-f <-- HEAD
                 /
Hacker 1: o-o-o-o <-- origin

Hacker 2: o-o-o-o-o-o <-- origin
                     \
                      v-w-x-y <-- HEAD

Suppose hacker 2, who happened to sync with upstream more recently,
wants to pull from hacker 1, in the hopes of building

                  a-b-c-d-e-f
                 /           \
Hacker 2: o-o-o-o-o-o         z <-- HEAD
                     \       /
                      v-w-x-y

This works now, and should continue to work.

> Client would track this incrementally  and not recompute it each time.

Yes, this is probably possible.  I haven't worked it out, but given a
cache of precomputed (commit,depth) numbers, you can trace back from the
new heads until you hit a cache entry.

> If you follow the links in what looks to be a dangling object sooner
> or latter you will run into the root object or a 'not present' object.
> If you hit one of those the objects are not dangling and should be
> preserved.

I don't understand.  It seems like you're saying that any commit without an
ancestor, and all objects recursively pointing to it, are not garbage.
How would anything ever get declared garbage in this case?

> Here is another way to look at the shallow clone problem. The only
> public ids in a git tree are the head and tag pointers. Send these to
> the client. Now let's modify the git tools to fault the full objects
> in one by one from the server whenever a git operation needs the
> object.  Dangling references would point to 'not-present' objects.

Er... that would fault in a gigabyte the first time someone ran gitk,
or several other history-browsing commands.  Don't you need a way to say
"tell the user this isn't present and will take an hour to download"?

Well, thinking, there are actually two shallow clone possibilities:
1) Don't load the unwanted commit objects
2) Clone the commit objects, but not their trees.

The latter would let you browse the commit history, at least.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 18:42     ` linux
@ 2006-09-08 21:13       ` Jon Smirl
  2006-09-08 22:27         ` Jakub Narebski
  2006-09-08 23:09         ` Linus Torvalds
  0 siblings, 2 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-08 21:13 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git

> > Here is another way to look at the shallow clone problem. The only
> > public ids in a git tree are the head and tag pointers. Send these to
> > the client. Now let's modify the git tools to fault the full objects
> > in one by one from the server whenever a git operation needs the
> > object.  Dangling references would point to 'not-present' objects.
>
> Er... that would fault in a gigabyte the first time someone ran gitk,
> or several other history-browsing commands.  Don't you need a way to say
> "tell the user this isn't present and will take an hour to download"?

gitk would need to be modified to only run enough of the commit tree
to draw what is displayed in the window.  As you page down it would
retrive more commits if needed. There is no need for gitk to run 250K
commits when I'm usually never going to look at them all. Of course
this may mean some rework for gitk.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 21:13       ` Jon Smirl
@ 2006-09-08 22:27         ` Jakub Narebski
  2006-09-08 23:09         ` Linus Torvalds
  1 sibling, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-08 22:27 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

>> > Here is another way to look at the shallow clone problem. The only
>> > public ids in a git tree are the head and tag pointers. Send these to
>> > the client. Now let's modify the git tools to fault the full objects
>> > in one by one from the server whenever a git operation needs the
>> > object.  Dangling references would point to 'not-present' objects.
>>
>> Er... that would fault in a gigabyte the first time someone ran gitk,
>> or several other history-browsing commands.  Don't you need a way to say
>> "tell the user this isn't present and will take an hour to download"?
> 
> gitk would need to be modified to only run enough of the commit tree
> to draw what is displayed in the window.  As you page down it would
> retrive more commits if needed. There is no need for gitk to run 250K
> commits when I'm usually never going to look at them all. Of course
> this may mean some rework for gitk.

Or remembering to set --max-count or --max-age parameters to gitk (which are
then passed to git-rev-list).

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 21:13       ` Jon Smirl
  2006-09-08 22:27         ` Jakub Narebski
@ 2006-09-08 23:09         ` Linus Torvalds
  2006-09-08 23:28           ` Jon Smirl
  2006-09-09  1:05           ` Paul Mackerras
  1 sibling, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-08 23:09 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, Git Mailing List, Paul Mackerras

On Fri, 8 Sep 2006, Jon Smirl wrote:
> 
> gitk would need to be modified to only run enough of the commit tree
> to draw what is displayed in the window.  As you page down it would
> retrive more commits if needed. There is no need for gitk to run 250K
> commits when I'm usually never going to look at them all. Of course
> this may mean some rework for gitk.

Actually, it's more than a little rework.

The _real_ problem is that gitk uses "--topo-order" to git-rev-list, which 
basically means that git-rev-list needs to parse EVERY SINGLE COMMIT.

So far, nobody has written a topological sort that can reasonably 
efficiently handle partial trees and automatically notice when there is no 
way that a DAG cannot have any more references to a commit any more (if 
you can show that the result would cause a cycle, you could output a 
partial topological tree early).

It's possible that no such topological sorting (ie the efficient kind, 
that can work with partial DAG's) even exists.

So if you want to be able to visualize a partial repository, you need to 
teach git to not need "--topo-order" first. That's likely the _big_ 
change. After that, the rest should be easy.

[ One way to do it might be: the normal ordering of revisions without 
  "--topo-order) is "_close_ to topological", and gitk could just decide 
  to re-compute the whole graph whenever it gets a commit that has a 
  parent that it has already graphed. Done right, it would probably almost 
  never actually generate re-computed graphs (if you only actually 
  generate the graph when the user scrolls down to it).

  Getting rid of the --topo-order requirement would speed up gitk 
  absolutely immensely, especially for unpacked cold-cache archives. So it 
  would probably be a good thing to do, regardless of any shallow clone 
  issues ]

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 23:09         ` Linus Torvalds
@ 2006-09-08 23:28           ` Jon Smirl
  2006-09-08 23:45             ` Paul Mackerras
  2006-09-10 12:41             ` Paul Mackerras
  2006-09-09  1:05           ` Paul Mackerras
  1 sibling, 2 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-08 23:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, Git Mailing List, Paul Mackerras

On 9/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
>   Getting rid of the --topo-order requirement would speed up gitk
>   absolutely immensely, especially for unpacked cold-cache archives. So it
>   would probably be a good thing to do, regardless of any shallow clone
>   issues ]
>
> Hmm?

gitk takes about a minute to come up on the Mozilla repo when
everything is in cache. It  takes about twice as long when things are
cold. It's enough of delay that I don't use the tool.

>
>                 Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 23:28           ` Jon Smirl
@ 2006-09-08 23:45             ` Paul Mackerras
  2006-09-09  1:45               ` Jon Smirl
  2006-09-10 12:41             ` Paul Mackerras
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-08 23:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List

Jon Smirl writes:

> gitk takes about a minute to come up on the Mozilla repo when
> everything is in cache. It  takes about twice as long when things are
> cold. It's enough of delay that I don't use the tool.

How long does it take for git rev-list --topo-order to produce its
first line of output, in the cache-hot case?  That is, is it just
gitk's graph layout that is taking the time, or is it git rev-list
(meaning that gitk needs to stop using --topo-order)?

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 23:45             ` Paul Mackerras
@ 2006-09-09  1:45               ` Jon Smirl
  0 siblings, 0 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-09  1:45 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List

On 9/8/06, Paul Mackerras <paulus@samba.org> wrote:
> Jon Smirl writes:
>
> > gitk takes about a minute to come up on the Mozilla repo when
> > everything is in cache. It  takes about twice as long when things are
> > cold. It's enough of delay that I don't use the tool.
>
> How long does it take for git rev-list --topo-order to produce its
> first line of output, in the cache-hot case?  That is, is it just
> gitk's graph layout that is taking the time, or is it git rev-list
> (meaning that gitk needs to stop using --topo-order)?

Cold cache:
jonsmirl@jonsmirl:/opt/t1$ time git rev-list --all --topo-order
--parents >/dev/null

real    0m27.687s
user    0m5.672s
sys     0m0.560s

Hot cache:
jonsmirl@jonsmirl:/opt/t1$ time git rev-list --all --topo-order
--parents >/dev/null

real    0m6.041s
user    0m5.768s
sys     0m0.276s
jonsmirl@jonsmirl:/opt/t1$

I have enough RAM (3GB) that everything fits.

Hot cache, 27 seconds to get first line of display in gitk
It takes about two minutes get everything loaded into the window and
scroll to the head

This a 450MB pack, 250K commits, 2M objects.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 23:28           ` Jon Smirl
  2006-09-08 23:45             ` Paul Mackerras
@ 2006-09-10 12:41             ` Paul Mackerras
  2006-09-10 14:56               ` Jon Smirl
  2006-09-11  0:03               ` Shawn Pearce
  1 sibling, 2 replies; 101+ messages in thread
From: Paul Mackerras @ 2006-09-10 12:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List

Jon Smirl writes:

> gitk takes about a minute to come up on the Mozilla repo when
> everything is in cache. It  takes about twice as long when things are
> cold. It's enough of delay that I don't use the tool.

I've been doing some timing measurements with Jon's repo.  The results
are interesting.

It turns out that a lot of the initial delay is in reading all the
references.  With ~1600 heads and almost as many tags, readrefs was
taking about 30 seconds (on my 2.5GHz quad G5), largely because it was
doing two execs for each tag, a git rev-parse and a git cat-file,
which was a bit stupid.  With that fixed, it's about 5 or 6 seconds
from starting gitk to seeing a window with commits listed in it in the
hot cache case, even still using --topo-order.  Without --topo-order
it's about 2-3 seconds to seeing the window with commits listed, but
the overall process takes longer (I still need to figure out why).

In the cold cache case, it takes about 32 seconds to read all the
references, even with the fixed version of readrefs, since that's
about how long git ls-remote .git takes.  Also, if you do gitk --all
(as I often do), then gitk does a git rev-parse, which takes about 20
seconds (but then readrefs only takes 10 seconds since the heads are
in cache).

The bottom line is that I can speed up the startup for the hot-cache
case quite a lot.  The cold-cache case is going to take about 20-30
seconds whatever I do unless Linus or Junio can come up with a way to
pack the heads and tags.  I could read the refs asynchronously but
there will still be a long delay in git rev-parse if you give
arguments such as --all.

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 12:41             ` Paul Mackerras
@ 2006-09-10 14:56               ` Jon Smirl
  2006-09-10 16:10                 ` linux
  2006-09-10 18:51                 ` Junio C Hamano
  2006-09-11  0:03               ` Shawn Pearce
  1 sibling, 2 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-10 14:56 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux@horizon.com, Git Mailing List

On 9/10/06, Paul Mackerras <paulus@samba.org> wrote:
> Jon Smirl writes:
>
> > gitk takes about a minute to come up on the Mozilla repo when
> > everything is in cache. It  takes about twice as long when things are
> > cold. It's enough of delay that I don't use the tool.
>
> I've been doing some timing measurements with Jon's repo.  The results
> are interesting.

Using the Mozilla repo you downloaded is not a normal situation since
it is 100% packed. Most people are going to have a few thousand loose
objects floating around too. Loose objects really slow things down.

You noticed too that forks of small apps are relatively slow. The
first pass of the import tools used fork for everything and took a
week to run with 60% of the time spent in the kernel. There may be
some work to do on fork in the kernel. Does mapping the kernel into
the top 1G slow down fork of these small apps? Or are they dynamically
linked to something that is bringing in millions of pages? When I was
doing oprofile all of the time was in the actual fork call and page
table copying.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 14:56               ` Jon Smirl
@ 2006-09-10 16:10                 ` linux
  2006-09-10 18:00                   ` Jon Smirl
  2006-09-10 18:51                 ` Junio C Hamano
  1 sibling, 1 reply; 101+ messages in thread
From: linux @ 2006-09-10 16:10 UTC (permalink / raw)
  To: jonsmirl; +Cc: git, linux, paulus, torvalds

> You noticed too that forks of small apps are relatively slow. The
> first pass of the import tools used fork for everything and took a
> week to run with 60% of the time spent in the kernel. There may be
> some work to do on fork in the kernel. Does mapping the kernel into
> the top 1G slow down fork of these small apps? Or are they dynamically
> linked to something that is bringing in millions of pages? When I was
> doing oprofile all of the time was in the actual fork call and page
> table copying.

Actually, Linux has one of the fastest forks around, 100-200 us on
modern x86.  For large executables, the shared page tables patch (is it
merged yet?) might help.

No, mapping the kernel does NOT slow anything down.  Those page tables are
shared between all processes, and use the x86's global bit to avoid being
forced out of the TLB during context switch.  Attempting to make the
kernel mapping vary between processes would slow things down enormously!

I'm not finding reasonably recent Linux/FreeBSD comparisons.  There are
plenty of Max OS X comparisons, but it appears to particularly suck,
so that's a bit of a straw man.  See, e.g.
http://www.anandtech.com/mac/showdoc.aspx?i=2520&p=7
or some figures from 2002:
http://lists.apple.com/archives/darwin-kernel/2002/Oct/msg00009.html
----------------------------------------------------------------
Host      OS            Mhz  null null      open selct sig  sig  fork exec sh
                             call I/O  stat clos TCP   inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
g4           Darwin 1.4  799 2.21 3.19 12.6 16.4 30.4  3.78 9.88 1576 4095 13.K
g4           Darwin 5.5  797 2.26 3.15 14.5 17.8 30.4  3.82 10.2 5685 12.K 40.K
g4.local.    Darwin 6.0  797 1.47 2.73 17.2 20.7 27.2  3.00 10.5 7517 17.K 41.K
g4.local.    Darwin 6.0  (after misconfiguration fixed)          1575
g4        Linux 2.4.18-  799 0.42 0.69 2.52 3.79 33.6  1.23 3.08 659. 2642 12.K
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
jim          Darwin 1.4  448 2.85 4.23 9.53   17       4.83   14 2402 6817  19K
jim.local    Darwin 6.1  448 1.89 3.71   21   29    43 3.85   15 2832 8955  19K

For small processes such as lmbench uses, you can see from the above that
exec is much more expensive than fork.  Also
http://www.opersys.com/lrtbf/index.html
http://marc.theaimsgroup.com/?l=linux-kernel&m=112086443319815
http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Administration/hyper-threading-linux/page4.html

As for small apps and dynamic linking, glibc isn't tiny (prelink can
help), but that's post-exec.  If it's in the fork() call, then it's just
a case that your app's memory space is big and even copying all of its
page tables is expensive.

If you're invoking one of the git programs that's a shell script, that
adds its own startup overhead, of course.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 16:10                 ` linux
@ 2006-09-10 18:00                   ` Jon Smirl
  2006-09-10 19:03                     ` linux
  0 siblings, 1 reply; 101+ messages in thread
From: Jon Smirl @ 2006-09-10 18:00 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git, paulus, torvalds

On 10 Sep 2006 12:10:07 -0400, linux@horizon.com <linux@horizon.com> wrote:
> Actually, Linux has one of the fastest forks around, 100-200 us on
> modern x86.  For large executables, the shared page tables patch (is it
> merged yet?) might help.

Here is the opfile of the kernel

3467889  18.9893  copy_page_range
2190416  11.9941  unmap_vmas
1156011   6.3300  page_fault
887794    4.8613  release_pages
860853    4.7138  page_remove_rmap
633243    3.4675  get_page_from_freelist
398773    2.1836  do_wp_page
344422    1.8860  __mutex_lock_slowpath
280070    1.5336  __handle_mm_fault
241713    1.3236  do_page_fault
238398    1.3054  __d_lookup
236654    1.2959  vm_normal_page

At the time this was measured parsecvs was executing millions of git
command using system(command). 40% of the CPU was in the kernel, it
stayed that way for hours.

18262372 41.0441 /home/good/vmlinux
 5465741 12.2841 /usr/bin/cvs
 4374336  9.8312 /lib/libc-2.4.so
 3627709  8.1532 /lib/libcrypto.so.0.9.8a
 2494610  5.6066 /usr/bin/oprofiled
 2471238  5.5540 /usr/lib/libz.so.1.2.3
  945349  2.1246 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
  933646  2.0983 /usr/local/bin/git-read-tree
  758776  1.7053 /usr/local/bin/git-write-tree
  642502  1.4440 /lib/ld-2.4.so
  472903  1.0628 /nvidia
  379254  0.8524 /usr/local/bin/git-pack-objects

Maybe we are looking at the wrong thing, it may be fast to fork a
process, is it fast for the process to exit?


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 18:00                   ` Jon Smirl
@ 2006-09-10 19:03                     ` linux
  2006-09-10 20:00                       ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: linux @ 2006-09-10 19:03 UTC (permalink / raw)
  To: jonsmirl, linux; +Cc: git, paulus, torvalds

> Maybe we are looking at the wrong thing, it may be fast to fork a
> process, is it fast for the process to exit?

The lmbench benchmark figures I cited are for fork+exit.  The other is
fork+exec+exit.

> At the time this was measured parsecvs was executing millions of git
> command using system(command). 40% of the CPU was in the kernel, it
> stayed that way for hours.

Ah!  You are aware, aren't you, that system(string) invokes
/bin/sh to parse the string, right?  So you're starting bash
several million times.  ("man system" explains.)

A direct fork() (or even faster, vfork) and exec() is going to have a
lot less overhead, although it's more work to code.  See Stevens for
excellent examples.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 19:03                     ` linux
@ 2006-09-10 20:00                       ` Linus Torvalds
  2006-09-10 21:00                         ` Jon Smirl
  2006-09-10 22:41                         ` Paul Mackerras
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-10 20:00 UTC (permalink / raw)
  To: linux; +Cc: jonsmirl, git, paulus

On Sun, 10 Sep 2006, linux@horizon.com wrote:
> 
> A direct fork() (or even faster, vfork) and exec() is going to have a
> lot less overhead, although it's more work to code.  See Stevens for
> excellent examples.

Well, that said, the Linux fork/exec/exit is certainly fairly efficient, 
but nothing can hide the fact that it _is_ a very expensive operation.

So we may do a fork/exit in a millisecond, but on the other hand, we can 
generally open a file in a microsecond. So we're definitely talking about 
several orders of magnitude difference..

And a lot of the trivial git helpers are literally not doing a lot more 
than opening a file, reading it, and parsing it. That's for things like 
looking up the SHA1 of a branch head, for example. Doing it in-line might 
be a microsecond or two, while spawning a shell to execute git-rev-parse 
will take you several microseconds.

Do it a million times, and the difference is now a second vs an hour. Or a 
few minutes vs a week.

So in that sense, it is absolutely undeniable that fork/exit is slow. 
Everything is relative. The Linux fork/exit is fast when compared to most 
other systems fork/exit, but it's slow as hell when compared to just doing 
some small op directly in-line.

			Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 20:00                       ` Linus Torvalds
@ 2006-09-10 21:00                         ` Jon Smirl
  2006-09-11  2:49                           ` Linus Torvalds
  2006-09-10 22:41                         ` Paul Mackerras
  1 sibling, 1 reply; 101+ messages in thread
From: Jon Smirl @ 2006-09-10 21:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, git, paulus

On 9/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Sun, 10 Sep 2006, linux@horizon.com wrote:
> >
> > A direct fork() (or even faster, vfork) and exec() is going to have a
> > lot less overhead, although it's more work to code.  See Stevens for
> > excellent examples.
>
> Well, that said, the Linux fork/exec/exit is certainly fairly efficient,
> but nothing can hide the fact that it _is_ a very expensive operation.

cvs2svn + fastimport can import the same Mozilla repo in about 2hrs
that was taking parsecvs about five days to do. The algorithms are not
that different, cvs2svn is probably slower than parsecvs for detecting
changesets. The difference is mostly due to the removal of forks.

Is the kernel mapped into user processes using huge pages? That would
reduce some of the pressure on TLBs. Another thing that should be
mapped with huge pages is the video RAM.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 21:00                         ` Jon Smirl
@ 2006-09-11  2:49                           ` Linus Torvalds
  0 siblings, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-11  2:49 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux, git, paulus

On Sun, 10 Sep 2006, Jon Smirl wrote:
> 
> Is the kernel mapped into user processes using huge pages?

Yup. Huge pages and global. So the kernel should have fairly low TLB 
impact (depending on CPU - the kernel also has less locality than user 
space, since the data structures are all over the place, and some CPU's 
have very few hugepage entries. System code generally tends to have much 
worse I$ and TLB behaviour than most user apps, but at least the hugepage 
mappings tend to mitigate the TLB side somewhat).

> Another thing that should be mapped with huge pages is the video RAM.

I doubt it ends up being a huge deal in most cases. If you don't use the 
graphics chip to access video RAM with accelerated methods, the TLB is the 
_least_ of your worries. Going over the PCI bus (even with PCI-X etc) is 
the real problem.

There's a reason you want to do acceleration even in 2D..

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 20:00                       ` Linus Torvalds
  2006-09-10 21:00                         ` Jon Smirl
@ 2006-09-10 22:41                         ` Paul Mackerras
  2006-09-11  2:55                           ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-10 22:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, jonsmirl, git

Linus Torvalds writes:

> So we may do a fork/exit in a millisecond, but on the other hand, we can 
> generally open a file in a microsecond. So we're definitely talking about 
> several orders of magnitude difference..

Unless the file isn't in memory, in which case it's probably around a
millisecond or more...

Do you think there is any way to speed up the cold-cache case for
git-ls-remote and git-rev-parse with thousands of heads and tags?

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 22:41                         ` Paul Mackerras
@ 2006-09-11  2:55                           ` Linus Torvalds
  2006-09-11  3:18                             ` Linus Torvalds
                                               ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-11  2:55 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux, jonsmirl, git

On Mon, 11 Sep 2006, Paul Mackerras wrote:
> 
> Do you think there is any way to speed up the cold-cache case for
> git-ls-remote and git-rev-parse with thousands of heads and tags?

Nothing obvious comes to mind.

If we did the same pack-file approach that we do for objects, the problem 
ends up being that _updating_ things is really hard. What we could do (and 
might work) is that a "git repack" would create a "packed representation 
of the heads too".

The issue is that once you create a pack, you _really_ don't want to use 
read-modify-write to modify it ever afterwards. That's how you get nasty 
corruption. The "append-only" side of the git object database is really 
what makes things so stable, and packing multiple heads in the same file 
automatically means that if something goes wrong, it's _disastrously_ 
wrong, in a way it isn't when you have separate files.

So we could generate a "pack of references", but then any modifications 
would be done with the current loose "file objects" approach, and just 
have the filesystem override the pack-files. The problem then actually 
becomes one of _deleting_ branches, because then we'd have to add a 
"negative branch" loose object. Ugly.

An alternate approach might be to keep the current filesystem layour, but 
simply do the readdir over the whole reference situation in one pass, then 
sort them by inode number, and look them up in order. That would help on 
some filesystems, at least.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  2:55                           ` Linus Torvalds
@ 2006-09-11  3:18                             ` Linus Torvalds
  2006-09-11  6:35                               ` Junio C Hamano
  2006-09-11 18:54                               ` Junio C Hamano
  2006-09-11  8:36                             ` Paul Mackerras
  2006-09-11  9:04                             ` Jakub Narebski
  2 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-11  3:18 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux, jonsmirl, git

On Sun, 10 Sep 2006, Linus Torvalds wrote:
> 
> If we did the same pack-file approach that we do for objects, the problem 
> ends up being that _updating_ things is really hard. What we could do (and 
> might work) is that a "git repack" would create a "packed representation 
> of the heads too".

To clarify: I'm _not_ suggesting actually using the "pack-file" 
representation itself for the references.

I'm saying that we could have something like a

	.git/refs-packed

file, which could be (for example) just a plain linear text-file, of the 
form

	19ed368b17abfb9ad5c7467ea74fd8a045d96b43	refs/heads/html
	60a6bf5f53635005f4f68d8b8a33172309193623	refs/heads/maint
	2d32e76f893b2ac432201ce7596c8bab691691e6	refs/heads/man
	a41fae9c46a4cb5e59cc1a17d19f6a3a6cbfbb2f	refs/heads/master
	61af0aaf26e003a61848e551bbd57e78e94eacdc	refs/heads/next
	585729203fef6ade64923277e7151b2e3a4ca330	refs/heads/pu
	997283a8e87179b5b87a909686869d7843c8e19a	refs/heads/todo
	a0e7d36193b96f552073558acf5fcc1f10528917	refs/tags/junio-gpg-pub
	d6602ec5194c87b0fc87103ca4d67251c76f233a	refs/tags/v0.99
	f25a265a342aed6041ab0cc484224d9ca54b6f41	refs/tags/v0.99.1
	c5db5456ae3b0873fc659c19fafdde22313cc441	refs/tags/v0.99.2
	7ceca275d047c90c0c7d5afb13ab97efdf51bd6e	refs/tags/v0.99.3
	b3e9704ecdf48869f635f0aa99ddfb513f885aff	refs/tags/v0.99.4
	07e38db6a5a03690034d27104401f6c8ea40f1fc	refs/tags/v0.99.5
	f12e22d4c12c3d0263fa681f25c06569f643da0f	refs/tags/v0.99.6
	f8696fcd2abc446a5ccda3e414b731eff2a7e981	refs/tags/v0.99.7
	1094cf40f7029f803421c1dcc971238507c830c5	refs/tags/v0.99.7a
	da30c6c39cd3b048952a15929c5440acfd71b912	refs/tags/v0.99.7b
	9165ec17fde255a1770886189359897dbb541012	refs/tags/v0.99.7c
	02b2acff8bafb6d73c6513469cdda0c6c18c4138	refs/tags/v0.99.7d
	...

ie it would contain just a linear file with the "<hex></tab><refname>"
format.  Then, the way to look up a reference would be:

 - look it up in the traditional loose file
 - if it exists, and contains zeros (or not a hex value), it's considered 
   a "negative entry", and the branch doesn't exist
 - otherwise, if it's a good SHA1, that's the result
 - if it's not there, look it up in the ".git/refs-packed" file by just 
   doing a simple linear scan (trivial, and actually efficient - we're 
   talking about just a few kB of memory after all, and the _cost_ is 
   actually the IO, where "simple linear scan" is actually very good for 
   performance).

The end result would be that we'd probably have very few loose references 
(we'd get them whenever we change a ref, or delete one), making the lookup 
scale better. The big _bulk_ of the references tend to be very stable, 
notably they are tags that seldom - if ever - change, and would thus stay 
just in the packed refs file.

So the normal situation would be that you'd have a few hundred (maybe, for 
a bigger project) refs in the single .git/refs-packed file, totalling a 
few kB of disk-space, and then you might have a handful of "active" heads 
that are in the traditional single-file format .git/refs/<filename> 
because they are beign actively modified and have changed since the last 
repack.

I bet it would work fairly well. But somebody would need to implement it.

The good news is that the refs-handling code tends to be _fairly_ well 
abstracted out, because we already wanted that for the logging thing. So 
we hopefully already don't actually access the loose file objects by hand 
from shell scripts any more - we use git-rev-parse and git-update-ref etc 
to look up refnames, and that all goes back to git/refs.c.

So _most_ of the bulk of it would probably be in refs.c, but there's 
obviously also things like git-branch.sh that needs to be taught the new 
rules about deleting branches etc.

Anybody want to try the above rules out? I bet the _only_ real issue would 
be "for_each_ref()", where it's important to _first_ do the loose objects, 
and remember them all, so that you do _not_ show the refs that are in the 
.git/refs-packed file and already got shown because they were loose.

NOTE! It's important that whatever sceme used gets locking right. The 
above suggestion gets it right simply because it doesn't really _change_ 
anything. Any new or modified ref ends up using the old code, and using a 
".lock" file and renaming it automatically does the same thing it ever 
did.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  3:18                             ` Linus Torvalds
@ 2006-09-11  6:35                               ` Junio C Hamano
  2006-09-11 18:54                               ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11  6:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Paul Mackerras

Linus Torvalds <torvalds@osdl.org> writes:

> On Sun, 10 Sep 2006, Linus Torvalds wrote:
>> 
>> If we did the same pack-file approach that we do for objects, the problem 
>> ends up being that _updating_ things is really hard. What we could do (and 
>> might work) is that a "git repack" would create a "packed representation 
>> of the heads too".
>
> To clarify: I'm _not_ suggesting actually using the "pack-file" 
> representation itself for the references.
>
> I'm saying that we could have something like a
>
> 	.git/refs-packed
>
> file, which could be (for example) just a plain linear text-file, of the 
> form
>
> 	19ed368b17abfb9ad5c7467ea74fd8a045d96b43	refs/heads/html
> 	60a6bf5f53635005f4f68d8b8a33172309193623	refs/heads/maint
> 	...
>
> ie it would contain just a linear file with the "<hex></tab><refname>"
> format.  Then, the way to look up a reference would be:
>...
> NOTE! It's important that whatever sceme used gets locking right. The 
> above suggestion gets it right simply because it doesn't really _change_ 
> anything. Any new or modified ref ends up using the old code, and using a 
> ".lock" file and renaming it automatically does the same thing it ever 
> did.

This is all interesting and can be done by somebody interested
in the really core part of the system.

I was just looking at Paul's "new" branch, and realized that
there is something else that somebody _else_ can do in parallel
to make life easier for gitk and gitweb, and it is not on so
bare-metal-core side.  The latest commit in Paul's "new" branch
says:

    We were doing two execs for each tag - one to map the tag ID to a commit
    ID (which git ls-remote was already giving us as refs/tags/foo^{}) and
    one to read the contents of the tag for later display....

I think this is a typical problem any Porcelain layer faces.  To
grab summary information for all refs in a concise form in one
go.

We discussed such a potential command about a month ago with
Jakub, but the suggestion did not go anywhere.

http://thread.gmane.org/gmane.comp.version-control.git/25013/focus=25055

In addition to what I described there, such a command should at
least allow specifying some special formatting for tag objects,
such as dereferencing it once (i.e. what the object name and
type of the referenced object is) and dereferencing it
repeatedly until a non-tag appears (i.e. "$that_tag^{}" along
with its type).

The show-ref command will alleviate the fork+exec problem; the
core side update you suggested for ref handling will improve the
performance of the implementation of such a command and can be
done in parallel.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  3:18                             ` Linus Torvalds
  2006-09-11  6:35                               ` Junio C Hamano
@ 2006-09-11 18:54                               ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11 18:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, linux, jonsmirl, git

Linus Torvalds <torvalds@osdl.org> writes:

> On Sun, 10 Sep 2006, Linus Torvalds wrote:
>> 
>> If we did the same pack-file approach that we do for objects, the problem 
>> ends up being that _updating_ things is really hard. What we could do (and 
>> might work) is that a "git repack" would create a "packed representation 
>> of the heads too".
>
> To clarify: I'm _not_ suggesting actually using the "pack-file" 
> representation itself for the references.
>
> I'm saying that we could have something like a
>
> 	.git/refs-packed
>
> file, which could be (for example) just a plain linear text-file, of the 
> form
>
> 	19ed368b17abfb9ad5c7467ea74fd8a045d96b43	refs/heads/html
> 	60a6bf5f53635005f4f68d8b8a33172309193623	refs/heads/maint
> 	2d32e76f893b2ac432201ce7596c8bab691691e6	refs/heads/man
>...
> 	9165ec17fde255a1770886189359897dbb541012	refs/tags/v0.99.7c
> 	02b2acff8bafb6d73c6513469cdda0c6c18c4138	refs/tags/v0.99.7d
> 	...
>
> ie it would contain just a linear file with the "<hex></tab><refname>"
> format.

This depends on how expensive it is to read a file backwards,
but it might make sense to have a single append-only file.

${GIT_DIR}/refs-packed file would be a sequence of records, each
record is of form:

	type SP timestamp LF payload LF length LF

where length and timestamp are ASCII integers, the former is the
length of this record and you would use to compute the file
offset you give lseek() to seek back to the beginning of this
record.  The latter is the creation time of this record.  The
type of the record is either full or incremental (perhaps 'F'
and 'I').

A full record describes in the above "object-name refname" tuple
format.  An incremental record has entries only for the subset
of refs, with 0{40} object name denoting deletion of a ref as in
your suggestion.

A few nice things about this representation:

 - It gives reflog for free (we need to add 'author' and
   'reason' encoded after timestamp, though); it also gives
   reflog to tags which I sometimes miss.

 - Garbage collection is straightforward, although it requires
   locking the whole refs.

 - You can now have 'foo' and 'foo/a', 'foo/b', branches at the
   same time.  It is natural to organize a topic branch 'foo'
   which consists of a merge of subtopics 'foo/a', 'foo/b', ...,
   but this cannot be expressed with the current directory-file
   based scheme.

Some minor details:

 - After every N (I do not think we would need to make this
   configurable, but we need to decide on a sane value) updates,
   we would want to write a full record so that we do not have
   to traverse too long a chain.

 - An incremental record could be delta since the last full,
   instead of delta since the last record, as the above
   introduction hints.  If we start from A, B, and C, and then A
   and B are modified in turn, we might have:

	Full: A B C
        Incr: A'
        Incr: B'

   but we could have instead:

	Full: A B C
        Incr: A'
        Incr: A' B'

   For this to make sense, the record header needs to record a
   back-pointer to the last full record before the current
   incremental record.  With that, we would need to read at most
   two records to find out the current status.

Locking between writers and readers can become problematic.  The
reader would read the last line to find the record length, seek
that much back to find the beginning of the "then-latest"
record, but when there is a simultanous writer, there is no
guarantee that what at the last line of the file _is_ the length
of the last record.  The writer may be in the middle of
appending something.  We may be able to work it around by:

 - Add hash of the record at the end;

 - The reader assumes that there is usually no collision, reads
   the hash and the length, seeks back and reads the record and
   validate the hash -- if the hash does not match, it is likely
   that it detected the writer-reader collision so it should
   wait until the file grows and stops growing and then retry.

But writer-writer contention obviously needs a file lock.  Maybe
we can use flock/lockf/fcntl(F_SETLKW)?  If we are going to do
that then protecting the reader-writer side with the same
mechanism may be simpler and cleaner.

> NOTE! It's important that whatever sceme used gets locking right. The 
> above suggestion gets it right simply because it doesn't really _change_ 
> anything. Any new or modified ref ends up using the old code, and using a 
> ".lock" file and renaming it automatically does the same thing it ever 
> did.

The above changes things quite a bit, so locking aspect of it
needs to be thought through thouroughly.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  2:55                           ` Linus Torvalds
  2006-09-11  3:18                             ` Linus Torvalds
@ 2006-09-11  8:36                             ` Paul Mackerras
  2006-09-11 14:26                               ` linux
  2006-09-11  9:04                             ` Jakub Narebski
  2 siblings, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-11  8:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, jonsmirl, git

Linus Torvalds writes:

> So we could generate a "pack of references", but then any modifications 
> would be done with the current loose "file objects" approach, and just 
> have the filesystem override the pack-files. The problem then actually 
> becomes one of _deleting_ branches, because then we'd have to add a 
> "negative branch" loose object. Ugly.

Could we do a cache of the refs that stores the stat information for
each of the files under .git/refs plus the sha1 that the ref points
to?  In other words this cache would do for the refs what the index
does for the working directory.  Reading all the refs would mean we
still had to stat each of the files, but that's much quicker than
reading them in the cold-cache case.  In the common case when most of
the stat information matches, we don't have to read the file because
we have the sha1 that the file contains right there in the cache.

Ideally we would have two sha1 values in the cache - the sha1 in the
file, and if that is the ID of a tag object, we would also put the
sha1 of the commit that the tag points to in the cache.

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  8:36                             ` Paul Mackerras
@ 2006-09-11 14:26                               ` linux
  2006-09-11 15:01                                 ` Jon Smirl
  2006-09-11 16:47                                 ` Junio C Hamano
  0 siblings, 2 replies; 101+ messages in thread
From: linux @ 2006-09-11 14:26 UTC (permalink / raw)
  To: paulus, torvalds; +Cc: git, jonsmirl, linux

> Could we do a cache of the refs that stores the stat information for
> each of the files under .git/refs plus the sha1 that the ref points
> to?  In other words this cache would do for the refs what the index
> does for the working directory.  Reading all the refs would mean we
> still had to stat each of the files, but that's much quicker than
> reading them in the cold-cache case.  In the common case when most of
> the stat information matches, we don't have to read the file because
> we have the sha1 that the file contains right there in the cache.

Well, that could save one of two seeks, but that's not *much* quicker.
(Indeed, a git ref would fit into the 60 bytes of block pointer space
in an ext2/3 inode if regular files were stuffed there as well as symlinks.)

> Ideally we would have two sha1 values in the cache - the sha1 in the
> file, and if that is the ID of a tag object, we would also put the
> sha1 of the commit that the tag points to in the cache.

Now that's not a bad idea.  Hacking it in to Linus's scheme, that's

<foo sha>\t<foo^{} sha>\tfoo

A couple of thoughts:
1) I bet Hans Reiser is enjoying this; he's been agitating for better
   lots-of-small-files support for years.

2) Since I've written about two caches in a few minutes (here
   and in git-rev-list), a standardized cache validation hook for
   git-fsck-objects and git-prune's use might be useful.

3) If we use Linus's idea of a flat "static refs" file overridden by loose
   refs (presumably, refs would be stuffed in if their mod times got old
   enough, and on initial import you'd use the timestamp of the commit
   they point to), we'll have to do a bit of a dance to move refs to and
   from it.

   Basically, to move refs into the refs file, it's
   - Read all the old refs and loose refs and write the new refs file.
   - Rename the new refs file into place.
   - For each loose ref moved in, lock it, verify it hasn'd changed,
     and delete it.
   with some more locking to prevent two people from doing this at once.

   Folks looking up tags will do an FS search, then validate their refs
   file cache, then if necessary, suck in the refs file.

   Now, exploding a refs file into loose refs is tricky.  There's
   the possible race condition with a reader:

   A: Looks for loose ref "foo", doesn't find it.
		B: Write out loose ref "foo"
		B: Deletes now-unpacked refs file
   A: Looks for refs file, doesn't find it.
   A: Concludes that ref "foo" doesn't exist.

   The only solution I can think of is to stat the refs file at the
   start of the operation and restart from the beginning if it changes
   by the time it actually opens and read it.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11 14:26                               ` linux
@ 2006-09-11 15:01                                 ` Jon Smirl
  2006-09-11 16:47                                 ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-11 15:01 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: paulus, torvalds, git

On 11 Sep 2006 10:26:44 -0400, linux@horizon.com <linux@horizon.com> wrote:
> > Could we do a cache of the refs that stores the stat information for
> > each of the files under .git/refs plus the sha1 that the ref points
> > to?  In other words this cache would do for the refs what the index
> > does for the working directory.  Reading all the refs would mean we
> > still had to stat each of the files, but that's much quicker than
> > reading them in the cold-cache case.  In the common case when most of
> > the stat information matches, we don't have to read the file because
> > we have the sha1 that the file contains right there in the cache.
>
> Well, that could save one of two seeks, but that's not *much* quicker.
> (Indeed, a git ref would fit into the 60 bytes of block pointer space
> in an ext2/3 inode if regular files were stuffed there as well as symlinks.)

Does everyone hate my idea of putting the sha in the file name and
using the directory structure as a simple database with locking? No
one made any comments.

------------------------------------------------
Here's a hack, instead of of putting the sha inside the file, put the
sha into the filename.

master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78

Now you can just do a directory listing and get all of the data
quickly. To keep the existing porcelain working add a symlink.

ln -s master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 master

You might want the sha1 encoded names in a new director
-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11 14:26                               ` linux
  2006-09-11 15:01                                 ` Jon Smirl
@ 2006-09-11 16:47                                 ` Junio C Hamano
  2006-09-11 21:52                                   ` Paul Mackerras
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11 16:47 UTC (permalink / raw)
  To: linux; +Cc: git, paulus, torvalds

linux@horizon.com writes:

>> Ideally we would have two sha1 values in the cache - the sha1 in the
>> file, and if that is the ID of a tag object, we would also put the
>> sha1 of the commit that the tag points to in the cache.
>
> Now that's not a bad idea.  Hacking it in to Linus's scheme, that's
>
> <foo sha>\t<foo^{} sha>\tfoo

That's a dubious idea.

 - Why assume a tag points directly at a commit, or if it is
   not, why assume "foo^{}" (dereferencing repeatedly until we
   get a non-tag) is special?

 - Why assume the user wants access to only the object name of
   what the tag points at?  Perhaps most users would want to
   have its type, dates (committer and author), and probably the
   first line of the commit message if it is (and most likely it
   is) a commit?  -- at least gitweb and gitk would want these.

You should separate the issue of the internal data structure
implementation and programming interface for Porcelains.  From
the internal data structure point-of-view, the second one is a
redundant piece of information.  Caching it _would_ speed up the
access to it, but then the issue becomes where we draw the line
to stop.

It is probably more useful to think about what kind of
information formatted in what way is often wanted by Porcelains
who want to grab many refs in one-go.  If you can come up with a
set that can satisfy everybody using that as the cache file
format would be fine, but I strongly doubt you can satisfy
everybody.  In which case, thinking about the ways for the
Porcelain to express flexibly what information is wanted and
formatted in what way, and have a command to access that
(git-show-refs, anybody?) would be more fruitful, and at that
point, the internal representation of the cached data becomes
the implementation detail.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11 16:47                                 ` Junio C Hamano
@ 2006-09-11 21:52                                   ` Paul Mackerras
  2006-09-11 23:47                                     ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-11 21:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: linux, git, torvalds

Junio C Hamano writes:

> That's a dubious idea.
> 
>  - Why assume a tag points directly at a commit, or if it is
>    not, why assume "foo^{}" (dereferencing repeatedly until we
>    get a non-tag) is special?

Umm, I'm not sure what you're getting at here - if one shouldn't make
those assumptions, why does git ls-remote output both the tag and
tag^{} lines?

>  - Why assume the user wants access to only the object name of
>    what the tag points at?  Perhaps most users would want to
>    have its type, dates (committer and author), and probably the
>    first line of the commit message if it is (and most likely it
>    is) a commit?  -- at least gitweb and gitk would want these.

There are two things here.  Gitk needs to know which IDs have tags
when displaying the graph, and their names.  It doesn't need to know
the other details until someone clicks on the commit or the tag.  Thus
the information that needs to be collected for every tag at startup is
just the name and the sha1 id of the commit it eventually points to
(if it doesn't eventually point to a commit it's basically not
interesting).

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11 21:52                                   ` Paul Mackerras
@ 2006-09-11 23:47                                     ` Junio C Hamano
  2006-09-12  0:06                                       ` Jakub Narebski
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11 23:47 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux, git, torvalds

Paul Mackerras <paulus@samba.org> writes:

> Junio C Hamano writes:
>
>> That's a dubious idea.
>> 
>>  - Why assume a tag points directly at a commit, or if it is
>>    not, why assume "foo^{}" (dereferencing repeatedly until we
>>    get a non-tag) is special?
>
> Umm, I'm not sure what you're getting at here - if one shouldn't make
> those assumptions, why does git ls-remote output both the tag and
> tag^{} lines?

This was originally done to support Cogito's tag following which
was in its infancy.  So in that sense it is already special (iow
we know one user that can take advantage of it), but my point
was that its usefulness for a commit chain fetching application
(i.e. Cogito) does not automatically mean it is also useful for
visualizers like gitk and gitweb.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11 23:47                                     ` Junio C Hamano
@ 2006-09-12  0:06                                       ` Jakub Narebski
  2006-09-12  0:18                                         ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Jakub Narebski @ 2006-09-12  0:06 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Paul Mackerras <paulus@samba.org> writes:
> 
>> Junio C Hamano writes:
>>
>>> That's a dubious idea.
>>> 
>>>  - Why assume a tag points directly at a commit, or if it is
>>>    not, why assume "foo^{}" (dereferencing repeatedly until we
>>>    get a non-tag) is special?
>>
>> Umm, I'm not sure what you're getting at here - if one shouldn't make
>> those assumptions, why does git ls-remote output both the tag and
>> tag^{} lines?
> 
> This was originally done to support Cogito's tag following which
> was in its infancy.  So in that sense it is already special (iow
> we know one user that can take advantage of it), but my point
> was that its usefulness for a commit chain fetching application
> (i.e. Cogito) does not automatically mean it is also useful for
> visualizers like gitk and gitweb.

Actually 'git ls-remote .' _is_ useful for gitweb, see the new 
git_get_references code. 

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-12  0:06                                       ` Jakub Narebski
@ 2006-09-12  0:18                                         ` Junio C Hamano
  2006-09-12  0:25                                           ` Jakub Narebski
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-12  0:18 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Actually 'git ls-remote .' _is_ useful for gitweb, see the new 
> git_get_references code. 

You are missing the point.

We are not discussing if having foo^{} is useful.  There is no
argument about it.  Without it the user is forced to ask
rev-parse.

The point is if it is Ok to assume foo and foo^{} (and nothing
else) are enough to make Porcelains and visualizers happy, and I
suspected the answer was no (remember show-ref discussion?).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-12  0:18                                         ` Junio C Hamano
@ 2006-09-12  0:25                                           ` Jakub Narebski
  0 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-12  0:25 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
>> Actually 'git ls-remote .' _is_ useful for gitweb, see the new 
>> git_get_references code. 
> 
> You are missing the point.
> 
> We are not discussing if having foo^{} is useful.  There is no
> argument about it.  Without it the user is forced to ask
> rev-parse.
> 
> The point is if it is Ok to assume foo and foo^{} (and nothing
> else) are enough to make Porcelains and visualizers happy, and I
> suspected the answer was no (remember show-ref discussion?).

Well, it is enough for ref markers (this commit is this head or tag), for
dumb http transport (indo/refs output is the same as ls-remote), and not
much else... 

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  2:55                           ` Linus Torvalds
  2006-09-11  3:18                             ` Linus Torvalds
  2006-09-11  8:36                             ` Paul Mackerras
@ 2006-09-11  9:04                             ` Jakub Narebski
  2 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-11  9:04 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:

> On Mon, 11 Sep 2006, Paul Mackerras wrote:
>> 
>> Do you think there is any way to speed up the cold-cache case for
>> git-ls-remote and git-rev-parse with thousands of heads and tags?
> 
> Nothing obvious comes to mind.
> 
> If we did the same pack-file approach that we do for objects, the problem 
> ends up being that _updating_ things is really hard. What we could do (and 
> might work) is that a "git repack" would create a "packed representation 
> of the heads too".

Because usually there is reasonable number of heads, and that is the tags
that might be problem, because the number of tags likes to grow with the
history, perhaps we could pack only tags heads into the pack? This has
additional advantage that tags, especially annotated tags (i.e. true tags
objects) changes rarely.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 14:56               ` Jon Smirl
  2006-09-10 16:10                 ` linux
@ 2006-09-10 18:51                 ` Junio C Hamano
  2006-09-11  0:04                   ` Shawn Pearce
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-10 18:51 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git, Shawn Pearce

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Using the Mozilla repo you downloaded is not a normal situation since
> it is 100% packed. Most people are going to have a few thousand loose
> objects floating around too. Loose objects really slow things down.

When you have a few thousand loose objects you definitely should
consider repacking (not "repack -a" for Mozilla case, perhaps).

We could benefit from the suggested organization of one base
archive pack plus a current active pack.  The core side code to
help doing so was posted here which followed a discussion on how
to have repack make use of it last week.

    http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326

Any takers?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 18:51                 ` Junio C Hamano
@ 2006-09-11  0:04                   ` Shawn Pearce
  2006-09-11  0:42                     ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Shawn Pearce @ 2006-09-11  0:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jon Smirl, git

Junio C Hamano <junkio@cox.net> wrote:
> We could benefit from the suggested organization of one base
> archive pack plus a current active pack.  The core side code to
> help doing so was posted here which followed a discussion on how
> to have repack make use of it last week.
> 
>     http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326
> 
> Any takers?

I thought this was settled and you had it implemented?  Is this
still an open topic?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  0:04                   ` Shawn Pearce
@ 2006-09-11  0:42                     ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11  0:42 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> Junio C Hamano <junkio@cox.net> wrote:
>> We could benefit from the suggested organization of one base
>> archive pack plus a current active pack.  The core side code to
>> help doing so was posted here which followed a discussion on how
>> to have repack make use of it last week.
>> 
>>     http://thread.gmane.org/gmane.comp.version-control.git/26218/focus=26326
>> 
>> Any takers?
>
> I thought this was settled and you had it implemented?  Is this
> still an open topic?

I've implemented the core side and I think it is ready.  People
who actually have need for such a repack are welcome to take a
crack at it at the Porcelain-ish level.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 12:41             ` Paul Mackerras
  2006-09-10 14:56               ` Jon Smirl
@ 2006-09-11  0:03               ` Shawn Pearce
  2006-09-11  0:41                 ` Junio C Hamano
  1 sibling, 1 reply; 101+ messages in thread
From: Shawn Pearce @ 2006-09-11  0:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Jon Smirl, Linus Torvalds, linux@horizon.com, Git Mailing List

Paul Mackerras <paulus@samba.org> wrote:
[snip]
> The bottom line is that I can speed up the startup for the hot-cache
> case quite a lot.  The cold-cache case is going to take about 20-30
> seconds whatever I do unless Linus or Junio can come up with a way to
> pack the heads and tags.  I could read the refs asynchronously but
> there will still be a long delay in git rev-parse if you give
> arguments such as --all.

I've been thinking about implementing ref storage within a Git tree
object.  Just store the commit/tag/object IDs in a tree (or graph
of trees) with a mode of '0'.  Anchor that under '.git/refs-tree'.
Any edit of a ref would "lock" .git/refs-tree, create a new tree
containing the update, then replace .git/refs-tree.

But it would put additional stress on the objects directory by
creating a lot of trees which would never get pulled into pack
files and thus would need to be pruned away on a regular basis.

It also would make parallel updates more difficult on the server
side as everyone would need to wait for the lock to .git/refs-tree
before they can change any ref; today users only need to wait for
the ref they are trying to change.

It also doesn't help looking up a ref quickly; although trees are
sorted they are variable length entries which forces the application
to read the entire tree to find its entry.

Given those three downsides I haven't put anything to code yet.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  0:03               ` Shawn Pearce
@ 2006-09-11  0:41                 ` Junio C Hamano
  2006-09-11  1:04                   ` Jakub Narebski
  2006-09-11  2:11                   ` Jon Smirl
  0 siblings, 2 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11  0:41 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> I've been thinking about implementing ref storage within a Git tree
> object.  Just store the commit/tag/object IDs in a tree (or graph
> of trees) with a mode of '0'.  Anchor that under '.git/refs-tree'.
> Any edit of a ref would "lock" .git/refs-tree, create a new tree
> containing the update, then replace .git/refs-tree.
>
> But it would put additional stress on the objects directory by
> creating a lot of trees which would never get pulled into pack
> files and thus would need to be pruned away on a regular basis.

I do not see any advantage of making such a phoney object at
all, but I do agree that the current one file per ref can be
less than optimal when your repository has tons of tags or
refs.

We've designed the ref interface well enough so that only very
narrow core functions know the implementation (and properly
written Porcelain would access refs via update-ref and
symbolic-ref interfaces), so the impact of changing the
one-file-per-ref implementation to something based on a single
simple databasy file (e.g. gdbm, bdb, sqlite, ...) would be
reasonably limited.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  0:41                 ` Junio C Hamano
@ 2006-09-11  1:04                   ` Jakub Narebski
  2006-09-11  2:44                     ` Shawn Pearce
  2006-09-11  2:11                   ` Jon Smirl
  1 sibling, 1 reply; 101+ messages in thread
From: Jakub Narebski @ 2006-09-11  1:04 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

>  the impact of changing the
> one-file-per-ref implementation to something based on a single
> simple databasy file (e.g. gdbm, bdb, sqlite, ...)

One of the complaints against Subversion was that it use BerkeleyDB
(bdb) backend... but it was before it acquired fsfs interface. Perhaps
we could use it too.

Or perhaps it is for something like Electra, or ReiserFS file properites
access...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  1:04                   ` Jakub Narebski
@ 2006-09-11  2:44                     ` Shawn Pearce
  2006-09-11  5:27                       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Shawn Pearce @ 2006-09-11  2:44 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> Junio C Hamano wrote:
> 
> >  the impact of changing the
> > one-file-per-ref implementation to something based on a single
> > simple databasy file (e.g. gdbm, bdb, sqlite, ...)
> 
> One of the complaints against Subversion was that it use BerkeleyDB
> (bdb) backend... but it was before it acquired fsfs interface. Perhaps
> we could use it too.

I'm against the latest Berkely DB (Sleepy Cat) implementations.
Every time I've stored data in them or in applications which use
them I've lost everything.  GNU dbm might not be too bad.  SQL Lite
is overkill.

I was thinking of using a tree object only because its a well
defined Git file format that's already in use.  Plus its "easy"
to repair by loading it into an index, twiddline it and invoking
git-write-tree to convert it back.  But there's a lot of downsides.

This is probably something that is easily solved by a simple fixed
record format holding a 20 byte SHA1 (binary) and a fixed width null
terminated string holding the ref name, with the records sorted
by ref name.  Its yet another file format with yet another set of
utilities needed but we pretty much have those (update-ref).

> Or perhaps it is for something like Electra, or ReiserFS file properites
> access...

Except not everyone has those filesystems.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  2:44                     ` Shawn Pearce
@ 2006-09-11  5:27                       ` Junio C Hamano
  2006-09-11  6:08                         ` Shawn Pearce
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11  5:27 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jakub Narebski

Shawn Pearce <spearce@spearce.org> writes:

> Jakub Narebski <jnareb@gmail.com> wrote:
>> Junio C Hamano wrote:
>> 
>> >  the impact of changing the
>> > one-file-per-ref implementation to something based on a single
>> > simple databasy file (e.g. gdbm, bdb, sqlite, ...)
>> 
>> One of the complaints against Subversion was that it use BerkeleyDB
>> (bdb) backend... but it was before it acquired fsfs interface. Perhaps
>> we could use it too.
>
> I'm against the latest Berkely DB (Sleepy Cat) implementations.
> Every time I've stored data in them or in applications which use
> them I've lost everything.  GNU dbm might not be too bad.  SQL Lite
> is overkill.
>...
> This is probably something that is easily solved by a simple fixed
> record format holding a 20 byte SHA1 (binary) and a fixed width null
> terminated string holding the ref name, with the records sorted
> by ref name.  Its yet another file format with yet another set of
> utilities needed but we pretty much have those (update-ref).

Yup.  That is one reasonable implementation of "single simple
databasy file" I suggested.  Or we could just borrow .git/info/refs
format.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  5:27                       ` Junio C Hamano
@ 2006-09-11  6:08                         ` Shawn Pearce
  2006-09-11  7:11                           ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Shawn Pearce @ 2006-09-11  6:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jakub Narebski

Junio C Hamano <junkio@cox.net> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
> > This is probably something that is easily solved by a simple fixed
> > record format holding a 20 byte SHA1 (binary) and a fixed width null
> > terminated string holding the ref name, with the records sorted
> > by ref name.  Its yet another file format with yet another set of
> > utilities needed but we pretty much have those (update-ref).
> 
> Yup.  That is one reasonable implementation of "single simple
> databasy file" I suggested.  Or we could just borrow .git/info/refs
> format.

I think Linus was suggesting the same thing, just not nearly as
obvious.  He also was looking for ways to not update it frequently.
But I can't say I'm a fan of negative entries. :-)

I'd probably rather code this as a fixed width binary heap file.
It should scale reasonable well to larger sets of refs.  I know
someone on the list wanted to store Perforce change identifiers
as tags in Git but it was taking up more disk in ref files than
in pack-*.pack.  That could be a very larger number of refs and
having O(log N) access there might be good.

But at this point I have more on my plate than I have time for.  :-)

I've got to get that map window code merged against master or next
(preference Junio?  next touches code I'm touching too but those
aren't in master yet) for 32 bit offset.  Should be slightly easier
to do but the pack verification change really threw a wrench into
my merge.  Other things have also kept me from really working on it.
I probably won't be able to look at it again until Wednesday.

I also want to spend more time on the dictionary packing prototype to
see if its worth spending even more time on.  My amd64 box is happily
installing itself as write this (finally!) so I now have some more
reasonable hardware to deal with transforming the Mozilla pack.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  6:08                         ` Shawn Pearce
@ 2006-09-11  7:11                           ` Junio C Hamano
  2006-09-11 17:52                             ` Shawn Pearce
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-11  7:11 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> I think Linus was suggesting the same thing, just not nearly as
> obvious.  He also was looking for ways to not update it frequently.
> But I can't say I'm a fan of negative entries. :-)

I like Linus's suggestion quite a lot actually.  It would be a
nice little project that needs to touch reasonably well-isolated
area of the code, but I suspect it would need a bit more
exposure and familiarity to the existing code than somebody who
looks at the core code for the first time.

> But at this point I have more on my plate than I have time for.  :-)

You've done quite a lot of work on refs area yourself, so you
might want to give somebody else the chance to try first ;-).

> I've got to get that map window code merged against master or next
> (preference Junio?  next touches code I'm touching too but those
> aren't in master yet) for 32 bit offset.

As you know I've shelved the 64-bit offset stuff.  My preference
is to base it on 106d710b but basing it on 'next' would be fine.
Between 106d710b and next there is only one unrelated change to
sha1_file.c, which is the hexval[] change that is in 'master'.

I've sent out "What's in" tonight but forgot to talk about plans
for topics in "next", so here is an update.

 - These things are probably Ok, and after a bit more testing
   I'd like to push them out to 'master' by my Wednesday:

   - Andy Whitcroft's "send-pack to use git-rev-list --stdin"
   - "git apply" to apply binary without explicit --binary flag;
   - "git diff --binary" to do full-index only for binary patch;
   - "git unpack-objects -r", except it needs a better name
     (perhaps --recover).

 - And these by the end of week:

   - Jeff King's git-runstatus;
   - Frank and Rene's git-archive along with git-daemon change;

 - I am not _so_ comfortable with these quite yet, but plan to
   test it a bit more to gain confidence and decide when to push
   out to 'master' perhaps by the middle of the week:

   - pack-objects --unpacked=this-pack --unpacked=that-pack
   - pack-objects --stdin

> I also want to spend more time on the dictionary packing prototype to
> see if its worth spending even more time on.

So far I liked the per-object-type dictionary conjecture the
most, primarily because my gut feeling tells me that it would
involve the least amount of hassle in the interoperability area.

But honestly speaking I am not looking forward to a packfile
format update at this moment.  I'd like things to settle down a
bit now, after merging good bits from 'next' to 'master' and
declare 1.4.3 first, perhaps by the end of the month.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  7:11                           ` Junio C Hamano
@ 2006-09-11 17:52                             ` Shawn Pearce
  0 siblings, 0 replies; 101+ messages in thread
From: Shawn Pearce @ 2006-09-11 17:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> wrote:
> As you know I've shelved the 64-bit offset stuff.  My preference
> is to base it on 106d710b but basing it on 'next' would be fine.
> Between 106d710b and next there is only one unrelated change to
> sha1_file.c, which is the hexval[] change that is in 'master'.

OK.  I'll rebase on 106d710b and try to get the series out in a day
or two.

> So far I liked the per-object-type dictionary conjecture the
> most, primarily because my gut feeling tells me that it would
> involve the least amount of hassle in the interoperability area.

Agreed, except I don't think its going to save us very much (~4%)
and its enough of a change still to make things a little ugly.
If the "true" dictionary compression idea has any value it may
save us even more and make it easier to implement fast full text
searching across files.  But its definately more complex.  So I'm
going to shut up now.

> But honestly speaking I am not looking forward to a packfile
> format update at this moment.  I'd like things to settle down a
> bit now, after merging good bits from 'next' to 'master' and
> declare 1.4.3 first, perhaps by the end of the month.

No worry there.  I won't have enough time between now and the end
of this month to create a dictionary pack that's even ready for pu,
let alone next.  I hope we see 1.4.3 before then.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  0:41                 ` Junio C Hamano
  2006-09-11  1:04                   ` Jakub Narebski
@ 2006-09-11  2:11                   ` Jon Smirl
  1 sibling, 0 replies; 101+ messages in thread
From: Jon Smirl @ 2006-09-11  2:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, git

On 9/10/06, Junio C Hamano <junkio@cox.net> wrote:
> I do not see any advantage of making such a phoney object at
> all, but I do agree that the current one file per ref can be
> less than optimal when your repository has tons of tags or
> refs.

Here's a hack, instead of of putting the sha inside the file, put the
sha into the filename.

master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78

Now you can just do a directory listing and get all of the data
quickly. To keep the existing porcelain working add a symlink.

ln -s master_86a8534ba23a5532f6d0ddd01ecd8f02f662cf78 master

You might want the sha1 encoded names in a new directory

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-08 23:09         ` Linus Torvalds
  2006-09-08 23:28           ` Jon Smirl
@ 2006-09-09  1:05           ` Paul Mackerras
  2006-09-09  2:56             ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-09  1:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List

Linus Torvalds writes:

> [ One way to do it might be: the normal ordering of revisions without 
>   "--topo-order) is "_close_ to topological", and gitk could just decide 
>   to re-compute the whole graph whenever it gets a commit that has a 
>   parent that it has already graphed. Done right, it would probably almost 
>   never actually generate re-computed graphs (if you only actually 
>   generate the graph when the user scrolls down to it).
> 
>   Getting rid of the --topo-order requirement would speed up gitk 
>   absolutely immensely, especially for unpacked cold-cache archives. So it 
>   would probably be a good thing to do, regardless of any shallow clone 
>   issues ]
> 
> Hmm?

I recently added code to gitk to add extra commits after the graph has
been laid out, provided that the extra commits have one parent and no
children.  If I generalized that to handle arbitrary numbers of
parents and children, it might go close to what you're suggesting.

It might get nasty if we have laid out A and then later B, and then C
comes along and turns out to be a child of A but a parent of B,
meaning that both B and C have to be put above A.

Another thing I have been thinking of is that gitk probably should
impose a time limit of say 3 months by default, unless the user
specifies some other ending condition (such as commitid.., ^commitid
or --since=something-else).  Together with a menu to select the time
limit, I think that would be quite usable and would make gitk start up
*much* faster.

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  1:05           ` Paul Mackerras
@ 2006-09-09  2:56             ` Linus Torvalds
  2006-09-09  3:23               ` Junio C Hamano
  2006-09-09  3:31               ` Paul Mackerras
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-09  2:56 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List

On Sat, 9 Sep 2006, Paul Mackerras wrote:
> 
> It might get nasty if we have laid out A and then later B, and then C
> comes along and turns out to be a child of A but a parent of B,
> meaning that both B and C have to be put above A.

Right. This is why I would suggest just recomputing the thing entirely, 
instead of trying to make it incremental. It would definitely cause 
re-organization of the tree when you find a new relationship between two 
old commits that basically creates a new orderign between them.

Trying to do that incrementally sounds really really hard. But just 
_detecting_ the situation that you are getting a new commit that has a 
parent that you have already shown (and that thus must go _before_ one of 
the things you've shown already, and implies a new line reacing it) is 
very easy. And then re-generating the graph might not be too bad.

Running "gitk" on the kernel with a hot cache and fully packed (and enough 
memory) isn't too bad, because git rev-list literally takes under a second 
for that case, so the advantage of avoiding --topo-order isn't _that_ 
noticeable. 

And the "fully packed" part is probably the most important part. A packed 
Linux historic tree takes just under six seconds cold-cache and under two 
seconds hot-cache, but that's because pack-files are _really_ good at 
mapping all the commits in just one go, and at the beginning of the 
pack-file.

But try the same thing with a fully unpacked kernel, and you'll see the 
real pain of having to traverse all of history. We're talking minutes, 
even when hot in the cache.

> Another thing I have been thinking of is that gitk probably should
> impose a time limit of say 3 months by default

I don't think time is good, because for an old project (like the Linux 
historic repo), three months is literally nothing. So it would be much 
better to go simply by numbers, but sadly, that won't help - simply 
because if we want to know the 100 first commits in --topo-order, we'll do 
the whole history and sort it _first_, and then do the "pick top 100 
commits" only after that.

(Which may not be sensible, of course, but it's what we do.. It's almost 
impossible to do it the other way, because you won't know until the point 
where you do "get_revision()" if you are _actually_ going to use a commit 
or not, so counting them before the end is fundamentally very hard).

> Together with a menu to select the time limit, I think that would be 
> quite usable and would make gitk start up *much* faster.

The menu would help, of course. But it would be even nicer if you'd be 
able to make do without the --topo-order.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  2:56             ` Linus Torvalds
@ 2006-09-09  3:23               ` Junio C Hamano
  2006-09-09  3:31               ` Paul Mackerras
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-09  3:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> And the "fully packed" part is probably the most important part. A packed 
> Linux historic tree takes just under six seconds cold-cache and under two 
> seconds hot-cache, but that's because pack-files are _really_ good at 
> mapping all the commits in just one go, and at the beginning of the 
> pack-file.
>
> But try the same thing with a fully unpacked kernel, and you'll see the 
> real pain of having to traverse all of history. We're talking minutes, 
> even when hot in the cache.

Just a random thought...  Does that suggest the general purpose
filesystem you would use for "a fully unpacked" case have room
for improvements?  Would a special purpose filesystem that is
designed to host .git/objects/??/ directory help?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  2:56             ` Linus Torvalds
  2006-09-09  3:23               ` Junio C Hamano
@ 2006-09-09  3:31               ` Paul Mackerras
  2006-09-09  4:04                 ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-09  3:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List

Linus Torvalds writes:

> Right. This is why I would suggest just recomputing the thing entirely, 
> instead of trying to make it incremental. It would definitely cause 
> re-organization of the tree when you find a new relationship between two 
> old commits that basically creates a new orderign between them.

I think it may be possible to back up the layout algorithm to the row
where the parent is and redo it from there down.

Interestingly, I only see two commits out of order on the linux-2.6
repository, but on the full history (up to linux-2.6.12-rc2) I see
520.

> The menu would help, of course. But it would be even nicer if you'd be 
> able to make do without the --topo-order.

The graph does seem to look a little nicer with --topo-order, in that
it tends to do a whole string of commits on a single branch in a
bunch, whereas without --topo-order it seems to hop from branch to
branch more.

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  3:31               ` Paul Mackerras
@ 2006-09-09  4:04                 ` Linus Torvalds
  2006-09-09  8:47                   ` Marco Costalba
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2006-09-09  4:04 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Jon Smirl, linux@horizon.com, Git Mailing List

On Sat, 9 Sep 2006, Paul Mackerras wrote:
> 
> > The menu would help, of course. But it would be even nicer if you'd be 
> > able to make do without the --topo-order.
> 
> The graph does seem to look a little nicer with --topo-order, in that
> it tends to do a whole string of commits on a single branch in a
> bunch, whereas without --topo-order it seems to hop from branch to
> branch more.

Yeah, see "--date-order". It's topologically sorted, but within that 
topological sort it is in the "date order", which approximates the normal 
non-sorted output.

Without sorting, you also won't see any sorting by branch, although it 
could be done by the consumer of the rev-list data (ie you could do it in 
gitk, similarly to how you'd effectively do a topological sort of your 
own).

So it actually boils down to the exact same thing:

 - git-rev-list doing whatever sorting is "convenient", in that it's the 
   one common interface to git revisions, so once you do it there, you do 
   it for everybody.

 - however, for much the same reason, git-rev-list by definition doesn't 
   know _who_ it is sorting things for, or how it's going to be visualized 
   (exactly because it's the "one convenient central place", of course), 
   and as a result it needs to not only get the right answer, it needs to 
   get it on the first try and can't go back incrementally to fix up the 
   list of commits.

So the convenience of doing any kind of sorting in git-rev-list is by 
definition also the thing that causes it to have bad latency, in that it 
needs to parse the _whole_ history.

In contrast, doing some of the same sorting that git-rev-list already does 
in the consumer instead, is obviously duplicating the same basic notions, 
but once you do it in the consumer, you can suddenly do things that simply 
weren't possible in git-rev-list - do things incrementally, and then if 
you notice half-way that you did something wrong, you can go back and fix 
it up (which can be quite acceptable). That just isn't an option in a 
linear environment like the git-rev-list output.

			Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  4:04                 ` Linus Torvalds
@ 2006-09-09  8:47                   ` Marco Costalba
  2006-09-09 17:33                     ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Marco Costalba @ 2006-09-09  8:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List

On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
>
> In contrast, doing some of the same sorting that git-rev-list already does
> in the consumer instead, is obviously duplicating the same basic notions,
> but once you do it in the consumer, you can suddenly do things that simply
> weren't possible in git-rev-list - do things incrementally, and then if
> you notice half-way that you did something wrong, you can go back and fix
> it up (which can be quite acceptable). That just isn't an option in a
> linear environment like the git-rev-list output.
>

Perhaps is total idiocy but why do not implement the fix-up logic
directly in git-rev-list?

If the out of order revisions are a small amount of the total then
could be possible to have something like

git rev-list --topo-order --with-appended-fixups HEAD

Where, while git-rev-list is working _whithout sorting the whole tree
first_, when finds an out of order revision stores it in a fixup-list
buffer and *at the end* of normal git-rev-lsit the buffer is flushed
to receiver, so that the drawing logic does not change and the out of
order revisions arrive at the end, already packed, sorted and prepared
by git-rev-list.

Instead of changing drwing logic is  then enough to add a fixup
routine that cycles through out of order commits and  updates the
graph.

This also avoids recursive reordering  where perhaps the same part of
a graph should be redrawn more then one time due to differtent new
parents to the same child arrives at different times.

Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09  8:47                   ` Marco Costalba
@ 2006-09-09 17:33                     ` Linus Torvalds
  2006-09-09 18:04                       ` Marco Costalba
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2006-09-09 17:33 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List



On Sat, 9 Sep 2006, Marco Costalba wrote:
> 
> Perhaps is total idiocy but why do not implement the fix-up logic
> directly in git-rev-list?

It's possible in theory, but in practice it's not worth it.

Why?

Because you really want the _asynchronous_ nature of having a separate 
user, that only shows _partial_ results.

In other words, we could reasonably easily make git-rev-list do something 
like

 - output all revisions in the normal non-topological ordering

 - when git-rev-list notices a topological sort error, it outputs a 
   "FIXME" line, and restarts the whole thing - printing out the commits 
   in the newly fixed ordering - and does it right the next time around

 - it then continues doing this until it's totally exhausted the whole 
   commit list and has done one final output in the proper topological 
   ordering.

Possible? Yes.

BUT:

 - as long as git-rev-list is entirely single-threaded (and trust me, it's 
   going to be that, because otherwise we'd be better off doing it in a 
   separate process - like gitk), this means that it will be _entirely_ 
   unaware of what has actually been _shown_, so it will restart a LOT 
   more than the external interactive process would do. So it would be 
   much worse than doing it externally and knowing what you've actually 
   shown to the user (if you haven't shown the bad thing yet, there's no 
   reason to restart).

 - Again, as long as it's single-threaded, git-rev-list will block once it
   has filled up the pipeline between the processes, so instead of parsing 
   everything in parallel with the "show it all", if will synchronize with 
   the showing process all the time, and especially so when it needs to 
   re-show the stuff that it already sent once. So it's also fairly 
   inefficient.

However, what you seem to imply is something different:

> Where, while git-rev-list is working _whithout sorting the whole tree
> first_, when finds an out of order revision stores it in a fixup-list
> buffer and *at the end* of normal git-rev-lsit the buffer is flushed
> to receiver, so that the drawing logic does not change and the out of
> order revisions arrive at the end, already packed, sorted and prepared
> by git-rev-list.

But this is exactly what we already do. We flush things *at the end* 
because that's when we actually know the ordering. And that's exactly why 
"git-rev-list --topo-ordering" has a latency ranging from a few seconds to 
a few minutes for large projects (depending on whether they are packed or 
not).

The "wait for the end" is _not_ good, exactly because the end will take 
some time to arrive. The whole point is to start outputting the data 
early, and thet BY DEFINITION means that the order of revisions isn't 
guaranteed to be in topological order.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 17:33                     ` Linus Torvalds
@ 2006-09-09 18:04                       ` Marco Costalba
  2006-09-09 18:44                         ` linux
  2006-09-09 20:05                         ` Linus Torvalds
  0 siblings, 2 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-09 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List

On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
>  - output all revisions in the normal non-topological ordering
>
>  - when git-rev-list notices a topological sort error, it outputs a
>    "FIXME" line, and restarts the whole thing - printing out the commits
>    in the newly fixed ordering - and does it right the next time around
>

Sorry, but I don't understand why  you should restart the whole thing
instead of store away the FIXME commit  and continue.

If you read my previous e-mail in this thread perhaps it is better
explained the whole idea.

Anyhow the basic is:

-git-rev-list starts outputting the data early (order is not guaranteed)

-before to mark for output a revision check if it breaks --topo-order

-if the above is true store the revision away and *do not send*

- at the end you get an early started steram of topological corrects
revisions without
preordering, just because you filter away the (few?) revisions that
are not in order.
The list you get is guaranteed to be in topo order although my not be complete.

- _then_  you send the missing revisions that where previously
filtered out. At this stage the receiver has already drwan the graph,
indeed it has start drwaing as soon as the first revisons arrived and
*very important* receiver used old and fast topo-order parsing code.

- finally the fixup routine at receiver's end updates the graph with
the info from the small set of out of order revisions filtered out
before and sent at the end (only this small set is sent at the end).

Sorry if it is not still clear, in the previous my e-mail in this
thread there is also a small example that could be useful.

      Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 18:04                       ` Marco Costalba
@ 2006-09-09 18:44                         ` linux
  2006-09-09 19:17                           ` Marco Costalba
  2006-09-09 20:05                         ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: linux @ 2006-09-09 18:44 UTC (permalink / raw)
  To: mcostalba, torvalds; +Cc: git, jonsmirl, linux, paulus

> Anyhow the basic is:
> 
> -git-rev-list starts outputting the data early (order is not guaranteed)
> 
> -before to mark for output a revision check if it breaks --topo-order
> 
> -if the above is true store the revision away and *do not send*
> 
> - at the end you get an early started steram of topological corrects
> revisions without
> preordering, just because you filter away the (few?) revisions that
> are not in order.
> The list you get is guaranteed to be in topo order although my not be complete.
> 
> - _then_  you send the missing revisions that where previously
> filtered out. At this stage the receiver has already drwan the graph,
> indeed it has start drwaing as soon as the first revisons arrived and
> *very important* receiver used old and fast topo-order parsing code.
> 
> - finally the fixup routine at receiver's end updates the graph with
> the info from the small set of out of order revisions filtered out
> before and sent at the end (only this small set is sent at the end).

The problem is that the gitk display code doesn't *like* input like this;
it's only designed to append to a list.  Handling insertions would be
hard work for a rare corner case, and a perpetual source of bugs.

Unless gitk does a complete second pass, or course, which would
guarantee an annoying flicker a few seconds after startup.
And Twice the work.

The original alternative was to note that out-of-order commits
usually aren't *very* out of order.  So if gitk could checkpoint
the state of its display algorithm periodically, it could just, when
seeing an out-of-order commit, rewind to the last checkpoint
before the problem and replay while merging in the corrections.

While that is potentially O(n^2) (if you feed it all the commits
in reverse order), in practice it
- Is less than 2*n work, and
- Always maintains the longest possible correct prefix of
  recent commits.  It's the old stuff that takes a while to
  fill in.

Either way, gitk has to have topo-sorting code added to it.  I think
Linus' idea was to avoid that.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 18:44                         ` linux
@ 2006-09-09 19:17                           ` Marco Costalba
  0 siblings, 0 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-09 19:17 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: torvalds, git, jonsmirl, paulus

On 9 Sep 2006 14:44:41 -0400, linux@horizon.com <linux@horizon.com> wrote:
> > Anyhow the basic is:
> >
> > -git-rev-list starts outputting the data early (order is not guaranteed)
> >
> > -before to mark for output a revision check if it breaks --topo-order
> >
> > -if the above is true store the revision away and *do not send*
> >
> > - at the end you get an early started steram of topological corrects
> > revisions without
> > preordering, just because you filter away the (few?) revisions that
> > are not in order.
> > The list you get is guaranteed to be in topo order although my not be complete.
> >
> > - _then_  you send the missing revisions that where previously
> > filtered out. At this stage the receiver has already drwan the graph,
> > indeed it has start drwaing as soon as the first revisons arrived and
> > *very important* receiver used old and fast topo-order parsing code.
> >
> > - finally the fixup routine at receiver's end updates the graph with
> > the info from the small set of out of order revisions filtered out
> > before and sent at the end (only this small set is sent at the end).
>
> The problem is that the gitk display code doesn't *like* input like this;
> it's only designed to append to a list.  Handling insertions would be
> hard work for a rare corner case, and a perpetual source of bugs.
>
> Unless gitk does a complete second pass, or course, which would
> guarantee an annoying flicker a few seconds after startup.
> And Twice the work.
>

I think I need to add another argument here, I didn't mention before
for clarity (indeed I'm not very good at this it seems ;-)  )

I don't know for gitk, perhaps Paul can better explain, but the usual
optimization of a git  visualizer is simply to not draw any graph
until that part does became visibile on the screen.

So your arguments are true but the fact is that there is no graph
insertion handling at all in almost all cases, but only insertion in
loaded revs list ; when the user scrolls to the inserted commit only
_then_ the graph will be calculated and dispalyed (I'm not sure in
gitk, but in qgit it works that way). So there's no flickering too,
and not double work.

Indeed lazy/on demand graph drawing policy is very well supported by
the above schema, and the above schema is good also because of the
lazy graph drawing.

         Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 18:04                       ` Marco Costalba
  2006-09-09 18:44                         ` linux
@ 2006-09-09 20:05                         ` Linus Torvalds
  2006-09-09 20:43                           ` Jeff King
  2006-09-10  3:49                           ` Marco Costalba
  1 sibling, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2006-09-09 20:05 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List

On Sat, 9 Sep 2006, Marco Costalba wrote:
> 
> Sorry, but I don't understand why  you should restart the whole thing
> instead of store away the FIXME commit  and continue.
> 
> If you read my previous e-mail in this thread perhaps it is better
> explained the whole idea.
> 
> Anyhow the basic is:
> 
> -git-rev-list starts outputting the data early (order is not guaranteed)
> 
> -before to mark for output a revision check if it breaks --topo-order
> 
> -if the above is true store the revision away and *do not send*

You don't seem to understand.

It's not that individual new commits might be "out of order", and you can 
suppress them.

It's that individual commits end up showing that OTHER COMMITS, THAT YOU 
HAVE ALREADY SEEN AND SHOWN, can now be shown to be "out of order".

So you're _not_ just suppressing and delaying the few commits that didn't 
conform to the topological sorting. What you're asking for is impossible. 
You can't just delay those to the end, because they implicitly are telling 
you that the earlier commits were already shown in the wrong order.

The example is

		    A		<--- tip of branch
		   / \
		  B   E
                  |   |
		  |   F
		  | /
		  C 
		  |
		  D
		...

where the lettering is in "date order" (ie "A" is more recent than "B" 
etc). In this situation, we'd start following the branch A->B->C->D->.. 
before we even start looking at E and F, because they all _look_ more 
recent.

But then, at some point, we see E and F, and suddenly when looking at F we 
realize that C was printed out much much earlier, and was already shown as 
having only one child. We can't just say "delay F to the end", because 
we'd then have to draw the line _backwards_ to C, which we haven't even 
left room for in the graph, not to mention that the end result would look 
absolutely disgusting even if we had.

So we started out painting this as

		  A
		 / \
		B   |
		|   |
		C   |
		|   |
		D   |
		|   |
		|   E
		|   |
		|   F
		|   |

before we notice how wrong it was, and now we have to go _back_, and 
repaint both E and F as being above C, because only once we hit F do we 
suddenly realize that yeah, it really _was_ younger than C, despite the 
timestamp (which was off, because somebody had set his clock to be a week 
in the past by mistake).

So we can't just delay F and "fix it up" at the end.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 20:05                         ` Linus Torvalds
@ 2006-09-09 20:43                           ` Jeff King
  2006-09-09 21:11                             ` Junio C Hamano
  2006-09-09 21:40                             ` Linus Torvalds
  2006-09-10  3:49                           ` Marco Costalba
  1 sibling, 2 replies; 101+ messages in thread
From: Jeff King @ 2006-09-09 20:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Costalba, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

On Sat, Sep 09, 2006 at 01:05:42PM -0700, Linus Torvalds wrote:

> The example is
> 
> 		    A		<--- tip of branch
> 		   / \
> 		  B   E
>                 |   |
> 		  |   F
> 		  | /
> 		  C 
> 		  |
> 		  D
> 		...
> 
> where the lettering is in "date order" (ie "A" is more recent than "B" 
> etc). In this situation, we'd start following the branch A->B->C->D->.. 
> before we even start looking at E and F, because they all _look_ more 
> recent.

I'm just coming into this discussion in the middle and know very little
about the rev-list code, so please humor me and tell me why my
suggestion is completely irrelevant.

The problem you describe seems to come from doing a depth-first display
of each branch. Why not look at the tip of each "active" branch
simultaneously and pick the one with the most recent date? Something
like:
  void show_all(commit_array start)
  {
    commit_array active;

    copy_array(active, start);
    sort_array(active);
    while (active.size > 0) {
      show_commit(active.data[0]);
      push_back(active.data[0].parents);
      pop_front(active);
      sort_array(active);
    }
  }
Obviously you could use a specialized data structure to add nodes into
the right place instead of resorting constantly.

-Peff

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 20:43                           ` Jeff King
@ 2006-09-09 21:11                             ` Junio C Hamano
  2006-09-09 21:14                               ` Jeff King
  2006-09-09 21:40                             ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-09 21:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I'm just coming into this discussion in the middle and know very little
> about the rev-list code, so please humor me and tell me why my
> suggestion is completely irrelevant.

Not irrelevant.

> The problem you describe seems to come from doing a depth-first display
> of each branch. Why not look at the tip of each "active" branch
> simultaneously and pick the one with the most recent date? Something
> like:

That's what we have been doing from day one.

The trouble Linus illustrated is that in a global project you
cannot rely on timestamps always being correct.  You can use
them as HINT, but you need to be prepared to do sensible things
when some people screw up the time.

> On Sat, Sep 09, 2006 at 01:05:42PM -0700, Linus Torvalds wrote:
>
>> The example is
>> 
>> 		    A		<--- tip of branch
>> 		   / \
>> 		  B   E
>>                |   |
>> 		  |   F
>> 		  | /
>> 		  C 
>> 		  |
>> 		  D
>> 		...
>> 
>> where the lettering is in "date order" (ie "A" is more recent than "B" 
>> etc). In this situation, we'd start following the branch A->B->C->D->.. 
>> before we even start looking at E and F, because they all _look_ more 
>> recent.

The ancestry graph, topologically, flows from bottom to top but
the timestamps are in F E D C B A order (A is closer to current,
F is in the most distant past).  Somebody forked from C on a
machine with slow clock, made two commits with wrong (from the
point of view of the person who made commit C anyway) timestamps,
and that side branch ended up merged with B into A.

You start following A, read A and find B and E (now B and E are
"active" in your lingo), pop B because it is the most recent.
We look at B, find C is the parent, push C into the active list
(which is always sorted by timestamp order).  Now "active" are C
and E, and C is most recent so we pop C.

In the past people suggested workarounds such as making commit-tree 
to ensure that a child commit has timestamp no older than any of
its parents by either refusing to make such or automatically
adjusting.  That would surely work around the problem, but if
somebody made a commit with a wrong timestamp far into the
future, every commit that is made after that will have
"adjusted" timestamp that pretty much is meaningless, so that
would not work well in practice.

If we had a commit generation number field in the commit object,
we could have used something like that.  Each commit gets
generation number that is maximum of its parents' generation
number plus one, and we prime the recursive definition by giving
root commits generation number of 0, and rev-list would have
used the generation number not timestamp to do the above
"date-order" sort and we will always get topology right.  Of
course fsck-objects needs to be taught to check and complain if
the generation numbers between parent and child are inverted.

But it's too late in the game now -- maybe in git version 47.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 21:11                             ` Junio C Hamano
@ 2006-09-09 21:14                               ` Jeff King
  0 siblings, 0 replies; 101+ messages in thread
From: Jeff King @ 2006-09-09 21:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sat, Sep 09, 2006 at 02:11:51PM -0700, Junio C Hamano wrote:

> The trouble Linus illustrated is that in a global project you
> cannot rely on timestamps always being correct.  You can use
> them as HINT, but you need to be prepared to do sensible things
> when some people screw up the time.

Right, I forgot about the unreliability of the timestamps. Thanks for
the explanation!

-Peff

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 20:43                           ` Jeff King
  2006-09-09 21:11                             ` Junio C Hamano
@ 2006-09-09 21:40                             ` Linus Torvalds
  2006-09-09 22:54                               ` Jon Smirl
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2006-09-09 21:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Marco Costalba, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

On Sat, 9 Sep 2006, Jeff King wrote:
> 
> The problem you describe seems to come from doing a depth-first display
> of each branch

No, not at all.

We _don't_ do depth-first, or breadth-first. The thing is, neither would 
really work, since the "distance" between commits is very variable.

Instead, the rule is always to try to traverse the tree in date order. 
That's most likely the closest order to what you want - it ends up being 
breadth-first if there is a lot of concurrent work, and if there is one 
old branch and one new branch, we'll look at the new one first, and 
approximate depth-first there.

> Why not look at the tip of each "active" branch
> simultaneously and pick the one with the most recent date?

Which is _exactly_ what we do.

However, it's just a heuristic. "Most recent date" is not well-defined in 
a distributed environment that doesn't have a global clock. If somebody 
does commits on a machine that has the clock just set wrong, "most recent" 
may well not be _really_ recent.

So all the algorithms are actually very careful to just use the date as a 
heuristic. They have real guarantees, but they aren't about the ordering. 
The ordering is just used as a way to have _some_ non-random and likely 
good order to traverse the DAG in. Doing it depth-first would suck (you'd 
always reach the root before you reach a lot of much more interesting 
commits), and breadth-first is not well-defined either, since one "level" 
can be a year old for one parent, and a day old for another.

So you could kind of think of the ordering as "breadth-first, modified by 
date" kind of thing.

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 21:40                             ` Linus Torvalds
@ 2006-09-09 22:54                               ` Jon Smirl
  2006-09-10  0:18                                 ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Jon Smirl @ 2006-09-09 22:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Marco Costalba, Paul Mackerras, linux@horizon.com,
	Git Mailing List

On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> However, it's just a heuristic. "Most recent date" is not well-defined in
> a distributed environment that doesn't have a global clock. If somebody
> does commits on a machine that has the clock just set wrong, "most recent"
> may well not be _really_ recent.

When a merge happens could git fix things up in the database by adding
a corrected, hidden time stamp that would keep things from having an
out of order time sequence? That way you wouldn't need to rediscover
the out of order commit each time the tree is generated.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 22:54                               ` Jon Smirl
@ 2006-09-10  0:18                                 ` Linus Torvalds
  2006-09-10  1:22                                   ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2006-09-10  0:18 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Jeff King, Marco Costalba, Paul Mackerras, linux@horizon.com,
	Git Mailing List

On Sat, 9 Sep 2006, Jon Smirl wrote:
> 
> When a merge happens could git fix things up in the database by adding
> a corrected, hidden time stamp that would keep things from having an
> out of order time sequence? That way you wouldn't need to rediscover
> the out of order commit each time the tree is generated.

I don't think any such algorithm exists, and it's too late now ;)

I was thinking of some "sequence number" algorithm, but the thing is, it's 
not _merges_ that are the problem per se, it's the "branch points". And 
you don't actually know that a commit is a branch point in advance: it's 
a branch point not because you branched, but because somebody else 
happened to use the same parent as the basis of their independent work.

So we could make the rule be something like "when we commit, the new 
commit must have a datestamp that is equal to or in the future from the 
parent commits". However, that results in some really strange problems in 
a distributed environment, where _if_ somebody has their clock set wrong, 
you'll end up either unable to merge with them, or you have to generate a 
totally bogus time yourself (which then moves up the chain).

It might be simpler to then not think of the thing as a "date" at all, but 
as a pure sequence number. Then the rule would be: a new commit has to 
have a sequence number that is "max(of-the-parent-sequence-numbers) + 1".
That _may_ or may not actually help.

It was discussed (a bit) early on - summer-05 - but I wasn't convinced 
enough that it would help that I was willing to change the format and add 
that sequence number. I still am not. I'm not even sure it would really 
work.

In other words, it's probably just not worth it. It's better to just say 
"dates don't really _matter_, and all that matters is the parenthood 
information". Which is really what git does - the dates aren't _important_ 
for any real operation, they are literally just a heuristic.

But hey, maybe somebody can come up with some operation that becomes 
_much_ cheaper with the above kind of sequence number. It shouldn't be 
hard to test..

		Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  0:18                                 ` Linus Torvalds
@ 2006-09-10  1:22                                   ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-10  1:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> But hey, maybe somebody can come up with some operation that becomes 
> _much_ cheaper with the above kind of sequence number. It shouldn't be 
> hard to test..

Sometimes I get a feeling that we are saying the same thing in
the same thread but in parallel without reading what the other
is saying.  This is such a thread ;-).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 20:05                         ` Linus Torvalds
  2006-09-09 20:43                           ` Jeff King
@ 2006-09-10  3:49                           ` Marco Costalba
  2006-09-10  4:13                             ` Junio C Hamano
  1 sibling, 1 reply; 101+ messages in thread
From: Marco Costalba @ 2006-09-10  3:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Jon Smirl, linux@horizon.com, Git Mailing List

On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
>
> The example is
>
>                     A           <--- tip of branch
>                    / \
>                   B   E
>                   |   |
>                   |   F
>                   | /
>                   C
>                   |
>                   D
>                 ...
>

Ok now it' clear, thanks. But anyhow I think that it should be
possible to avoid the check and reordering on the receiver side.

Suppose for a moment to split the graph drawing from the sequence
reordering problem, suppose for a moment that receiver does do not
draw the graph immediately.

As you described, in our case git-rev-list sends the following sequence:
A, B, C, D, E, F

instead git-rev-list --topo-order would have sent something like
A, E, F, B, C, D

Now I ask, is it possible to have a sequence (without latency) like
A, B, C, D, (-3)E, (-3)F

where, in case of not topological correct revisions, git-rev-list
gives the hint on the correct position in sequence (3 revs before in
our case) where the revision would have been if the sequence would
have been --topo-order ?

This saves all the checking and reordering on the receiver side and
guarantees consistent results on different implementations of git
visualizers because the --topo-order algorithm logic is computed in
git rev-list only.

The visualizers  could be sequence order agnostic, i.e. receivers can
handle --topo-order or --date-order sequences or any other kind of
sequence with the same code, simply they recreate the sequence in the
way git-rev-list tells them.

Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  3:49                           ` Marco Costalba
@ 2006-09-10  4:13                             ` Junio C Hamano
  2006-09-10  4:23                               ` Marco Costalba
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2006-09-10  4:13 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

"Marco Costalba" <mcostalba@gmail.com> writes:

> On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
>>
>> The example is
>>
>>                     A           <--- tip of branch
>>                    / \
>>                   B   E
>>                   |   |
>>                   |   F
>>                   | /
>>                   C
>>                   |
>>                   D
>>                 ...
>>
>
> Ok now it' clear, thanks. But anyhow I think that it should be
> possible to avoid the check and reordering on the receiver side.
>
> Suppose for a moment to split the graph drawing from the sequence
> reordering problem, suppose for a moment that receiver does do not
> draw the graph immediately.
>
> As you described, in our case git-rev-list sends the following sequence:
> A, B, C, D, E, F
>
> instead git-rev-list --topo-order would have sent something like
> A, E, F, B, C, D
>
> Now I ask, is it possible to have a sequence (without latency) like
> A, B, C, D, (-3)E, (-3)F
>
> where, in case of not topological correct revisions, git-rev-list
> gives the hint on the correct position in sequence (3 revs before in
> our case) where the revision would have been if the sequence would
> have been --topo-order ?

When rev-list is writing E out, it does not know it is a
descendant of something it already emitted (i.e. C) because it
hasn't looked at F nor checked its parent yet.  So asking for
(-3)F may be fine but I think (-3)E is just a fantasy.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:13                             ` Junio C Hamano
@ 2006-09-10  4:23                               ` Marco Costalba
  2006-09-10  4:46                                 ` Marco Costalba
  2006-09-10  4:54                                 ` Junio C Hamano
  0 siblings, 2 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-10  4:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

On 9/10/06, Junio C Hamano <junkio@cox.net> wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>
> > On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >>
> >> The example is
> >>
> >>                     A           <--- tip of branch
> >>                    / \
> >>                   B   E
> >>                   |   |
> >>                   |   F
> >>                   | /
> >>                   C
> >>                   |
> >>                   D
> >>                 ...
> >>
> >
> > Ok now it' clear, thanks. But anyhow I think that it should be
> > possible to avoid the check and reordering on the receiver side.
> >
> > Suppose for a moment to split the graph drawing from the sequence
> > reordering problem, suppose for a moment that receiver does do not
> > draw the graph immediately.
> >
> > As you described, in our case git-rev-list sends the following sequence:
> > A, B, C, D, E, F
> >
> > instead git-rev-list --topo-order would have sent something like
> > A, E, F, B, C, D
> >
> > Now I ask, is it possible to have a sequence (without latency) like
> > A, B, C, D, (-3)E, (-3)F
> >
> > where, in case of not topological correct revisions, git-rev-list
> > gives the hint on the correct position in sequence (3 revs before in
> > our case) where the revision would have been if the sequence would
> > have been --topo-order ?
>
> When rev-list is writing E out, it does not know it is a
> descendant of something it already emitted (i.e. C) because it
> hasn't looked at F nor checked its parent yet.  So asking for
> (-3)F may be fine but I think (-3)E is just a fantasy.
>
Ok. What about something like this?
A, B, C, D, E, (-3, 1)F

where -3 is the correct position in sequence and 1 is the number of
revisions before F to whom apply the -3 rule.


>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:23                               ` Marco Costalba
@ 2006-09-10  4:46                                 ` Marco Costalba
  2006-09-10  4:54                                 ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-10  4:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

On 9/10/06, Marco Costalba <mcostalba@gmail.com> wrote:
> On 9/10/06, Junio C Hamano <junkio@cox.net> wrote:
> > "Marco Costalba" <mcostalba@gmail.com> writes:
> >
> > > On 9/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > >>
> > >> The example is
> > >>
> > >>                     A           <--- tip of branch
> > >>                    / \
> > >>                   B   E
> > >>                   |   |
> > >>                   |   F
> > >>                   | /
> > >>                   C
> > >>                   |
> > >>                   D
> > >>                 ...
> > >>
> > >
> > > Ok now it' clear, thanks. But anyhow I think that it should be
> > > possible to avoid the check and reordering on the receiver side.
> > >
> > > Suppose for a moment to split the graph drawing from the sequence
> > > reordering problem, suppose for a moment that receiver does do not
> > > draw the graph immediately.
> > >
> > > As you described, in our case git-rev-list sends the following sequence:
> > > A, B, C, D, E, F
> > >
> > > instead git-rev-list --topo-order would have sent something like
> > > A, E, F, B, C, D
> > >
> > > Now I ask, is it possible to have a sequence (without latency) like
> > > A, B, C, D, (-3)E, (-3)F
> > >
> > > where, in case of not topological correct revisions, git-rev-list
> > > gives the hint on the correct position in sequence (3 revs before in
> > > our case) where the revision would have been if the sequence would
> > > have been --topo-order ?
> >
> > When rev-list is writing E out, it does not know it is a
> > descendant of something it already emitted (i.e. C) because it
> > hasn't looked at F nor checked its parent yet.  So asking for
> > (-3)F may be fine but I think (-3)E is just a fantasy.
> >
> Ok. What about something like this?
> A, B, C, D, E, (-3, 1)F
>
> where -3 is the correct position in sequence and 1 is the number of
> revisions before F to whom apply the -3 rule.
>

Suppose git-rev-list gives:

A, B, H, C, D, E, F, G, I, L, M, N

Suppose git-rev-list --topo-order would have sent

A, B, C, D, E, F, G, H,  I,  L, M, N

Perhaps a more general fixup syntax could be:

A,B, H, C, D, E, F, G, I(6, C, D, E, F, G, H), L, M, N

Where 6 is the number of previous revisions to rearrange with the
corresponding correct sequence(C, D, E, F, G, H).

Note that the same subset could be rearranged multiple times if it
helps git-rev-list parsing, as example it should be possible to have
something like:

A,B, H, C, D, E, F, G, I(6, C, D, E, F, G, H), L, M(3, L, G, H), N


Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:23                               ` Marco Costalba
  2006-09-10  4:46                                 ` Marco Costalba
@ 2006-09-10  4:54                                 ` Junio C Hamano
  2006-09-10  5:14                                   ` Marco Costalba
                                                     ` (2 more replies)
  1 sibling, 3 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-10  4:54 UTC (permalink / raw)
  To: Marco Costalba
  Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

"Marco Costalba" <mcostalba@gmail.com> writes:

> >>
> >>                     A           <--- tip of branch
> >>                    / \
> >>                   B   E
> >>                   |   |
> >>                   |   F
> >>                   | /
> >>                   C
> >>                   |
> >>                   D
> >>                 ...
> >>
>
> Ok. What about something like this?
> A, B, C, D, E, (-3, 1)F
>
> where -3 is the correct position in sequence and 1 is the number of
> revisions before F to whom apply the -3 rule.

That means F knows who its descendants are, and commit objects
do not have that information, so while traversing you need to
keep track of all the descendants yourself, doesn't it?

And how does that fix-up information help the user of the stream
anyway?  If I understand your model correctly, the model does
not synchronously draw nodes as they are received, so it keeps
track of what it has seen so far.  When it sees F it can notice
that its parent, C, is something it has seen, so it can tell F
is wrong.  It also knows that it has seen E and E's parent is F
(which turned out to be at a wrong spot), so it can suspect E is
also in the wrong spot (now, it is fuzzy to me how that chain
of suspicion ends at E and does not propagate up to A, but let's
think about that later).

There is one thing that the user knows a lot better than the
generic rev-list output part.  It is the size of the output
window (how tall the window is).  I wonder if there is a way to
exploit it, either in the user, or telling the size of the
window (and perhaps where the display region at the top begins
at) to rev-list...

If we are dealing in a 3-item-tall window, we will traverse A B
C D, notice they form a single strand of pearls, and can make a
mental note that we do not have to worry about ancestors of D
for now, because D's ancestors will be drawn below C, which is
already the putative bottom edge of the window (when oddballs
like E and F comes later, they can only push things down at the
bottom edge of the window).  Then rev-list can postpone
traversing D and go on to traverse other nodes that are in the
"active" list (in this case, E).

I am starting to suspect that introducing "generation" header to
the commit objects might actually be a very good thing.  For
one thing, rev-list will automatically get the topology always
right if we did so.

We obviously need to update 'convert-objects' and tell everybody
that they need to rewrite their history if we take this route.
That kind of flag-day conversion used to be an Ok thing to do,
but it is getting harder and harder these days for obvious
reasons, though.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:54                                 ` Junio C Hamano
@ 2006-09-10  5:14                                   ` Marco Costalba
  2006-09-10  5:46                                     ` Junio C Hamano
  2006-09-10 15:21                                     ` linux
  2006-09-10  9:49                                   ` Jakub Narebski
  2006-09-10 10:28                                   ` Josef Weidendorfer
  2 siblings, 2 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-10  5:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Paul Mackerras, Jon Smirl, linux@horizon.com,
	Git Mailing List

On 9/10/06, Junio C Hamano <junkio@cox.net> wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>
> > >>
> > >>                     A           <--- tip of branch
> > >>                    / \
> > >>                   B   E
> > >>                   |   |
> > >>                   |   F
> > >>                   | /
> > >>                   C
> > >>                   |
> > >>                   D
> > >>                 ...
> > >>
> >
> > Ok. What about something like this?
> > A, B, C, D, E, (-3, 1)F
> >
> > where -3 is the correct position in sequence and 1 is the number of
> > revisions before F to whom apply the -3 rule.
>
> That means F knows who its descendants are, and commit objects
> do not have that information, so while traversing you need to
> keep track of all the descendants yourself, doesn't it?
>

Yes! it is, but you do it in git instead of in the visualizer ;-)
because I think in any case someone defenitly needs to keep track of
it.

> And how does that fix-up information help the user of the stream
> anyway?  If I understand your model correctly, the model does
> not synchronously draw nodes as they are received,

Visualizers draw only what is on screen so when you start them they
draw the first 20/30 lines only. And noramally git output it's faster
then user scrolling so when user scrolls down revisions are already
arrived.

> track of what it has seen so far.  When it sees F it can notice
> that its parent, C, is something it has seen, so it can tell F
> is wrong.  It also knows that it has seen E and E's parent is F
> (which turned out to be at a wrong spot), so it can suspect E is
> also in the wrong spot (now, it is fuzzy to me how that chain
> of suspicion ends at E and does not propagate up to A, but let's
> think about that later).
>
Is it possible, in that case flicker will be involved, but it is very
uncommon, so in the big amount of all other cases we have no
flickering and early graph drawing.

Becasue the worst that can happen on visualizer side is flickering we
can accept a worst rare case to have a big adavantage (no latency and
no flickering) on the (very) common case: "user scrolls to a given
page when the corresponding revisions are already arrived and, in
case, fixed up".


    Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  5:14                                   ` Marco Costalba
@ 2006-09-10  5:46                                     ` Junio C Hamano
  2006-09-10 15:21                                     ` linux
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2006-09-10  5:46 UTC (permalink / raw)
  To: git

"Marco Costalba" <mcostalba@gmail.com> writes:

> On 9/10/06, Junio C Hamano <junkio@cox.net> wrote:
>> "Marco Costalba" <mcostalba@gmail.com> writes:
>>
>> > >>
>> > >>                     A           <--- tip of branch
>> > >>                    / \
>> > >>                   B   E
>> > >>                   |   |
>> > >>                   |   F
>> > >>                   | /
>> > >>                   C
>> > >>                   |
>> > >>                   D
>> > >>                 ...
>> > >>
>> >
>> > Ok. What about something like this?
>> > A, B, C, D, E, (-3, 1)F
>> >
>> > where -3 is the correct position in sequence and 1 is the number of
>> > revisions before F to whom apply the -3 rule.
>>
>> That means F knows who its descendants are, and commit objects
>> do not have that information, so while traversing you need to
>> keep track of all the descendants yourself, doesn't it?
>>
>
> Yes! it is, but you do it in git instead of in the visualizer ;-)
> because I think in any case someone defenitly needs to keep track of
> it.

Not really.

To git-rev-list, which does not know how the visualizer uses the
data, what you are saying essentially means that it needs to
hold onto the data it parsed forever.  That's where my earlier
suggestion for visualizers to take advantage of the "windowing"
information comes from.  Since the chain depicted in the above
between E..F can be arbitrary long (and you would need to keep
track of information for B and C too, well, essentially
everything), that is unacceptable for rev-list if you do not
give it additional clue when it can discard that information.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  5:14                                   ` Marco Costalba
  2006-09-10  5:46                                     ` Junio C Hamano
@ 2006-09-10 15:21                                     ` linux
  2006-09-10 18:32                                       ` Marco Costalba
  2006-09-11  9:56                                       ` Paul Mackerras
  1 sibling, 2 replies; 101+ messages in thread
From: linux @ 2006-09-10 15:21 UTC (permalink / raw)
  To: junkio, mcostalba; +Cc: git, jonsmirl, linux, paulus, torvalds

This conversation is going in so many directions at once that it's
getting annoying.

The first level decision requires input from the point of view of gitk
and qgit internals as to what's easy to implement.

I'm trying to figure out the gitk code, but I'm not fluent in tcl, and
it has 39 non-boilerplate comment lines in 229 functions and 6308 lines
of source, so it requires fairly intensive grokking.

Still, while it obviously doesn't render bitmaps until the data
appears in the window, it appears as though at least part of the
layout (the "layoutmore" function) code is performed eagerly as
soon as new data arrives via the getcommitlines callback.

(Indeed, it appears that Tk does not inform it of window scroll
events, so it can't wait any longer to decide on the layout.)

Case 1: The visualizer is NOT CAPABLE of accepting out-of-order
	input.  Without very sophisticated cacheing in git-rev-list,
	it must process all of the data before outputting any in
	order to make an absolute guarantee of no out-of-order data
	despite possibly messed-up timestamps.

	It is possible to notice that the date-first traversal only
	has one active chain and flush the queued data at that point,
	but that situation is unlikely to arise in repositories of
	non-trivial size, which are exactly the ones for which
	the batch-sorting delay is annoying.

Case 2: The visualizer IS CAPABLE of accepting out-of-order input.
	Regardless of whether the layout is done eagerly or lazily,
	this requires the visualizer to potentially undo part of
	its layout and re-do it, so has a UI implementation cost.

	The re-doing does not need to be highly efficient; any number
	of heuristics and exception caches can reduce the occurrence of
	this in git-rev-list output to very low levels.  It's just
	absolutely excluding it, without losing incremental output,
	that is difficult.

	Heuristic: I suspect the most common wrong-timestamp case is
	a time zone misconfiguration, so holding back output until the
	tree traversal has advanced 24 hours will eliminate most of the
	problems.  Atypically large time jumps (like more than a year)
	could also trigger special "gross clock error" handling.

	Cache: whenever a child timestamped older than an ancestor is
	encountered, this can be entered in a persistent cache that can
	be used to give the child a different sorting priority next time.
	The simplest implementation would propagate this information up a
	chain of bad timestamps by one commit per git-rev-list invocation,
	but even that's probably okay.

	(A study of timestamp ordering problems in existing repositories
	would be helpful for tuning these.)

In case 2, I utterly fail to see how delaying emitting the out-of-order
commit is of the slightest help to the UI.  The simplest way to merge
out-of-order data is with an insertion sort (a.k.a. roll back and
reprocess forward), and the cost of that is minimized if the distance
to back up is minimized.

Some "oops!" annotation on the git-rev-list output may be helpful to
tell the UI that it needs to search back, but it already has an internal
index of commits, so perhaps even that isn't worth bothering with.
Fancier output formats are also more work to process.

With sufficient cacheing of exceptions in git-rev-list, it may be
practical to just have a single "oops, I screwed up; let's start again"
output line, which very rarely triggers.

But can we stop designing git-rev-list output formats until we've figured
out if and how to implement it in the visualizer?  Or, more to the point,
visualizers plural.  That's the hard part.  Then we can see what sort
of git-rev-list output would be most convenient.

For example, is fixing a small number of out-of-place commits practical,
or is it better to purge and restart?  The former avoids deleting
already-existing objects, while the latter avoids moving them.

The original problem is that the long delay between starting a git history
browser and being able to browse them is annoying.  The visualizer UIs
already support browsing while history is flowing in from git-rev-list,
but git-rev-list is reading and sorting the entire history before
outputting the first line.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 15:21                                     ` linux
@ 2006-09-10 18:32                                       ` Marco Costalba
  2006-09-11  9:56                                       ` Paul Mackerras
  1 sibling, 0 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-10 18:32 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: junkio, git, jonsmirl, paulus, torvalds

>
> But can we stop designing git-rev-list output formats until we've figured
> out if and how to implement it in the visualizer?  Or, more to the point,
> visualizers plural.  That's the hard part.  Then we can see what sort
> of git-rev-list output would be most convenient.
>

Regarding qgit we have two cases:

-arrive of fixup set for out of order commits _not_ currently
displayed on the screen
This is by far the most common case expecially for big archives. In
this case the fix speed depends on how far are the fixed commits from
the last loaded commit (it's an insertion in a vector). In any case
far more quick then redrawing the graph. No flickering side effect
here. Implementation is easy.

-arrive of fixup set for out of order commits currently displayed on the screen
Here implementation _could_ be more difficult, but given the very low
statistic of this case (a rare case among rare cases) we could accept
the brutal approach of reset the complete graph, *but not the loaded
revisions data*, and redraw the graph. With this approach
implementation is almost as easy as before but flickering is involved.

I agree with you that we should have fixup information as soon as
git-rev-list discovers out of order commits.

   Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10 15:21                                     ` linux
  2006-09-10 18:32                                       ` Marco Costalba
@ 2006-09-11  9:56                                       ` Paul Mackerras
  2006-09-11 12:39                                         ` linux
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Mackerras @ 2006-09-11  9:56 UTC (permalink / raw)
  To: linux; +Cc: junkio, mcostalba, git, jonsmirl, torvalds

linux@horizon.com writes:

> I'm trying to figure out the gitk code, but I'm not fluent in tcl, and
> it has 39 non-boilerplate comment lines in 229 functions and 6308 lines
> of source, so it requires fairly intensive grokking.

Sorry :)

> Still, while it obviously doesn't render bitmaps until the data
> appears in the window, it appears as though at least part of the
> layout (the "layoutmore" function) code is performed eagerly as
> soon as new data arrives via the getcommitlines callback.

It doesn't render bitmaps as such at the Tcl level, what it does is
create items on canvases, add text to text widgets, etc.  Tk takes
care of turning it into bitmaps.

The work of drawing the graph is done in three stages - layout,
optimization and drawing.  The layout and optimization stages are done
when we have enough data because, at least for the layout stage,
earlier rows can affect later rows arbitrarily far away.  For the
optimization stage, I don't think the changes it makes can propagate
arbitrarily far, so it would be possible to defer that stage, and I am
looking into implementing that.  The drawing stage is done on demand,
just for the part that's actually visible.

> (Indeed, it appears that Tk does not inform it of window scroll
> events, so it can't wait any longer to decide on the layout.)

The Tcl code does get to know when the canvas has been scrolled - see
the scrollcanv procedure.

> In case 2, I utterly fail to see how delaying emitting the out-of-order
> commit is of the slightest help to the UI.  The simplest way to merge

Indeed.  I would want to see it as early as possible.

> out-of-order data is with an insertion sort (a.k.a. roll back and
> reprocess forward), and the cost of that is minimized if the distance
> to back up is minimized.

That part isn't too hard; I already have modifications to gitk to
handle that much of it.  The harder part is backing up the graph
drawing algorithm.

> For example, is fixing a small number of out-of-place commits practical,
> or is it better to purge and restart?  The former avoids deleting
> already-existing objects, while the latter avoids moving them.

For gitk, I'm thinking of a reorder buffer of say 10 entries at the
front end to cope with minor misorderings; then if misordering occurs
that is outside the scope of the reorder buffer to fix, freeze the
layout algorithms at that point, read in the rest of the commits and
reorder as necessary, then at the end restart the layout algorithm
from scratch, probably with a popup to inform the user what happened.
If the user could set the size of the reorder buffer then they could
avoid having that happen in future, at the cost of it taking longer to
show the first screenful of commits.

Paul.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-11  9:56                                       ` Paul Mackerras
@ 2006-09-11 12:39                                         ` linux
  0 siblings, 0 replies; 101+ messages in thread
From: linux @ 2006-09-11 12:39 UTC (permalink / raw)
  To: linux, paulus; +Cc: git, jonsmirl, junkio, mcostalba, torvalds

>> I'm trying to figure out the gitk code, but I'm not fluent in tcl, and
>> it has 39 non-boilerplate comment lines in 229 functions and 6308 lines
>> of source, so it requires fairly intensive grokking.
>
> Sorry :)

I was more making excuses for my failure to understand its properties.
Still, if you get the chance, a couple of paragraphs of comments on the
primary data structures would be an immense help.

Anyway, thanks for the summary.

>> For example, is fixing a small number of out-of-place commits practical,
>> or is it better to purge and restart?  The former avoids deleting
>> already-existing objects, while the latter avoids moving them.

> For gitk, I'm thinking of a reorder buffer of say 10 entries at the
> front end to cope with minor misorderings; then if misordering occurs
> that is outside the scope of the reorder buffer to fix, freeze the
> layout algorithms at that point, read in the rest of the commits and
> reorder as necessary, then at the end restart the layout algorithm
> from scratch, probably with a popup to inform the user what happened.
> If the user could set the size of the reorder buffer then they could
> avoid having that happen in future, at the cost of it taking longer to
> show the first screenful of commits.

My thoughts were that anything you could do, git-rev-list could do better,
including the re-order buffer, so don't even bother.  (For example, I
was thinking of a time- or depth-based one rather than a count-based one.)

So I read your suggestion as a "purge and restart" preference, with
git-rev-list keeping a cache of problems encountered so that on most
runs, it doesn't have to do it.  Putting that logic in git-rev-list
means that you don't have to detect ordering problems or do any sorting,
just handle an "oops!" line in the output.

Basically, given the example graph (with letters indicating timestamps)

...G--D--C--B---A
          \    /
           F--E

You'd get something like
A B E
B C
C D
D G
E F
*** ERROR -  restarting - please wait
A B E
B C
E F
F C
C D
D G
G ...

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:54                                 ` Junio C Hamano
  2006-09-10  5:14                                   ` Marco Costalba
@ 2006-09-10  9:49                                   ` Jakub Narebski
  2006-09-10 10:28                                   ` Josef Weidendorfer
  2 siblings, 0 replies; 101+ messages in thread
From: Jakub Narebski @ 2006-09-10  9:49 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> I am starting to suspect that introducing "generation" header to
> the commit objects might actually be a very good thing.  For
> one thing, rev-list will automatically get the topology always
> right if we did so.
> 
> We obviously need to update 'convert-objects' and tell everybody
> that they need to rewrite their history if we take this route.
> That kind of flag-day conversion used to be an Ok thing to do,
> but it is getting harder and harder these days for obvious
> reasons, though.

Wouldn't it be possible to have not that more complex code if some
of the commit objects (newer) would have "generation" header, and some
of them (older) wouldn't have it? Git would use generation header if it is
present, and current heuristic timestamp based code if it is not present.

It would be better of course if the new commits have correct "generation"
header, so insertion of "new" commit after "old" commit would have some
extra overhead...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-10  4:54                                 ` Junio C Hamano
  2006-09-10  5:14                                   ` Marco Costalba
  2006-09-10  9:49                                   ` Jakub Narebski
@ 2006-09-10 10:28                                   ` Josef Weidendorfer
  2 siblings, 0 replies; 101+ messages in thread
From: Josef Weidendorfer @ 2006-09-10 10:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Marco Costalba, Linus Torvalds, Paul Mackerras, Jon Smirl,
	linux@horizon.com, Git Mailing List

On Sunday 10 September 2006 06:54, Junio C Hamano wrote:
> I am starting to suspect that introducing "generation" header to
> the commit objects might actually be a very good thing.  For
> one thing, rev-list will automatically get the topology always
> right if we did so.

I do not think that any commits need to be changed, but only
"rev-list --topo-order" to build its own private cache of
sequence numbers for commits.

AFAICS, a generation number is pure redundant information, as a single
run of "rev-list --topo-order" can recalculate it. If it can not find
the number in its private per-repository cache of sequence numbers,
it calculates it the hard way (same as currently) and puts the found
numbers into the cache afterwards.

The cache should be similar in size to a pack index of a fully packed repository,
but probably could be way smaller, by storing only one sequence number for
a linear sequence of commits with only one parent each.

Josef

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
@ 2006-09-09 10:31 linux
  2006-09-09 13:00 ` Marco Costalba
  0 siblings, 1 reply; 101+ messages in thread
From: linux @ 2006-09-09 10:31 UTC (permalink / raw)
  To: mcostalba; +Cc: git, linux

> If the out of order revisions are a small amount of the total then
> could be possible to have something like
> 
> git rev-list --topo-order --with-appended-fixups HEAD
> 
> Where, while git-rev-list is working _whithout sorting the whole tree
> first_, when finds an out of order revision stores it in a fixup-list
> buffer and *at the end* of normal git-rev-lsit the buffer is flushed
> to receiver, so that the drawing logic does not change and the out of
> order revisions arrive at the end, already packed, sorted and prepared
> by git-rev-list.

I don't think I understand your proposal.  The problem arises when
git-rev-list finds a commit that it should have listed before something
that it has already output.

Just for example:

Commit D: Ancestor B
Commit B: Ancestor A
Commit C: Ancestor B

Commit C is the problem, because if git-rev-list has already output B,
there's no way to back up and insert it in the right place.

How is waiting to output the already-behind-schedule commit C going
to help anything?  The idea in gitk is to rewind the layout algorithm
to before B was added, insert C, and replay from there.  This is
most efficient if C is encountered as soon after its correct place
as possible.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Change set based shallow clone
  2006-09-09 10:31 linux
@ 2006-09-09 13:00 ` Marco Costalba
  0 siblings, 0 replies; 101+ messages in thread
From: Marco Costalba @ 2006-09-09 13:00 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git

On 9 Sep 2006 06:31:57 -0400, linux@horizon.com <linux@horizon.com> wrote:
> > If the out of order revisions are a small amount of the total then
> > could be possible to have something like
> >
> > git rev-list --topo-order --with-appended-fixups HEAD
> >
> > Where, while git-rev-list is working _whithout sorting the whole tree
> > first_, when finds an out of order revision stores it in a fixup-list
> > buffer and *at the end* of normal git-rev-lsit the buffer is flushed
> > to receiver, so that the drawing logic does not change and the out of
> > order revisions arrive at the end, already packed, sorted and prepared
> > by git-rev-list.
>
> I don't think I understand your proposal.  The problem arises when
> git-rev-list finds a commit that it should have listed before something
> that it has already output.
>
> Just for example:
>
> Commit D: Ancestor B
> Commit B: Ancestor A
> Commit C: Ancestor B
>
> Commit C is the problem, because if git-rev-list has already output B,
> there's no way to back up and insert it in the right place.
>
> How is waiting to output the already-behind-schedule commit C going
> to help anything?

It helps in a number of ways from receiver point of view.

- Receiver doesn't need to keep a sorted list of loaded revisions.

- Receiver doesn't need to stop parsing input and redraw the graph in
the middle of the loading.

- Receiver doesn't need to  check for possible out of order commits
when loading data, but can be assured data that arrives is
--topo-order consistent (although my not be complete)

- Split the in order parsing code (performance critical and common
case) from fixup-code (run at the end and on a small set of commits).

- Much faster implementation because the sorting is done only in
git-rev-list and only once.

Following your example my suggestion was something like this:

Instead of:

git-rev-list HEAD

Commit D: Ancestor B
Commit B: Ancestor A
Commit C: Ancestor B
Commit A: Ancestor E
Commit E: Ancestor F
Commit F: Ancestor G
....
Commit V: Ancestor Z

git-rev-list --topo-order ----with-appended-fixups HEAD

sends something like:

Commit D: Ancestor B
Commit B: Ancestor A
Commit A: Ancestor E
Commit E: Ancestor F
Commit F: Ancestor G
....
Commit V: Ancestor Z
(* end of correct sorted commits and start of out of order commits*)
Commit C: Ancestor B
......
Commit  N: Ancestor M

So that receiver drawing code designed with working with --topo-order
kind output could as usual draws the graph until Commit V: Ancestor Z
(but without waiting for git-rev-list latency!!) and the new fixup
code could *at the end* loop across *already sorted* fixup set:

Commit C: Ancestor B
......
Commit  N: Ancestor M

and update the graph in one go. Of course some flag/syntax should be
introduced to trigger the end of sorted code and beginning of fixups
in git-rev-list output stream.

       Marco

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2006-09-12  0:25 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-07 19:52 Change set based shallow clone Jon Smirl
2006-09-07 20:21 ` Jakub Narebski
2006-09-07 20:41   ` Jon Smirl
2006-09-07 21:33     ` Jeff King
2006-09-07 21:51       ` Jakub Narebski
2006-09-07 21:37     ` Jakub Narebski
2006-09-07 22:14     ` Junio C Hamano
2006-09-07 23:09       ` Jon Smirl
2006-09-10 23:20         ` Anand Kumria
2006-09-08  8:48     ` Andreas Ericsson
2006-09-07 22:07 ` Junio C Hamano
2006-09-07 22:40   ` Jakub Narebski
2006-09-08  3:54   ` Martin Langhoff
2006-09-08  5:30     ` Junio C Hamano
2006-09-08  7:15       ` Martin Langhoff
2006-09-08  8:33         ` Junio C Hamano
2006-09-08 17:18         ` A Large Angry SCM
2006-09-08 14:20       ` Jon Smirl
2006-09-08 15:50         ` Jakub Narebski
2006-09-09  3:13           ` Petr Baudis
2006-09-09  8:39             ` Jakub Narebski
2006-09-08  5:05   ` Aneesh Kumar K.V
2006-09-08  1:01 ` linux
2006-09-08  2:23   ` Jon Smirl
2006-09-08  8:36     ` Jakub Narebski
2006-09-08  8:39       ` Junio C Hamano
2006-09-08 18:42     ` linux
2006-09-08 21:13       ` Jon Smirl
2006-09-08 22:27         ` Jakub Narebski
2006-09-08 23:09         ` Linus Torvalds
2006-09-08 23:28           ` Jon Smirl
2006-09-08 23:45             ` Paul Mackerras
2006-09-09  1:45               ` Jon Smirl
2006-09-10 12:41             ` Paul Mackerras
2006-09-10 14:56               ` Jon Smirl
2006-09-10 16:10                 ` linux
2006-09-10 18:00                   ` Jon Smirl
2006-09-10 19:03                     ` linux
2006-09-10 20:00                       ` Linus Torvalds
2006-09-10 21:00                         ` Jon Smirl
2006-09-11  2:49                           ` Linus Torvalds
2006-09-10 22:41                         ` Paul Mackerras
2006-09-11  2:55                           ` Linus Torvalds
2006-09-11  3:18                             ` Linus Torvalds
2006-09-11  6:35                               ` Junio C Hamano
2006-09-11 18:54                               ` Junio C Hamano
2006-09-11  8:36                             ` Paul Mackerras
2006-09-11 14:26                               ` linux
2006-09-11 15:01                                 ` Jon Smirl
2006-09-11 16:47                                 ` Junio C Hamano
2006-09-11 21:52                                   ` Paul Mackerras
2006-09-11 23:47                                     ` Junio C Hamano
2006-09-12  0:06                                       ` Jakub Narebski
2006-09-12  0:18                                         ` Junio C Hamano
2006-09-12  0:25                                           ` Jakub Narebski
2006-09-11  9:04                             ` Jakub Narebski
2006-09-10 18:51                 ` Junio C Hamano
2006-09-11  0:04                   ` Shawn Pearce
2006-09-11  0:42                     ` Junio C Hamano
2006-09-11  0:03               ` Shawn Pearce
2006-09-11  0:41                 ` Junio C Hamano
2006-09-11  1:04                   ` Jakub Narebski
2006-09-11  2:44                     ` Shawn Pearce
2006-09-11  5:27                       ` Junio C Hamano
2006-09-11  6:08                         ` Shawn Pearce
2006-09-11  7:11                           ` Junio C Hamano
2006-09-11 17:52                             ` Shawn Pearce
2006-09-11  2:11                   ` Jon Smirl
2006-09-09  1:05           ` Paul Mackerras
2006-09-09  2:56             ` Linus Torvalds
2006-09-09  3:23               ` Junio C Hamano
2006-09-09  3:31               ` Paul Mackerras
2006-09-09  4:04                 ` Linus Torvalds
2006-09-09  8:47                   ` Marco Costalba
2006-09-09 17:33                     ` Linus Torvalds
2006-09-09 18:04                       ` Marco Costalba
2006-09-09 18:44                         ` linux
2006-09-09 19:17                           ` Marco Costalba
2006-09-09 20:05                         ` Linus Torvalds
2006-09-09 20:43                           ` Jeff King
2006-09-09 21:11                             ` Junio C Hamano
2006-09-09 21:14                               ` Jeff King
2006-09-09 21:40                             ` Linus Torvalds
2006-09-09 22:54                               ` Jon Smirl
2006-09-10  0:18                                 ` Linus Torvalds
2006-09-10  1:22                                   ` Junio C Hamano
2006-09-10  3:49                           ` Marco Costalba
2006-09-10  4:13                             ` Junio C Hamano
2006-09-10  4:23                               ` Marco Costalba
2006-09-10  4:46                                 ` Marco Costalba
2006-09-10  4:54                                 ` Junio C Hamano
2006-09-10  5:14                                   ` Marco Costalba
2006-09-10  5:46                                     ` Junio C Hamano
2006-09-10 15:21                                     ` linux
2006-09-10 18:32                                       ` Marco Costalba
2006-09-11  9:56                                       ` Paul Mackerras
2006-09-11 12:39                                         ` linux
2006-09-10  9:49                                   ` Jakub Narebski
2006-09-10 10:28                                   ` Josef Weidendorfer
  -- strict thread matches above, loose matches on Subject: below --
2006-09-09 10:31 linux
2006-09-09 13:00 ` Marco Costalba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).