git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: linux@horizon.com
To: git@vger.kernel.org, jonsmirl@gmail.com, linux@horizon.com
Subject: Re: Change set based shallow clone
Date: 7 Sep 2006 21:01:12 -0400	[thread overview]
Message-ID: <20060908010112.6962.qmail@science.horizon.com> (raw)
In-Reply-To: <9e4733910609071252ree73effwb06358e9a22ba965@mail.gmail.com>

> Here's a change set based shallow clone scheme I've been thinking
> about, does it have potential?

Well, let's look at the problems...

You might want to look at the explanation of the current git network
protocol at
http://www.spinics.net/lists/git/msg08899.html

I'm just understanding it myself, but as best I understant it,
it has four phases:

- First, the server lists all of the heads it has, by SHA1 and name.
- Second, the client tells it which ones it wants, by name.
- Third, an interactive protocol is entered in which the client
  tells the server which commits it has in terms the server can
  understand.  This proceeds as follows:

  - For each head the client has, it lists commit objects going
    backwards in history.
  - As soon as the server sees one that it has heard of, it sends
    back "ACK <hash> continue".  The client stops tracing that
    branch and starts in on the next one.
  - (Periodically, the client sends a ping to the server, which
    responds "NAK" if it's still alive.)
  - This continues until the client says "done".

At this point, the server knows a set of commits which the client has,
and a set of commits which the client wants.  Then it invokes

	git-rev-list --objects-edge <want1> <want2> ^<have1> ^<have2>

(In practice, there will be a lot more than two of each, but forgive
me if I simplify.)

This builds a list of commits that are ancestors of <want1> and <want2>
but not <have1> or <have2>.  (A commit is considered an ancestor of
itself, so <want1> and <want2> are in the set, while <have1> and <have2>
are not.)

Then it enumerates the complete trees of each of those commits.  Then it
subtracts from that object set all of the trees and blobs in the <have1>
and <have2> commits.

This is a good approximation to the set of objects that the client needs
to be sent.  It does not search the ancestors of <have1> and <have2>,
so it is possible that a copy of some object exists in an older commit,
and the state in <want1> is a reversion to a pre-<have1> state, but it's
unlikely and not worth looking for.  Sending it redundantly is harmless.


Then, this set of objects is piped to git-pack-objects, which finds them
and builds a pack file.  If the objects are stored locally as deltas
relative to other objects which are either to be sent, or are in the
<have1> or <have2> commits (which information is included in the data
output by git-rev-list), the delta is copied to the pack file verbatim.
Otherwise, more CPU is spent to pack it.

Again, the server is allowed to check for deltas against ancestors
of <have1> or <have2>, but doesn't bother for efficiency.


Fourth, the pack file is sent to the client, which unpacks it (and
discards any duplicates).

> When the client wants a shallow clone it starts by telling the server
> all of the HEADs and how many change sets down each of those HEADs it
> has locally. That's a small amout of data to transmit and it can be
> easily tracked. Let's ignore merged branches for the moment.

When you say "change set", I'm going to assume you mean "commit object".

Okay.  Now, the server hasn't heard of one or more of those commit
objects, because they're local changes.  What then?


Another issue is that a client with a nearly-full copy has to do a full
walk of its history to determine the depth count that it has.  That can
be more than 2500 commits down in the git repository, and worse in the
mozilla one.  It's actually pretty quick (git-show-branch --more=99999
will do it), but git normally tries to avoid full traversals like the
plague 

Oh, and was "for the moment" supposed to last past the end of your e-mail?
I don't see what to do if there's a merge in the history and the depth
on different sides is not equal.  E.g. the history looks like:

...a---b---c---d---e---f
                  /     \
      ...w---x---y       HEAD
                        /
        ...p---q---r---s

Where "..." means that there are ancestors, but they're missing.

> If you haven't updated for six months when the server walks backwards
> for 10 change sets it's not going to find anything you have locally.
> When this situation is encountered the server needs to generate a
> delta just for you between one of the change sets it knows you have
> and one of the 10 change sets you want. By generating this one-off
> delta it lets you avoid the need to fetch all of the objects back to a
> common branch ancestor. The delta functions as a jump over the
> intervening space.

Your choice of words keeps giving me the impression that you believe
that a "change set" is a monolithic object that includes all the changes
made to all the files.  It's neither monolithic nor composed of changes.
A commit objects consists soley of metadata, and contains a pointer to
a tree object, which points recursively to the entire project state at
the time of the commit.

There is massive sharing of component objects between successive
commits, but they are NOT stored as deltas relative to one another.

The pack-forming heuristics tend to achieve that effect, but it is not
guaranteed or required by design.

Please understand that, deep in your bones: git is based on snapshots,
not deltas.


But okay, so we've sent the client the latest 10 commits, with a dangling
tail at the bottom.  (The files may have been sent as deltas against the
old state, or just fresh compressed copies, but that doesn't matter.)
Then the heads like "origin" have been advanced.

So the old commit history is now unreferenced garbage; nothing points
to it, and it will be deleted next time git-prune is run.  Is that
the intended behavior?  Or should updates to an existing clone always
complete the connections?

  parent reply	other threads:[~2006-09-08  1:01 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-07 19:52 Change set based shallow clone Jon Smirl
2006-09-07 20:21 ` Jakub Narebski
2006-09-07 20:41   ` Jon Smirl
2006-09-07 21:33     ` Jeff King
2006-09-07 21:51       ` Jakub Narebski
2006-09-07 21:37     ` Jakub Narebski
2006-09-07 22:14     ` Junio C Hamano
2006-09-07 23:09       ` Jon Smirl
2006-09-10 23:20         ` Anand Kumria
2006-09-08  8:48     ` Andreas Ericsson
2006-09-07 22:07 ` Junio C Hamano
2006-09-07 22:40   ` Jakub Narebski
2006-09-08  3:54   ` Martin Langhoff
2006-09-08  5:30     ` Junio C Hamano
2006-09-08  7:15       ` Martin Langhoff
2006-09-08  8:33         ` Junio C Hamano
2006-09-08 17:18         ` A Large Angry SCM
2006-09-08 14:20       ` Jon Smirl
2006-09-08 15:50         ` Jakub Narebski
2006-09-09  3:13           ` Petr Baudis
2006-09-09  8:39             ` Jakub Narebski
2006-09-08  5:05   ` Aneesh Kumar K.V
2006-09-08  1:01 ` linux [this message]
2006-09-08  2:23   ` Jon Smirl
2006-09-08  8:36     ` Jakub Narebski
2006-09-08  8:39       ` Junio C Hamano
2006-09-08 18:42     ` linux
2006-09-08 21:13       ` Jon Smirl
2006-09-08 22:27         ` Jakub Narebski
2006-09-08 23:09         ` Linus Torvalds
2006-09-08 23:28           ` Jon Smirl
2006-09-08 23:45             ` Paul Mackerras
2006-09-09  1:45               ` Jon Smirl
2006-09-10 12:41             ` Paul Mackerras
2006-09-10 14:56               ` Jon Smirl
2006-09-10 16:10                 ` linux
2006-09-10 18:00                   ` Jon Smirl
2006-09-10 19:03                     ` linux
2006-09-10 20:00                       ` Linus Torvalds
2006-09-10 21:00                         ` Jon Smirl
2006-09-11  2:49                           ` Linus Torvalds
2006-09-10 22:41                         ` Paul Mackerras
2006-09-11  2:55                           ` Linus Torvalds
2006-09-11  3:18                             ` Linus Torvalds
2006-09-11  6:35                               ` Junio C Hamano
2006-09-11 18:54                               ` Junio C Hamano
2006-09-11  8:36                             ` Paul Mackerras
2006-09-11 14:26                               ` linux
2006-09-11 15:01                                 ` Jon Smirl
2006-09-11 16:47                                 ` Junio C Hamano
2006-09-11 21:52                                   ` Paul Mackerras
2006-09-11 23:47                                     ` Junio C Hamano
2006-09-12  0:06                                       ` Jakub Narebski
2006-09-12  0:18                                         ` Junio C Hamano
2006-09-12  0:25                                           ` Jakub Narebski
2006-09-11  9:04                             ` Jakub Narebski
2006-09-10 18:51                 ` Junio C Hamano
2006-09-11  0:04                   ` Shawn Pearce
2006-09-11  0:42                     ` Junio C Hamano
2006-09-11  0:03               ` Shawn Pearce
2006-09-11  0:41                 ` Junio C Hamano
2006-09-11  1:04                   ` Jakub Narebski
2006-09-11  2:44                     ` Shawn Pearce
2006-09-11  5:27                       ` Junio C Hamano
2006-09-11  6:08                         ` Shawn Pearce
2006-09-11  7:11                           ` Junio C Hamano
2006-09-11 17:52                             ` Shawn Pearce
2006-09-11  2:11                   ` Jon Smirl
2006-09-09  1:05           ` Paul Mackerras
2006-09-09  2:56             ` Linus Torvalds
2006-09-09  3:23               ` Junio C Hamano
2006-09-09  3:31               ` Paul Mackerras
2006-09-09  4:04                 ` Linus Torvalds
2006-09-09  8:47                   ` Marco Costalba
2006-09-09 17:33                     ` Linus Torvalds
2006-09-09 18:04                       ` Marco Costalba
2006-09-09 18:44                         ` linux
2006-09-09 19:17                           ` Marco Costalba
2006-09-09 20:05                         ` Linus Torvalds
2006-09-09 20:43                           ` Jeff King
2006-09-09 21:11                             ` Junio C Hamano
2006-09-09 21:14                               ` Jeff King
2006-09-09 21:40                             ` Linus Torvalds
2006-09-09 22:54                               ` Jon Smirl
2006-09-10  0:18                                 ` Linus Torvalds
2006-09-10  1:22                                   ` Junio C Hamano
2006-09-10  3:49                           ` Marco Costalba
2006-09-10  4:13                             ` Junio C Hamano
2006-09-10  4:23                               ` Marco Costalba
2006-09-10  4:46                                 ` Marco Costalba
2006-09-10  4:54                                 ` Junio C Hamano
2006-09-10  5:14                                   ` Marco Costalba
2006-09-10  5:46                                     ` Junio C Hamano
2006-09-10 15:21                                     ` linux
2006-09-10 18:32                                       ` Marco Costalba
2006-09-11  9:56                                       ` Paul Mackerras
2006-09-11 12:39                                         ` linux
2006-09-10  9:49                                   ` Jakub Narebski
2006-09-10 10:28                                   ` Josef Weidendorfer
  -- strict thread matches above, loose matches on Subject: below --
2006-09-09 10:31 linux
2006-09-09 13:00 ` Marco Costalba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060908010112.6962.qmail@science.horizon.com \
    --to=linux@horizon.com \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).