Re: Resumable clone/Gittorrent (again)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Maaartin-1 <grajcar1@seznam.cz>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Resumable clone/Gittorrent (again)
Date: Thu, 06 Jan 2011 04:34:51 +0100	[thread overview]
Message-ID: <4D25385B.3010103@seznam.cz> (raw)
In-Reply-To: <AANLkTi=_R53fm5Er0CdtZCFvDpE-Dqt8tMHAubcjOUBb@mail.gmail.com>

On 11-01-06 02:32, Nguyen Thai Ngoc Duy wrote:
> On Thu, Jan 6, 2011 at 6:28 AM, Maaartin <grajcar1@seznam.cz> wrote:
>> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

>> I haven't read the whole other thread yet, but what about going the other way
>> round? Use a single commit as a chain, create deltas assuming that all
>> ancestors are already available. The packs may arrive out of order, so the
>> decompression may have to wait. The number of commits may be one order of
>> magnitude larger than the the number of paths (there are currently 2254 paths
>> and 24235 commits in git.git), so grouping consequent commits into one larger
>> pack may be useful.
> 
> The number of commits can increase fast. I'd rather have a
> small/stable number over time.

In theory, I could create many commits per seconds. I could create many
unique paths per seconds, too. But I don't think it really happens. I do
know no larger repository than git.git and I don't want to download it
just to see how many commits, paths, and object it contains, but I'd
suppose it's less than one million commits, which should be manageable,
especially when commits get grouped together as I described below.

> And commits depend on other commits so
> you can't verify a commit until you have got all of its parents. That
> does apply to file, but then this file chain does not interfere other
> file chains.

That's true, but the verification is something done locally on the
client, it consumes no network traffic and no server resources, so I
consider it to be cheap. I need less than half a minute (using only a
single core) for verifying of the whole git.git repository (36 MB). This
is no problem, even when it had to wait until the download finishes. I'm
sure, the OP of [1] would be happy if he could wait for this.

>> The advantage is that the packs stays stable over time, you may create them
>> using the most aggressive and time-consuming settings and store them forever.
>> You could create packs for single commits, packs for non-overlapping
>> consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with
>> commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7],
>> etc. The reason for this is obviously to allow reading groups of commits from
>> different servers so that they fit together (similar to Buddy memory
>> allocation). Of course, there are things like branches bringing chaos in this
>> simple scheme, but I'm sure this can be solved somehow.
> 
> Pack encoding can change.

I see I didn't explain it clear enough (or am missing something
completely). I know why the packs normally used by git can't be used for
this purpose. Let me retry: Let's assume there's a commit chain
A-B-C-D-E-F-..., the client has already commit B and requests commit F.
It may send requests to up to 4 servers, asking for C, D, E, and F,
respectively. The server being asked for E _creates_ a pack containing
all the information needed to create E given _all of_ A, B, C, D. As
base for any blob/whatever in E it may choose any blob contained in any
of these commits. Of course, it may also choose a blob already packed in
this pack. It may not choose any other blob, so any client having all
ancestors of E can use the pack. Different server and/or program
versions may create different packs for E, but all of them are
_interchangeable_. Because of this, it makes sense to _store_ it for
future reuse.

Compared to the way git packing normally works, this is a restriction,
but I don't think it leads to significantly worse compression. You guys
working on git can confirm or disprove it.

> And packs can contain objects you don't want
> to share (i.e. hidden from public view).

This pack would contain only commit E. I also described pairing intended
for greater efficiency. In this case a server creates a pack allowing
e.g. to create commits E and F given all their ancestors (while other
server creates a pack for C and D). This way the number of packs needed
may be a fraction of the total number of commits requested.

>> Another problem is the client requesting commits A and B while declaring to
>> possess commits C and D. When both C and D are ancestors of either A or B, you
>> can ignore it (as you assume this while packing, anyway). The other case is
>> less probable, unless e.g. C is the master and A is a developing branch.
>> Currently. I've no idea how to optimize this and whether this could be
>> important.
> 
> As I said, we can request just part of a chain (from A+B to C+D).
> git-fetch should be used if the repo is quite uptodate though. It's
> just more efficient.

[1] http://article.gmane.org/gmane.comp.version-control.git/164564

next prev parent reply	other threads:[~2011-01-06  3:35 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-05 16:23 Resumable clone/Gittorrent (again) Nguyen Thai Ngoc Duy
2011-01-05 16:56 ` Luke Kenneth Casson Leighton
2011-01-05 17:13   ` Thomas Rast
2011-01-05 18:07     ` Luke Kenneth Casson Leighton
2011-01-06  1:47       ` Nguyen Thai Ngoc Duy
2011-01-06 17:50         ` Luke Kenneth Casson Leighton
2011-01-05 23:28 ` Maaartin
2011-01-06  1:32   ` Nguyen Thai Ngoc Duy
2011-01-06  3:34     ` Maaartin-1 [this message]
2011-01-06  6:36       ` Nguyen Thai Ngoc Duy
2011-01-08  1:04         ` Maaartin-1
2011-01-08  2:40           ` Nguyen Thai Ngoc Duy
2011-01-07  3:21 ` Nicolas Pitre
2011-01-07  6:34   ` Nguyen Thai Ngoc Duy
2011-01-07 15:59   ` Luke Kenneth Casson Leighton
2011-01-08  2:17     ` Nguyen Thai Ngoc Duy
2011-01-08 17:21       ` Luke Kenneth Casson Leighton
2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
2011-01-09 13:55           ` Luke Kenneth Casson Leighton
2011-01-09 17:48             ` Nguyen Thai Ngoc Duy
2011-01-13 11:39               ` Luke Kenneth Casson Leighton
2011-01-13 23:40                 ` Sam Vilain
2011-01-14 14:26                   ` Luke Kenneth Casson Leighton
2011-01-16  2:11                     ` Sam Vilain
2011-01-10 21:38         ` Sam Vilain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D25385B.3010103@seznam.cz \
    --to=grajcar1@seznam.cz \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).