git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Resumable clone/Gittorrent (again) - stable packs?
@ 2011-01-06  2:29 Zenaan Harkness
  2011-01-06 17:05 ` Shawn Pearce
  2011-01-06 21:09 ` Nicolas Pitre
  0 siblings, 2 replies; 22+ messages in thread
From: Zenaan Harkness @ 2011-01-06  2:29 UTC (permalink / raw)
  To: git

Bittorrent requires some stability around torrent files.

Can packs be generated deterministically?
If not by two separate repos, what about by one particular repo?

For Linus' linux-2.6.git, that repo is considered 'canonical' by many.

Pack-torrents could be ~1MiB, ~10MiB, ~100Mib, ~1GiB, or as configured
in a particular repo, which repo is the canonical location for
pack-torrents for all who consider that particular repo as canonical.

Perhaps a heuristic/ algorithm: once ten 10MiB (sequentially
generated) pack-torrents are floating around,
they could be simply concatenated to create a 100MiB pack-torrent,
with a deterministic name and SHA etc,
so that all those 10MiB pack-torrent files that torrent clients have,
can be re-used and locally combined into the 100MiB torrent as needed,
on demand.

Same for 100MiB -> 1GiB pack-torrents.

Individual extra commits:
While "small" number of additional commits go into a repo, clients
fall back to git-fetch, _after .

If Linus linus-2.6.git (currently configured "canonical" repo) goes
offline, simply configure a new remote canonical repo.

Branches:
Other "branches" repos of linux-2.6.git could create their own
consistent 50MiB (or as configured) pack-torrents which are
commits-only-missing-from-linux-2.6 pack-torrents (ie, those missing
from that repo's "canonical" upstream).

This would require clients have a recursive torrent locator (I start
at linux-net.git, which requires linux-2.6.git, so I go get those
packs as well as the linux-net.git packs).

Perhaps have a system-wide or user-wide git repo/ torrent config, or
check with user running git-clone linux-net.git "Do you have an
existing git.vger.kernel.org/linux-2.6.git archive?".

Zen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-06  2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness
@ 2011-01-06 17:05 ` Shawn Pearce
  2011-01-10 16:39   ` John Wyzer
  2011-01-06 21:09 ` Nicolas Pitre
  1 sibling, 1 reply; 22+ messages in thread
From: Shawn Pearce @ 2011-01-06 17:05 UTC (permalink / raw)
  To: Zenaan Harkness; +Cc: git

On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness <zen@freedbms.net> wrote:
> Bittorrent requires some stability around torrent files.
>
> Can packs be generated deterministically?

No.  We have been trying to avoid doing that, because it ties us into
one particular compression scheme.  We can't tune the algorithm and
get better compression later, because it would generate a different
pack.  We also rely on the system's libz to generate the compressed
data.  A version change to libz may generate a different encoding for
the same uncompressed data, simply because they made a tweak to how
the compression was performed.  Likewise our own delta compression
code can be tweaked to produce a different (but logically identical)
delta between the same two objects.

Right now packs aren't deterministic because they use multiple threads
to generate the deltas, the thread scheduling impacts which base
objects deltas are tried against because threads can steal work from
each other if one finishes before the other one.  Disabling threading
entirely slows down delta compression considerably on multi-core
machines, but does remove this work-stealing, making the pack
deterministic... but only for this exact Git binary, with this same
shared libz.  If the system libz or Git changes, all bets are off.

We've been down this road before; we don't want to box ourselves into
a tight corner by setting for all time these tunable portions of the
compression algorithms.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-06  2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness
  2011-01-06 17:05 ` Shawn Pearce
@ 2011-01-06 21:09 ` Nicolas Pitre
  2011-01-07  2:36   ` Zenaan Harkness
  1 sibling, 1 reply; 22+ messages in thread
From: Nicolas Pitre @ 2011-01-06 21:09 UTC (permalink / raw)
  To: Zenaan Harkness; +Cc: git

On Thu, 6 Jan 2011, Zenaan Harkness wrote:

> Bittorrent requires some stability around torrent files.
> 
> Can packs be generated deterministically?

They _could_, but we do _not_ want to do that.

The only thing which is stable in Git is the canonical representation of 
objects, and the objects they depend on, expressed by their SHA1 
signature.  Any BitTorrent-alike design for Git must be based on that 
property and not the packed representation of those objects which is not 
meant to be stable.

If you don't want to design anything and simply reuse current BitTorrent 
codebase then simply create a Git bundle from some release version and 
seed that bundle for a sufficiently long period to be worth it.  Then 
falling back to git fetch in order to bring the repo up to date with the 
very latest commits should be small and quick.  When that clone gets too 
big then it's time to start seeding another more up-to-date bundle.


Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-06 21:09 ` Nicolas Pitre
@ 2011-01-07  2:36   ` Zenaan Harkness
  2011-01-07  4:33     ` Nicolas Pitre
  0 siblings, 1 reply; 22+ messages in thread
From: Zenaan Harkness @ 2011-01-07  2:36 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Thu, 6 Jan 2011, Zenaan Harkness wrote:
>
>> Bittorrent requires some stability around torrent files.
>>
>> Can packs be generated deterministically?
>
> They _could_, but we do _not_ want to do that.
>
> The only thing which is stable in Git is the canonical representation of
> objects, and the objects they depend on, expressed by their SHA1
> signature.  Any BitTorrent-alike design for Git must be based on that
> property and not the packed representation of those objects which is not
> meant to be stable.
>
> If you don't want to design anything and simply reuse current BitTorrent
> codebase then simply create a Git bundle from some release version and
> seed that bundle for a sufficiently long period to be worth it.  Then
> falling back to git fetch in order to bring the repo up to date with the
> very latest commits should be small and quick.  When that clone gets too
> big then it's time to start seeding another more up-to-date bundle.

Thanks guys for the explanations.

So, we don't _want_ to generate packs deterministically.
BUT, we _can_ reliably unpack a pack (duh).

So if my configured "canonical upstream" decides on a particular
compression etc, I (my git client) doesn't care what has been chosen
by my upstream.

What is important for torrent-able packs though is stability over some
time period, no matter what the format.

There's been much talk of caching, invalidating of caches, overlapping
torrent-packs etc.

In every case, for torrents to work, the P2P'd files must have some
stability over some time period.
(If this assumption is incorrect, please clarify, not counting
every-file-is-a-torrent and every-commit-is-a-torrent.)

So, torrentable options:
- torrent per commit
- torrent per pack
- torrent per torrent-archive - new file format

Torrent per commit - too small, too many torrents; we need larger
p2p-able sizes in general.

Torrent per pack - packs non-deterministically created, both between
hosts and even intra-host (libz upgrade, nr_threads change, git pack
algorithm optimization).

A new torrent format, if "close enough" to current git pack
performance (cpu load, threadability, size) is potential for new
version of git pack file format - we don't want to store two sets of
pack files on disk, if sensible to not do so; unlikely to happen - I
can't conceive that a torrentable format would be anything but worse
than pack files and therefore would be rejected from git master.

Can we can relax the perceived requirement to deterministically create
pack files?
Well, over what time period are pack files stable in a particular git?
Over what time period do we require stable files for torrenting?

Can we simply configure our local git to keep specified pack files for
specified time period?
And use those for torrent-packs?
Perhaps the torrent file could have a UseBy date?

Zen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  2:36   ` Zenaan Harkness
@ 2011-01-07  4:33     ` Nicolas Pitre
  2011-01-07  5:22       ` Jeff King
  2011-01-10 11:48       ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 22+ messages in thread
From: Nicolas Pitre @ 2011-01-07  4:33 UTC (permalink / raw)
  To: Zenaan Harkness; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6423 bytes --]

On Fri, 7 Jan 2011, Zenaan Harkness wrote:

> On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Thu, 6 Jan 2011, Zenaan Harkness wrote:
> >
> >> Bittorrent requires some stability around torrent files.
> >>
> >> Can packs be generated deterministically?
> >
> > They _could_, but we do _not_ want to do that.
> >
> > The only thing which is stable in Git is the canonical representation of
> > objects, and the objects they depend on, expressed by their SHA1
> > signature.  Any BitTorrent-alike design for Git must be based on that
> > property and not the packed representation of those objects which is not
> > meant to be stable.
> >
> > If you don't want to design anything and simply reuse current BitTorrent
> > codebase then simply create a Git bundle from some release version and
> > seed that bundle for a sufficiently long period to be worth it.  Then
> > falling back to git fetch in order to bring the repo up to date with the
> > very latest commits should be small and quick.  When that clone gets too
> > big then it's time to start seeding another more up-to-date bundle.
> 
> Thanks guys for the explanations.
> 
> So, we don't _want_ to generate packs deterministically.
> BUT, we _can_ reliably unpack a pack (duh).

Of course.

> So if my configured "canonical upstream" decides on a particular
> compression etc, I (my git client) doesn't care what has been chosen
> by my upstream.

Indeed.  This is like saying: I'm sending you the value 52, but I chose 
to use the representation "24 + 28", while someone else might decide to 
encode that value as "13 * 4" instead.  You still are able to decode it 
to the same result in both cases.

> What is important for torrent-able packs though is stability over some
> time period, no matter what the format.

Hence my suggestion to simply seed a Git bundle over BitTorrent. Bundles 
are files which are designed to be used by completely ad hoc transports 
and you can fetch from them just like if they were a remote repository.

> There's been much talk of caching, invalidating of caches, overlapping
> torrent-packs etc.

And in my humble opinion this is just all crap.  All those suggestions 
are fragile, create administrative issues, eat up server resources, and 
they all are suboptimal in the end. No one ever implemented a working 
prototype so far either.

We don't want caches.  Fundamentally, we do not need any cache.  Caches 
are a pain to administrate on a busy server anyway as they eat disk 
space and they also represent a much bigger security risk compared to a 
read-only operation.

Furthermore, a cache is good only for the common case that everyone 
want.  but with Git, you cannot presume that everyone is at the same 
version locally.  So either you do a custom transfer for each client to 
minimize transfers and caching the result in that case might not benefit 
that many people, or you make the cached data bigger so to cover more 
cases while making the transfer suboptimal.

Finally, we do have a cache already, and that's the existing packs 
themselves.  During a clone, the vast majority of the transferred data 
is streamed without further processing straight of those existing packs 
as we try to reuse as much data as possible from those packs so not to 
recompute/recompress that data all the time.

> In every case, for torrents to work, the P2P'd files must have some
> stability over some time period.
> (If this assumption is incorrect, please clarify, not counting
> every-file-is-a-torrent and every-commit-is-a-torrent.)
> 
> So, torrentable options:
> - torrent per commit
> - torrent per pack
> - torrent per torrent-archive - new file format
> 
> Torrent per commit - too small, too many torrents; we need larger
> p2p-able sizes in general.
> 
> Torrent per pack - packs non-deterministically created, both between
> hosts and even intra-host (libz upgrade, nr_threads change, git pack
> algorithm optimization).
> 
> A new torrent format, if "close enough" to current git pack
> performance (cpu load, threadability, size) is potential for new
> version of git pack file format - we don't want to store two sets of
> pack files on disk, if sensible to not do so; unlikely to happen - I
> can't conceive that a torrentable format would be anything but worse
> than pack files and therefore would be rejected from git master.
> 
> Can we can relax the perceived requirement to deterministically create
> pack files?
> Well, over what time period are pack files stable in a particular git?
> Over what time period do we require stable files for torrenting?
> 
> Can we simply configure our local git to keep specified pack files for
> specified time period?
> And use those for torrent-packs?
> Perhaps the torrent file could have a UseBy date?

Again, this is just too much complexity for so little gain.

Here's what I suggest:

	cd my_project
	BUNDLENAME=my_project_$(date "+%s").gitbundle
	git bundle create $BUNDLENAME --all
	maketorrent-console your_favorite_tracker $BUNDLENAME

Then start seeding that bundle, and upload $BUNDLENAME.torrent as 
bundle.torrent inside my_project.git on your server.

Now... Git clients could be improved to first check for the availability 
of the file "bundle.torrent" on the remote side, either directly in 
my_project.git, or through some Git protocol extension.  Or even better, 
the torrent hash could be stored in a Git ref, such as 
refs/bittorrent/bundle and the client could use that to retrieve the 
bundle.torrent file through some other means.

When the bundle.torrent file is retrieved, then just pull the torrent 
content (and seed it some more to be nice).  Then simply run "git clone" 
using the original arguments but with the obtained bundle instead of the 
original URL.  Then replace the remote URL in .git/config with the 
actual remote URL instead of the bundle file path.  And finally perform 
a "git pull" to bring the new commits that were added to the remote 
repository since the bundle was created.  That final pull will be small 
and quick.

After a while, that final pull will get bigger as the difference between 
the bundled version and the current tip in the remote repository will 
grow.  So every so often, say 3 months, it might be a good idea to 
create a new bundle so that the latest commits are included into it in 
order to make that final pull small and quick again.

Isn't that sufficient?


Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  4:33     ` Nicolas Pitre
@ 2011-01-07  5:22       ` Jeff King
  2011-01-07  5:31         ` Jeff King
  2011-01-10 11:48       ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff King @ 2011-01-07  5:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Zenaan Harkness, git

On Thu, Jan 06, 2011 at 11:33:51PM -0500, Nicolas Pitre wrote:

> Here's what I suggest:
> 
> 	cd my_project
> 	BUNDLENAME=my_project_$(date "+%s").gitbundle
> 	git bundle create $BUNDLENAME --all
> 	maketorrent-console your_favorite_tracker $BUNDLENAME
> 
> Then start seeding that bundle, and upload $BUNDLENAME.torrent as 
> bundle.torrent inside my_project.git on your server.
> 
> Now... Git clients could be improved to first check for the availability 
> of the file "bundle.torrent" on the remote side, either directly in 
> my_project.git, or through some Git protocol extension.  Or even better, 
> the torrent hash could be stored in a Git ref, such as 
> refs/bittorrent/bundle and the client could use that to retrieve the 
> bundle.torrent file through some other means.

I really like the simplicity of this idea. It could even be generalized
to handle more traditional mirrors, too. Just slice up the refs/mirrors
namespace to provide different methods of fetching some initial set of
objects. For example, upload-pack might advertise (in addition to the
usual refs):

  refs/mirrors/bundle/torrent
  refs/mirrors/bundle/http
  refs/mirrors/fetch/git
  refs/mirrors/fetch/http

and the client can decide its preferred way of getting data: a bundle by
http or by torrent, or connecting directly to some other git repository
by git protocol or http. It would fetch the appropriate ref, which would
contain a blob in some method-specific format. For torrent, it would be
a torrent file. For the others, probably a newline-delimited set of
URLs. You could also provide a torrent-magnet ref if you didn't even
want to distribute the torrent file.

And no matter what the method used, at the end you have some set of refs
and objects, and you can re-try your (now much smaller fetch). And there
are a few obvious optimizations:

  1. When you get the initial set of refs from the master, remember
     them. If the mirror actually satisfies everything you were going to
     fetch, then you don't even have to reconnect for the final fetch.

  2. You can optionally cache the mirror list, and go straight to a
     mirror for future fetches instead of checking the master. This is
     only a reasonable thing to do if the mirrors are kept up to date,
     and provide good incremental access (i.e., only actual git-protocol
     mirrors, not torrent or http file).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  5:22       ` Jeff King
@ 2011-01-07  5:31         ` Jeff King
  2011-01-07 10:04           ` Zenaan Harkness
  2011-01-07 18:52           ` Ilari Liusvaara
  0 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2011-01-07  5:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Zenaan Harkness, git

On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote:

>   refs/mirrors/bundle/torrent
>   refs/mirrors/bundle/http
>   refs/mirrors/fetch/git
>   refs/mirrors/fetch/http
> 
> and the client can decide its preferred way of getting data: a bundle by
> http or by torrent, or connecting directly to some other git repository
> by git protocol or http. It would fetch the appropriate ref, which would
> contain a blob in some method-specific format. For torrent, it would be
> a torrent file. For the others, probably a newline-delimited set of
> URLs. You could also provide a torrent-magnet ref if you didn't even
> want to distribute the torrent file.
> 
> And no matter what the method used, at the end you have some set of refs
> and objects, and you can re-try your (now much smaller fetch).

And I think it is probably obvious to you, Nicolas, since these are
problems you have been thinking about for some time, but the reason I am
interested in this expanded definition of mirroring is for a few
features people have been asking for:

  1. restartable clone; any bundle format is easily restartable using
     standard protocols

  2. avoid too-big clones; I remember the gentoo folks wanting to
     disallow full clones from their actual dev machines and push people
     off to some more static method of pulling. I think not just because
     of restartability, but because of the load on the dev machines

  3. people on low-bandwidth servers who fork major projects; if I write
     three kernel patches and host a git server, I would really like
     people to only fetch my patches from me and get the rest of it from
     kernel.org

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  5:31         ` Jeff King
@ 2011-01-07 10:04           ` Zenaan Harkness
  2011-01-07 18:52           ` Ilari Liusvaara
  1 sibling, 0 replies; 22+ messages in thread
From: Zenaan Harkness @ 2011-01-07 10:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, git

On Fri, Jan 7, 2011 at 16:31, Jeff King <peff@peff.net> wrote:
> On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote:
> the reason I am
> interested in this expanded definition of mirroring is for a few
> features people have been asking for:
>
>  1. restartable clone; any bundle format is easily restartable using
>     standard protocols

This is very important to me. I have failed to establish an initial
repo for a few larger projects, some apache projects and opentaps most
recently. It is getting _really_ frustrating.


>  2. avoid too-big clones; I remember the gentoo folks wanting to
>     disallow full clones from their actual dev machines and push people
>     off to some more static method of pulling. I think not just because
>     of restartability, but because of the load on the dev machines

And of course the lack of restartability causes an ongoing increase in
the load on the machines delivering those large clones.


>  3. people on low-bandwidth servers who fork major projects; if I write
>     three kernel patches and host a git server, I would really like
>     people to only fetch my patches from me and get the rest of it from
>     kernel.org

This is not so much of a problem - can already be handled by cloning
your linux-full.git to a private dir, and only publishing your shallow
"personal patches only" clone, or better still, just a tar-ball of
your 3 patches, or email them, or etc.


So I agree with the big issues being restartable large clones and
lowering server loads.

Zen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  5:31         ` Jeff King
  2011-01-07 10:04           ` Zenaan Harkness
@ 2011-01-07 18:52           ` Ilari Liusvaara
  2011-01-07 19:17             ` Jeff King
  2011-01-10 21:07             ` Sam Vilain
  1 sibling, 2 replies; 22+ messages in thread
From: Ilari Liusvaara @ 2011-01-07 18:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote:
> 
>   3. people on low-bandwidth servers who fork major projects; if I write
>      three kernel patches and host a git server, I would really like
>      people to only fetch my patches from me and get the rest of it from
>      kernel.org

One client-side-only feature that could be useful:

Ability to contact multiple servers in sequence, each time advertising
everything obtained so far. Then treat the new repo as clone of the last
address.

This would e.g. be very handy if you happen to have local mirror of say, Linux
kernel and want to fetch some related project without messing with alternates
or downloading everything again:

git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo

This would first fetch everything from local source and then update that
from remote, likely being vastly faster.

-Ilari

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 18:52           ` Ilari Liusvaara
@ 2011-01-07 19:17             ` Jeff King
  2011-01-07 21:45               ` Ilari Liusvaara
  2011-01-10 21:07             ` Sam Vilain
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff King @ 2011-01-07 19:17 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote:

> On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote:
> > 
> >   3. people on low-bandwidth servers who fork major projects; if I write
> >      three kernel patches and host a git server, I would really like
> >      people to only fetch my patches from me and get the rest of it from
> >      kernel.org
> 
> One client-side-only feature that could be useful:
> 
> Ability to contact multiple servers in sequence, each time advertising
> everything obtained so far. Then treat the new repo as clone of the last
> address.
> 
> This would e.g. be very handy if you happen to have local mirror of say, Linux
> kernel and want to fetch some related project without messing with alternates
> or downloading everything again:
> 
> git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
> 
> This would first fetch everything from local source and then update that
> from remote, likely being vastly faster.

I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
repo? In that case, isn't that basically the same as --reference? Or is
it a local mirror list?

If the latter, then yeah, I think it is a good idea. Clients should
definitely be able to ignore, override, or add to mirror lists provided
by servers. The server can provide hints about useful mirrors, but it is
up to the client to decide which methods are useful to it and which
mirrors are closest.

Of course there are some servers who will want to do more than hint
(e.g., the gentoo case where they really don't want people cloning from
the main machine). For those cases, though, I think it is best to
provide the hint and to reject clients who don't follow it (e.g., by
barfing on somebody who tries to do a full clone). You have to implement
that rejection layer anyway for older clients.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 19:17             ` Jeff King
@ 2011-01-07 21:45               ` Ilari Liusvaara
  2011-01-07 21:56                 ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Ilari Liusvaara @ 2011-01-07 21:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Fri, Jan 07, 2011 at 02:17:19PM -0500, Jeff King wrote:
> On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote:
> 
> > 
> > git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
> > 
> > This would first fetch everything from local source and then update that
> > from remote, likely being vastly faster.
> 
> I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
> repo? In that case, isn't that basically the same as --reference? Or is
> it a local mirror list?

Yes, it is a repo. No, it isn't the same as --reference. It is list
of mirrors to try first before connecting to final repository and can
be any type of repository URL (local, true smart transport, smart HTTP,
dumb HTTP, etc...)

Idea is that you have list of mirrors that are faster than the final
repository, but not necressarily complete. You want to download most of
the stuff from there.

> If the latter, then yeah, I think it is a good idea. Clients should
> definitely be able to ignore, override, or add to mirror lists provided
> by servers. The server can provide hints about useful mirrors, but it is
> up to the client to decide which methods are useful to it and which
> mirrors are closest.

This is essentially adding mirrors to mirror list (modulo that mirrors
are not assumed to be complete).

Security:

Confidentiality: The connection to mirror must transverse only trusted
links or be encrypted if material from mirror is sensitive.

Integerity: The same integerity as the connection to final repo (assuming
SHA-1 can't be collided) due to fact that git object naming is securely
unique.

> Of course there are some servers who will want to do more than hint
> (e.g., the gentoo case where they really don't want people cloning from
> the main machine). For those cases, though, I think it is best to
> provide the hint and to reject clients who don't follow it (e.g., by
> barfing on somebody who tries to do a full clone). You have to implement
> that rejection layer anyway for older clients.

With option like this, a client could do:

git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo

To first grab stuff via HTTP (well-packed dumb HTTP is very light on the
server) and then continue via git:// (now much cheaper because client is
relatively up to date).

-Ilari

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 21:45               ` Ilari Liusvaara
@ 2011-01-07 21:56                 ` Jeff King
  2011-01-07 22:21                   ` Ilari Liusvaara
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2011-01-07 21:56 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:

> > I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
> > repo? In that case, isn't that basically the same as --reference? Or is
> > it a local mirror list?
> 
> Yes, it is a repo. No, it isn't the same as --reference. It is list
> of mirrors to try first before connecting to final repository and can
> be any type of repository URL (local, true smart transport, smart HTTP,
> dumb HTTP, etc...)

OK, I understand what you mean. I was thrown off by your example using a
local repository (in which case you probably would want --reference to
save disk space, unless the burden of alternates management is too
much).

So yeah, I think we are on the same page, except that you were proposing
to pass the mirror directly, and I was proposing passing a mirror file
which would contain a list of mirrors. Yours is much simpler and would
probably be what people want most of the time.

> > If the latter, then yeah, I think it is a good idea. Clients should
> > definitely be able to ignore, override, or add to mirror lists provided
> > by servers. The server can provide hints about useful mirrors, but it is
> > up to the client to decide which methods are useful to it and which
> > mirrors are closest.
> 
> This is essentially adding mirrors to mirror list (modulo that mirrors
> are not assumed to be complete).

I think there should always be an assumption that mirrors are not
necessarily complete. That is necessary for bundle-like mirrors to be
feasible, since updating the bundle for every commit defeats the
purpose.

It would be nice for there to be a way for some mirrors to be marked as
"should be considered complete and authoritative", since we can optimize
out the final check of the master in that case (as well as for future
fetches). But that's a future feature. My plan was to leave space in the
mirror list for arbitrary metadata of that sort.

> > Of course there are some servers who will want to do more than hint
> > (e.g., the gentoo case where they really don't want people cloning from
> > the main machine). For those cases, though, I think it is best to
> > provide the hint and to reject clients who don't follow it (e.g., by
> > barfing on somebody who tries to do a full clone). You have to implement
> > that rejection layer anyway for older clients.
> 
> With option like this, a client could do:
> 
> git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo
> 
> To first grab stuff via HTTP (well-packed dumb HTTP is very light on the
> server) and then continue via git:// (now much cheaper because client is
> relatively up to date).

Yes, exactly.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 21:56                 ` Jeff King
@ 2011-01-07 22:21                   ` Ilari Liusvaara
  2011-01-07 22:27                     ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Ilari Liusvaara @ 2011-01-07 22:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote:
> On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:
> 
> 
> I think there should always be an assumption that mirrors are not
> necessarily complete. That is necessary for bundle-like mirrors to be
> feasible, since updating the bundle for every commit defeats the
> purpose.

Also add protocol that grabs a bundle from HTTP and then opens that
up? :-)

> It would be nice for there to be a way for some mirrors to be marked as
> "should be considered complete and authoritative", since we can optimize
> out the final check of the master in that case (as well as for future
> fetches). But that's a future feature. My plan was to leave space in the
> mirror list for arbitrary metadata of that sort.

The first thing one should get/do when connecting to another repository
is its list of references. One can see from there if what one has got
is complete or not (with --use-mirror that only allows skipping commit
negotiation and fetch, not the whole connection due to the fact that the
repositories are contacted in order)...

-Ilari

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 22:21                   ` Ilari Liusvaara
@ 2011-01-07 22:27                     ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2011-01-07 22:27 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git

On Sat, Jan 08, 2011 at 12:21:33AM +0200, Ilari Liusvaara wrote:

> On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote:
> > On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:
> > 
> > 
> > I think there should always be an assumption that mirrors are not
> > necessarily complete. That is necessary for bundle-like mirrors to be
> > feasible, since updating the bundle for every commit defeats the
> > purpose.
> 
> Also add protocol that grabs a bundle from HTTP and then opens that
> up? :-)

Well, yes, that still needs to be implemented. But it's all client-side,
so the server just has to provide the bundle somewhere.

> > It would be nice for there to be a way for some mirrors to be marked as
> > "should be considered complete and authoritative", since we can optimize
> > out the final check of the master in that case (as well as for future
> > fetches). But that's a future feature. My plan was to leave space in the
> > mirror list for arbitrary metadata of that sort.
> 
> The first thing one should get/do when connecting to another repository
> is its list of references. One can see from there if what one has got
> is complete or not (with --use-mirror that only allows skipping commit
> negotiation and fetch, not the whole connection due to the fact that the
> repositories are contacted in order)...

Yes, but it would be cool to be able to skip even that connect in some
cases (e.g., mirrors can be useful not just to take load off the master,
but also when the master isn't available, either for downtime or because
the client is behind a firewall). But the default should definitely be
to double-check that the master is right, and we can leave more advanced
cases for later (we just need to be aware of leaving room for them now).

I'm going to start working on a patch series for this, so hopefully
we'll see how it's shaping up in a day or two.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07  4:33     ` Nicolas Pitre
  2011-01-07  5:22       ` Jeff King
@ 2011-01-10 11:48       ` Nguyen Thai Ngoc Duy
  2011-01-10 13:50         ` Nicolas Pitre
  1 sibling, 1 reply; 22+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-10 11:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Zenaan Harkness, git

On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> Here's what I suggest:
>
>        cd my_project
>        BUNDLENAME=my_project_$(date "+%s").gitbundle
>        git bundle create $BUNDLENAME --all
>        maketorrent-console your_favorite_tracker $BUNDLENAME
>
> Then start seeding that bundle, and upload $BUNDLENAME.torrent as
> bundle.torrent inside my_project.git on your server.

I was about to ask if we could put more "trailer" sha-1 checksums to
the bundle, so we can verify which part is corrupt without
redownloading the whole thing (this is over http/ftp.. not torrent).

But I realize it's just easier to split the bundle into multiple
packs, so we can verify and redownload only corrupt packs. Logically
it is still a single pack. Splitting help put more sha-1 checksums in
without changing pack format. The packs will be merged back into one
with "index-pack --pack-stream" patch I sent elsewhere.
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-10 11:48       ` Nguyen Thai Ngoc Duy
@ 2011-01-10 13:50         ` Nicolas Pitre
  0 siblings, 0 replies; 22+ messages in thread
From: Nicolas Pitre @ 2011-01-10 13:50 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Zenaan Harkness, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 986 bytes --]

On Mon, 10 Jan 2011, Nguyen Thai Ngoc Duy wrote:

> On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> > Here's what I suggest:
> >
> >        cd my_project
> >        BUNDLENAME=my_project_$(date "+%s").gitbundle
> >        git bundle create $BUNDLENAME --all
> >        maketorrent-console your_favorite_tracker $BUNDLENAME
> >
> > Then start seeding that bundle, and upload $BUNDLENAME.torrent as
> > bundle.torrent inside my_project.git on your server.
> 
> I was about to ask if we could put more "trailer" sha-1 checksums to
> the bundle, so we can verify which part is corrupt without
> redownloading the whole thing (this is over http/ftp.. not torrent).

Aren't HTTP and FTP based on TCP which is meant to be a reliable 
transport protocol already?  In this case, isn't the final SHA1 embedded 
in the bundle/pack sufficient enough?  Normally, your HTTP/FTP client 
should get you all data or partial data, but not wrong data.


Nicolas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-06 17:05 ` Shawn Pearce
@ 2011-01-10 16:39   ` John Wyzer
  2011-01-10 21:42     ` Sam Vilain
  2011-01-11  0:03     ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 22+ messages in thread
From: John Wyzer @ 2011-01-10 16:39 UTC (permalink / raw)
  To: git

On 06/01/11 18:05, Shawn Pearce wrote:
> On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness<zen@freedbms.net>  wrote:
>> Bittorrent requires some stability around torrent files.
>>
>> Can packs be generated deterministically?
>

I hope that I don't get something technically wrong (did not read any 
code, only skimmed the docs) and that this question is not redundant:

Why not provide an alternative mode for the git:// protocoll that 
instead of retrieving a big packaged blob breaks this down to the 
smallest atomic objects from the repository? Those are not changing and 
should be able to survive partial transfers.
While this might not be as efficient network traffic-wise it would 
provide a solution for those behind breaking connections.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-07 18:52           ` Ilari Liusvaara
  2011-01-07 19:17             ` Jeff King
@ 2011-01-10 21:07             ` Sam Vilain
  1 sibling, 0 replies; 22+ messages in thread
From: Sam Vilain @ 2011-01-10 21:07 UTC (permalink / raw)
  To: Ilari Liusvaara
  Cc: Jeff King, Nicolas Pitre, Zenaan Harkness, git, Shawn Pearce,
	Nguyen Thai Ngoc Duy, Joshua Roys, Nick Edelen, Jonas Fonseca

On 08/01/11 07:52, Ilari Liusvaara wrote:
> Ability to contact multiple servers in sequence, each time advertising
> everything obtained so far. Then treat the new repo as clone of the last
> address.
>
> This would e.g. be very handy if you happen to have local mirror of say, Linux
> kernel and want to fetch some related project without messing with alternates
> or downloading everything again:
>
> git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
>
> This would first fetch everything from local source and then update that
> from remote, likely being vastly faster.

Coming to this discussion a little late, I'll summarise the previous
research.

First, the idea of applying the straight BitTorrent protocol to the pack
files was raised, but as Nicolas mentions, this is not useful because
the pack files are not deterministic.  The protocol was revisited based
around the part which is stable, object manifests.  The RFC is at
http://utsl.gen.nz/gittorrent/rfc.html and the prototype code (an
unsuccessful GSoC project) is at http://repo.or.cz/w/VCS-Git-Torrent.git

After some thought, I decided that the BitTorrent protocol itself is all
cruft and that trying to cut it down to be useful was a waste of time. 
So, this is where the idea of "automatic mirroring" came from.  With
Automatic Mirroring, the two main functions of P2P operation - peer
discovery and partial transfer - are broken into discrete features.

I wrote this patch series so far, for "client-side mirroring":

http://thread.gmane.org/gmane.comp.version-control.git/133626/focus=133628

The later levels are roughly discussed on this page:

http://code.google.com/p/gittorrent/wiki/MirrorSync

The "mirror sync" part is the complicated one, and as others have noted
no truly successful prototype has yet been built.  Actually the Perl
gittorrent implementation did manage to perform an incremental clone; it
just didn't wrap it up nicely.  But I won't go into that too much. 
There was also another GSoC program to look at caching the object list
generation, the most expensive part of the process in the Perl
implementation.  This was a generic mechanism for accelerating object
graph traversal and showed promise, however unfortunately was never merged.

The client-side mirroring patch, in its current form, already supports
out-of-date mirrors.  It saves refs first into
'refs/mirrors/hostname/...' and finally contacts the main server to
check what objects it is still missing.  So, if there was a regular
bittorrent+bundle transport available, it would be a useful way to
support an incremental clone; the client would first clone the (static)
bittorrent bundle, unpack it with its refs into the 'refs/mirrors/xxx/'
namespace, making the subsequent 'git fetch' to get the most recent
objects a much more efficient operation.

Hope that helps!

Cheers,
Sam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-10 16:39   ` John Wyzer
@ 2011-01-10 21:42     ` Sam Vilain
  2011-01-11  0:03     ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 22+ messages in thread
From: Sam Vilain @ 2011-01-10 21:42 UTC (permalink / raw)
  To: John Wyzer; +Cc: git

On 11/01/11 05:39, John Wyzer wrote:
> Why not provide an alternative mode for the git:// protocoll that
> instead of retrieving a big packaged blob breaks this down to the
> smallest atomic objects from the repository? Those are not changing
> and should be able to survive partial transfers.
> While this might not be as efficient network traffic-wise it would
> provide a solution for those behind breaking connections.

To put this into numbers, for perl.git that might mean transferring 2GB
of data instead of 70MB of pack.

Sam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-10 16:39   ` John Wyzer
  2011-01-10 21:42     ` Sam Vilain
@ 2011-01-11  0:03     ` Nguyen Thai Ngoc Duy
  2011-01-11  0:57       ` J.H.
  1 sibling, 1 reply; 22+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-11  0:03 UTC (permalink / raw)
  To: John Wyzer; +Cc: git

On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote:
> Why not provide an alternative mode for the git:// protocoll that instead of
> retrieving a big packaged blob breaks this down to the smallest atomic
> objects from the repository? Those are not changing and should be able to
> survive partial transfers.
> While this might not be as efficient network traffic-wise it would provide a
> solution for those behind breaking connections.

That's what I'm getting to, except that I'll send deltas as much as I can.
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-11  0:03     ` Nguyen Thai Ngoc Duy
@ 2011-01-11  0:57       ` J.H.
  2011-01-11  1:56         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 22+ messages in thread
From: J.H. @ 2011-01-11  0:57 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: John Wyzer, git

On 01/10/2011 04:03 PM, Nguyen Thai Ngoc Duy wrote:
> On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote:
>> Why not provide an alternative mode for the git:// protocoll that instead of
>> retrieving a big packaged blob breaks this down to the smallest atomic
>> objects from the repository? Those are not changing and should be able to
>> survive partial transfers.
>> While this might not be as efficient network traffic-wise it would provide a
>> solution for those behind breaking connections.
> 
> That's what I'm getting to, except that I'll send deltas as much as I can.

While I think we need to come up with a mechanism to allow for resumable
fetches (I'm thinking slow sporadic links and larger repos like the
kernel for instance), but breaking the repo up into too small a chunks
will very adversely affect the overall transfer and could cause just as
much system thrash on the upstream provider.

I'd be curious to see what the system impact numbers and performance
differences are though, as I do think getting some sort of resumability
is important, but resumability at the expense of being able to get the
data out quickly and efficiently is not going to be a good trade off :-/

- John 'Warthog9' Hawley

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Resumable clone/Gittorrent (again) - stable packs?
  2011-01-11  0:57       ` J.H.
@ 2011-01-11  1:56         ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 22+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-11  1:56 UTC (permalink / raw)
  To: J.H.; +Cc: John Wyzer, git

On Tue, Jan 11, 2011 at 7:57 AM, J.H. <warthog9@kernel.org> wrote:
> On 01/10/2011 04:03 PM, Nguyen Thai Ngoc Duy wrote:
>> On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote:
>>> Why not provide an alternative mode for the git:// protocoll that instead of
>>> retrieving a big packaged blob breaks this down to the smallest atomic
>>> objects from the repository? Those are not changing and should be able to
>>> survive partial transfers.
>>> While this might not be as efficient network traffic-wise it would provide a
>>> solution for those behind breaking connections.
>>
>> That's what I'm getting to, except that I'll send deltas as much as I can.
>
> While I think we need to come up with a mechanism to allow for resumable
> fetches (I'm thinking slow sporadic links and larger repos like the
> kernel for instance), but breaking the repo up into too small a chunks
> will very adversely affect the overall transfer and could cause just as
> much system thrash on the upstream provider.
>
> I'd be curious to see what the system impact numbers and performance
> differences are though, as I do think getting some sort of resumability
> is important, but resumability at the expense of being able to get the
> data out quickly and efficiently is not going to be a good trade off :-/

Yeah, I'm interested in those numbers too. Let me get a prototype
working, then we'll have numbers to discuss.
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-01-11  1:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-06  2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness
2011-01-06 17:05 ` Shawn Pearce
2011-01-10 16:39   ` John Wyzer
2011-01-10 21:42     ` Sam Vilain
2011-01-11  0:03     ` Nguyen Thai Ngoc Duy
2011-01-11  0:57       ` J.H.
2011-01-11  1:56         ` Nguyen Thai Ngoc Duy
2011-01-06 21:09 ` Nicolas Pitre
2011-01-07  2:36   ` Zenaan Harkness
2011-01-07  4:33     ` Nicolas Pitre
2011-01-07  5:22       ` Jeff King
2011-01-07  5:31         ` Jeff King
2011-01-07 10:04           ` Zenaan Harkness
2011-01-07 18:52           ` Ilari Liusvaara
2011-01-07 19:17             ` Jeff King
2011-01-07 21:45               ` Ilari Liusvaara
2011-01-07 21:56                 ` Jeff King
2011-01-07 22:21                   ` Ilari Liusvaara
2011-01-07 22:27                     ` Jeff King
2011-01-10 21:07             ` Sam Vilain
2011-01-10 11:48       ` Nguyen Thai Ngoc Duy
2011-01-10 13:50         ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).