* Re: Resumable clone/Gittorrent (again) - stable packs? @ 2011-01-06 2:29 Zenaan Harkness 2011-01-06 17:05 ` Shawn Pearce 2011-01-06 21:09 ` Nicolas Pitre 0 siblings, 2 replies; 22+ messages in thread From: Zenaan Harkness @ 2011-01-06 2:29 UTC (permalink / raw) To: git Bittorrent requires some stability around torrent files. Can packs be generated deterministically? If not by two separate repos, what about by one particular repo? For Linus' linux-2.6.git, that repo is considered 'canonical' by many. Pack-torrents could be ~1MiB, ~10MiB, ~100Mib, ~1GiB, or as configured in a particular repo, which repo is the canonical location for pack-torrents for all who consider that particular repo as canonical. Perhaps a heuristic/ algorithm: once ten 10MiB (sequentially generated) pack-torrents are floating around, they could be simply concatenated to create a 100MiB pack-torrent, with a deterministic name and SHA etc, so that all those 10MiB pack-torrent files that torrent clients have, can be re-used and locally combined into the 100MiB torrent as needed, on demand. Same for 100MiB -> 1GiB pack-torrents. Individual extra commits: While "small" number of additional commits go into a repo, clients fall back to git-fetch, _after . If Linus linus-2.6.git (currently configured "canonical" repo) goes offline, simply configure a new remote canonical repo. Branches: Other "branches" repos of linux-2.6.git could create their own consistent 50MiB (or as configured) pack-torrents which are commits-only-missing-from-linux-2.6 pack-torrents (ie, those missing from that repo's "canonical" upstream). This would require clients have a recursive torrent locator (I start at linux-net.git, which requires linux-2.6.git, so I go get those packs as well as the linux-net.git packs). Perhaps have a system-wide or user-wide git repo/ torrent config, or check with user running git-clone linux-net.git "Do you have an existing git.vger.kernel.org/linux-2.6.git archive?". Zen ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-06 2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness @ 2011-01-06 17:05 ` Shawn Pearce 2011-01-10 16:39 ` John Wyzer 2011-01-06 21:09 ` Nicolas Pitre 1 sibling, 1 reply; 22+ messages in thread From: Shawn Pearce @ 2011-01-06 17:05 UTC (permalink / raw) To: Zenaan Harkness; +Cc: git On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness <zen@freedbms.net> wrote: > Bittorrent requires some stability around torrent files. > > Can packs be generated deterministically? No. We have been trying to avoid doing that, because it ties us into one particular compression scheme. We can't tune the algorithm and get better compression later, because it would generate a different pack. We also rely on the system's libz to generate the compressed data. A version change to libz may generate a different encoding for the same uncompressed data, simply because they made a tweak to how the compression was performed. Likewise our own delta compression code can be tweaked to produce a different (but logically identical) delta between the same two objects. Right now packs aren't deterministic because they use multiple threads to generate the deltas, the thread scheduling impacts which base objects deltas are tried against because threads can steal work from each other if one finishes before the other one. Disabling threading entirely slows down delta compression considerably on multi-core machines, but does remove this work-stealing, making the pack deterministic... but only for this exact Git binary, with this same shared libz. If the system libz or Git changes, all bets are off. We've been down this road before; we don't want to box ourselves into a tight corner by setting for all time these tunable portions of the compression algorithms. -- Shawn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-06 17:05 ` Shawn Pearce @ 2011-01-10 16:39 ` John Wyzer 2011-01-10 21:42 ` Sam Vilain 2011-01-11 0:03 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 22+ messages in thread From: John Wyzer @ 2011-01-10 16:39 UTC (permalink / raw) To: git On 06/01/11 18:05, Shawn Pearce wrote: > On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness<zen@freedbms.net> wrote: >> Bittorrent requires some stability around torrent files. >> >> Can packs be generated deterministically? > I hope that I don't get something technically wrong (did not read any code, only skimmed the docs) and that this question is not redundant: Why not provide an alternative mode for the git:// protocoll that instead of retrieving a big packaged blob breaks this down to the smallest atomic objects from the repository? Those are not changing and should be able to survive partial transfers. While this might not be as efficient network traffic-wise it would provide a solution for those behind breaking connections. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-10 16:39 ` John Wyzer @ 2011-01-10 21:42 ` Sam Vilain 2011-01-11 0:03 ` Nguyen Thai Ngoc Duy 1 sibling, 0 replies; 22+ messages in thread From: Sam Vilain @ 2011-01-10 21:42 UTC (permalink / raw) To: John Wyzer; +Cc: git On 11/01/11 05:39, John Wyzer wrote: > Why not provide an alternative mode for the git:// protocoll that > instead of retrieving a big packaged blob breaks this down to the > smallest atomic objects from the repository? Those are not changing > and should be able to survive partial transfers. > While this might not be as efficient network traffic-wise it would > provide a solution for those behind breaking connections. To put this into numbers, for perl.git that might mean transferring 2GB of data instead of 70MB of pack. Sam ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-10 16:39 ` John Wyzer 2011-01-10 21:42 ` Sam Vilain @ 2011-01-11 0:03 ` Nguyen Thai Ngoc Duy 2011-01-11 0:57 ` J.H. 1 sibling, 1 reply; 22+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-01-11 0:03 UTC (permalink / raw) To: John Wyzer; +Cc: git On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote: > Why not provide an alternative mode for the git:// protocoll that instead of > retrieving a big packaged blob breaks this down to the smallest atomic > objects from the repository? Those are not changing and should be able to > survive partial transfers. > While this might not be as efficient network traffic-wise it would provide a > solution for those behind breaking connections. That's what I'm getting to, except that I'll send deltas as much as I can. -- Duy ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-11 0:03 ` Nguyen Thai Ngoc Duy @ 2011-01-11 0:57 ` J.H. 2011-01-11 1:56 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 22+ messages in thread From: J.H. @ 2011-01-11 0:57 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: John Wyzer, git On 01/10/2011 04:03 PM, Nguyen Thai Ngoc Duy wrote: > On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote: >> Why not provide an alternative mode for the git:// protocoll that instead of >> retrieving a big packaged blob breaks this down to the smallest atomic >> objects from the repository? Those are not changing and should be able to >> survive partial transfers. >> While this might not be as efficient network traffic-wise it would provide a >> solution for those behind breaking connections. > > That's what I'm getting to, except that I'll send deltas as much as I can. While I think we need to come up with a mechanism to allow for resumable fetches (I'm thinking slow sporadic links and larger repos like the kernel for instance), but breaking the repo up into too small a chunks will very adversely affect the overall transfer and could cause just as much system thrash on the upstream provider. I'd be curious to see what the system impact numbers and performance differences are though, as I do think getting some sort of resumability is important, but resumability at the expense of being able to get the data out quickly and efficiently is not going to be a good trade off :-/ - John 'Warthog9' Hawley ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-11 0:57 ` J.H. @ 2011-01-11 1:56 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 22+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-01-11 1:56 UTC (permalink / raw) To: J.H.; +Cc: John Wyzer, git On Tue, Jan 11, 2011 at 7:57 AM, J.H. <warthog9@kernel.org> wrote: > On 01/10/2011 04:03 PM, Nguyen Thai Ngoc Duy wrote: >> On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <john.wyzer@gmx.de> wrote: >>> Why not provide an alternative mode for the git:// protocoll that instead of >>> retrieving a big packaged blob breaks this down to the smallest atomic >>> objects from the repository? Those are not changing and should be able to >>> survive partial transfers. >>> While this might not be as efficient network traffic-wise it would provide a >>> solution for those behind breaking connections. >> >> That's what I'm getting to, except that I'll send deltas as much as I can. > > While I think we need to come up with a mechanism to allow for resumable > fetches (I'm thinking slow sporadic links and larger repos like the > kernel for instance), but breaking the repo up into too small a chunks > will very adversely affect the overall transfer and could cause just as > much system thrash on the upstream provider. > > I'd be curious to see what the system impact numbers and performance > differences are though, as I do think getting some sort of resumability > is important, but resumability at the expense of being able to get the > data out quickly and efficiently is not going to be a good trade off :-/ Yeah, I'm interested in those numbers too. Let me get a prototype working, then we'll have numbers to discuss. -- Duy ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-06 2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness 2011-01-06 17:05 ` Shawn Pearce @ 2011-01-06 21:09 ` Nicolas Pitre 2011-01-07 2:36 ` Zenaan Harkness 1 sibling, 1 reply; 22+ messages in thread From: Nicolas Pitre @ 2011-01-06 21:09 UTC (permalink / raw) To: Zenaan Harkness; +Cc: git On Thu, 6 Jan 2011, Zenaan Harkness wrote: > Bittorrent requires some stability around torrent files. > > Can packs be generated deterministically? They _could_, but we do _not_ want to do that. The only thing which is stable in Git is the canonical representation of objects, and the objects they depend on, expressed by their SHA1 signature. Any BitTorrent-alike design for Git must be based on that property and not the packed representation of those objects which is not meant to be stable. If you don't want to design anything and simply reuse current BitTorrent codebase then simply create a Git bundle from some release version and seed that bundle for a sufficiently long period to be worth it. Then falling back to git fetch in order to bring the repo up to date with the very latest commits should be small and quick. When that clone gets too big then it's time to start seeding another more up-to-date bundle. Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-06 21:09 ` Nicolas Pitre @ 2011-01-07 2:36 ` Zenaan Harkness 2011-01-07 4:33 ` Nicolas Pitre 0 siblings, 1 reply; 22+ messages in thread From: Zenaan Harkness @ 2011-01-07 2:36 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <nico@fluxnic.net> wrote: > On Thu, 6 Jan 2011, Zenaan Harkness wrote: > >> Bittorrent requires some stability around torrent files. >> >> Can packs be generated deterministically? > > They _could_, but we do _not_ want to do that. > > The only thing which is stable in Git is the canonical representation of > objects, and the objects they depend on, expressed by their SHA1 > signature. Any BitTorrent-alike design for Git must be based on that > property and not the packed representation of those objects which is not > meant to be stable. > > If you don't want to design anything and simply reuse current BitTorrent > codebase then simply create a Git bundle from some release version and > seed that bundle for a sufficiently long period to be worth it. Then > falling back to git fetch in order to bring the repo up to date with the > very latest commits should be small and quick. When that clone gets too > big then it's time to start seeding another more up-to-date bundle. Thanks guys for the explanations. So, we don't _want_ to generate packs deterministically. BUT, we _can_ reliably unpack a pack (duh). So if my configured "canonical upstream" decides on a particular compression etc, I (my git client) doesn't care what has been chosen by my upstream. What is important for torrent-able packs though is stability over some time period, no matter what the format. There's been much talk of caching, invalidating of caches, overlapping torrent-packs etc. In every case, for torrents to work, the P2P'd files must have some stability over some time period. (If this assumption is incorrect, please clarify, not counting every-file-is-a-torrent and every-commit-is-a-torrent.) So, torrentable options: - torrent per commit - torrent per pack - torrent per torrent-archive - new file format Torrent per commit - too small, too many torrents; we need larger p2p-able sizes in general. Torrent per pack - packs non-deterministically created, both between hosts and even intra-host (libz upgrade, nr_threads change, git pack algorithm optimization). A new torrent format, if "close enough" to current git pack performance (cpu load, threadability, size) is potential for new version of git pack file format - we don't want to store two sets of pack files on disk, if sensible to not do so; unlikely to happen - I can't conceive that a torrentable format would be anything but worse than pack files and therefore would be rejected from git master. Can we can relax the perceived requirement to deterministically create pack files? Well, over what time period are pack files stable in a particular git? Over what time period do we require stable files for torrenting? Can we simply configure our local git to keep specified pack files for specified time period? And use those for torrent-packs? Perhaps the torrent file could have a UseBy date? Zen ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 2:36 ` Zenaan Harkness @ 2011-01-07 4:33 ` Nicolas Pitre 2011-01-07 5:22 ` Jeff King 2011-01-10 11:48 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 22+ messages in thread From: Nicolas Pitre @ 2011-01-07 4:33 UTC (permalink / raw) To: Zenaan Harkness; +Cc: git [-- Attachment #1: Type: TEXT/PLAIN, Size: 6423 bytes --] On Fri, 7 Jan 2011, Zenaan Harkness wrote: > On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <nico@fluxnic.net> wrote: > > On Thu, 6 Jan 2011, Zenaan Harkness wrote: > > > >> Bittorrent requires some stability around torrent files. > >> > >> Can packs be generated deterministically? > > > > They _could_, but we do _not_ want to do that. > > > > The only thing which is stable in Git is the canonical representation of > > objects, and the objects they depend on, expressed by their SHA1 > > signature. Any BitTorrent-alike design for Git must be based on that > > property and not the packed representation of those objects which is not > > meant to be stable. > > > > If you don't want to design anything and simply reuse current BitTorrent > > codebase then simply create a Git bundle from some release version and > > seed that bundle for a sufficiently long period to be worth it. Then > > falling back to git fetch in order to bring the repo up to date with the > > very latest commits should be small and quick. When that clone gets too > > big then it's time to start seeding another more up-to-date bundle. > > Thanks guys for the explanations. > > So, we don't _want_ to generate packs deterministically. > BUT, we _can_ reliably unpack a pack (duh). Of course. > So if my configured "canonical upstream" decides on a particular > compression etc, I (my git client) doesn't care what has been chosen > by my upstream. Indeed. This is like saying: I'm sending you the value 52, but I chose to use the representation "24 + 28", while someone else might decide to encode that value as "13 * 4" instead. You still are able to decode it to the same result in both cases. > What is important for torrent-able packs though is stability over some > time period, no matter what the format. Hence my suggestion to simply seed a Git bundle over BitTorrent. Bundles are files which are designed to be used by completely ad hoc transports and you can fetch from them just like if they were a remote repository. > There's been much talk of caching, invalidating of caches, overlapping > torrent-packs etc. And in my humble opinion this is just all crap. All those suggestions are fragile, create administrative issues, eat up server resources, and they all are suboptimal in the end. No one ever implemented a working prototype so far either. We don't want caches. Fundamentally, we do not need any cache. Caches are a pain to administrate on a busy server anyway as they eat disk space and they also represent a much bigger security risk compared to a read-only operation. Furthermore, a cache is good only for the common case that everyone want. but with Git, you cannot presume that everyone is at the same version locally. So either you do a custom transfer for each client to minimize transfers and caching the result in that case might not benefit that many people, or you make the cached data bigger so to cover more cases while making the transfer suboptimal. Finally, we do have a cache already, and that's the existing packs themselves. During a clone, the vast majority of the transferred data is streamed without further processing straight of those existing packs as we try to reuse as much data as possible from those packs so not to recompute/recompress that data all the time. > In every case, for torrents to work, the P2P'd files must have some > stability over some time period. > (If this assumption is incorrect, please clarify, not counting > every-file-is-a-torrent and every-commit-is-a-torrent.) > > So, torrentable options: > - torrent per commit > - torrent per pack > - torrent per torrent-archive - new file format > > Torrent per commit - too small, too many torrents; we need larger > p2p-able sizes in general. > > Torrent per pack - packs non-deterministically created, both between > hosts and even intra-host (libz upgrade, nr_threads change, git pack > algorithm optimization). > > A new torrent format, if "close enough" to current git pack > performance (cpu load, threadability, size) is potential for new > version of git pack file format - we don't want to store two sets of > pack files on disk, if sensible to not do so; unlikely to happen - I > can't conceive that a torrentable format would be anything but worse > than pack files and therefore would be rejected from git master. > > Can we can relax the perceived requirement to deterministically create > pack files? > Well, over what time period are pack files stable in a particular git? > Over what time period do we require stable files for torrenting? > > Can we simply configure our local git to keep specified pack files for > specified time period? > And use those for torrent-packs? > Perhaps the torrent file could have a UseBy date? Again, this is just too much complexity for so little gain. Here's what I suggest: cd my_project BUNDLENAME=my_project_$(date "+%s").gitbundle git bundle create $BUNDLENAME --all maketorrent-console your_favorite_tracker $BUNDLENAME Then start seeding that bundle, and upload $BUNDLENAME.torrent as bundle.torrent inside my_project.git on your server. Now... Git clients could be improved to first check for the availability of the file "bundle.torrent" on the remote side, either directly in my_project.git, or through some Git protocol extension. Or even better, the torrent hash could be stored in a Git ref, such as refs/bittorrent/bundle and the client could use that to retrieve the bundle.torrent file through some other means. When the bundle.torrent file is retrieved, then just pull the torrent content (and seed it some more to be nice). Then simply run "git clone" using the original arguments but with the obtained bundle instead of the original URL. Then replace the remote URL in .git/config with the actual remote URL instead of the bundle file path. And finally perform a "git pull" to bring the new commits that were added to the remote repository since the bundle was created. That final pull will be small and quick. After a while, that final pull will get bigger as the difference between the bundled version and the current tip in the remote repository will grow. So every so often, say 3 months, it might be a good idea to create a new bundle so that the latest commits are included into it in order to make that final pull small and quick again. Isn't that sufficient? Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 4:33 ` Nicolas Pitre @ 2011-01-07 5:22 ` Jeff King 2011-01-07 5:31 ` Jeff King 2011-01-10 11:48 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 22+ messages in thread From: Jeff King @ 2011-01-07 5:22 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Zenaan Harkness, git On Thu, Jan 06, 2011 at 11:33:51PM -0500, Nicolas Pitre wrote: > Here's what I suggest: > > cd my_project > BUNDLENAME=my_project_$(date "+%s").gitbundle > git bundle create $BUNDLENAME --all > maketorrent-console your_favorite_tracker $BUNDLENAME > > Then start seeding that bundle, and upload $BUNDLENAME.torrent as > bundle.torrent inside my_project.git on your server. > > Now... Git clients could be improved to first check for the availability > of the file "bundle.torrent" on the remote side, either directly in > my_project.git, or through some Git protocol extension. Or even better, > the torrent hash could be stored in a Git ref, such as > refs/bittorrent/bundle and the client could use that to retrieve the > bundle.torrent file through some other means. I really like the simplicity of this idea. It could even be generalized to handle more traditional mirrors, too. Just slice up the refs/mirrors namespace to provide different methods of fetching some initial set of objects. For example, upload-pack might advertise (in addition to the usual refs): refs/mirrors/bundle/torrent refs/mirrors/bundle/http refs/mirrors/fetch/git refs/mirrors/fetch/http and the client can decide its preferred way of getting data: a bundle by http or by torrent, or connecting directly to some other git repository by git protocol or http. It would fetch the appropriate ref, which would contain a blob in some method-specific format. For torrent, it would be a torrent file. For the others, probably a newline-delimited set of URLs. You could also provide a torrent-magnet ref if you didn't even want to distribute the torrent file. And no matter what the method used, at the end you have some set of refs and objects, and you can re-try your (now much smaller fetch). And there are a few obvious optimizations: 1. When you get the initial set of refs from the master, remember them. If the mirror actually satisfies everything you were going to fetch, then you don't even have to reconnect for the final fetch. 2. You can optionally cache the mirror list, and go straight to a mirror for future fetches instead of checking the master. This is only a reasonable thing to do if the mirrors are kept up to date, and provide good incremental access (i.e., only actual git-protocol mirrors, not torrent or http file). -Peff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 5:22 ` Jeff King @ 2011-01-07 5:31 ` Jeff King 2011-01-07 10:04 ` Zenaan Harkness 2011-01-07 18:52 ` Ilari Liusvaara 0 siblings, 2 replies; 22+ messages in thread From: Jeff King @ 2011-01-07 5:31 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Zenaan Harkness, git On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote: > refs/mirrors/bundle/torrent > refs/mirrors/bundle/http > refs/mirrors/fetch/git > refs/mirrors/fetch/http > > and the client can decide its preferred way of getting data: a bundle by > http or by torrent, or connecting directly to some other git repository > by git protocol or http. It would fetch the appropriate ref, which would > contain a blob in some method-specific format. For torrent, it would be > a torrent file. For the others, probably a newline-delimited set of > URLs. You could also provide a torrent-magnet ref if you didn't even > want to distribute the torrent file. > > And no matter what the method used, at the end you have some set of refs > and objects, and you can re-try your (now much smaller fetch). And I think it is probably obvious to you, Nicolas, since these are problems you have been thinking about for some time, but the reason I am interested in this expanded definition of mirroring is for a few features people have been asking for: 1. restartable clone; any bundle format is easily restartable using standard protocols 2. avoid too-big clones; I remember the gentoo folks wanting to disallow full clones from their actual dev machines and push people off to some more static method of pulling. I think not just because of restartability, but because of the load on the dev machines 3. people on low-bandwidth servers who fork major projects; if I write three kernel patches and host a git server, I would really like people to only fetch my patches from me and get the rest of it from kernel.org -Peff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 5:31 ` Jeff King @ 2011-01-07 10:04 ` Zenaan Harkness 2011-01-07 18:52 ` Ilari Liusvaara 1 sibling, 0 replies; 22+ messages in thread From: Zenaan Harkness @ 2011-01-07 10:04 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, git On Fri, Jan 7, 2011 at 16:31, Jeff King <peff@peff.net> wrote: > On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote: > the reason I am > interested in this expanded definition of mirroring is for a few > features people have been asking for: > > 1. restartable clone; any bundle format is easily restartable using > standard protocols This is very important to me. I have failed to establish an initial repo for a few larger projects, some apache projects and opentaps most recently. It is getting _really_ frustrating. > 2. avoid too-big clones; I remember the gentoo folks wanting to > disallow full clones from their actual dev machines and push people > off to some more static method of pulling. I think not just because > of restartability, but because of the load on the dev machines And of course the lack of restartability causes an ongoing increase in the load on the machines delivering those large clones. > 3. people on low-bandwidth servers who fork major projects; if I write > three kernel patches and host a git server, I would really like > people to only fetch my patches from me and get the rest of it from > kernel.org This is not so much of a problem - can already be handled by cloning your linux-full.git to a private dir, and only publishing your shallow "personal patches only" clone, or better still, just a tar-ball of your 3 patches, or email them, or etc. So I agree with the big issues being restartable large clones and lowering server loads. Zen ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 5:31 ` Jeff King 2011-01-07 10:04 ` Zenaan Harkness @ 2011-01-07 18:52 ` Ilari Liusvaara 2011-01-07 19:17 ` Jeff King 2011-01-10 21:07 ` Sam Vilain 1 sibling, 2 replies; 22+ messages in thread From: Ilari Liusvaara @ 2011-01-07 18:52 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote: > > 3. people on low-bandwidth servers who fork major projects; if I write > three kernel patches and host a git server, I would really like > people to only fetch my patches from me and get the rest of it from > kernel.org One client-side-only feature that could be useful: Ability to contact multiple servers in sequence, each time advertising everything obtained so far. Then treat the new repo as clone of the last address. This would e.g. be very handy if you happen to have local mirror of say, Linux kernel and want to fetch some related project without messing with alternates or downloading everything again: git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo This would first fetch everything from local source and then update that from remote, likely being vastly faster. -Ilari ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 18:52 ` Ilari Liusvaara @ 2011-01-07 19:17 ` Jeff King 2011-01-07 21:45 ` Ilari Liusvaara 2011-01-10 21:07 ` Sam Vilain 1 sibling, 1 reply; 22+ messages in thread From: Jeff King @ 2011-01-07 19:17 UTC (permalink / raw) To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote: > On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote: > > > > 3. people on low-bandwidth servers who fork major projects; if I write > > three kernel patches and host a git server, I would really like > > people to only fetch my patches from me and get the rest of it from > > kernel.org > > One client-side-only feature that could be useful: > > Ability to contact multiple servers in sequence, each time advertising > everything obtained so far. Then treat the new repo as clone of the last > address. > > This would e.g. be very handy if you happen to have local mirror of say, Linux > kernel and want to fetch some related project without messing with alternates > or downloading everything again: > > git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo > > This would first fetch everything from local source and then update that > from remote, likely being vastly faster. I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a repo? In that case, isn't that basically the same as --reference? Or is it a local mirror list? If the latter, then yeah, I think it is a good idea. Clients should definitely be able to ignore, override, or add to mirror lists provided by servers. The server can provide hints about useful mirrors, but it is up to the client to decide which methods are useful to it and which mirrors are closest. Of course there are some servers who will want to do more than hint (e.g., the gentoo case where they really don't want people cloning from the main machine). For those cases, though, I think it is best to provide the hint and to reject clients who don't follow it (e.g., by barfing on somebody who tries to do a full clone). You have to implement that rejection layer anyway for older clients. -Peff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 19:17 ` Jeff King @ 2011-01-07 21:45 ` Ilari Liusvaara 2011-01-07 21:56 ` Jeff King 0 siblings, 1 reply; 22+ messages in thread From: Ilari Liusvaara @ 2011-01-07 21:45 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git On Fri, Jan 07, 2011 at 02:17:19PM -0500, Jeff King wrote: > On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote: > > > > > git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo > > > > This would first fetch everything from local source and then update that > > from remote, likely being vastly faster. > > I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a > repo? In that case, isn't that basically the same as --reference? Or is > it a local mirror list? Yes, it is a repo. No, it isn't the same as --reference. It is list of mirrors to try first before connecting to final repository and can be any type of repository URL (local, true smart transport, smart HTTP, dumb HTTP, etc...) Idea is that you have list of mirrors that are faster than the final repository, but not necressarily complete. You want to download most of the stuff from there. > If the latter, then yeah, I think it is a good idea. Clients should > definitely be able to ignore, override, or add to mirror lists provided > by servers. The server can provide hints about useful mirrors, but it is > up to the client to decide which methods are useful to it and which > mirrors are closest. This is essentially adding mirrors to mirror list (modulo that mirrors are not assumed to be complete). Security: Confidentiality: The connection to mirror must transverse only trusted links or be encrypted if material from mirror is sensitive. Integerity: The same integerity as the connection to final repo (assuming SHA-1 can't be collided) due to fact that git object naming is securely unique. > Of course there are some servers who will want to do more than hint > (e.g., the gentoo case where they really don't want people cloning from > the main machine). For those cases, though, I think it is best to > provide the hint and to reject clients who don't follow it (e.g., by > barfing on somebody who tries to do a full clone). You have to implement > that rejection layer anyway for older clients. With option like this, a client could do: git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo To first grab stuff via HTTP (well-packed dumb HTTP is very light on the server) and then continue via git:// (now much cheaper because client is relatively up to date). -Ilari ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 21:45 ` Ilari Liusvaara @ 2011-01-07 21:56 ` Jeff King 2011-01-07 22:21 ` Ilari Liusvaara 0 siblings, 1 reply; 22+ messages in thread From: Jeff King @ 2011-01-07 21:56 UTC (permalink / raw) To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote: > > I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a > > repo? In that case, isn't that basically the same as --reference? Or is > > it a local mirror list? > > Yes, it is a repo. No, it isn't the same as --reference. It is list > of mirrors to try first before connecting to final repository and can > be any type of repository URL (local, true smart transport, smart HTTP, > dumb HTTP, etc...) OK, I understand what you mean. I was thrown off by your example using a local repository (in which case you probably would want --reference to save disk space, unless the burden of alternates management is too much). So yeah, I think we are on the same page, except that you were proposing to pass the mirror directly, and I was proposing passing a mirror file which would contain a list of mirrors. Yours is much simpler and would probably be what people want most of the time. > > If the latter, then yeah, I think it is a good idea. Clients should > > definitely be able to ignore, override, or add to mirror lists provided > > by servers. The server can provide hints about useful mirrors, but it is > > up to the client to decide which methods are useful to it and which > > mirrors are closest. > > This is essentially adding mirrors to mirror list (modulo that mirrors > are not assumed to be complete). I think there should always be an assumption that mirrors are not necessarily complete. That is necessary for bundle-like mirrors to be feasible, since updating the bundle for every commit defeats the purpose. It would be nice for there to be a way for some mirrors to be marked as "should be considered complete and authoritative", since we can optimize out the final check of the master in that case (as well as for future fetches). But that's a future feature. My plan was to leave space in the mirror list for arbitrary metadata of that sort. > > Of course there are some servers who will want to do more than hint > > (e.g., the gentoo case where they really don't want people cloning from > > the main machine). For those cases, though, I think it is best to > > provide the hint and to reject clients who don't follow it (e.g., by > > barfing on somebody who tries to do a full clone). You have to implement > > that rejection layer anyway for older clients. > > With option like this, a client could do: > > git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo > > To first grab stuff via HTTP (well-packed dumb HTTP is very light on the > server) and then continue via git:// (now much cheaper because client is > relatively up to date). Yes, exactly. -Peff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 21:56 ` Jeff King @ 2011-01-07 22:21 ` Ilari Liusvaara 2011-01-07 22:27 ` Jeff King 0 siblings, 1 reply; 22+ messages in thread From: Ilari Liusvaara @ 2011-01-07 22:21 UTC (permalink / raw) To: Jeff King; +Cc: Nicolas Pitre, Zenaan Harkness, git On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote: > On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote: > > > I think there should always be an assumption that mirrors are not > necessarily complete. That is necessary for bundle-like mirrors to be > feasible, since updating the bundle for every commit defeats the > purpose. Also add protocol that grabs a bundle from HTTP and then opens that up? :-) > It would be nice for there to be a way for some mirrors to be marked as > "should be considered complete and authoritative", since we can optimize > out the final check of the master in that case (as well as for future > fetches). But that's a future feature. My plan was to leave space in the > mirror list for arbitrary metadata of that sort. The first thing one should get/do when connecting to another repository is its list of references. One can see from there if what one has got is complete or not (with --use-mirror that only allows skipping commit negotiation and fetch, not the whole connection due to the fact that the repositories are contacted in order)... -Ilari ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 22:21 ` Ilari Liusvaara @ 2011-01-07 22:27 ` Jeff King 0 siblings, 0 replies; 22+ messages in thread From: Jeff King @ 2011-01-07 22:27 UTC (permalink / raw) To: Ilari Liusvaara; +Cc: Nicolas Pitre, Zenaan Harkness, git On Sat, Jan 08, 2011 at 12:21:33AM +0200, Ilari Liusvaara wrote: > On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote: > > On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote: > > > > > > I think there should always be an assumption that mirrors are not > > necessarily complete. That is necessary for bundle-like mirrors to be > > feasible, since updating the bundle for every commit defeats the > > purpose. > > Also add protocol that grabs a bundle from HTTP and then opens that > up? :-) Well, yes, that still needs to be implemented. But it's all client-side, so the server just has to provide the bundle somewhere. > > It would be nice for there to be a way for some mirrors to be marked as > > "should be considered complete and authoritative", since we can optimize > > out the final check of the master in that case (as well as for future > > fetches). But that's a future feature. My plan was to leave space in the > > mirror list for arbitrary metadata of that sort. > > The first thing one should get/do when connecting to another repository > is its list of references. One can see from there if what one has got > is complete or not (with --use-mirror that only allows skipping commit > negotiation and fetch, not the whole connection due to the fact that the > repositories are contacted in order)... Yes, but it would be cool to be able to skip even that connect in some cases (e.g., mirrors can be useful not just to take load off the master, but also when the master isn't available, either for downtime or because the client is behind a firewall). But the default should definitely be to double-check that the master is right, and we can leave more advanced cases for later (we just need to be aware of leaving room for them now). I'm going to start working on a patch series for this, so hopefully we'll see how it's shaping up in a day or two. -Peff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 18:52 ` Ilari Liusvaara 2011-01-07 19:17 ` Jeff King @ 2011-01-10 21:07 ` Sam Vilain 1 sibling, 0 replies; 22+ messages in thread From: Sam Vilain @ 2011-01-10 21:07 UTC (permalink / raw) To: Ilari Liusvaara Cc: Jeff King, Nicolas Pitre, Zenaan Harkness, git, Shawn Pearce, Nguyen Thai Ngoc Duy, Joshua Roys, Nick Edelen, Jonas Fonseca On 08/01/11 07:52, Ilari Liusvaara wrote: > Ability to contact multiple servers in sequence, each time advertising > everything obtained so far. Then treat the new repo as clone of the last > address. > > This would e.g. be very handy if you happen to have local mirror of say, Linux > kernel and want to fetch some related project without messing with alternates > or downloading everything again: > > git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo > > This would first fetch everything from local source and then update that > from remote, likely being vastly faster. Coming to this discussion a little late, I'll summarise the previous research. First, the idea of applying the straight BitTorrent protocol to the pack files was raised, but as Nicolas mentions, this is not useful because the pack files are not deterministic. The protocol was revisited based around the part which is stable, object manifests. The RFC is at http://utsl.gen.nz/gittorrent/rfc.html and the prototype code (an unsuccessful GSoC project) is at http://repo.or.cz/w/VCS-Git-Torrent.git After some thought, I decided that the BitTorrent protocol itself is all cruft and that trying to cut it down to be useful was a waste of time. So, this is where the idea of "automatic mirroring" came from. With Automatic Mirroring, the two main functions of P2P operation - peer discovery and partial transfer - are broken into discrete features. I wrote this patch series so far, for "client-side mirroring": http://thread.gmane.org/gmane.comp.version-control.git/133626/focus=133628 The later levels are roughly discussed on this page: http://code.google.com/p/gittorrent/wiki/MirrorSync The "mirror sync" part is the complicated one, and as others have noted no truly successful prototype has yet been built. Actually the Perl gittorrent implementation did manage to perform an incremental clone; it just didn't wrap it up nicely. But I won't go into that too much. There was also another GSoC program to look at caching the object list generation, the most expensive part of the process in the Perl implementation. This was a generic mechanism for accelerating object graph traversal and showed promise, however unfortunately was never merged. The client-side mirroring patch, in its current form, already supports out-of-date mirrors. It saves refs first into 'refs/mirrors/hostname/...' and finally contacts the main server to check what objects it is still missing. So, if there was a regular bittorrent+bundle transport available, it would be a useful way to support an incremental clone; the client would first clone the (static) bittorrent bundle, unpack it with its refs into the 'refs/mirrors/xxx/' namespace, making the subsequent 'git fetch' to get the most recent objects a much more efficient operation. Hope that helps! Cheers, Sam ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-07 4:33 ` Nicolas Pitre 2011-01-07 5:22 ` Jeff King @ 2011-01-10 11:48 ` Nguyen Thai Ngoc Duy 2011-01-10 13:50 ` Nicolas Pitre 1 sibling, 1 reply; 22+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-01-10 11:48 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Zenaan Harkness, git On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <nico@fluxnic.net> wrote: > Here's what I suggest: > > cd my_project > BUNDLENAME=my_project_$(date "+%s").gitbundle > git bundle create $BUNDLENAME --all > maketorrent-console your_favorite_tracker $BUNDLENAME > > Then start seeding that bundle, and upload $BUNDLENAME.torrent as > bundle.torrent inside my_project.git on your server. I was about to ask if we could put more "trailer" sha-1 checksums to the bundle, so we can verify which part is corrupt without redownloading the whole thing (this is over http/ftp.. not torrent). But I realize it's just easier to split the bundle into multiple packs, so we can verify and redownload only corrupt packs. Logically it is still a single pack. Splitting help put more sha-1 checksums in without changing pack format. The packs will be merged back into one with "index-pack --pack-stream" patch I sent elsewhere. -- Duy ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Resumable clone/Gittorrent (again) - stable packs? 2011-01-10 11:48 ` Nguyen Thai Ngoc Duy @ 2011-01-10 13:50 ` Nicolas Pitre 0 siblings, 0 replies; 22+ messages in thread From: Nicolas Pitre @ 2011-01-10 13:50 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Zenaan Harkness, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 986 bytes --] On Mon, 10 Jan 2011, Nguyen Thai Ngoc Duy wrote: > On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <nico@fluxnic.net> wrote: > > Here's what I suggest: > > > > cd my_project > > BUNDLENAME=my_project_$(date "+%s").gitbundle > > git bundle create $BUNDLENAME --all > > maketorrent-console your_favorite_tracker $BUNDLENAME > > > > Then start seeding that bundle, and upload $BUNDLENAME.torrent as > > bundle.torrent inside my_project.git on your server. > > I was about to ask if we could put more "trailer" sha-1 checksums to > the bundle, so we can verify which part is corrupt without > redownloading the whole thing (this is over http/ftp.. not torrent). Aren't HTTP and FTP based on TCP which is meant to be a reliable transport protocol already? In this case, isn't the final SHA1 embedded in the bundle/pack sufficient enough? Normally, your HTTP/FTP client should get you all data or partial data, but not wrong data. Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-01-11 1:56 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-06 2:29 Resumable clone/Gittorrent (again) - stable packs? Zenaan Harkness 2011-01-06 17:05 ` Shawn Pearce 2011-01-10 16:39 ` John Wyzer 2011-01-10 21:42 ` Sam Vilain 2011-01-11 0:03 ` Nguyen Thai Ngoc Duy 2011-01-11 0:57 ` J.H. 2011-01-11 1:56 ` Nguyen Thai Ngoc Duy 2011-01-06 21:09 ` Nicolas Pitre 2011-01-07 2:36 ` Zenaan Harkness 2011-01-07 4:33 ` Nicolas Pitre 2011-01-07 5:22 ` Jeff King 2011-01-07 5:31 ` Jeff King 2011-01-07 10:04 ` Zenaan Harkness 2011-01-07 18:52 ` Ilari Liusvaara 2011-01-07 19:17 ` Jeff King 2011-01-07 21:45 ` Ilari Liusvaara 2011-01-07 21:56 ` Jeff King 2011-01-07 22:21 ` Ilari Liusvaara 2011-01-07 22:27 ` Jeff King 2011-01-10 21:07 ` Sam Vilain 2011-01-10 11:48 ` Nguyen Thai Ngoc Duy 2011-01-10 13:50 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).