* How to resume broke clone ? @ 2013-11-28 3:13 zhifeng hu 2013-11-28 7:39 ` Trần Ngọc Quân 0 siblings, 1 reply; 36+ messages in thread From: zhifeng hu @ 2013-11-28 3:13 UTC (permalink / raw) To: git Hello all: Today i want to clone the Linux Kernel git repository. git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git I am in china. our bandwidth is very limitation. Less than 50Kb/s. The clone progress is very slow, and broken times and time. I am very unhappy. Because i could not easily to clone kernel. I had do some research about resume clone , but no good plan how to resolve this problem . Would it be possible add resume transfer clone repository after the transfer broken? such as bittorrent download. or what ever. zhifeng hu ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 3:13 How to resume broke clone ? zhifeng hu @ 2013-11-28 7:39 ` Trần Ngọc Quân 2013-11-28 7:41 ` zhifeng hu 0 siblings, 1 reply; 36+ messages in thread From: Trần Ngọc Quân @ 2013-11-28 7:39 UTC (permalink / raw) To: zhifeng hu; +Cc: git On 28/11/2013 10:13, zhifeng hu wrote: > Hello all: > Today i want to clone the Linux Kernel git repository. > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > I am in china. our bandwidth is very limitation. Less than 50Kb/s. This repo is really too big. You may consider using --depth option if you don't want full history, or clone from somewhere have better bandwidth $ git clone --depth=1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git you may chose other mirror (github.com) for example see git-clone(1) -- Trần Ngọc Quân. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 7:39 ` Trần Ngọc Quân @ 2013-11-28 7:41 ` zhifeng hu 2013-11-28 8:14 ` Duy Nguyen 0 siblings, 1 reply; 36+ messages in thread From: zhifeng hu @ 2013-11-28 7:41 UTC (permalink / raw) To: Trần Ngọc Quân; +Cc: git Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. zhifeng hu On Nov 28, 2013, at 3:39 PM, Trần Ngọc Quân <vnwildman@gmail.com> wrote: > On 28/11/2013 10:13, zhifeng hu wrote: >> Hello all: >> Today i want to clone the Linux Kernel git repository. >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >> >> I am in china. our bandwidth is very limitation. Less than 50Kb/s. > This repo is really too big. > You may consider using --depth option if you don't want full history, or > clone from somewhere have better bandwidth > $ git clone --depth=1 > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > you may chose other mirror (github.com) for example > see git-clone(1) > > -- > Trần Ngọc Quân. > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 7:41 ` zhifeng hu @ 2013-11-28 8:14 ` Duy Nguyen 2013-11-28 8:35 ` Karsten Blees 2013-11-28 9:20 ` How to resume broke clone ? Tay Ray Chuan 0 siblings, 2 replies; 36+ messages in thread From: Duy Nguyen @ 2013-11-28 8:14 UTC (permalink / raw) To: zhifeng hu; +Cc: Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote: > Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. if it works with --depth =1, you can incrementally run "fetch --depth=N" with N larger and larger. But it may be easier to ask kernel.org admin, or any dev with a public web server, to provide you a git bundle you can download via http. Then you can fetch on top. -- Duy ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:14 ` Duy Nguyen @ 2013-11-28 8:35 ` Karsten Blees 2013-11-28 8:50 ` Duy Nguyen 2013-11-28 9:20 ` How to resume broke clone ? Tay Ray Chuan 1 sibling, 1 reply; 36+ messages in thread From: Karsten Blees @ 2013-11-28 8:35 UTC (permalink / raw) To: Duy Nguyen, zhifeng hu; +Cc: Trần Ngọc Quân, Git Mailing List Am 28.11.2013 09:14, schrieb Duy Nguyen: > On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote: >> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. > > if it works with --depth =1, you can incrementally run "fetch > --depth=N" with N larger and larger. > > But it may be easier to ask kernel.org admin, or any dev with a public > web server, to provide you a git bundle you can download via http. > Then you can fetch on top. > Or simply download the individual files (via ftp/http) and clone locally: > wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ > git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > cd linux > git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:35 ` Karsten Blees @ 2013-11-28 8:50 ` Duy Nguyen 2013-11-28 8:55 ` zhifeng hu 0 siblings, 1 reply; 36+ messages in thread From: Duy Nguyen @ 2013-11-28 8:50 UTC (permalink / raw) To: Karsten Blees Cc: zhifeng hu, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees <karsten.blees@gmail.com> wrote: > Or simply download the individual files (via ftp/http) and clone locally: > >> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ >> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >> cd linux >> git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Yeah I didn't realize it is published over dumb http too. You may need to be careful with this though because it's not atomic and you may get refs that point nowhere because you're already done with "pack" directory when you come to fetcing "refs" and did not see new packs... If dumb commit walker supports resume (I don't know) then it'll be safer to do git clone http://git.kernel.org/.... If it does not support resume, I don't think it's hard to do. -- Duy ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:50 ` Duy Nguyen @ 2013-11-28 8:55 ` zhifeng hu 2013-11-28 9:09 ` Duy Nguyen 0 siblings, 1 reply; 36+ messages in thread From: zhifeng hu @ 2013-11-28 8:55 UTC (permalink / raw) To: Duy Nguyen Cc: Karsten Blees, Trần Ngọc Quân, Git Mailing List The repository growing fast, things get harder . Now the size reach several GB, it may possible be TB, YB. When then, How do we handle this? If the transfer broken, and it can not be resume transfer, waste time and waste bandwidth. Git should be better support resume transfer. It now seems not doing better it’s job. Share code, manage code, transfer code, what would it be a VCS we imagine it ? zhifeng hu On Nov 28, 2013, at 4:50 PM, Duy Nguyen <pclouds@gmail.com> wrote: > On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees <karsten.blees@gmail.com> wrote: >> Or simply download the individual files (via ftp/http) and clone locally: >> >>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ >>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >>> cd linux >>> git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > Yeah I didn't realize it is published over dumb http too. You may need > to be careful with this though because it's not atomic and you may get > refs that point nowhere because you're already done with "pack" > directory when you come to fetcing "refs" and did not see new packs... > If dumb commit walker supports resume (I don't know) then it'll be > safer to do > > git clone http://git.kernel.org/.... > > If it does not support resume, I don't think it's hard to do. > -- > Duy ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:55 ` zhifeng hu @ 2013-11-28 9:09 ` Duy Nguyen 2013-11-28 9:29 ` Jeff King 0 siblings, 1 reply; 36+ messages in thread From: Duy Nguyen @ 2013-11-28 9:09 UTC (permalink / raw) To: zhifeng hu Cc: Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 3:55 PM, zhifeng hu <zf@ancientrocklab.com> wrote: > The repository growing fast, things get harder . Now the size reach several GB, it may possible be TB, YB. > When then, How do we handle this? > If the transfer broken, and it can not be resume transfer, waste time and waste bandwidth. > > Git should be better support resume transfer. > It now seems not doing better it’s job. > Share code, manage code, transfer code, what would it be a VCS we imagine it ? You're welcome to step up and do it. On top of my head there are a few options: - better integration with git bundles, provide a way to seamlessly create/fetch/resume the bundles with "git clone" and "git fetch" - shallow/narrow clone. the idea is get a small part of the repo, one depth, a few paths, then get more and more over many iterations so if we fail one iteration we don't lose everything - stablize pack order so we can resume downloading a pack - remote alternates, the repo will ask for more and more objects as you need them (so goodbye to distributed model) -- Duy ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:09 ` Duy Nguyen @ 2013-11-28 9:29 ` Jeff King 2013-11-28 10:17 ` Duy Nguyen 2013-11-28 19:15 ` Shawn Pearce 0 siblings, 2 replies; 36+ messages in thread From: Jeff King @ 2013-11-28 9:29 UTC (permalink / raw) To: Duy Nguyen Cc: zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote: > > Git should be better support resume transfer. > > It now seems not doing better it’s job. > > Share code, manage code, transfer code, what would it be a VCS we imagine it ? > > You're welcome to step up and do it. On top of my head there are a few options: > > - better integration with git bundles, provide a way to seamlessly > create/fetch/resume the bundles with "git clone" and "git fetch" I posted patches for this last year. One of the things that I got hung up on was that I spooled the bundle to disk, and then cloned from it. Which meant that you needed twice the disk space for a moment. I wanted to teach index-pack to "--fix-thin" a pack that was already on disk, so that we could spool to disk, and then finalize it without making another copy. One of the downsides of this approach is that it requires the repo provider (or somebody else) to provide the bundle. I think that is something that a big site like GitHub would do (and probably push the bundles out to a CDN, too, to make getting them faster). But it's not a universal solution. > - stablize pack order so we can resume downloading a pack I think stabilizing in all cases (e.g., including ones where the content has changed) is hard, but I wonder if it would be enough to handle the easy cases, where nothing has changed. If the server does not use multiple threads for delta computation, it should generate the same pack from the same on-disk deterministically. We just need a way for the client to indicate that it has the same partial pack. I'm thinking that the server would report some opaque hash representing the current pack. The client would record that, along with the number of pack bytes it received. If the transfer is interrupted, the client comes back with the hash/bytes pair. The server starts to generate the pack, checks whether the hash matches, and if so, says "here is the same pack, resuming at byte X". What would need to go into such a hash? It would need to represent the exact bytes that will go into the pack, but without actually generating those bytes. Perhaps a sha1 over the sequence of <object sha1, type, base (if applicable), length> for each object would be enough. We should know that after calling compute_write_order. If the client has a match, we should be able to skip ahead to the correct byte. > - remote alternates, the repo will ask for more and more objects as > you need them (so goodbye to distributed model) This is also something I've been playing with, but just for very large objects (so to support something like git-media, but below the object graph layer). I don't think it would apply here, as the kernel has a lot of small objects, and getting them in the tight delta'd pack format increases efficiency a lot. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:29 ` Jeff King @ 2013-11-28 10:17 ` Duy Nguyen 2013-11-28 19:15 ` Shawn Pearce 1 sibling, 0 replies; 36+ messages in thread From: Duy Nguyen @ 2013-11-28 10:17 UTC (permalink / raw) To: Jeff King Cc: zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 4:29 PM, Jeff King <peff@peff.net> wrote: >> - stablize pack order so we can resume downloading a pack > > I think stabilizing in all cases (e.g., including ones where the content > has changed) is hard, but I wonder if it would be enough to handle the > easy cases, where nothing has changed. If the server does not use > multiple threads for delta computation, it should generate the same pack > from the same on-disk deterministically. We just need a way for the > client to indicate that it has the same partial pack. > > I'm thinking that the server would report some opaque hash representing > the current pack. The client would record that, along with the number of > pack bytes it received. If the transfer is interrupted, the client comes > back with the hash/bytes pair. The server starts to generate the pack, > checks whether the hash matches, and if so, says "here is the same pack, > resuming at byte X". > > What would need to go into such a hash? It would need to represent the > exact bytes that will go into the pack, but without actually generating > those bytes. Perhaps a sha1 over the sequence of <object sha1, type, > base (if applicable), length> for each object would be enough. We should > know that after calling compute_write_order. If the client has a match, > we should be able to skip ahead to the correct byte. Exactly. The hash would include the list of sha-1 and object source, the git version (so changes in code or default values are covered), the list of config keys/values that may impact pack generation algorithm (like window size..), .git/shallow, refs/replace, .git/graft, all or most of command line options. If we audit the code carefully I think we can cover all input that influences pack generation. From then on it's just a matter of protocol extension. It also opens an opportunity for optional server side caching, just save the pack and associate it with the hash. Next time the client asks to resume, the server has everything ready. -- Duy ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:29 ` Jeff King 2013-11-28 10:17 ` Duy Nguyen @ 2013-11-28 19:15 ` Shawn Pearce 2013-12-04 20:08 ` Jeff King 1 sibling, 1 reply; 36+ messages in thread From: Shawn Pearce @ 2013-11-28 19:15 UTC (permalink / raw) To: Jeff King Cc: Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 1:29 AM, Jeff King <peff@peff.net> wrote: > On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote: > >> > Git should be better support resume transfer. >> > It now seems not doing better it’s job. >> > Share code, manage code, transfer code, what would it be a VCS we imagine it ? >> >> You're welcome to step up and do it. On top of my head there are a few options: >> >> - better integration with git bundles, provide a way to seamlessly >> create/fetch/resume the bundles with "git clone" and "git fetch" We have been thinking about formalizing the /clone.bundle hack used by repo on Android. If the server has the bundle, add a capability in the refs advertisement saying its available, and the clone client can first fetch $URL/clone.bundle. For most Git repositories the bundle can be constructed by saving the bundle reference header into a file, e.g. $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is created. The bundle can be served by combining the .bh and .pack streams onto the network. It is very little additional disk overhead for the origin server, but allows resumable clone, provided the server has not done a GC. > I posted patches for this last year. One of the things that I got hung > up on was that I spooled the bundle to disk, and then cloned from it. > Which meant that you needed twice the disk space for a moment. I don't think this is a huge concern. In many cases the checked out copy of the repository approaches a sizable fraction of the .pack itself. If you don't have 2x .pack disk available at clone time you may be in trouble anyway as you try to work with the repository post clone. > I wanted > to teach index-pack to "--fix-thin" a pack that was already on disk, so > that we could spool to disk, and then finalize it without making another > copy. Don't you need to separate the bundle header from the pack data before you do this? If the bundle is only used at clone time there is no --fix-thin step. > One of the downsides of this approach is that it requires the repo > provider (or somebody else) to provide the bundle. I think that is > something that a big site like GitHub would do (and probably push the > bundles out to a CDN, too, to make getting them faster). But it's not a > universal solution. See above, I think you can reasonably do the /clone.bundle automatically on any HTTP server. Big sites might choose to have /clone.bundle do a redirect into a caching CDN that fills itself by going to the application servers to obtain the current data. This is what we do for Android. >> - stablize pack order so we can resume downloading a pack > > I think stabilizing in all cases (e.g., including ones where the content > has changed) is hard, but I wonder if it would be enough to handle the > easy cases, where nothing has changed. If the server does not use > multiple threads for delta computation, it should generate the same pack > from the same on-disk deterministically. We just need a way for the > client to indicate that it has the same partial pack. > > I'm thinking that the server would report some opaque hash representing > the current pack. The client would record that, along with the number of > pack bytes it received. If the transfer is interrupted, the client comes > back with the hash/bytes pair. The server starts to generate the pack, > checks whether the hash matches, and if so, says "here is the same pack, > resuming at byte X". An important part of this is the want set must be identical to the prior request. It is entirely possible the branch tips have advanced since the prior packing attempt started. > What would need to go into such a hash? It would need to represent the > exact bytes that will go into the pack, but without actually generating > those bytes. Perhaps a sha1 over the sequence of <object sha1, type, > base (if applicable), length> for each object would be enough. We should > know that after calling compute_write_order. If the client has a match, > we should be able to skip ahead to the correct byte. I don't think Length is sufficient. The repository could have recompressed an object with the same length but different libz encoding. I wonder if loose object recompression is reliable enough about libz encoding to resume in the middle of an object? Is it just based on libz version? You may need to do include information about the source of the object, e.g. the trailing 20 byte hash in the source pack file. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 19:15 ` Shawn Pearce @ 2013-12-04 20:08 ` Jeff King 2013-12-05 6:50 ` Shawn Pearce 0 siblings, 1 reply; 36+ messages in thread From: Jeff King @ 2013-12-04 20:08 UTC (permalink / raw) To: Shawn Pearce Cc: Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote: > >> - better integration with git bundles, provide a way to seamlessly > >> create/fetch/resume the bundles with "git clone" and "git fetch" > > We have been thinking about formalizing the /clone.bundle hack used by > repo on Android. If the server has the bundle, add a capability in the > refs advertisement saying its available, and the clone client can > first fetch $URL/clone.bundle. Yes, that was going to be my next step after getting the bundle fetch support in. If we are going to do this, though, I'd really love for it to not be "hey, fetch .../clone.bundle from me", but a full-fledged "here are full URLs of my mirrors". Then you can redirect a non-http cloner to http to grab the bundle. Or redirect them to a CDN. Or even somebody else's server entirely (e.g., "go fetch from Linus first, my piddly server cannot feed you the whole kernel"). Some of the redirects you can do by issuing an http redirect to "/clone.bundle", but the cross-protocol ones are tricky. If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors") it does not add much overhead over a simple capability. There are a few extra round trips to actually fetch the blob (client sends a want and no haves, then server sends the pack), but I think that's negligible when we are talking about redirecting a full clone. In either case, we have to hang up the original connection, fetch the mirror, and then come back. > For most Git repositories the bundle can be constructed by saving the > bundle reference header into a file, e.g. > $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is > created. The bundle can be served by combining the .bh and .pack > streams onto the network. It is very little additional disk overhead > for the origin server, That's clever. It does not work out of the box if you are using alternates, but I think it could be adapted in certain situations. E.g., if you layer the pack so that one "base" repo always has its full pack at the start, which is something we're already doing at GitHub. > but allows resumable clone, provided the server has not done a GC. As an aside, the current transfer-resuming code in http.c is questionable. It does not use etags or any sort of invalidation mechanism, but just assumes hitting the same URL will give the same bytes. That _usually_ works for dumb fetching of objects and packfiles, though it is possible for a pack to change representation without changing name. My bundle patches inherited the same flaw, but it is much worse there, because your URL may very well just be "clone.bundle" that gets updated periodically. > > I posted patches for this last year. One of the things that I got hung > > up on was that I spooled the bundle to disk, and then cloned from it. > > Which meant that you needed twice the disk space for a moment. > > I don't think this is a huge concern. In many cases the checked out > copy of the repository approaches a sizable fraction of the .pack > itself. If you don't have 2x .pack disk available at clone time you > may be in trouble anyway as you try to work with the repository post > clone. Yeah, in retrospect I was being stupid to let that hold it up. I'll revisit the patches (I've rebased them forward over the past year, so it shouldn't be too bad). > > I wanted > > to teach index-pack to "--fix-thin" a pack that was already on disk, so > > that we could spool to disk, and then finalize it without making another > > copy. > > Don't you need to separate the bundle header from the pack data before > you do this? Yes, though it isn't hard. We have to fetch part of the bundle header into memory during discover_refs(), since that is when we realize we are getting a bundle and not just the refs. From there you can spool the bundle header to disk, and then the packfile separately. My original implementation did that, though I don't remember if that one got posted to the list (after realizing that I couldn't just "--fix-thin" directly, I simplified it to just spool the whole thing to a single file). > If the bundle is only used at clone time there is no > --fix-thin step. Yes, for the particular use case of a clone-mirror, you wouldn't need to --fix-thin. But I think "git fetch https://example.com/foo.bundle" should work in the general case (and it does with my patches). > See above, I think you can reasonably do the /clone.bundle > automatically on any HTTP server. Yeah, the ".bh" trick you mentioned is low enough impact to the server that we could just unconditionally make it part of the repack. > > What would need to go into such a hash? It would need to represent the > > exact bytes that will go into the pack, but without actually generating > > those bytes. Perhaps a sha1 over the sequence of <object sha1, type, > > base (if applicable), length> for each object would be enough. We should > > know that after calling compute_write_order. If the client has a match, > > we should be able to skip ahead to the correct byte. > > I don't think Length is sufficient. > > The repository could have recompressed an object with the same length > but different libz encoding. I wonder if loose object recompression is > reliable enough about libz encoding to resume in the middle of an > object? Is it just based on libz version? > > You may need to do include information about the source of the object, > e.g. the trailing 20 byte hash in the source pack file. Yeah, I think you're right that it's too flaky without recording the source. At any rate, I think I prefer the bundle approach you mentioned above. It solves the same problem, and is a lot more flexible (e.g., for offloading to other servers). -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-04 20:08 ` Jeff King @ 2013-12-05 6:50 ` Shawn Pearce 2013-12-05 13:21 ` Michael Haggerty 2013-12-05 16:04 ` Jeff King 0 siblings, 2 replies; 36+ messages in thread From: Shawn Pearce @ 2013-12-05 6:50 UTC (permalink / raw) To: Jeff King Cc: Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Wed, Dec 4, 2013 at 12:08 PM, Jeff King <peff@peff.net> wrote: > On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote: > >> >> - better integration with git bundles, provide a way to seamlessly >> >> create/fetch/resume the bundles with "git clone" and "git fetch" >> >> We have been thinking about formalizing the /clone.bundle hack used by >> repo on Android. If the server has the bundle, add a capability in the >> refs advertisement saying its available, and the clone client can >> first fetch $URL/clone.bundle. > > Yes, that was going to be my next step after getting the bundle fetch > support in. Yay! > If we are going to do this, though, I'd really love for it > to not be "hey, fetch .../clone.bundle from me", but a full-fledged > "here are full URLs of my mirrors". Ack. I agree completely. > Then you can redirect a non-http cloner to http to grab the bundle. Or > redirect them to a CDN. Or even somebody else's server entirely (e.g., > "go fetch from Linus first, my piddly server cannot feed you the whole > kernel"). Some of the redirects you can do by issuing an http redirect > to "/clone.bundle", but the cross-protocol ones are tricky. Ack. My thoughts exactly. Especially the part of "my piddly server shouldn't have to serve you a clone of Linus' tree when there are many public hosts mirroring his code available to anyone". It is simply not fair to clone Linus' tree off some guy's home ADSL connection, his uplink probably sucks. But it is reasonable to fetch his incremental delta after cloning from some other well known and well connected source. > If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors") > it does not add much overhead over a simple capability. There are a few > extra round trips to actually fetch the blob (client sends a want and no > haves, then server sends the pack), but I think that's negligible when > we are talking about redirecting a full clone. In either case, we have > to hang up the original connection, fetch the mirror, and then come > back. I wasn't thinking about using a "well known blob" for this. Jonathan, Dave, Colby and I were kicking this idea around on Monday during lunch. If the initial ref advertisement included a "mirrors" capability the client could respond with "want mirrors" instead of the usual want/have negotiation. The server could then return the mirror URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial compared to the cost to really clone the repository. These pkt-lines need to be a bit more than just URL. Or we need a new URL like "bundle:http://...." to denote a resumable bundle over HTTP vs. a normal HTTP URL that might not be a bundle file, and is just a better connected server. The mirror URLs could be stored in $GIT_DIR/config as a simple multi-value variable. Unfortunately that isn't easily remotely editable. But I am not sure I care? GitHub doesn't let you edit $GIT_DIR/config, but it doesn't need to. For most repositories hosted at GitHub, GitHub is probably the best connected server for that repository. For repositories that are incredibly high traffic GitHub might out of its own interest want to configure mirror URLs on some sort of CDN to distribute the network traffic closer to the edges. Repository owners just shouldn't have to worry about these sorts of details. It should be managed by the hosting service. In my case for android.googlesource.com we want bundles on the CDN near the network edges, and our repository owners don't care to know the details of that. They just want our server software to make it all happen, and our servers already manage $GIT_DIR/config for them. It also mostly manages /clone.bundle on the CDN. And /clone.bundle is an ugly, limited hack. For the average home user sharing their working repository over git:// from their home ADSL or cable connection, editing .git/config is easier than a blob in refs/mirrors. They already know how to edit .git/config to manage remotes. Heck, remote.origin.url might already be a good mirror address to advertise, especially if the client isn't on the same /24 as the server and the remote.origin.url is something like "git.kernel.org". :-) >> For most Git repositories the bundle can be constructed by saving the >> bundle reference header into a file, e.g. >> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is >> created. The bundle can be served by combining the .bh and .pack >> streams onto the network. It is very little additional disk overhead >> for the origin server, > > That's clever. It does not work out of the box if you are using > alternates, but I think it could be adapted in certain situations. E.g., > if you layer the pack so that one "base" repo always has its full pack > at the start, which is something we're already doing at GitHub. Yes, well, I was assuming the pack was a fully connected repack. Alternates always creates a partial pack. But if you have an alternate, that alternate maybe should be given as a mirror URL? And allow the client to recurse the alternate mirror URL list too? By listing the alternate as a mirror a client could maybe discover the resumable clone bundle in the alternate, grab that first to bootstrap, reducing the amount it has to obtain in a non-resumable way. Or... the descendant repository could offer its own bundle with the "must have" assertions from the alternate at the time it repacked. So the .bh file would have a number of ^ lines and the bundle was built with a "--not ..." list. >> but allows resumable clone, provided the server has not done a GC. > > As an aside, the current transfer-resuming code in http.c is > questionable. It does not use etags or any sort of invalidation > mechanism, but just assumes hitting the same URL will give the same > bytes. Yea, our lunch conversation eventually reached this part too. repo's /clone.bundle hack is equally stupid and assumes a resume will get the correct data, with no validation. If you resume with the wrong data while inside of the pack stream the pack will be invalid; the SHA-1 trailer won't match. But you won't know until you have downloaded the entire useless file. Resuming a 700M download after the first 10M only to find out the first 10M is mismatched sucks. What really got us worried was the bundle header has no checksums, and a resume in the bundle header from the wrong version could be interesting. > That _usually_ works for dumb fetching of objects and packfiles, > though it is possible for a pack to change representation without > changing name. Yes. And this is why the packfile name algorithm is horribly flawed. I keep saying we should change it to name the pack using the last 20 bytes of the file but ... nobody has written the patch for that? :-) > My bundle patches inherited the same flaw, but it is much worse there, > because your URL may very well just be "clone.bundle" that gets updated > periodically. Yup, you followed the same thing we did in repo, which is horribly wrong. We should try to use ETag if available to safely resume, and we should try to encourage people to use stronger names when pointing to URLs that are resumable, like a bundle on a CDN. If the URL is offered by the server in pkt-lines after the advertisement its easy for the server to return the current CDN URL, and easy for the server to implement enforcement of the URLs being unique. Especially if you manage the CDN automatically; e.g. Android uses tools to build the CDN files and push them out. Its easy for us to ensure these have unique URLs on every push. A bundling server bundling once a day or once a week could simply date stamp each run. >> > I posted patches for this last year. One of the things that I got hung >> > up on was that I spooled the bundle to disk, and then cloned from it. >> > Which meant that you needed twice the disk space for a moment. >> >> I don't think this is a huge concern. In many cases the checked out >> copy of the repository approaches a sizable fraction of the .pack >> itself. If you don't have 2x .pack disk available at clone time you >> may be in trouble anyway as you try to work with the repository post >> clone. > > Yeah, in retrospect I was being stupid to let that hold it up. I'll > revisit the patches (I've rebased them forward over the past year, so it > shouldn't be too bad). I keep prodding Jonathan to work on this too, because I'd really like to get this out of repo and just have it be something git knows how to do. And bigger mirrors like git.kernel.org could do a quick grep/sort/uniq -c through their access logs and periodically bundle up a few repositories that are cloned often. E.g. we all know git.kernel.org should just bundle Linus' repository. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 6:50 ` Shawn Pearce @ 2013-12-05 13:21 ` Michael Haggerty 2013-12-05 15:11 ` Shawn Pearce 2013-12-05 16:12 ` Jeff King 2013-12-05 16:04 ` Jeff King 1 sibling, 2 replies; 36+ messages in thread From: Michael Haggerty @ 2013-12-05 13:21 UTC (permalink / raw) To: Shawn Pearce Cc: Jeff King, Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List This discussion has mostly been about letting small Git servers delegate the work of an initial clone to a beefier server. I haven't seen any explicit mention of the inverse: Suppose a company has a central Git server that is meant to be the "single source of truth", but has worldwide offices and wants to locate bootstrap mirrors in each office. The end users would not even want to know that there are multiple servers. Hosters like GitHub might also encourage their big customers to set up bootstrap mirror(s) in-house to make cloning faster for their users while reducing internet traffic and the burden on their own infrastructure. The goal would be to make the system transparent to users and easily reconfigurable as circumstances change. One alternative would be to ask users to clone from their local mirror. The local mirror would give them whatever it has, then do the equivalent of a permanent redirect to tell the client "from now on, use the central server" to get the rest of the initial clone and for future fetches. But this would require users to know which mirror is "local". A better alternative would be to ask users to clone from the central server. In this case, the central server would want to tell the clients to grab what they can from their local bootstrap mirror and then come back to the central server for any remainders. The trick is that which bootstrap mirror is "local" would vary from client to client. I suppose that this could be implemented using what you have discussed by having the central server direct the client to a URL that resolves differently for different clients, CDN-like. Alternatively, the central Git server could itself look where a request is coming from and use some intelligence to redirect the client to the closest bootstrap mirror from its own list. Or the server could pass the client a list of known mirrors, and the client could try to determine which one is closest (and reachable!). I'm not sure that this idea is interesting, but I just wanted to throw it out there as a related use case that seems a bit different than what you have been discussing. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 13:21 ` Michael Haggerty @ 2013-12-05 15:11 ` Shawn Pearce 2013-12-05 16:12 ` Jeff King 1 sibling, 0 replies; 36+ messages in thread From: Shawn Pearce @ 2013-12-05 15:11 UTC (permalink / raw) To: Michael Haggerty Cc: Jeff King, Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote: > This discussion has mostly been about letting small Git servers delegate > the work of an initial clone to a beefier server. I haven't seen any > explicit mention of the inverse: > > Suppose a company has a central Git server that is meant to be the > "single source of truth", but has worldwide offices and wants to locate > bootstrap mirrors in each office. The end users would not even want to > know that there are multiple servers. Hosters like GitHub might also > encourage their big customers to set up bootstrap mirror(s) in-house to > make cloning faster for their users while reducing internet traffic and > the burden on their own infrastructure. The goal would be to make the > system transparent to users and easily reconfigurable as circumstances > change. I think there is a different way to do that. Build a caching Git proxy server. And teach Git clients to use it. One idea we had at $DAY_JOB a couple of years ago was to build a daemon that sat in the background and continuously fetched content from repository upstreams. We made it efficient by modifying the Git protocol to use a hanging network socket, and the upstream server would broadcast push pack files down these hanging streams as pushes were received. The original intent was for an Android developer to be able to have his working tree forest of 500 repositories subscribe to our internal server's broadcast stream. We figured if the server knows exactly which refs every client has, because they all have the same ones, and their streams are all still open and active, then the server can make exactly one incremental thin pack and send the same copy to every client. Its "just" a socket write problem. Instead of packing the same stuff 100x for 100x clients its packed once and sent 100x. Then we realized remote offices could also install this software on a local server, and use this as a fan-out distributor within the LAN. We were originally thinking about some remote offices on small Internet connections, where delivery of 10 MiB x 20 was a lot but delivery of 10 MiB once and local fan-out on the Ethernet was easy. The JGit patches for this work are still pending[1]. If clients had a local Git-aware cache server in their office and ~/.gitconfig had the address of it, your problem becomes simple. Clients clone from the public URL e.g. GitHub, but the local cache server first gives the client a URL to clone from itself. After that is complete then the client can fetch from the upstream. The cache server can be self-maintaining, watching its requests to see what is accessed often-ish, and keep those repositories current-ish locally by running git fetch itself in the background. Its easy to do this with bundles on "CDN" like HTTP. Just use the office's caching HTTP proxy server. Assuming its cache is big enough for those large Git bundle payloads, and the viral cat videos. But you are at the mercy of the upstream bundler rebuilding the bundles. And refetching them in whole. Neither of which is great. A simple self-contained server that doesn't accept pushes, but knows how to clone repositories, fetch them periodically, and run `git gc`, works well. And the mirror URL extension we have been discussing in this thread would work fine here. The cache server can return URLs that point to itself. Or flat out proxy the Git transaction with the origin server. [1] https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 13:21 ` Michael Haggerty 2013-12-05 15:11 ` Shawn Pearce @ 2013-12-05 16:12 ` Jeff King 1 sibling, 0 replies; 36+ messages in thread From: Jeff King @ 2013-12-05 16:12 UTC (permalink / raw) To: Michael Haggerty Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Dec 05, 2013 at 02:21:09PM +0100, Michael Haggerty wrote: > A better alternative would be to ask users to clone from the central > server. In this case, the central server would want to tell the clients > to grab what they can from their local bootstrap mirror and then come > back to the central server for any remainders. The trick is that which > bootstrap mirror is "local" would vary from client to client. > > I suppose that this could be implemented using what you have discussed > by having the central server direct the client to a URL that resolves > differently for different clients, CDN-like. Alternatively, the central > Git server could itself look where a request is coming from and use some > intelligence to redirect the client to the closest bootstrap mirror from > its own list. Or the server could pass the client a list of known > mirrors, and the client could try to determine which one is closest (and > reachable!). Exactly. I think this will mostly happen via CDN, but I had also envisioned that the server could add metadata to a list of possible mirrors, like: [mirror "ko-us"] url = http://git.us.kernel.org/... zone = us [mirror "ko-cn"] url = http://git.cn.kernel.org/... zone = cn If the "zone" keys follow a micro-format convention, then the client knows that it prefers "cn" over "us" (either on the command line, or a local config option in ~/.gitconfig). The biggest problem with all of this is that the server has to know about the mirrors. If you want to set up an in-house mirror for something hosted on GitHub, but its only available to people in your company, then GitHub would not want to advertise it. You need some way to tell your clients about the mirror (and that is the inverse-mirror "fetch from the mirror, which tells you it is just a bootstrap and to now switch to the real repo" scheme that I think you were describing earlier). -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 6:50 ` Shawn Pearce 2013-12-05 13:21 ` Michael Haggerty @ 2013-12-05 16:04 ` Jeff King 2013-12-05 18:01 ` Junio C Hamano 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King 1 sibling, 2 replies; 36+ messages in thread From: Jeff King @ 2013-12-05 16:04 UTC (permalink / raw) To: Shawn Pearce Cc: Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Wed, Dec 04, 2013 at 10:50:27PM -0800, Shawn Pearce wrote: > I wasn't thinking about using a "well known blob" for this. > > Jonathan, Dave, Colby and I were kicking this idea around on Monday > during lunch. If the initial ref advertisement included a "mirrors" > capability the client could respond with "want mirrors" instead of the > usual want/have negotiation. The server could then return the mirror > URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial > compared to the cost to really clone the repository. I don't think this is any more or less efficient than the blob scheme. In both cases, the client sends a single "want" line and no "have" lines, and then the server responds with the output (either pkt-lines, or a single-blob pack). What I like about the blob approach is: 1. It requires zero extra code on the server. This makes implementation simple, but also means you can deploy it on existing servers (or even on non-pkt-line servers like dumb http). 2. It's very debuggable from the client side. You can fetch the blob, look at it, and decide which mirror you want outside of git if you want to (true, you can teach the git client to dump the pkt-line URLs, too, but that's extra code). You could even do this with an existing git client that has not yet learned about the mirror redirect. 3. It removes any size or structure limits that the protocol imposes (I was planning to use git-config format for the blob itself). The URLs themselves aren't big, but we may want to annotate them with metadata. You mentioned "this is a bundle" versus "this is a regular http server" below. You might also want to provide network location information (e.g., "this is a good mirror if you are in Asia"), though for the most part I'd expect that to happen magically via CDN. When we discussed this before, the concept came up of offering not just a clone bundle, but "slices" of history (as thin-pack bundles), so that a fetch could grab a sequence of resumable slices, starting with what they have, and then topping off with a true fetch. You would want to provide the start and end points of each slice. 4. You can manage it remotely via the git protocol (more discussion below). 5. A clone done with "--mirror" will actually propagate the mirror file automatically. What are the advantages of the pkt-line approach? The biggest one I can think of is that it does not pollute the refs namespace. While (5) is convenient in some cases, it would make it more of a pain if you are trying to keep a clone mirror up to date, but do _not_ want to pass along upstream's mirror file. You may want to have a server implementation that offers a dynamic mirror, rather than a true object we have in the ODB. That is possible with a mirror blob, but is slightly harder (you have to fake the object rather than just dumping a line). > These pkt-lines need to be a bit more than just URL. Or we need a new > URL like "bundle:http://...." to denote a resumable bundle over HTTP > vs. a normal HTTP URL that might not be a bundle file, and is just a > better connected server. Right, I think that's the most critical one (though you could also just use the convention of ".bundle" in the URL). I think we may want to leave room for more metadata, though. > The mirror URLs could be stored in $GIT_DIR/config as a simple > multi-value variable. Unfortunately that isn't easily remotely > editable. But I am not sure I care? For big sites that manage the bundles on behalf of the user, I don't think it is an issue. For somebody running their own small site, I think it is a useful way of moving the data to the server. > For the average home user sharing their working repository over git:// > from their home ADSL or cable connection, editing .git/config is > easier than a blob in refs/mirrors. They already know how to edit > .git/config to manage remotes. Yes, but it's editing .git/config on the server, not on the client, which may be slightly harder for some people. I do think we'd want some tool support on the client side. git-config recently learned to read from a blob. The next step is: git config --blob=refs/mirrors --edit or git config --blob=refs/mirrors mirror.ko.url git://git.kernel.org/... git config --blob=refs/mirrors mirror.ko.bundle true We can't add tool support for editing .git/config on the server side, because the method for doing so isn't standard. > Heck, remote.origin.url might already > be a good mirror address to advertise, especially if the client isn't > on the same /24 as the server and the remote.origin.url is something > like "git.kernel.org". :-) You could have a "git-advertise-upstream" that generates a mirror blob from your remotes config and pushes it to your publishing point. That may be overkill, but I don't think it's possible with a .git/config-based solution. > > That's clever. It does not work out of the box if you are using > > alternates, but I think it could be adapted in certain situations. E.g., > > if you layer the pack so that one "base" repo always has its full pack > > at the start, which is something we're already doing at GitHub. > > Yes, well, I was assuming the pack was a fully connected repack. > Alternates always creates a partial pack. But if you have an > alternate, that alternate maybe should be given as a mirror URL? And > allow the client to recurse the alternate mirror URL list too? The problem for us is not that we have a partial pack, but that the alternates pack has a lot of other junk in it. A linux.git clone is 650MB or so. The packfile for all of the linux.git forks together on GitHub is several gigabytes. > What really got us worried was the bundle header has no checksums, and > a resume in the bundle header from the wrong version could be > interesting. The bundle header is small enough that you should just throw it away if you didn't get the whole thing (IIRC, that is what my patches do, because it does not do _anything_ until we receive the whole ref advertisement, at which point we decide if it is smart, dumb, or a bundle). > Yes. And this is why the packfile name algorithm is horribly flawed. I > keep saying we should change it to name the pack using the last 20 > bytes of the file but ... nobody has written the patch for that? :-) Totally agree. I think we could also get rid of the horrible hacks in repack where we pack to a tempfile, then have to do another tempfile dance (which is not atomic!) to move the same-named packfile out of the way. If the name were based on the content, we could just throw away our new pack if one of the same name is already there (just like we do for loose objects). I haven't looked at making such a patch, but I think it shouldn't be too complicated. My big worry would be weird fallouts from some hidden part of the code that we don't realize is depending on the current naming scheme. :) -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 16:04 ` Jeff King @ 2013-12-05 18:01 ` Junio C Hamano 2013-12-05 19:08 ` Jeff King 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King 1 sibling, 1 reply; 36+ messages in thread From: Junio C Hamano @ 2013-12-05 18:01 UTC (permalink / raw) To: Jeff King Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List Jeff King <peff@peff.net> writes: > Right, I think that's the most critical one (though you could also just > use the convention of ".bundle" in the URL). I think we may want to > leave room for more metadata, though. Good. I like this line of thinking. >> Heck, remote.origin.url might already >> be a good mirror address to advertise, especially if the client isn't >> on the same /24 as the server and the remote.origin.url is something >> like "git.kernel.org". :-) > > You could have a "git-advertise-upstream" that generates a mirror blob > from your remotes config and pushes it to your publishing point. That > may be overkill, but I don't think it's possible with a > .git/config-based solution. I do not think I follow. The upload-pack service could be taught to pay attention to the uploadpack.advertiseUpstream config at runtime, advertise 'mirror' capability, and then respond with the list of remote.*.url it uses when asked (if we go with the pkt-line based approach). Alternatively, it could also be taught to pay attention to the same config at runtime, create an blob to advertise the list of remote.*.url it uses and store it in refs/mirror (or do this purely in-core without actually writing to the refs/ namespace), and emit an entry for refs/mirror using that blob object name in the ls-remote part of the response (if we go with the magic blob based approach). >> Yes. And this is why the packfile name algorithm is horribly flawed. I >> keep saying we should change it to name the pack using the last 20 >> bytes of the file but ... nobody has written the patch for that? :-) > > Totally agree. I think we could also get rid of the horrible hacks in > repack where we pack to a tempfile, then have to do another tempfile > dance (which is not atomic!) to move the same-named packfile out of the > way. If the name were based on the content, we could just throw away our > new pack if one of the same name is already there (just like we do for > loose objects). Yay. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-12-05 18:01 ` Junio C Hamano @ 2013-12-05 19:08 ` Jeff King 0 siblings, 0 replies; 36+ messages in thread From: Jeff King @ 2013-12-05 19:08 UTC (permalink / raw) To: Junio C Hamano Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees, Trần Ngọc Quân, Git Mailing List On Thu, Dec 05, 2013 at 10:01:28AM -0800, Junio C Hamano wrote: > > You could have a "git-advertise-upstream" that generates a mirror blob > > from your remotes config and pushes it to your publishing point. That > > may be overkill, but I don't think it's possible with a > > .git/config-based solution. > > I do not think I follow. The upload-pack service could be taught to > pay attention to the uploadpack.advertiseUpstream config at runtime, > advertise 'mirror' capability, and then respond with the list of > remote.*.url it uses when asked (if we go with the pkt-line based > approach). I was assuming a triangular workflow, where your publishing point (that other people will fetch from) does not know anything about the upstream. Like: $ git clone git://git.kernel.org/pub/scm/git/git.git $ hack hack hack; commit commit commit $ git remote add me myserver:/var/git/git.git $ git push me $ git advertise-upstream origin me If your publishing point is already fetching from another upstream, then yeah, I'd agree that dynamically generating it from the config is fine. > Alternatively, it could also be taught to pay attention > to the same config at runtime, create an blob to advertise the list > of remote.*.url it uses and store it in refs/mirror (or do this > purely in-core without actually writing to the refs/ namespace), and > emit an entry for refs/mirror using that blob object name in the > ls-remote part of the response (if we go with the magic blob based > approach). Yes. The pkt-line versus refs distinction is purely a protocol issue. You can do anything you want on the backend with either of them, including faking the ref (you can also accept fake pushes to refs/mirror, too, if you really want people to be able to upload that way). But it is worth considering what implementation difficulties we would run across in either case. Producing a fake refs/mirror blob that responds like a normal ref is more work than just dumping the lines. If we're always just going to generate it dynamically anyway, then we can save ourselves some effort. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] pack-objects: name pack files after trailer hash 2013-12-05 16:04 ` Jeff King 2013-12-05 18:01 ` Junio C Hamano @ 2013-12-05 20:28 ` Jeff King 2013-12-05 21:56 ` Shawn Pearce ` (2 more replies) 1 sibling, 3 replies; 36+ messages in thread From: Jeff King @ 2013-12-05 20:28 UTC (permalink / raw) To: Shawn Pearce; +Cc: Git Mailing List On Thu, Dec 05, 2013 at 11:04:18AM -0500, Jeff King wrote: > > Yes. And this is why the packfile name algorithm is horribly flawed. I > > keep saying we should change it to name the pack using the last 20 > > bytes of the file but ... nobody has written the patch for that? :-) > > Totally agree. I think we could also get rid of the horrible hacks in > repack where we pack to a tempfile, then have to do another tempfile > dance (which is not atomic!) to move the same-named packfile out of the > way. If the name were based on the content, we could just throw away our > new pack if one of the same name is already there (just like we do for > loose objects). > > I haven't looked at making such a patch, but I think it shouldn't be too > complicated. My big worry would be weird fallouts from some hidden part > of the code that we don't realize is depending on the current naming > scheme. :) So here's the first part, that actually changes the name. It passes the test suite, so it must be good, right? And just look at that diffstat. This actually applies on top of e74435a (sha1write: make buffer const-correct, 2013-10-24), which is on another topic. Since the sha1 parameter to write_idx_file is no longer used as both an in- and out- parameter (yuck), we can make it const. The second half would be to simplify git-repack. The current behavior is to replace the old packfile with a tricky rename dance. Which is still correct, but overly complicated. We should be able to just drop the new packfile, since we know the bytes are identical (or rename the new one over the old, though I think keeping the old is probably kinder to the disk cache, especially if another process already has it mmap'd). -- >8 -- Subject: pack-objects: name pack files after trailer hash Our current scheme for naming packfiles is to calculate the sha1 hash of the sorted list of objects contained in the packfile. This gives us a unique name, so we are reasonably sure that two packs with the same name will contain the same objects. It does not, however, tell us that two such packs have the exact same bytes. This makes things awkward if we repack the same set of objects. Due to run-to-run variations, the bytes may not be identical (e.g., changed zlib or git versions, different source object reuse due to new packs in the repository, or even different deltas due to races during a multi-threaded delta search). In theory, this could be helpful to a program that cares that the packfile contains a certain set of objects, but does not care about the particular representation. In practice, no part of git makes use of that, and in many cases it is potentially harmful. For example, if a dumb http client fetches the .idx file, it must be sure to get the exact .pack that matches it. Similarly, a partial transfer of a .pack file cannot be safely resumed, as the actual bytes may have changed. This could also affect a local client which opened the .idx and .pack files, closes the .pack file (due to memory or file descriptor limits), and then re-opens a changed packfile. In all of these cases, git can detect the problem, as we have the sha1 of the bytes themselves in the pack trailer (which we verify on transfer), and the .idx file references the trailer from the matching packfile. But it would be simpler and more efficient to actually get the correct bytes, rather than noticing the problem and having to restart the operation. This patch simply uses the pack trailer sha1 as the pack name. It should be similarly unique, but covers the exact representation of the objects. Other parts of git should not care, as the pack name is returned by pack-objects and is essentially opaque. One test needs to be updated, because it actually corrupts a pack and expects that re-packing the corrupted bytes will use the same name. It won't anymore, but we can easily just use the name that pack-objects hands back. Signed-off-by: Jeff King <peff@peff.net> --- pack-write.c | 8 +------- pack.h | 2 +- t/t5302-pack-index.sh | 4 ++-- 3 files changed, 4 insertions(+), 10 deletions(-) diff --git a/pack-write.c b/pack-write.c index ca9e63b..ddc174e 100644 --- a/pack-write.c +++ b/pack-write.c @@ -44,14 +44,13 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts) */ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *opts, - unsigned char *sha1) + const unsigned char *sha1) { struct sha1file *f; struct pack_idx_entry **sorted_by_sha, **list, **last; off_t last_obj_offset = 0; uint32_t array[256]; int i, fd; - git_SHA_CTX ctx; uint32_t index_version; if (nr_objects) { @@ -114,9 +113,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec } sha1write(f, array, 256 * 4); - /* compute the SHA1 hash of sorted object names. */ - git_SHA1_Init(&ctx); - /* * Write the actual SHA1 entries.. */ @@ -128,7 +124,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec sha1write(f, &offset, 4); } sha1write(f, obj->sha1, 20); - git_SHA1_Update(&ctx, obj->sha1, 20); if ((opts->flags & WRITE_IDX_STRICT) && (i && !hashcmp(list[-2]->sha1, obj->sha1))) die("The same object %s appears twice in the pack", @@ -178,7 +173,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec sha1write(f, sha1, 20); sha1close(f, NULL, ((opts->flags & WRITE_IDX_VERIFY) ? CSUM_CLOSE : CSUM_FSYNC)); - git_SHA1_Final(sha1, &ctx); return index_name; } diff --git a/pack.h b/pack.h index aa6ee7d..12d9516 100644 --- a/pack.h +++ b/pack.h @@ -76,7 +76,7 @@ struct pack_idx_entry { struct progress; typedef int (*verify_fn)(const unsigned char*, enum object_type, unsigned long, void*, int*); -extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, unsigned char *sha1); +extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, const unsigned char *sha1); extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr); extern int verify_pack_index(struct packed_git *); extern int verify_pack(struct packed_git *, verify_fn fn, struct progress *, uint32_t); diff --git a/t/t5302-pack-index.sh b/t/t5302-pack-index.sh index fe82025..4bbb718 100755 --- a/t/t5302-pack-index.sh +++ b/t/t5302-pack-index.sh @@ -174,11 +174,11 @@ test_expect_success \ test_expect_success \ '[index v1] 5) pack-objects happily reuses corrupted data' \ 'pack4=$(git pack-objects test-4 <obj-list) && - test -f "test-4-${pack1}.pack"' + test -f "test-4-${pack4}.pack"' test_expect_success \ '[index v1] 6) newly created pack is BAD !' \ - 'test_must_fail git verify-pack -v "test-4-${pack1}.pack"' + 'test_must_fail git verify-pack -v "test-4-${pack4}.pack"' test_expect_success \ '[index v2] 1) stream pack to repository' \ -- 1.8.5.524.g6743da6 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King @ 2013-12-05 21:56 ` Shawn Pearce 2013-12-05 22:59 ` Junio C Hamano 2013-12-16 7:41 ` Michael Haggerty 2 siblings, 0 replies; 36+ messages in thread From: Shawn Pearce @ 2013-12-05 21:56 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List On Thu, Dec 5, 2013 at 12:28 PM, Jeff King <peff@peff.net> wrote: > Subject: pack-objects: name pack files after trailer hash > > Our current scheme for naming packfiles is to calculate the > sha1 hash of the sorted list of objects contained in the > packfile. This gives us a unique name, so we are reasonably > sure that two packs with the same name will contain the same > objects. Yay-by: Shawn Pearce <spearce@spearce.org> > --- > pack-write.c | 8 +------- > pack.h | 2 +- > t/t5302-pack-index.sh | 4 ++-- > 3 files changed, 4 insertions(+), 10 deletions(-) Obviously this is correct given the diffstat. :-) ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King 2013-12-05 21:56 ` Shawn Pearce @ 2013-12-05 22:59 ` Junio C Hamano 2013-12-06 22:18 ` Jeff King 2013-12-16 7:41 ` Michael Haggerty 2 siblings, 1 reply; 36+ messages in thread From: Junio C Hamano @ 2013-12-05 22:59 UTC (permalink / raw) To: Jeff King; +Cc: Shawn Pearce, Git Mailing List Jeff King <peff@peff.net> writes: > The second half would be to simplify git-repack. The current behavior is > to replace the old packfile with a tricky rename dance. Which is still > correct, but overly complicated. We should be able to just drop the new > packfile, since we know the bytes are identical (or rename the new one > over the old, though I think keeping the old is probably kinder to the > disk cache, especially if another process already has it mmap'd). Concurred. > One test needs to be updated, because it actually corrupts a > pack and expects that re-packing the corrupted bytes will > use the same name. It won't anymore, but we can easily just > use the name that pack-objects hands back. Re-reading the tests in that script, I am not sure if keeping these tests is even a sane thing to do, by the way. It "expects" that certain breakages are propagated, and anybody who breaks that expectation by improving pack-objects etc. to catch such breakages will be yelled at by breaking the test that used to pass. Seeing that the way the test scripts are line-wrapped follows the ancient convention, I suspect that this may be because it predates our more recent best practice to document known breakages with test_expect_failure. > diff --git a/t/t5302-pack-index.sh b/t/t5302-pack-index.sh > index fe82025..4bbb718 100755 > --- a/t/t5302-pack-index.sh > +++ b/t/t5302-pack-index.sh > @@ -174,11 +174,11 @@ test_expect_success \ > test_expect_success \ > '[index v1] 5) pack-objects happily reuses corrupted data' \ > 'pack4=$(git pack-objects test-4 <obj-list) && > - test -f "test-4-${pack1}.pack"' > + test -f "test-4-${pack4}.pack"' > > test_expect_success \ > '[index v1] 6) newly created pack is BAD !' \ > - 'test_must_fail git verify-pack -v "test-4-${pack1}.pack"' > + 'test_must_fail git verify-pack -v "test-4-${pack4}.pack"' A good thing is that the above hunks are the right thing to do, even if we are to modernise these tests so that they document a known breakage with expect-failure. Thanks. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-05 22:59 ` Junio C Hamano @ 2013-12-06 22:18 ` Jeff King 0 siblings, 0 replies; 36+ messages in thread From: Jeff King @ 2013-12-06 22:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, Git Mailing List On Thu, Dec 05, 2013 at 02:59:45PM -0800, Junio C Hamano wrote: > > One test needs to be updated, because it actually corrupts a > > pack and expects that re-packing the corrupted bytes will > > use the same name. It won't anymore, but we can easily just > > use the name that pack-objects hands back. > > Re-reading the tests in that script, I am not sure if keeping these > tests is even a sane thing to do, by the way. It "expects" that > certain breakages are propagated, and anybody who breaks that > expectation by improving pack-objects etc. to catch such breakages > will be yelled at by breaking the test that used to pass. I had a similar thought, but I figured I would leave it for the person who _does_ make that change. The yelling will be a good signal that they've got it right, and they can clean up the test (either by dropping it, or modifying it to check the right thing) at that point. > Seeing that the way the test scripts are line-wrapped follows the > ancient convention, I suspect that this may be because it predates > our more recent best practice to document known breakages with > test_expect_failure. I read it more as "make sure that the v1 index breaks, so when we are testing v2 we know it is not an accident that we notice the breakage". But I also see your reason, and I think it would be fine to use test_expect_failure. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King 2013-12-05 21:56 ` Shawn Pearce 2013-12-05 22:59 ` Junio C Hamano @ 2013-12-16 7:41 ` Michael Haggerty 2013-12-16 19:04 ` Jeff King 2 siblings, 1 reply; 36+ messages in thread From: Michael Haggerty @ 2013-12-16 7:41 UTC (permalink / raw) To: Jeff King; +Cc: Shawn Pearce, Git Mailing List On 12/05/2013 09:28 PM, Jeff King wrote: > [...] > This patch simply uses the pack trailer sha1 as the pack > name. It should be similarly unique, but covers the exact > representation of the objects. Other parts of git should not > care, as the pack name is returned by pack-objects and is > essentially opaque. > [...] Peff, The old naming scheme is documented in Documentation/git-pack-objects.txt, under "OPTIONS" -> "base-name": > base-name:: > Write into a pair of files (.pack and .idx), using > <base-name> to determine the name of the created file. > When this option is used, the two files are written in > <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash > of the sorted object names to make the resulting filename > based on the pack content, and written to the standard > output of the command. The documentation should either be updated or the description of the naming scheme should be removed altogether. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 7:41 ` Michael Haggerty @ 2013-12-16 19:04 ` Jeff King 2013-12-16 19:19 ` Jonathan Nieder 2013-12-16 19:33 ` Junio C Hamano 0 siblings, 2 replies; 36+ messages in thread From: Jeff King @ 2013-12-16 19:04 UTC (permalink / raw) To: Michael Haggerty; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List On Mon, Dec 16, 2013 at 08:41:38AM +0100, Michael Haggerty wrote: > The old naming scheme is documented in > Documentation/git-pack-objects.txt, under "OPTIONS" -> "base-name": > > > base-name:: > > Write into a pair of files (.pack and .idx), using > > <base-name> to determine the name of the created file. > > When this option is used, the two files are written in > > <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash > > of the sorted object names to make the resulting filename > > based on the pack content, and written to the standard > > output of the command. > > The documentation should either be updated or the description of the > naming scheme should be removed altogether. Thanks. I looked in Documentation/technical for anything to update, but didn't imagine we would be advertising the format in the user-facing documentation. :) The original patch is in next, so here's one on top. I just updated the description. I was tempted to explicitly say something like "this is opaque and meaningless to you, don't rely on it", but I don't know that there is any need. -- >8 -- Subject: docs: update pack-objects "base-name" description As of 1190a1a, the SHA-1 used to determine the filename is now calculated differently. Update the documentation to reflect this. Noticed-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Jeff King <peff@peff.net> --- On top of jk/name-pack-after-byte-representations, naturally. Documentation/git-pack-objects.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index d94edcd..c69affc 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -51,8 +51,7 @@ base-name:: <base-name> to determine the name of the created file. When this option is used, the two files are written in <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash - of the sorted object names to make the resulting filename - based on the pack content, and written to the standard + of the bytes of the packfile, and is written to the standard output of the command. --stdout:: -- 1.8.5.524.g6743da6 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 19:04 ` Jeff King @ 2013-12-16 19:19 ` Jonathan Nieder 2013-12-16 19:28 ` Jeff King 2013-12-16 19:33 ` Junio C Hamano 1 sibling, 1 reply; 36+ messages in thread From: Jonathan Nieder @ 2013-12-16 19:19 UTC (permalink / raw) To: Jeff King Cc: Michael Haggerty, Junio C Hamano, Shawn Pearce, Git Mailing List Jeff King wrote: > The original patch is in next, so here's one on top. I just updated the > description. Thanks. > I was tempted to explicitly say something like "this is > opaque and meaningless to you, don't rely on it", but I don't know that > there is any need. [...] > On top of jk/name-pack-after-byte-representations, naturally. I think there is --- if someone starts caring about the SHA-1 used, they won't be able to act on old packfiles that were created before this change. How about something like the following instead? -- >8 -- From: Jeff King <peff@peff.net> Subject: pack-objects doc: treat output filename as opaque After 1190a1a (pack-objects: name pack files after trailer hash, 2013-12-05), the SHA-1 used to determine the filename is calculated differently. Update the documentation to not guarantee anything more than that the SHA-1 depends on the pack content somehow. Hopefully this will discourage readers from depending on the old or the new calculation. Reported-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> --- Documentation/git-pack-objects.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index d94edcd..cdab9ed 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -51,8 +51,7 @@ base-name:: <base-name> to determine the name of the created file. When this option is used, the two files are written in <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash - of the sorted object names to make the resulting filename - based on the pack content, and written to the standard + based on the pack content and is written to the standard output of the command. --stdout:: -- 1.8.5.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 19:19 ` Jonathan Nieder @ 2013-12-16 19:28 ` Jeff King 2013-12-16 19:37 ` Junio C Hamano 0 siblings, 1 reply; 36+ messages in thread From: Jeff King @ 2013-12-16 19:28 UTC (permalink / raw) To: Jonathan Nieder Cc: Michael Haggerty, Junio C Hamano, Shawn Pearce, Git Mailing List On Mon, Dec 16, 2013 at 11:19:33AM -0800, Jonathan Nieder wrote: > > I was tempted to explicitly say something like "this is > > opaque and meaningless to you, don't rely on it", but I don't know that > > there is any need. > [...] > > On top of jk/name-pack-after-byte-representations, naturally. > > I think there is --- if someone starts caring about the SHA-1 used, > they won't be able to act on old packfiles that were created before > this change. How about something like the following instead? Right, my point was that I do not think anybody has ever cared, and I do not see them starting now. But that is just my intuition. > diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt > index d94edcd..cdab9ed 100644 > --- a/Documentation/git-pack-objects.txt > +++ b/Documentation/git-pack-objects.txt > @@ -51,8 +51,7 @@ base-name:: > <base-name> to determine the name of the created file. > When this option is used, the two files are written in > <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash > - of the sorted object names to make the resulting filename > - based on the pack content, and written to the standard > + based on the pack content and is written to the standard I'm fine with that. I was worried it would get clunky, but the way you have worded it is good. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 19:28 ` Jeff King @ 2013-12-16 19:37 ` Junio C Hamano 0 siblings, 0 replies; 36+ messages in thread From: Junio C Hamano @ 2013-12-16 19:37 UTC (permalink / raw) To: Jeff King Cc: Jonathan Nieder, Michael Haggerty, Shawn Pearce, Git Mailing List Jeff King <peff@peff.net> writes: > On Mon, Dec 16, 2013 at 11:19:33AM -0800, Jonathan Nieder wrote: > >> > I was tempted to explicitly say something like "this is >> > opaque and meaningless to you, don't rely on it", but I don't know that >> > there is any need. >> [...] >> > On top of jk/name-pack-after-byte-representations, naturally. >> >> I think there is --- if someone starts caring about the SHA-1 used, >> they won't be able to act on old packfiles that were created before >> this change. How about something like the following instead? > > Right, my point was that I do not think anybody has ever cared, and I do > not see them starting now. But that is just my intuition. > >> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt >> index d94edcd..cdab9ed 100644 >> --- a/Documentation/git-pack-objects.txt >> +++ b/Documentation/git-pack-objects.txt >> @@ -51,8 +51,7 @@ base-name:: >> <base-name> to determine the name of the created file. >> When this option is used, the two files are written in >> <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash >> - of the sorted object names to make the resulting filename >> - based on the pack content, and written to the standard >> + based on the pack content and is written to the standard > > I'm fine with that. I was worried it would get clunky, but the way you > have worded it is good. Our mails crossed; I think the above is good. Thanks. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 19:04 ` Jeff King 2013-12-16 19:19 ` Jonathan Nieder @ 2013-12-16 19:33 ` Junio C Hamano 2013-12-16 19:35 ` Jeff King 1 sibling, 1 reply; 36+ messages in thread From: Junio C Hamano @ 2013-12-16 19:33 UTC (permalink / raw) To: Jeff King; +Cc: Michael Haggerty, Shawn Pearce, Git Mailing List Jeff King <peff@peff.net> writes: > I was tempted to explicitly say something like "this is > opaque and meaningless to you, don't rely on it", but I don't know that > there is any need. Thanks. When we did the original naming, it was envisioned that we may use the name for fsck to make sure that the pack contains what it contains in the name, but it never materialized. The most prominent and useful characteristic of the new naming scheme is that two packfiles with the same name must be identical, and we may want to start using it some time later once everybody repacked their packs with the updated pack-objects. But until that time comes, some packs in existing repositories will hash to their names while others do not, so spelling out how the new names are derived without saying older pack-objects used to name their output differently may add more confusion than it is worth. > <base-name> to determine the name of the created file. > When this option is used, the two files are written in > <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash > + of the bytes of the packfile, and is written to the standard "hash of the bytes of the packfile" tempts one to do $ sha1sum .git/objects/pack/pack-*.pack but that is not what we expect. I wonder if there are better ways to phrase it (or alternatively perhaps we want to make that expectation hold by updating our code to hash)? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] pack-objects: name pack files after trailer hash 2013-12-16 19:33 ` Junio C Hamano @ 2013-12-16 19:35 ` Jeff King 0 siblings, 0 replies; 36+ messages in thread From: Jeff King @ 2013-12-16 19:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Haggerty, Shawn Pearce, Git Mailing List On Mon, Dec 16, 2013 at 11:33:11AM -0800, Junio C Hamano wrote: > > <base-name> to determine the name of the created file. > > When this option is used, the two files are written in > > <base-name>-<SHA-1>.{pack,idx} files. <SHA-1> is a hash > > + of the bytes of the packfile, and is written to the standard > > "hash of the bytes of the packfile" tempts one to do > > $ sha1sum .git/objects/pack/pack-*.pack > > but that is not what we expect. I wonder if there are better ways to > phrase it (or alternatively perhaps we want to make that expectation > hold by updating our code to hash)? Yeah, I wondered about that, but didn't think it was worth the verbosity to explain that the true derivation. I think Jonathan's suggestion takes care of it, though. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:14 ` Duy Nguyen 2013-11-28 8:35 ` Karsten Blees @ 2013-11-28 9:20 ` Tay Ray Chuan 2013-11-28 9:29 ` zhifeng hu 1 sibling, 1 reply; 36+ messages in thread From: Tay Ray Chuan @ 2013-11-28 9:20 UTC (permalink / raw) To: Duy Nguyen; +Cc: zhifeng hu, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen <pclouds@gmail.com> wrote: > On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote: >> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. > > if it works with --depth =1, you can incrementally run "fetch > --depth=N" with N larger and larger. I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially create a "shallow" clone then 2) incrementally deepen your clone. Zhifeng, in the course of your research into resumable cloning, you might have learnt that while it's a really valuable feature, it's also a pretty hard problem at the same time. So it's not because git doesn't want to have this feature. -- Cheers, Ray Chuan ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:20 ` How to resume broke clone ? Tay Ray Chuan @ 2013-11-28 9:29 ` zhifeng hu 2013-11-28 19:35 ` Shawn Pearce 2013-11-28 21:54 ` Jakub Narebski 0 siblings, 2 replies; 36+ messages in thread From: zhifeng hu @ 2013-11-28 9:29 UTC (permalink / raw) To: Tay Ray Chuan Cc: Duy Nguyen, Trần Ngọc Quân, Git Mailing List Once using git clone —depth or git fetch —depth, While you want to move backward. you may face problem git fetch --depth=105 error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b zhifeng hu On Nov 28, 2013, at 5:20 PM, Tay Ray Chuan <rctay89@gmail.com> wrote: > On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen <pclouds@gmail.com> wrote: >> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote: >>> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. >> >> if it works with --depth =1, you can incrementally run "fetch >> --depth=N" with N larger and larger. > > I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially > create a "shallow" clone then 2) incrementally deepen your clone. > > Zhifeng, in the course of your research into resumable cloning, you > might have learnt that while it's a really valuable feature, it's also > a pretty hard problem at the same time. So it's not because git > doesn't want to have this feature. > > -- > Cheers, > Ray Chuan > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:29 ` zhifeng hu @ 2013-11-28 19:35 ` Shawn Pearce 2013-11-28 21:54 ` Jakub Narebski 1 sibling, 0 replies; 36+ messages in thread From: Shawn Pearce @ 2013-11-28 19:35 UTC (permalink / raw) To: zhifeng hu Cc: Tay Ray Chuan, Duy Nguyen, Trần Ngọc Quân, Git Mailing List On Thu, Nov 28, 2013 at 1:29 AM, zhifeng hu <zf@ancientrocklab.com> wrote: > Once using git clone —depth or git fetch —depth, > While you want to move backward. > you may face problem > > git fetch --depth=105 > error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a > error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef > error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b > error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db > error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b We now have a resumable bundle available through our kernel.org mirror. The bundle is 658M. mkdir linux cd linux git init wget https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/clone.bundle sha1sum clone.bundle 96831de0b81713333e5ebba94edb31e37e70e1df clone.bundle git fetch -u ./clone.bundle refs/*:refs/* git reset --hard You can also use our mirror as an upstream, as we have servers in Asia that lag no more than 5 or 6 minutes behind kernel.org: git remote add origin https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 9:29 ` zhifeng hu 2013-11-28 19:35 ` Shawn Pearce @ 2013-11-28 21:54 ` Jakub Narebski 1 sibling, 0 replies; 36+ messages in thread From: Jakub Narebski @ 2013-11-28 21:54 UTC (permalink / raw) To: git zhifeng hu <zf <at> ancientrocklab.com> writes: > > Once using git clone —depth or git fetch —depth, > While you want to move backward. > you may face problem > > git fetch --depth=105 > error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a > error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef > error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b > error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db > error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b BTW. there was (is?) a bundler service at http://bundler.caurea.org/ but I don't know if it can create Linux-size bundle. -- Jakub Narębski ^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: How to resume broke clone ?
@ 2013-11-28 8:32 Max Kirillov
2013-11-28 9:12 ` Jeff King
0 siblings, 1 reply; 36+ messages in thread
From: Max Kirillov @ 2013-11-28 8:32 UTC (permalink / raw)
To: zhifeng hu; +Cc: git
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> I am in china. our bandwidth is very limitation. Less than 50Kb/s.
You could manually download big packed bundled from some http remote.
For example http://repo.or.cz/r/linux.git
* create a new repository, add the remote there.
* download files with wget or whatever:
http://repo.or.cz/r/linux.git/objects/info/packs
also files mentioned in the file. Currently they are:
http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx
http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack
and put them to the corresponding places.
* then run fetch of pull. I believe it should run fast then. Though I
have not test it.
Br,
--
Max
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: How to resume broke clone ? 2013-11-28 8:32 Max Kirillov @ 2013-11-28 9:12 ` Jeff King 0 siblings, 0 replies; 36+ messages in thread From: Jeff King @ 2013-11-28 9:12 UTC (permalink / raw) To: Max Kirillov; +Cc: zhifeng hu, git On Thu, Nov 28, 2013 at 01:32:36AM -0700, Max Kirillov wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > > > I am in china. our bandwidth is very limitation. Less than 50Kb/s. > > You could manually download big packed bundled from some http remote. > For example http://repo.or.cz/r/linux.git > > * create a new repository, add the remote there. > > * download files with wget or whatever: > http://repo.or.cz/r/linux.git/objects/info/packs > also files mentioned in the file. Currently they are: > http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx > http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack > > and put them to the corresponding places. > > * then run fetch of pull. I believe it should run fast then. Though I > have not test it. You would also need to set up local refs so that git knows you have those objects. The simplest way to do it is to just fetch by dumb-http, which can resume the pack transfer. I think that clone is also very eager to clean up the partial transfer if the initial fetch fails. So you would want to init manually: git init linux cd linux git remote add origin http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git GIT_SMART_HTTP=0 git fetch -vv and then you can follow that up with regular smart fetches, which should be much smaller. It would be even simpler if you could fetch the whole thing as a bundle, rather than over dumb-http. But that requires the server side (or some third party who has fast access to the repo) cooperating and making a bundle available. -Peff ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2013-12-16 19:37 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-28 3:13 How to resume broke clone ? zhifeng hu 2013-11-28 7:39 ` Trần Ngọc Quân 2013-11-28 7:41 ` zhifeng hu 2013-11-28 8:14 ` Duy Nguyen 2013-11-28 8:35 ` Karsten Blees 2013-11-28 8:50 ` Duy Nguyen 2013-11-28 8:55 ` zhifeng hu 2013-11-28 9:09 ` Duy Nguyen 2013-11-28 9:29 ` Jeff King 2013-11-28 10:17 ` Duy Nguyen 2013-11-28 19:15 ` Shawn Pearce 2013-12-04 20:08 ` Jeff King 2013-12-05 6:50 ` Shawn Pearce 2013-12-05 13:21 ` Michael Haggerty 2013-12-05 15:11 ` Shawn Pearce 2013-12-05 16:12 ` Jeff King 2013-12-05 16:04 ` Jeff King 2013-12-05 18:01 ` Junio C Hamano 2013-12-05 19:08 ` Jeff King 2013-12-05 20:28 ` [PATCH] pack-objects: name pack files after trailer hash Jeff King 2013-12-05 21:56 ` Shawn Pearce 2013-12-05 22:59 ` Junio C Hamano 2013-12-06 22:18 ` Jeff King 2013-12-16 7:41 ` Michael Haggerty 2013-12-16 19:04 ` Jeff King 2013-12-16 19:19 ` Jonathan Nieder 2013-12-16 19:28 ` Jeff King 2013-12-16 19:37 ` Junio C Hamano 2013-12-16 19:33 ` Junio C Hamano 2013-12-16 19:35 ` Jeff King 2013-11-28 9:20 ` How to resume broke clone ? Tay Ray Chuan 2013-11-28 9:29 ` zhifeng hu 2013-11-28 19:35 ` Shawn Pearce 2013-11-28 21:54 ` Jakub Narebski -- strict thread matches above, loose matches on Subject: below -- 2013-11-28 8:32 Max Kirillov 2013-11-28 9:12 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).