* Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)
@ 2010-07-28 0:13 Elijah Newren
2010-07-28 1:05 ` Avery Pennarun
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Elijah Newren @ 2010-07-28 0:13 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Nguyễn Thái Ngọc, git
Hi,
2010/7/27 Shawn O. Pearce <spearce@spearce.org>:
> I would prefer doing something more like what we do with shallow
> on the client side. Record in a magic file the path(s) that we
> did actually obtain. During fsck, rev-list, or read-tree the
> client skips over any paths that don't match that file's listing.
> Then we can keep the same commit SHA-1s, but we won't complain that
> there are objects missing.
I recently decided to take a crack at implementing sparse clones, due
to a crazy idea I had (which might not be as crazy as I thought since
you suggest something similar, though more limited). I was going to
wait until I actually got somewhere tangible with it to post an RFC,
particularly since it may take me a while, but since it's fresh on
everyone's minds perhaps now is good anyway.
Does the following seem sane, or are there big gotchas that I'm just unaware of?
0) Sparse clones have "all" commit objects, but not all trees/blobs.
Note that "all" only means all that are reachable from the refs being
downloaded, of course. I think this is widely agreed upon and has
been suggested many times on this list.
1) A user controls sparseness by passing rev-list arguments to clone.
This allows a user to control sparseness both in terms of span of
content (files/directories) and depth of history. It can also be used
to limit to a subset of refs (cloning just one or two branches instead
of all branches and tags). For example,
$ git clone ssh://repo.git dst -- Documentation/
$ git clone ssh://repo.git dst master~6..master
$ git clone ssh://repo.git dst -3
(Note that the destination argument becomes mandatory for those doing
a sparse clone in order to disambiguate it from rev-list options.)
This method also means users don't need much training to learn how to
use sparse clones -- they just use syntax they've already learned with
log, and clone will pass this info on to upload-pack.
There is a slight question as to whether users should have to specify
"--all HEAD" with all sparse clones or whether it should be assumed
when no other refs are listed.
2) Sparse checkouts are automatically invoked with the path(s) from
the specified rev-list arguments.
Can't checkout content that we don't have. :-)
This has a slight downside -- it makes sparse checkouts and sparse
clones slight misfits: the syntax (.gitignore style vs. rev-list
arguments) is a bit different, and sparse checkouts can exclude
certain paths whereas my sparse clones would only be able to *include*
paths. I don't see this as a deal-breaker, but even if others
disagree I think a more general path-exclusion mechanism for the
revision walking machinery would be really nice for reasons beyond
just this one. I've often wanted to do something like
git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir
3) The limiting rev-list arguments passed to clone are stored.
However, relative arguments such as "-3" or "master~6" first need to
be translated into one or more exclude ranges written as "^<sha1>".
4) All revision-walking operations automatically use these limiting args.
This should be a simple code change, and would enable rev-list, log,
etc. to avoid missing blobs/trees and thus enable them to work with
sparse clones. fsck would take a bit more work, since it doesn't use
the setup_revisions() and revision.h walking machinery, but shouldn't
be too bad (I hope).
There are also performance ramifications: There should be no
measurable performance overhead for non-sparse clones (something that
might be a problem with a different implementation that did
does-this-exist check each time it references a blob). It should also
be a significant performance boost for those using it, as operations
will only need to deal with the subset of the repository they specify
(faster downloads, stats, logs, etc.)
5) "Densifying" a sparse clone can be done
One can fetch a new pack and replace the limiting rev-list args with
the new choice. The sparse checkout information needs to be updated
too.
(So users probably would want to densify a sparse clone with "pull"
rather than "fetch", as manually updating sparse checkouts may be a
bit of a hassle.)
6) Cloning-from/fetching-from/pushing-to sparse clones is supported.
Future fetches and pushes also make use of the limiting arguments.
Receives do as well, but only to make sure the pack obtained is not
"more sparse" than what the receiving repository already has.
(uploads ignore the stored rev-list arguments, instead using the
rev-list arguments passed to it -- it will die if asked for content
not locally available to it.)
7) Operations that need unavailable data simply error out
Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
clone). However, hopefully the error messages state what extra
information needs to be downloaded so the user can appropriately
"densify" their repository.
Thanks,
Elijah
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren @ 2010-07-28 1:05 ` Avery Pennarun 2010-07-28 3:06 ` Nguyen Thai Ngoc Duy 2010-07-28 3:31 ` Elijah Newren 2010-07-28 3:36 ` Nguyen Thai Ngoc Duy 2010-08-13 17:31 ` Enrico Weigelt 2 siblings, 2 replies; 16+ messages in thread From: Avery Pennarun @ 2010-07-28 1:05 UTC (permalink / raw) To: Elijah Newren; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git 2010/7/27 Elijah Newren <newren@gmail.com>: > 0) Sparse clones have "all" commit objects, but not all trees/blobs. > > Note that "all" only means all that are reachable from the refs being > downloaded, of course. I think this is widely agreed upon and has > been suggested many times on this list. I think downloading all commit objects would require very low bandwidth and storage space, so it should be harmless. In fact, I have a pretty strong impression that also downloading all *tree* objects would be fine too. But I've never actually gone and counted them to see what the stats are like. Still, I'd assume that the vast majority of repo space is blobs, not trees, and that trees are highly compatible with deltafication. Note that if you happen to want to implement it in a way that you'll also get all the commit objects from your submodules too (which I highly encourage :)) then downloading the trees is the easiest way. Otherwise you won't know which submodule commits you need. > 1) A user controls sparseness by passing rev-list arguments to clone. > > This allows a user to control sparseness both in terms of span of > content (files/directories) and depth of history. It can also be used > to limit to a subset of refs (cloning just one or two branches instead > of all branches and tags). For example, > $ git clone ssh://repo.git dst -- Documentation/ > $ git clone ssh://repo.git dst master~6..master > $ git clone ssh://repo.git dst -3 > (Note that the destination argument becomes mandatory for those doing > a sparse clone in order to disambiguate it from rev-list options.) It's really too bad that the dst argument took up that slot which, in every other git command, is where the list of revs would go :( Other than that, I think the syntax looks nice. > There is a slight question as to whether users should have to specify > "--all HEAD" with all sparse clones or whether it should be assumed > when no other refs are listed. Since downloading commits is so cheap anyway, I'd suggest just defaulting to downloading all the refs, as clone currently does. If people don't like it, they can do what they currently do: git init git remote add ... git fetch Not that pretty, but then again, it's rarely needed. > 2) Sparse checkouts are automatically invoked with the path(s) from > the specified rev-list arguments. > > Can't checkout content that we don't have. :-) > > This has a slight downside -- it makes sparse checkouts and sparse > clones slight misfits: the syntax (.gitignore style vs. rev-list > arguments) is a bit different, and sparse checkouts can exclude > certain paths whereas my sparse clones would only be able to *include* > paths. I don't see this as a deal-breaker, but even if others > disagree I think a more general path-exclusion mechanism for the > revision walking machinery would be really nice for reasons beyond > just this one. I've often wanted to do something like > git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir I don't totally understand what you mean here. But I do think that if you can *mostly* trim down a tree, excluding every little thing is not that important. As was discussed on the other thread, it seems like *most* people are trimming down their trees (currently using submodules) just to make stuff faster, and getting rid of 90% of the unwanted cruft is probably fine; getting rid of 100% of it isn't that much more of a speed boost. I guess my point is, more complex exclusions could always be added later but they aren't so important right away. > 3) The limiting rev-list arguments passed to clone are stored. > > However, relative arguments such as "-3" or "master~6" first need to > be translated into one or more exclude ranges written as "^<sha1>". Just run them through rev-parse, I think. > 4) All revision-walking operations automatically use these limiting args. > > This should be a simple code change, and would enable rev-list, log, > etc. to avoid missing blobs/trees and thus enable them to work with > sparse clones. fsck would take a bit more work, since it doesn't use > the setup_revisions() and revision.h walking machinery, but shouldn't > be too bad (I hope). I don't know if this implementation detail would be better or worse than just having the tools auto-trim their activities when they run into a missing object. But maybe. It does sound sort of elegant: this way they *won't* run into the missing objects. Beware, however, that git log -- Documentation outputs a different set of commits than just git log You don't want to enable history simplification here; I think that means you want --full-history on by default for the "stored" path limiting, but not for any command-line path limiting. That could be slightly messy. > 5) "Densifying" a sparse clone can be done > > One can fetch a new pack and replace the limiting rev-list args with > the new choice. The sparse checkout information needs to be updated > too. > > (So users probably would want to densify a sparse clone with "pull" > rather than "fetch", as manually updating sparse checkouts may be a > bit of a hassle.) I think this would work, but unless you want to re-download some (possibly lots of) objects you've already got, it would require some kind of extra support from the server, I think. Maybe that's a rare enough case that few people will care and it could be fixed later. I don't think the pull vs. fetch distinction is valid; I would be very surprised if pull un-sparsified my checkout, just as I would be surprised if merge did. And pull is just fetch+merge. > 6) Cloning-from/fetching-from/pushing-to sparse clones is supported. > > Future fetches and pushes also make use of the limiting arguments. > Receives do as well, but only to make sure the pack obtained is not > "more sparse" than what the receiving repository already has. > (uploads ignore the stored rev-list arguments, instead using the > rev-list arguments passed to it -- it will die if asked for content > not locally available to it.) This scares me a little. It's a reminder that it's all-too-easy to get your repository into a really messed up state by going in and screwing with your sparseness parameters at the wrong time. It would make me more comfortable if there was some kind of "oh god, just fix it by downloading any objects you think are missing" mode :) In fact, git could benefit from that in general - every now and then someone on the list asks about a repository they managed to mangle by corrupting a pack or something, and there's no really good answer to that. > 7) Operations that need unavailable data simply error out > > Examples: merge, cherry-pick, rebase (and upload-pack in a sparse > clone). However, hopefully the error messages state what extra > information needs to be downloaded so the user can appropriately > "densify" their repository. That sounds good to me. Have fun, Avery ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 1:05 ` Avery Pennarun @ 2010-07-28 3:06 ` Nguyen Thai Ngoc Duy 2010-07-28 3:38 ` Nguyen Thai Ngoc Duy 2010-07-28 3:31 ` Elijah Newren 1 sibling, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-28 3:06 UTC (permalink / raw) To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git 2010/7/28 Avery Pennarun <apenwarr@gmail.com>: > 2010/7/27 Elijah Newren <newren@gmail.com>: >> 0) Sparse clones have "all" commit objects, but not all trees/blobs. >> >> Note that "all" only means all that are reachable from the refs being >> downloaded, of course. I think this is widely agreed upon and has >> been suggested many times on this list. > > I think downloading all commit objects would require very low > bandwidth and storage space, so it should be harmless. Here you go. A pack with only commits and trees of git.git#master is 15M. With blobs, it is 28M. Git is not a typical repo with large blobs though. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:06 ` Nguyen Thai Ngoc Duy @ 2010-07-28 3:38 ` Nguyen Thai Ngoc Duy 2010-07-28 3:58 ` Avery Pennarun 0 siblings, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-28 3:38 UTC (permalink / raw) To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git (corrected reply context, sorry) On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > 2010/7/28 Avery Pennarun <apenwarr@gmail.com>: >> 2010/7/27 Elijah Newren <newren@gmail.com>: >>> 0) Sparse clones have "all" commit objects, but not all trees/blobs. >>> >>> Note that "all" only means all that are reachable from the refs being >>> downloaded, of course. I think this is widely agreed upon and has >>> been suggested many times on this list. >> >> I think downloading all commit objects would require very low >> bandwidth and storage space, so it should be harmless. > > > > In fact, I have a pretty strong impression that also downloading > > all *tree* objects would be fine too. > > Here you go. A pack with only commits and trees of git.git#master is > 15M. With blobs, it is 28M. Git is not a typical repo with large blobs > though. > -- > Duy > -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:38 ` Nguyen Thai Ngoc Duy @ 2010-07-28 3:58 ` Avery Pennarun 2010-07-28 6:12 ` Sverre Rabbelier 2010-07-28 7:11 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 16+ messages in thread From: Avery Pennarun @ 2010-07-28 3:58 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Elijah Newren, Shawn O. Pearce, git On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > 2010/7/28 Avery Pennarun <apenwarr@gmail.com>: >> 2010/7/27 Elijah Newren <newren@gmail.com>: >>> 0) Sparse clones have "all" commit objects, but not all trees/blobs. >>> >>> Note that "all" only means all that are reachable from the refs being >>> downloaded, of course. I think this is widely agreed upon and has >>> been suggested many times on this list. >> >> I think downloading all commit objects would require very low >> bandwidth and storage space, so it should be harmless. > > > > In fact, I have a pretty strong impression that also downloading > > all *tree* objects would be fine too. > > Here you go. A pack with only commits and trees of git.git#master is > 15M. With blobs, it is 28M. Git is not a typical repo with large blobs > though. Hmm, that's very interesting - more than half the repo is just tree and commit objects? Maybe that's not so shocking after all, given the tendency in the git project to use long commit messages and relatively short patches. Was your pack carefully ordered for best deltification? Knowing how much of that is commits vs. trees would also be very interesting. But if so, only saving half the space is kind of disappointing. If you have a script around for generating this, it would be very interesting to compare the results with, say, the Linux kernel repo (especially since it seems to be the #1 example of "submodules people don't want to check out because they're so bloody huge"). In bup, I know the trees+commits are much smaller than the blobs, so my intuition was telling me it would be the same in git. It's entirely possible that I was wrong, though. In retrospect, bup uses really short computer-generated commit messages, and backs up large numbers of files at once, most of which never change (and thus most of the trees never change). Commits+trees end up somewhere around 0.5% of the total repo size. Have fun, Avery ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:58 ` Avery Pennarun @ 2010-07-28 6:12 ` Sverre Rabbelier 2010-07-28 7:59 ` Nguyen Thai Ngoc Duy 2010-07-28 7:11 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 16+ messages in thread From: Sverre Rabbelier @ 2010-07-28 6:12 UTC (permalink / raw) To: Avery Pennarun; +Cc: Nguyen Thai Ngoc Duy, Elijah Newren, Shawn O. Pearce, git Heya, On Tue, Jul 27, 2010 at 22:58, Avery Pennarun <apenwarr@gmail.com> wrote: > Hmm, that's very interesting - more than half the repo is just tree > and commit objects? Maybe that's not so shocking after all, given the > tendency in the git project to use long commit messages and relatively > short patches. Note that in the case of the ginormous-tree this holds as well. Many small files, but in insanely deeply nested directories with insane fan-out. If you need to download all the trees you don't save that much bandwith. OTOH, I'm not only concerned about bandwidth, just being able to run 'git status' without it taking half a minute would be sweet. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 6:12 ` Sverre Rabbelier @ 2010-07-28 7:59 ` Nguyen Thai Ngoc Duy 2010-07-28 14:48 ` Sverre Rabbelier 0 siblings, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-28 7:59 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: Avery Pennarun, Elijah Newren, Shawn O. Pearce, git On Wed, Jul 28, 2010 at 4:12 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote: > OTOH, I'm not only concerned about bandwidth, just > being able to run 'git status' without it taking half a minute would > be sweet. Doesn't assume-unchanged bit or sparse checkout help? -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 7:59 ` Nguyen Thai Ngoc Duy @ 2010-07-28 14:48 ` Sverre Rabbelier 0 siblings, 0 replies; 16+ messages in thread From: Sverre Rabbelier @ 2010-07-28 14:48 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Avery Pennarun, Elijah Newren, Shawn O. Pearce, git Heya, On Wed, Jul 28, 2010 at 02:59, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > Doesn't assume-unchanged bit or sparse checkout help? See my earlier part, assume-unchanged doesn't help due to the gitignore files. Sparse checkout isn't an option since I really need those files to be there, I just don't ever modify them. Really what I need is read-only checkout ;). "Git, please ignore the existence of this directory and all it's files/subdirectories, ktnx". -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:58 ` Avery Pennarun 2010-07-28 6:12 ` Sverre Rabbelier @ 2010-07-28 7:11 ` Nguyen Thai Ngoc Duy 1 sibling, 0 replies; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-28 7:11 UTC (permalink / raw) To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git On Wed, Jul 28, 2010 at 1:58 PM, Avery Pennarun <apenwarr@gmail.com> wrote: > On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: >> 2010/7/28 Avery Pennarun <apenwarr@gmail.com>: >>> 2010/7/27 Elijah Newren <newren@gmail.com>: >>>> 0) Sparse clones have "all" commit objects, but not all trees/blobs. >>>> >>>> Note that "all" only means all that are reachable from the refs being >>>> downloaded, of course. I think this is widely agreed upon and has >>>> been suggested many times on this list. >>> >>> I think downloading all commit objects would require very low >>> bandwidth and storage space, so it should be harmless. >> > >> > In fact, I have a pretty strong impression that also downloading >> > all *tree* objects would be fine too. >> >> Here you go. A pack with only commits and trees of git.git#master is >> 15M. With blobs, it is 28M. Git is not a typical repo with large blobs >> though. > > Hmm, that's very interesting - more than half the repo is just tree > and commit objects? Maybe that's not so shocking after all, given the > tendency in the git project to use long commit messages and relatively > short patches. > > Was your pack carefully ordered for best deltification? I did not do any optimization. It's pack-objects' defaults. I only filtered blobs out and that's what fetch-pack would receive. > > Knowing how much of that is commits vs. trees would also be very interesting. Commits only takes 7.8 MB. Well.. all those commit messages.. > But if so, only saving half the space is kind of disappointing. If > you have a script around for generating this, it would be very > interesting to compare the results with, say, the Linux kernel repo > (especially since it seems to be the #1 example of "submodules people > don't want to check out because they're so bloody huge"). I modified pack-objects.c, show_object() to certain objects. Actually I started with git-fetch-pack, but you can do git rev-parse master | git pack-objects --revs --delta-base-offset pack Then verify what's in the pack with git verify-pack -v pack-*.idx -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 1:05 ` Avery Pennarun 2010-07-28 3:06 ` Nguyen Thai Ngoc Duy @ 2010-07-28 3:31 ` Elijah Newren 2010-07-31 22:36 ` Elijah Newren 1 sibling, 1 reply; 16+ messages in thread From: Elijah Newren @ 2010-07-28 3:31 UTC (permalink / raw) To: Avery Pennarun; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git 2010/7/27 Avery Pennarun <apenwarr@gmail.com>: > Note that if you happen to want to implement it in a way that you'll > also get all the commit objects from your submodules too (which I > highly encourage :)) then downloading the trees is the easiest way. > Otherwise you won't know which submodule commits you need. Makes sense. Seems like a good reason to include all the trees. > Since downloading commits is so cheap anyway, I'd suggest just > defaulting to downloading all the refs, as clone currently does. If > people don't like it, they can do what they currently do: > > git init > git remote add ... > git fetch > > Not that pretty, but then again, it's rarely needed. Would you suggest then parsing the limiting arguments passed to clone and disallowing refs? Or just making it non-useful by always appending "--all HEAD"? >> 2) Sparse checkouts are automatically invoked with the path(s) from >> the specified rev-list arguments. <snip> > I don't totally understand what you mean here. But I do think that if Basically, I mean what you stated much more succinctly and eloquently right here: > I guess my point is, more complex exclusions could always be added > later but they aren't so important right away. >> 4) All revision-walking operations automatically use these limiting args. <snip> > It does sound sort of elegant: this way they *won't* run into the missing objects. > Beware, however, that > > git log -- Documentation > > outputs a different set of commits than just > > git log Yes, exactly. In a sparse clone, why wouldn't one want the behavior of the former automatically, without having to specify the paths on the command line every time they ran log (or rev-list or fast-export or...etc., especially if they cloned N directories rather than just 1)? Actually, I can kind of see the desire to see the 'real' log since the users do happen to have all commits locally, but it almost seems like it should be the case that requires a special option to be passed to git log ('--ignore-sparse-limiting'?). But trying to get that option to work in conjunction with other options (--stat, -S, -p, etc.) would be really hard, if not impossible. >> 5) "Densifying" a sparse clone can be done <snip> > I think this would work, but unless you want to re-download some > (possibly lots of) objects you've already got, it would require some > kind of extra support from the server, I think. Maybe that's a rare > enough case that few people will care and it could be fixed later. For my first implementation, my plan was to simply re-download ALL (not just some or lots of) objects I've already got in such a case. A bit wasteful to be sure, but I was hoping it was rare enough to "densify" a clone that it wouldn't be a big deal...and that support for smarter downloads could be added later. > I don't think the pull vs. fetch distinction is valid; I would be very > surprised if pull un-sparsified my checkout, just as I would be > surprised if merge did. And pull is just fetch+merge. Right, I don't think pull should un-sparsify either the checkout OR the clone by default (it should have fetch pass the same limiting arguments and only download an equivalently sparse set of updates). Your point about pull=fetch+merge (or fetch+rebase) makes sense, which I guess means that un-sparsifying a clone+checkout should be a separate toplevel command ("densify"?) rather than a special option for fetch/pull. >> 6) Cloning-from/fetching-from/pushing-to sparse clones is supported. >> >> Future fetches and pushes also make use of the limiting arguments. >> Receives do as well, but only to make sure the pack obtained is not >> "more sparse" than what the receiving repository already has. >> (uploads ignore the stored rev-list arguments, instead using the >> rev-list arguments passed to it -- it will die if asked for content >> not locally available to it.) > > This scares me a little. It's a reminder that it's all-too-easy to > get your repository into a really messed up state by going in and > screwing with your sparseness parameters at the wrong time. I don't follow. Why would people be "screwing with sparseness parameters"? My basic idea was that there would be only three ways to change sparseness parameters for clones, with only the first two documented: the initial clone command, the "densify" command (someone probably needs to think of a better name), and reading the source code to figure out what bits on your disk to change and changing them. Here's why I want the clone-able/fetch-able/pull-able sparse clone functionality: I like having translators (who only need maybe one file) or technical writers (who only need the Documentation/ subdirectory) or other similar folks having the ability to collaborate on the subset of the repository that they need to do their work. Thus, it makes sense for them to be able to clone from, pull from, and push to each other. The only two rules that I think are necessary to enable such behavior are: * No repository can provide information that it doesn't have (should be pretty easy to enforce...) * No repository accepts less data than it expects in its repository (i.e. you can push to a sparse clone or a real clone, but need to provide data that fulfills it's rev-list limiting arguments) > It would make me more comfortable if there was some kind of "oh god, > just fix it by downloading any objects you think are missing" mode :) > In fact, git could benefit from that in general - every now and then > someone on the list asks about a repository they managed to mangle by > corrupting a pack or something, and there's no really good answer to > that. For sparse clones, Isn't that mode just running the "densify" command with no limiting arguments? >> 7) Operations that need unavailable data simply error out >> >> Examples: merge, cherry-pick, rebase (and upload-pack in a sparse >> clone). However, hopefully the error messages state what extra >> information needs to be downloaded so the user can appropriately >> "densify" their repository. > > That sounds good to me. Thanks for the detailed feedback. :-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:31 ` Elijah Newren @ 2010-07-31 22:36 ` Elijah Newren 0 siblings, 0 replies; 16+ messages in thread From: Elijah Newren @ 2010-07-31 22:36 UTC (permalink / raw) To: Avery Pennarun; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git Hi, 2010/7/27 Elijah Newren <newren@gmail.com>: > 2010/7/27 Avery Pennarun <apenwarr@gmail.com>: >> Note that if you happen to want to implement it in a way that you'll >> also get all the commit objects from your submodules too (which I >> highly encourage :)) then downloading the trees is the easiest way. >> Otherwise you won't know which submodule commits you need. > > Makes sense. Seems like a good reason to include all the trees. Actually, having thought about it more, I don't see the reason for getting all the commit objects from submodules (unless those submodules are at paths specified for download). If a user has specified that they just want the Documentation subdirectory, why would it matter if the submodule under src/widgets was downloaded? They don't want to do anything with any of its contents, so I don't see why they'd needs its trees or commits. Am I missing something? Also, I'm rethinking the download-all-commits aspect too. This is partially due to Nguyễn's stats (and special usecases like translators), partially because of security issues (it has already been stated that only including stuff meant to be public is an important security concern for clone[1], and commit logs for changes completely outside specified paths might be considered non-public data[2]), and partially because it reinforces my whole rev-list limiting args idea (it makes it really clear that 'git log' should automatically behave like 'git log -- Documentation/' in a sparse clone of just Documentation/). [1] e.g. http://article.gmane.org/gmane.comp.version-control.git/115835 [2] This isn't just theoretical either. I have a couple big important (to $dayjob and thus me) sparse-clone usecases in this situation and have for a few years, but gave up on it thinking it wouldn't be possible with sparse clones. I instead wrote a fast filtering mechanism using fast-export/fast-import that creates a new repository and keeps track of the mapping between sha1sums in unfiltered and filtered repos, allowing changes to be grafted between the two. Kind of a pain, and suboptimal for a few reasons. It'd be really nice if I could replace this stuff with sparse clones, but can't do that if commit logs corresponding to changes completely outside the sparse paths are included. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren 2010-07-28 1:05 ` Avery Pennarun @ 2010-07-28 3:36 ` Nguyen Thai Ngoc Duy 2010-07-28 3:59 ` Elijah Newren 2010-08-13 17:31 ` Enrico Weigelt 2 siblings, 1 reply; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-28 3:36 UTC (permalink / raw) To: Elijah Newren; +Cc: Shawn O. Pearce, git 2010/7/28 Elijah Newren <newren@gmail.com>: > 1) A user controls sparseness by passing rev-list arguments to clone. > > This allows a user to control sparseness both in terms of span of > content (files/directories) and depth of history. It can also be used > to limit to a subset of refs (cloning just one or two branches instead > of all branches and tags). For example, > $ git clone ssh://repo.git dst -- Documentation/ Does pathspec is supported to in addition to prefix? > $ git clone ssh://repo.git dst master~6..master > $ git clone ssh://repo.git dst -3 > (Note that the destination argument becomes mandatory for those doing > a sparse clone in order to disambiguate it from rev-list options.) > > This method also means users don't need much training to learn how to > use sparse clones -- they just use syntax they've already learned with > log, and clone will pass this info on to upload-pack. > > There is a slight question as to whether users should have to specify > "--all HEAD" with all sparse clones or whether it should be assumed > when no other refs are listed. So you basically kill off shallow clone too, with "master~6..master". I wonder what happens if user does "git clone ... master~6..master~3"? > 4) All revision-walking operations automatically use these limiting args. > > This should be a simple code change, and would enable rev-list, log, > etc. to avoid missing blobs/trees and thus enable them to work with > sparse clones. fsck would take a bit more work, since it doesn't use > the setup_revisions() and revision.h walking machinery, but shouldn't > be too bad (I hope). > > There are also performance ramifications: There should be no > measurable performance overhead for non-sparse clones (something that > might be a problem with a different implementation that did > does-this-exist check each time it references a blob). It should also > be a significant performance boost for those using it, as operations > will only need to deal with the subset of the repository they specify > (faster downloads, stats, logs, etc.) Revision walking is not the only gate to access objects. Others like diff machinery needs also be taught about rev-list limits. > 5) "Densifying" a sparse clone can be done > > One can fetch a new pack and replace the limiting rev-list args with > the new choice. The sparse checkout information needs to be updated > too. > > (So users probably would want to densify a sparse clone with "pull" > rather than "fetch", as manually updating sparse checkouts may be a > bit of a hassle.) What information would you send to the server to request new pack in sparse clone? Currently we send all commit tips. rev-list has a notion to subtract commit trees. I don't know if it can "add" or "subtract" tree prefix though. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:36 ` Nguyen Thai Ngoc Duy @ 2010-07-28 3:59 ` Elijah Newren 2010-07-29 10:29 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 16+ messages in thread From: Elijah Newren @ 2010-07-28 3:59 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Shawn O. Pearce, git On Tue, Jul 27, 2010 at 9:36 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > 2010/7/28 Elijah Newren <newren@gmail.com>: >> 1) A user controls sparseness by passing rev-list arguments to clone. <snip> >> For example, >> $ git clone ssh://repo.git dst -- Documentation/ > > Does pathspec is supported to in addition to prefix? Basically, whatever git log or git rev-list accepts. I think I saw some other discussion about making those adopt some of the code/capability of git grep, which would automatically benefit sparse clones. But until then, no, because I need to be able to take these arguments and automatically pass them on to log, rev-list, etc. > So you basically kill off shallow clone too, with "master~6..master". Yes, that was part of the plan...extend the capabilities of shallow clones in two ways: allowing the user to specify a cutoff via a revision identifier as well as a number of commits, and allow people to clone (and fetch-from/push-to) other "shallow" clones. > I wonder what happens if user does "git clone ... master~6..master~3"? Currently, that'd break -- just like it similarly does for fast-export (see t/t9350-fast-export.sh, 'no exact-ref revisions included'). I had been thinking of trying to get that fixed for both cases by making it result in a "master" branch that is "three commits behind" what you clone/fast-export from. You'd have to look for and disallow other special cases like "git fast-export ... master^1 master^2" or "git clone ... :/searchstring". I'm not sure how this interacts with Avery's suggestion to just ignore branch/tag limiting. > Revision walking is not the only gate to access objects. Others like > diff machinery needs also be taught about rev-list limits. Right, good point. Are there others than the diff machinery (and the fsck special case) that you know of? > What information would you send to the server to request new pack in > sparse clone? Currently we send all commit tips. rev-list has a notion > to subtract commit trees. I don't know if it can "add" or "subtract" > tree prefix though. When "densifying" a sparse clone, I was (initially at least) just going to treat it like an initial clone and re-download _everything_ (even if sparsifying rather than densifying). I assumed it'd be rare to want to do such an operation, but yeah, in the future someone might want a smarter way to handle it. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 3:59 ` Elijah Newren @ 2010-07-29 10:29 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 16+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-07-29 10:29 UTC (permalink / raw) To: Elijah Newren; +Cc: Shawn O. Pearce, git On Wed, Jul 28, 2010 at 1:59 PM, Elijah Newren <newren@gmail.com> wrote: >> Revision walking is not the only gate to access objects. Others like >> diff machinery needs also be taught about rev-list limits. > > Right, good point. Are there others than the diff machinery (and the > fsck special case) that you know of? A lot (just found out as I was pushing subtree clone as far as I could). For merging, you can hardly limit sha1 access. When writing tree, git's particularly paranoid and check for sha1 existence (has_sha1_file and assert_sha1_type). You can find that has_sha1_file is used in many places, not just commit/write-tree. Narrow/Sparse/Subtree/Lazy/Whatever-it-is clone will have hard time.. Oh.. the lazy one does not. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) 2010-07-28 0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren 2010-07-28 1:05 ` Avery Pennarun 2010-07-28 3:36 ` Nguyen Thai Ngoc Duy @ 2010-08-13 17:31 ` Enrico Weigelt 2010-08-13 19:19 ` Truncating history (Re: Sparse clones) Jonathan Nieder 2 siblings, 1 reply; 16+ messages in thread From: Enrico Weigelt @ 2010-08-13 17:31 UTC (permalink / raw) To: git * Elijah Newren <newren@gmail.com> wrote: Hi folks, as I'm doing many backups via git (eg. hourly sql dumps), I'd like to cut off the history (eg. at the n'th past commit) and reclaim the space - both on local and remote side (even differently). So let me propose another approach: fake-root's Fake-roots are special refs that declare certain commit objects as root-commits). Each time git walks down the history, it checks whether the current commit is an fake-root and so treats it as having no ancestor. That should be generic enough let everything else (commit, push, gc, etc) work as usual. The only tricky point is when to update remote fake-roots: the remote should not cut off my local repo (unless explicitly asked). So remote fake-roots should only be imported if the local/receiving side has not the dropped commits anymore. hmm, maybe it's even more wise to get one step back in history and introduce fake-empty's (which also have no parents) instead of fake-root's ? A fake-empty is imported as soon as the original object is missing. Of course, it's important that this feature has to be explicitly enabled (maybe even on per-remote basis) to prevent security flaws. cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Truncating history (Re: Sparse clones) 2010-08-13 17:31 ` Enrico Weigelt @ 2010-08-13 19:19 ` Jonathan Nieder 0 siblings, 0 replies; 16+ messages in thread From: Jonathan Nieder @ 2010-08-13 19:19 UTC (permalink / raw) To: git Hi, Enrico Weigelt wrote: > Fake-roots are special refs that declare certain commit objects > as root-commits). Each time git walks down the history, it checks > whether the current commit is an fake-root and so treats it as > having no ancestor. That should be generic enough let everything > else (commit, push, gc, etc) work as usual. > > The only tricky point is when to update remote fake-roots: the > remote should not cut off my local repo (unless explicitly asked). > So remote fake-roots should only be imported if the local/receiving > side has not the dropped commits anymore. You may be interested in grafts and replacement refs; see git-filter-branch(1) and git-replace(1) for some hints. Good luck, Jonathan ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-08-13 19:21 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-28 0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren 2010-07-28 1:05 ` Avery Pennarun 2010-07-28 3:06 ` Nguyen Thai Ngoc Duy 2010-07-28 3:38 ` Nguyen Thai Ngoc Duy 2010-07-28 3:58 ` Avery Pennarun 2010-07-28 6:12 ` Sverre Rabbelier 2010-07-28 7:59 ` Nguyen Thai Ngoc Duy 2010-07-28 14:48 ` Sverre Rabbelier 2010-07-28 7:11 ` Nguyen Thai Ngoc Duy 2010-07-28 3:31 ` Elijah Newren 2010-07-31 22:36 ` Elijah Newren 2010-07-28 3:36 ` Nguyen Thai Ngoc Duy 2010-07-28 3:59 ` Elijah Newren 2010-07-29 10:29 ` Nguyen Thai Ngoc Duy 2010-08-13 17:31 ` Enrico Weigelt 2010-08-13 19:19 ` Truncating history (Re: Sparse clones) Jonathan Nieder
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).