From: Elijah Newren <newren@gmail.com>
To: git@vger.kernel.org
Cc: pclouds@gmail.com, Elijah Newren <newren@gmail.com>
Subject: [RFC PATCH 01/15] README-sparse-clone: Add a basic writeup of my ideas for sparse clones
Date: Sat, 4 Sep 2010 18:13:53 -0600 [thread overview]
Message-ID: <1283645647-1891-2-git-send-email-newren@gmail.com> (raw)
In-Reply-To: <1283645647-1891-1-git-send-email-newren@gmail.com>
This write-up just has basic ideas, strategies, notes of what needs to be
done, etc. It needs to be pruned, cleaned up, corrected as I learn more,
moved elsewhere, etc.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
README-sparse-clone | 283 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 283 insertions(+), 0 deletions(-)
create mode 100644 README-sparse-clone
diff --git a/README-sparse-clone b/README-sparse-clone
new file mode 100644
index 0000000..cfeeef3
--- /dev/null
+++ b/README-sparse-clone
@@ -0,0 +1,283 @@
+This is my set of notes on implementing sparse clones, which I define
+as a clone where not all blob, tree, or commit objects are downloaded.
+This includes sparseness both relative to span of directories and
+depth of history.
+
+(Note: This project has work-in-progress patches -- no promises about
+quality, speed of implementation, promises not to rebase, etc. etc.)
+
+*** Summary ***
+
+ Basic Idea:
+ 0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+ User View:
+ U1) A user controls sparseness by passing rev-list arguments to clone.
+ U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+ U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+ U4) Operations that need unavailable data simply error out
+ U5) Old style shallow clones (--depth argument to clone) are obsolete
+ U6) Miscellaneous notes
+ Internals:
+ I1) The limiting rev-list arguments passed to clone are stored.
+ I2) All revision-walking operations automatically use the limiting args.
+ I3) The index only contains paths matching the sparse limits
+ I4) Loading a missing commit results in a fake commit being created
+ I5) In sparse clones, a special merge strategy must be used
+ I6) Miscellaneous notes
+
+*** Basic Idea ***
+
+0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+
+Only the relevant blobs, trees, and commits are downloaded.
+Irrelevant blobs and trees are left out entirely (see items I2 & I3
+for how we avoid accessing these).
+
+To ensure minimum necessary connectivity, we also download basic
+information from otherwise excluded commits
+ * parents of these commits
+ * trees matching the specified sparse path(s)
+but, for security and space reasons, do not download
+ * author
+ * author date
+ * committer
+ * committer date
+ * log message
+Such commits are still considered "missing" (see item I4 for more
+details about how we handle "missing" commits).
+
+Tags/branches are downloaded if specified (or, if no branch/tag is
+specified, all tags/branches are downloaded).
+
+Security note: No modifications are done to existing trees, meaning
+that sparse clones will download the name of "irrelevant" blobs/trees
+with their type, mode, and sha1sum if (and only if) such blobs/trees
+are siblings of a relevant blob/tree. It is assumed that such
+information is okay to be transmitted and need not remain private; if
+such information does need to remain private, an alternate mechanism
+involving rewriting commits will be necessary (such as git-subtree).
+
+*** User View ***
+
+U1) A user controls sparseness by passing rev-list arguments to clone.
+
+This allows a user to control sparseness both in terms of span of
+content (files/directories) and depth of history. It can also be used
+to limit to a subset of refs (cloning just one or two branches instead
+of all branches and tags). For example,
+ $ git clone ssh://repo.git dst -- Documentation/
+ $ git clone ssh://repo.git dst master~6..master
+ $ git clone ssh://repo.git dst -3
+(Note that the destination argument becomes mandatory for those doing
+a sparse clone in order to disambiguate it from rev-list options.)
+
+This method also means users don't need much training to learn how to
+use sparse clones -- they just use syntax they've already learned with
+log, and clone will pass this info on to upload-pack.
+
+There is a difference due to inclusive revision specifications
+(master, master~6, v4.15.6) vs. exclusive ones (-3, ^master,
+^master~6). Inclusive revisions must be branch or tag names
+(e.g. stable or v1.8, but not master~6 or v4.18.2~1 or sha1sum or
+:/<search string>)[1]. "HEAD --all"
+are assumed if no inclusive revisions are specified. (Note: Avery
+seems to suggest always assuming "HEAD --all", at least at first.)
+
+[1] This limitation on inclusive revisions could be relaxed in the
+future for specifications derived from branch names, as long as each
+branch has no more than one associated derived revision specification.
+For example, master~6 would mean to clone a copy of the master branch
+on the remote side, excluding the last 6 commits, so that you start
+out "6 commits behind" the remote. Obviously, it wouldn't make sense
+to have both "master^1" and "master^2" specified, since we then
+wouldn't know where master should point in the clone.
+
+U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+
+One can fetch a new pack, replace the original limiting rev-list args
+with the new choice (see item I1), and update the working copy to
+reflect the changes. As users wouldn't expect a "fetch" or a "merge"
+to un-sparsify a checkout, there's a special operation for performing
+all three operations.
+
+[First cut will be to just redownload everything, instead of just the
+necessary data. I'm thinking it won't be a common operation, and it
+could always be improved later.]
+
+U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+
+This allows people who need to operate on a subset of the repository
+(e.g. translators, technical writers, etc.) to collaborate on that
+subset. I think one simple rule should enable this:
+
+ * The receiving repository specifies the limiting rev-list arguments
+ to use (if the sending repository does not have the relevant data,
+ it will naturally error out)
+
+By having the receving side specify the limiting rev-list arguments,
+it ensures that any data it receives fulfills its needs. The sending
+side then uses this information when creating a pack to determine the
+necessary objects to send, ignoring anything outside the paths/ranges
+specified in those limits. If the sending side is a sparse clone that
+does not have the necessary data specified by the receiver, then
+pack-objects will hit a nasty low-level missing object error, aborting
+the operation. In the future, we could maybe add a nicer error
+message.
+
+One special case:
+ * When cloning a repository, if the user did not specify any
+ limiting rev-list arguments, use those from the repository being
+ cloned. (Don't require the user to type out all the paths every
+ time; e.g. 'git clone URL DEST -- PATH1 PATH2 PATH3 PATH4...')
+
+U4) Operations that need unavailable data simply error out
+
+Although no normal git command should be disabled entirely, there will
+be cases when some git commands cannot function without more data.
+
+Examples:
+ * merge, cherry-pick, rebase (if unavailable files needed)
+ * upload-pack (if more data requested than available in a sparse clone)
+
+Merge, cherry-pick, and rebase deserve special consideration to
+operate in sparse clones (see item I5), since merge strategies
+normally require full trees.
+
+U5) Old style shallow clones (--depth argument to clone) are obsolete
+
+Since one can pass "-3" to get a "shallow" clone, old-style shallow
+clones are obsolete. New style shallow/sparse clones will also be
+more capable, since one can
+ * exclude based on commit (e.g. ^master~10) in addition to depth
+ * clone/push/pull from/to shallow clones
+
+What to do with old style shallow clones? Probably deprecate them,
+make the --depth argument to clone print an error message suggesting
+the new syntax, and then gut the related code at some point in the
+future.
+
+U6) Miscellaneous notes
+ * fsck & status should print a notice when working on a sparse clone
+ * paths in limiting rev-list args *must* follow '--' (current or
+ future remote repo may be bare, meaning setup_revisions will
+ complain about nonexistent paths specified without a preceding
+ '--'). Having all paths folow a '--' will also make it easier to
+ find them and pass them on to diff machinery (see item I2).
+ * notes hierarchy may also need to be made sparse in a way that only
+ notes pointing downloaded objects should be downloaded. This
+ implies missing blobs/trees, and maybe even "missing" commits.
+ But how do I avoid traversing the wrong notes on the client side?
+ Ouch. Maybe just include all notes? Or exclude all notes?
+
+*** Internals ***
+
+I1) The limiting rev-list arguments passed to clone are stored.
+
+However, relative arguments such as "-3" or "^master~6" first need to
+be translated into one or more exclude ranges written as "^<sha1>".
+
+I2) All revision-walking operations automatically use the limiting args.
+
+This should be a simple code change, and would enable rev-list, log,
+diff (which also uses the revision walking machinery), etc. to avoid
+missing blobs/trees/commits and thus enable them to work with sparse
+clones. fsck would take a bit more work, since it doesn't use the
+setup_revisions() and revision.h walking machinery, but shouldn't be
+too bad (I hope).
+
+Also, the pathspecs (or the diff options they generate) are available
+easily for operations that need them (see I3).
+
+I3) The index only contains paths matching the sparse limits
+
+Since not all trees are downloaded, not all files can even be
+referenced in the index. Further, in some cases, the only thing that
+can be referenced is a tree rather than a file. We only want paths
+matching the relevant sparse limits to be included in the index. This
+means two things:
+ * When extracting entries from trees into the index, the sparse limits
+ need to be taken into consideration
+ * Whenever writing trees, using the index is no longer sufficient.
+ Instead, the files in the index are used to record
+ sha1sums/modes/filenames for paths within the sparse limits, and
+ another tree (typically from HEAD) is used to record
+ sha1sums/modes/filenames/types for paths outside the sparse
+ limits.
+
+Note that writing trees from the index can occur with commit, merge,
+checkout (-m), revert/cherry-pick --no-commit, and write-tree. All
+need to be updated to either provide a relevant tree or error out when
+run from a sparse clone.
+
+I4) Loading a missing commit results in a fake commit being created
+
+Fake commits have correct parentage and an appropriate (sparse) tree
+(since those pieces of information are available), but blank author &
+committer, 0 for times & timezones, and a commit log message such as
+the following:
+ This commit is missing from this sparse clone. You can use the
+ densify command to download missing commits and files.
+
+This allows the following to work:
+ * git commit (which needs tree/file sha1sums that were not modified,
+ though if a given tree is unmodified, no subtree/subfile sha1s are
+ needed)
+ * tags & branches (which can correctly point at missing commits)
+ * git show (with a branch/tag/commit)
+ * git prune (missing objects correctly reference their parent(s))
+ * git fsck (missing commits still referenced)
+
+Extra notes:
+ * Stored in a file using multiple lines of: <commit> <tree> <parent1> ...
+ * Only referenced when git would otherwise die
+
+I5) In sparse clones, a special merge strategy must be used
+
+Most merge strategies work at the file/content level. Since many
+files and even whole trees will be unavailable, a special strategy
+that works with tree-level items is necessary. It should only perform
+trivial merges when forced to operate at the tree-level (modified on
+at most one side of history, and probably no rename handling at least
+at first). When such trivial merges are not possible, it should fail
+with a helpful error message noting the needed tree contents.
+
+For non-missing blobs, standard merge strategies may be used.
+
+I6) Miscellaneous notes
+ * thin-packs: git pack-objects needs to be told to only delta
+ against objects that match the sparse limits, otherwise the
+ receiving side will not be able to use the resulting pack.
+
+----------------------------------------------------------------------
+
+Testcases needed:
+ * basics: checkout, status, diff, log (w/ options!), add, commit
+ * extras: blame, apply, bisect, branch, tag, grep, reset
+ * maintainence: fsck, prune, gc/repack, verify-pack
+ * plumbing: {read,write,ls,commit,merge,tar,diff}-tree, mktree
+ * direct: cat-file, show (esp. missing obj. or tag/branch of such)
+ * merge strat.: merge, cherry-pick/revert, rebase
+ * communication: pull, push, fetch, clone, bundle, archive
+ * protocols: http, ssh, git, rsync
+ * rewrite: filter-branch, fast-{export, import}
+ * notes: ?
+
+ General:
+ 'clone NON-BARE-REPO dst PATHS' should fail (needs double dash)!
+ git rev-list master should show subset of available commits
+ Keep Index sparse:
+ git add <path> for <path> not in git_sparse_pathspec should error out
+ update-index on <path> not in git_sparse_pathspec should error out
+ Sparse Index Handling:
+ merge into branch yet to be born, revert
+ checkout -m (to real branch, from valid or yet-to-be born branch)
+
+ Major TODOs:
+ * fetch
+ * push
+ * don't pass revlist arguments on command line to upload pack; use protocol
+ * densify command
+ * missing commits
+ * fix thin packs to only delta against objects within sparse limits
+ * lots more testcases
+ * cleanup FIXMEs
--
1.7.2.2.140.gd06af
next prev parent reply other threads:[~2010-09-05 0:14 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-05 0:13 [RFC PATCH 00/15] Sparse clones Elijah Newren
2010-09-05 0:13 ` Elijah Newren [this message]
2010-09-05 3:01 ` [RFC PATCH 01/15] README-sparse-clone: Add a basic writeup of my ideas for sparse clones Nguyen Thai Ngoc Duy
2010-09-05 3:13 ` Elijah Newren
2010-09-06 3:14 ` Nguyen Thai Ngoc Duy
2010-09-05 0:13 ` [RFC PATCH 02/15] Add tests for client handling in a sparse repository Elijah Newren
2010-09-05 0:13 ` [RFC PATCH 03/15] Read sparse limiting args from $GIT_DIR/sparse-limit Elijah Newren
2010-09-05 0:13 ` [RFC PATCH 04/15] When unpacking in a sparse repository, avoid traversing missing trees/blobs Elijah Newren
2010-09-05 0:13 ` [RFC PATCH 05/15] read_tree_recursive: Avoid missing blobs and trees in a sparse repository Elijah Newren
2010-09-05 2:00 ` Nguyen Thai Ngoc Duy
2010-09-05 3:16 ` Elijah Newren
2010-09-05 4:31 ` Elijah Newren
2010-09-05 0:13 ` [RFC PATCH 06/15] Automatically reuse sparse limiting arguments in revision walking Elijah Newren
2010-09-05 1:58 ` Nguyen Thai Ngoc Duy
2010-09-05 4:50 ` Elijah Newren
2010-09-05 7:12 ` Nguyen Thai Ngoc Duy
2010-09-05 0:13 ` [RFC PATCH 07/15] cache_tree_update(): Capability to handle tree entries missing from index Elijah Newren
2010-09-05 7:54 ` Nguyen Thai Ngoc Duy
2010-09-05 21:09 ` Elijah Newren
2010-09-06 4:42 ` Elijah Newren
2010-09-06 5:02 ` Nguyen Thai Ngoc Duy
2010-09-06 4:47 ` [PATCH 0/4] en/object-list-with-pathspec update Nguyễn Thái Ngọc Duy
2010-09-06 4:47 ` [PATCH 1/4] Add testcases showing how pathspecs are ignored with rev-list --objects Nguyễn Thái Ngọc Duy
2010-09-06 4:47 ` [PATCH 2/4] tree-walk: copy tree_entry_interesting() as is from tree-diff.c Nguyễn Thái Ngọc Duy
2010-09-06 15:22 ` Elijah Newren
2010-09-06 22:09 ` Nguyen Thai Ngoc Duy
2010-09-06 4:47 ` [PATCH 3/4] tree-walk: actually move tree_entry_interesting() to tree-walk.c Nguyễn Thái Ngọc Duy
2010-09-06 15:31 ` Elijah Newren
2010-09-06 22:20 ` Nguyen Thai Ngoc Duy
2010-09-06 23:53 ` Junio C Hamano
2010-09-06 4:47 ` [PATCH 4/4] Make rev-list --objects work together with pathspecs Nguyễn Thái Ngọc Duy
2010-09-07 1:28 ` [RFC PATCH 07/15] cache_tree_update(): Capability to handle tree entries missing from index Nguyen Thai Ngoc Duy
2010-09-07 3:06 ` Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 08/15] cache_tree_update(): Require relevant tree to be passed Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 09/15] Add tests for communication dealing with sparse repositories Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 10/15] sparse-repo: Provide a function to record sparse limiting arguments Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 11/15] builtin-clone: Accept paths for sparse clone Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 12/15] Pass extra (rev-list) args on, at least in some cases Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 13/15] upload-pack: Handle extra rev-list arguments being passed Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 14/15] EVIL COMMIT: Include all commits Elijah Newren
2010-09-05 0:14 ` [RFC PATCH 15/15] clone: Ensure sparse limiting arguments are used in subsequent operations Elijah Newren
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1283645647-1891-2-git-send-email-newren@gmail.com \
--to=newren@gmail.com \
--cc=git@vger.kernel.org \
--cc=pclouds@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).