All of lore.kernel.org
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: git@vger.kernel.org
Cc: pclouds@gmail.com, Elijah Newren <newren@gmail.com>
Subject: [RFC PATCH 01/15] README-sparse-clone: Add a basic writeup of my ideas for sparse clones
Date: Sat,  4 Sep 2010 18:13:53 -0600	[thread overview]
Message-ID: <1283645647-1891-2-git-send-email-newren@gmail.com> (raw)
In-Reply-To: <1283645647-1891-1-git-send-email-newren@gmail.com>

This write-up just has basic ideas, strategies, notes of what needs to be
done, etc.  It needs to be pruned, cleaned up, corrected as I learn more,
moved elsewhere, etc.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 README-sparse-clone |  283 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 283 insertions(+), 0 deletions(-)
 create mode 100644 README-sparse-clone

diff --git a/README-sparse-clone b/README-sparse-clone
new file mode 100644
index 0000000..cfeeef3
--- /dev/null
+++ b/README-sparse-clone
@@ -0,0 +1,283 @@
+This is my set of notes on implementing sparse clones, which I define
+as a clone where not all blob, tree, or commit objects are downloaded.
+This includes sparseness both relative to span of directories and
+depth of history.
+
+(Note: This project has work-in-progress patches -- no promises about
+quality, speed of implementation, promises not to rebase, etc. etc.)
+
+*** Summary ***
+
+  Basic Idea:
+    0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+  User View:
+    U1) A user controls sparseness by passing rev-list arguments to clone.
+    U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+    U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+    U4) Operations that need unavailable data simply error out
+    U5) Old style shallow clones (--depth argument to clone) are obsolete
+    U6) Miscellaneous notes
+  Internals:
+    I1) The limiting rev-list arguments passed to clone are stored.
+    I2) All revision-walking operations automatically use the limiting args.
+    I3) The index only contains paths matching the sparse limits
+    I4) Loading a missing commit results in a fake commit being created
+    I5) In sparse clones, a special merge strategy must be used
+    I6) Miscellaneous notes
+
+*** Basic Idea ***
+
+0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+
+Only the relevant blobs, trees, and commits are downloaded.
+Irrelevant blobs and trees are left out entirely (see items I2 & I3
+for how we avoid accessing these).
+
+To ensure minimum necessary connectivity, we also download basic
+information from otherwise excluded commits
+  * parents of these commits
+  * trees matching the specified sparse path(s)
+but, for security and space reasons, do not download
+  * author
+  * author date
+  * committer
+  * committer date
+  * log message
+Such commits are still considered "missing" (see item I4 for more
+details about how we handle "missing" commits).
+
+Tags/branches are downloaded if specified (or, if no branch/tag is
+specified, all tags/branches are downloaded).
+
+Security note: No modifications are done to existing trees, meaning
+that sparse clones will download the name of "irrelevant" blobs/trees
+with their type, mode, and sha1sum if (and only if) such blobs/trees
+are siblings of a relevant blob/tree.  It is assumed that such
+information is okay to be transmitted and need not remain private; if
+such information does need to remain private, an alternate mechanism
+involving rewriting commits will be necessary (such as git-subtree).
+
+*** User View ***
+
+U1) A user controls sparseness by passing rev-list arguments to clone.
+
+This allows a user to control sparseness both in terms of span of
+content (files/directories) and depth of history.  It can also be used
+to limit to a subset of refs (cloning just one or two branches instead
+of all branches and tags).  For example,
+  $ git clone ssh://repo.git dst -- Documentation/
+  $ git clone ssh://repo.git dst master~6..master
+  $ git clone ssh://repo.git dst -3
+(Note that the destination argument becomes mandatory for those doing
+a sparse clone in order to disambiguate it from rev-list options.)
+
+This method also means users don't need much training to learn how to
+use sparse clones -- they just use syntax they've already learned with
+log, and clone will pass this info on to upload-pack.
+
+There is a difference due to inclusive revision specifications
+(master, master~6, v4.15.6) vs. exclusive ones (-3, ^master,
+^master~6).  Inclusive revisions must be branch or tag names
+(e.g. stable or v1.8, but not master~6 or v4.18.2~1 or sha1sum or
+:/<search string>)[1].  "HEAD --all"
+are assumed if no inclusive revisions are specified.  (Note: Avery
+seems to suggest always assuming "HEAD --all", at least at first.)
+
+[1] This limitation on inclusive revisions could be relaxed in the
+future for specifications derived from branch names, as long as each
+branch has no more than one associated derived revision specification.
+For example, master~6 would mean to clone a copy of the master branch
+on the remote side, excluding the last 6 commits, so that you start
+out "6 commits behind" the remote.  Obviously, it wouldn't make sense
+to have both "master^1" and "master^2" specified, since we then
+wouldn't know where master should point in the clone.
+
+U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+
+One can fetch a new pack, replace the original limiting rev-list args
+with the new choice (see item I1), and update the working copy to
+reflect the changes.  As users wouldn't expect a "fetch" or a "merge"
+to un-sparsify a checkout, there's a special operation for performing
+all three operations.
+
+[First cut will be to just redownload everything, instead of just the
+necessary data.  I'm thinking it won't be a common operation, and it
+could always be improved later.]
+
+U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+
+This allows people who need to operate on a subset of the repository
+(e.g. translators, technical writers, etc.) to collaborate on that
+subset.  I think one simple rule should enable this:
+
+  * The receiving repository specifies the limiting rev-list arguments
+    to use (if the sending repository does not have the relevant data,
+    it will naturally error out)
+
+By having the receving side specify the limiting rev-list arguments,
+it ensures that any data it receives fulfills its needs.  The sending
+side then uses this information when creating a pack to determine the
+necessary objects to send, ignoring anything outside the paths/ranges
+specified in those limits.  If the sending side is a sparse clone that
+does not have the necessary data specified by the receiver, then
+pack-objects will hit a nasty low-level missing object error, aborting
+the operation.  In the future, we could maybe add a nicer error
+message.
+
+One special case:
+  * When cloning a repository, if the user did not specify any
+    limiting rev-list arguments, use those from the repository being
+    cloned.  (Don't require the user to type out all the paths every
+    time; e.g. 'git clone URL DEST -- PATH1 PATH2 PATH3 PATH4...')
+
+U4) Operations that need unavailable data simply error out
+
+Although no normal git command should be disabled entirely, there will
+be cases when some git commands cannot function without more data.
+
+Examples:
+  * merge, cherry-pick, rebase (if unavailable files needed)
+  * upload-pack (if more data requested than available in a sparse clone)
+
+Merge, cherry-pick, and rebase deserve special consideration to
+operate in sparse clones (see item I5), since merge strategies
+normally require full trees.
+
+U5) Old style shallow clones (--depth argument to clone) are obsolete
+
+Since one can pass "-3" to get a "shallow" clone, old-style shallow
+clones are obsolete.  New style shallow/sparse clones will also be
+more capable, since one can
+  * exclude based on commit (e.g. ^master~10) in addition to depth
+  * clone/push/pull from/to shallow clones
+
+What to do with old style shallow clones?  Probably deprecate them,
+make the --depth argument to clone print an error message suggesting
+the new syntax, and then gut the related code at some point in the
+future.
+
+U6) Miscellaneous notes
+  * fsck & status should print a notice when working on a sparse clone
+  * paths in limiting rev-list args *must* follow '--' (current or
+    future remote repo may be bare, meaning setup_revisions will
+    complain about nonexistent paths specified without a preceding
+    '--').  Having all paths folow a '--' will also make it easier to
+    find them and pass them on to diff machinery (see item I2).
+  * notes hierarchy may also need to be made sparse in a way that only
+    notes pointing downloaded objects should be downloaded.  This
+    implies missing blobs/trees, and maybe even "missing" commits.
+    But how do I avoid traversing the wrong notes on the client side?
+    Ouch.  Maybe just include all notes?  Or exclude all notes?
+
+*** Internals ***
+
+I1) The limiting rev-list arguments passed to clone are stored.
+
+However, relative arguments such as "-3" or "^master~6" first need to
+be translated into one or more exclude ranges written as "^<sha1>".
+
+I2) All revision-walking operations automatically use the limiting args.
+
+This should be a simple code change, and would enable rev-list, log,
+diff (which also uses the revision walking machinery), etc. to avoid
+missing blobs/trees/commits and thus enable them to work with sparse
+clones.  fsck would take a bit more work, since it doesn't use the
+setup_revisions() and revision.h walking machinery, but shouldn't be
+too bad (I hope).
+
+Also, the pathspecs (or the diff options they generate) are available
+easily for operations that need them (see I3).
+
+I3) The index only contains paths matching the sparse limits
+
+Since not all trees are downloaded, not all files can even be
+referenced in the index.  Further, in some cases, the only thing that
+can be referenced is a tree rather than a file.  We only want paths
+matching the relevant sparse limits to be included in the index.  This
+means two things:
+  * When extracting entries from trees into the index, the sparse limits
+    need to be taken into consideration
+  * Whenever writing trees, using the index is no longer sufficient.
+    Instead, the files in the index are used to record
+    sha1sums/modes/filenames for paths within the sparse limits, and
+    another tree (typically from HEAD) is used to record
+    sha1sums/modes/filenames/types for paths outside the sparse
+    limits.
+
+Note that writing trees from the index can occur with commit, merge,
+checkout (-m), revert/cherry-pick --no-commit, and write-tree.  All
+need to be updated to either provide a relevant tree or error out when
+run from a sparse clone.
+
+I4) Loading a missing commit results in a fake commit being created
+
+Fake commits have correct parentage and an appropriate (sparse) tree
+(since those pieces of information are available), but blank author &
+committer, 0 for times & timezones, and a commit log message such as
+the following:
+  This commit is missing from this sparse clone.  You can use the
+  densify command to download missing commits and files.
+
+This allows the following to work:
+  * git commit (which needs tree/file sha1sums that were not modified,
+    though if a given tree is unmodified, no subtree/subfile sha1s are
+    needed)
+  * tags & branches (which can correctly point at missing commits)
+  * git show (with a branch/tag/commit)
+  * git prune (missing objects correctly reference their parent(s))
+  * git fsck (missing commits still referenced)
+
+Extra notes:
+  * Stored in a file using multiple lines of: <commit> <tree> <parent1> ...
+  * Only referenced when git would otherwise die
+
+I5) In sparse clones, a special merge strategy must be used
+
+Most merge strategies work at the file/content level.  Since many
+files and even whole trees will be unavailable, a special strategy
+that works with tree-level items is necessary.  It should only perform
+trivial merges when forced to operate at the tree-level (modified on
+at most one side of history, and probably no rename handling at least
+at first).  When such trivial merges are not possible, it should fail
+with a helpful error message noting the needed tree contents.
+
+For non-missing blobs, standard merge strategies may be used.
+
+I6) Miscellaneous notes
+  * thin-packs: git pack-objects needs to be told to only delta
+    against objects that match the sparse limits, otherwise the
+    receiving side will not be able to use the resulting pack.
+
+----------------------------------------------------------------------
+
+Testcases needed:
+  * basics:        checkout, status, diff, log (w/ options!), add, commit
+  * extras:        blame, apply, bisect, branch, tag, grep, reset
+  * maintainence:  fsck, prune, gc/repack, verify-pack
+  * plumbing:      {read,write,ls,commit,merge,tar,diff}-tree, mktree
+  * direct:        cat-file, show (esp. missing obj. or tag/branch of such)
+  * merge strat.:  merge, cherry-pick/revert, rebase
+  * communication: pull, push, fetch, clone, bundle, archive
+  * protocols:     http, ssh, git, rsync
+  * rewrite:       filter-branch, fast-{export, import}
+  * notes:         ?
+
+  General:
+    'clone NON-BARE-REPO dst PATHS' should fail (needs double dash)!
+    git rev-list master should show subset of available commits
+  Keep Index sparse:
+    git add <path> for <path> not in git_sparse_pathspec should error out
+    update-index on <path> not in git_sparse_pathspec should error out
+  Sparse Index Handling:
+    merge into branch yet to be born, revert
+    checkout -m  (to real branch, from valid or yet-to-be born branch)
+
+  Major TODOs:
+    * fetch
+    * push
+    * don't pass revlist arguments on command line to upload pack; use protocol
+    * densify command
+    * missing commits
+    * fix thin packs to only delta against objects within sparse limits
+    * lots more testcases
+    * cleanup FIXMEs
-- 
1.7.2.2.140.gd06af

  reply	other threads:[~2010-09-05  0:14 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-05  0:13 [RFC PATCH 00/15] Sparse clones Elijah Newren
2010-09-05  0:13 ` Elijah Newren [this message]
2010-09-05  3:01   ` [RFC PATCH 01/15] README-sparse-clone: Add a basic writeup of my ideas for sparse clones Nguyen Thai Ngoc Duy
2010-09-05  3:13     ` Elijah Newren
2010-09-06  3:14       ` Nguyen Thai Ngoc Duy
2010-09-05  0:13 ` [RFC PATCH 02/15] Add tests for client handling in a sparse repository Elijah Newren
2010-09-05  0:13 ` [RFC PATCH 03/15] Read sparse limiting args from $GIT_DIR/sparse-limit Elijah Newren
2010-09-05  0:13 ` [RFC PATCH 04/15] When unpacking in a sparse repository, avoid traversing missing trees/blobs Elijah Newren
2010-09-05  0:13 ` [RFC PATCH 05/15] read_tree_recursive: Avoid missing blobs and trees in a sparse repository Elijah Newren
2010-09-05  2:00   ` Nguyen Thai Ngoc Duy
2010-09-05  3:16     ` Elijah Newren
2010-09-05  4:31       ` Elijah Newren
2010-09-05  0:13 ` [RFC PATCH 06/15] Automatically reuse sparse limiting arguments in revision walking Elijah Newren
2010-09-05  1:58   ` Nguyen Thai Ngoc Duy
2010-09-05  4:50     ` Elijah Newren
2010-09-05  7:12       ` Nguyen Thai Ngoc Duy
2010-09-05  0:13 ` [RFC PATCH 07/15] cache_tree_update(): Capability to handle tree entries missing from index Elijah Newren
2010-09-05  7:54   ` Nguyen Thai Ngoc Duy
2010-09-05 21:09     ` Elijah Newren
2010-09-06  4:42       ` Elijah Newren
2010-09-06  5:02         ` Nguyen Thai Ngoc Duy
2010-09-06  4:47   ` [PATCH 0/4] en/object-list-with-pathspec update Nguyễn Thái Ngọc Duy
2010-09-06  4:47   ` [PATCH 1/4] Add testcases showing how pathspecs are ignored with rev-list --objects Nguyễn Thái Ngọc Duy
2010-09-06  4:47   ` [PATCH 2/4] tree-walk: copy tree_entry_interesting() as is from tree-diff.c Nguyễn Thái Ngọc Duy
2010-09-06 15:22     ` Elijah Newren
2010-09-06 22:09       ` Nguyen Thai Ngoc Duy
2010-09-06  4:47   ` [PATCH 3/4] tree-walk: actually move tree_entry_interesting() to tree-walk.c Nguyễn Thái Ngọc Duy
2010-09-06 15:31     ` Elijah Newren
2010-09-06 22:20       ` Nguyen Thai Ngoc Duy
2010-09-06 23:53         ` Junio C Hamano
2010-09-06  4:47   ` [PATCH 4/4] Make rev-list --objects work together with pathspecs Nguyễn Thái Ngọc Duy
2010-09-07  1:28   ` [RFC PATCH 07/15] cache_tree_update(): Capability to handle tree entries missing from index Nguyen Thai Ngoc Duy
2010-09-07  3:06     ` Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 08/15] cache_tree_update(): Require relevant tree to be passed Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 09/15] Add tests for communication dealing with sparse repositories Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 10/15] sparse-repo: Provide a function to record sparse limiting arguments Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 11/15] builtin-clone: Accept paths for sparse clone Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 12/15] Pass extra (rev-list) args on, at least in some cases Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 13/15] upload-pack: Handle extra rev-list arguments being passed Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 14/15] EVIL COMMIT: Include all commits Elijah Newren
2010-09-05  0:14 ` [RFC PATCH 15/15] clone: Ensure sparse limiting arguments are used in subsequent operations Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1283645647-1891-2-git-send-email-newren@gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.