All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, peff@peff.net, git@jeffhostetler.com,
	sbeller@google.com, dstolee@microsoft.com
Subject: [PATCH 01/14] graph: add packed graph design document
Date: Thu, 25 Jan 2018 09:02:18 -0500	[thread overview]
Message-ID: <20180125140231.65604-2-dstolee@microsoft.com> (raw)
In-Reply-To: <20180125140231.65604-1-dstolee@microsoft.com>

Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/packed-graph.txt | 185 +++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 Documentation/technical/packed-graph.txt

diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt
new file mode 100644
index 0000000000..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.
+
+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit. The recursive definition
+is easier to use for computation and the following property:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N <= M, then A cannot reach B. That is, we know without searching
+    that B is not an ancestor of A because it is further from a root commit
+    than A.
+
+    Conversely, when checking if A is an ancestor of B, then we only need
+    to walk commits until all commits on the walk boundary have generation
+    number at most N. If we walk commits using a priority queue seeded by
+    generation numbers, then we always expand the boundary commit with highest
+    generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+    If A and B are commits with commit time X and Y, respectively, and
+    X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<oid>.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file OID is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-<oid>.graph' file in the same pack
+  directory.
+
+- The core.graph config setting must be on to create or consume graph files.
+
+- The graph file is only a supplemental structure. If a user downgrades
+  or disables the 'core.graph' config setting, then the existing ODB is
+  sufficient.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer ID to
+  seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'graph' builtin. These are not
+  updated automatically during clone or fetch.
+
+- There is no '--verify' option for the 'graph' builtin to verify the
+  contents of the graph file.
+
+- The graph only considers commits existing in packfiles and does not
+  walk to fill in reachable commits. [Small]
+
+- When rewriting the graph, we do not check for a commit still existing
+  in the ODB, so garbage collection may remove commits
+
+- Generation numbers are not computed in the current version. The file
+  format supports storing them, along with a mechanism to upgrade from
+  a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+  are not currently computed, so all generation numbers will be marked as
+  0 (or "uncomputed"). A later patch will include this calculation.
+
+- The current implementation of the 'graph' builtin walks all packed objects
+  to find a complete list of commits in packfiles. If some commits are
+  stored as loose objects, then these do not appear in the graph. This is
+  handled gracefully by the file format, but it would cause incorrect
+  generation number calculations. We should implement the construct_graph()
+  method in a way that walks all commits reachable from some starting set
+  and then can use complete information for generation numbers. (Some
+  care must be taken around shallow clones.)
+
+- The graph is not currently integrated with grafts.
+
+- After computing and storing generation numbers, we must make graph
+  walks aware of generation numbers to gain performance benefits. This
+  will mostly be accomplished by swapping a commit-date-ordered priority
+  queue with one ordered by generation number. The following operations
+  are important candidates:
+
+    - paint_down_to_common()
+    - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+  When writing a new graph, we could check that the ODB still contains
+  the commits and choose to remove the commits that are deleted from the
+  ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+  object for a commit. This passes through lookup_tree() and consequently
+  lookup_object(). Also, it calls lookup_commit() when loading the parents.
+  These method calls check the ODB for object existence, even if the
+  consumer does not need the content. For example, we do not need the
+  tree contents when computing merge bases. Now that commit parsing is
+  removed from the computation time, these lookup operations are the
+  slowest operations keeping graph walks from being fast. Consider
+  loading these objects without verifying their existence in the ODB and
+  only loading them fully when consumers need them. Consider a method
+  such as "ensure_tree_loaded(commit)" that fully loads a tree before
+  using commit->tree.
+
+- The current design uses the 'graph' builtin to generate the graph. When
+  this feature stabilizes enough to recommend to most users, we should
+  add automatic graph writes to common operations that create many commits.
+  For example, one coulde compute a graph on 'clone' and 'fetch' commands.
+
+Related Links
+-------------
+
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+    Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+    An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+    Discussion about generation numbers on commits and how they interact
+    with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+    More discussion about generation numbers and not storing them inside
+    commit objects. A valuable quote:
+
+    "I think we should be moving more in the direction of keeping
+     repo-local caches for optimizations. Reachability bitmaps have been
+     a big performance win. I think we should be doing the same with our
+     properties of commits. Not just generation numbers, but making it
+     cheap to access the graph structure without zlib-inflating whole
+     commit objects (i.e., packv4 or something like the "metapacks" I
+     proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+    A patch to remove the ahead-behind calculation from 'status'.
-- 
2.16.0


  reply	other threads:[~2018-01-25 14:02 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-25 14:02 [PATCH 00/14] Serialized Commit Graph Derrick Stolee
2018-01-25 14:02 ` Derrick Stolee [this message]
2018-01-25 20:04   ` [PATCH 01/14] graph: add packed graph design document Stefan Beller
2018-01-26 12:49     ` Derrick Stolee
2018-01-26 18:17       ` Stefan Beller
2018-01-25 21:14   ` Junio C Hamano
2018-01-26 13:06     ` Derrick Stolee
2018-01-26 14:13   ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 02/14] packed-graph: add core.graph setting Derrick Stolee
2018-01-25 20:17   ` Stefan Beller
2018-01-25 20:40     ` Derrick Stolee
2018-01-25 21:43   ` Junio C Hamano
2018-01-26 13:08     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 03/14] packed-graph: create git-graph builtin Derrick Stolee
2018-01-25 21:45   ` Stefan Beller
2018-01-26 13:13     ` Derrick Stolee
2018-01-25 23:01   ` Junio C Hamano
2018-01-26 13:14     ` Derrick Stolee
2018-01-26 14:16       ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 04/14] packed-graph: add format document Derrick Stolee
2018-01-25 22:06   ` Junio C Hamano
2018-01-25 22:18     ` Stefan Beller
2018-01-25 22:29       ` Junio C Hamano
2018-01-26 13:22         ` Derrick Stolee
2018-01-25 22:07   ` Stefan Beller
2018-01-26 13:25     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 05/14] packed-graph: implement construct_graph() Derrick Stolee
2018-01-25 23:21   ` Stefan Beller
2018-01-26 20:47     ` Junio C Hamano
2018-01-26 20:55   ` Junio C Hamano
2018-01-26 21:14     ` Andreas Schwab
2018-01-26 22:04       ` Junio C Hamano
2018-01-25 14:02 ` [PATCH 06/14] packed-graph: implement git-graph --write Derrick Stolee
2018-01-25 23:28   ` Stefan Beller
2018-01-26 13:28     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 07/14] packed-graph: implement git-graph --read Derrick Stolee
2018-01-25 14:02 ` [PATCH 08/14] graph: implement git-graph --update-head Derrick Stolee
2018-01-25 14:02 ` [PATCH 09/14] packed-graph: implement git-graph --clear Derrick Stolee
2018-01-25 23:35   ` Stefan Beller
2018-01-25 14:02 ` [PATCH 10/14] packed-graph: teach git-graph --delete-expired Derrick Stolee
2018-01-25 14:02 ` [PATCH 11/14] commit: integrate packed graph with commit parsing Derrick Stolee
2018-01-26 19:38   ` Stefan Beller
2018-01-25 14:02 ` [PATCH 12/14] packed-graph: read only from specific pack-indexes Derrick Stolee
2018-01-25 14:02 ` [PATCH 13/14] packed-graph: close under reachability Derrick Stolee
2018-01-25 14:02 ` [PATCH 14/14] packed-graph: teach git-graph to read commits Derrick Stolee
2018-01-25 15:46 ` [PATCH 00/14] Serialized Commit Graph Ævar Arnfjörð Bjarmason
2018-01-25 16:09   ` Derrick Stolee
2018-01-25 23:06     ` Ævar Arnfjörð Bjarmason
2018-01-26 12:15       ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180125140231.65604-2-dstolee@microsoft.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.