From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id E38BF1F404 for ; Thu, 25 Jan 2018 14:02:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751444AbeAYOCo (ORCPT ); Thu, 25 Jan 2018 09:02:44 -0500 Received: from mail-qt0-f193.google.com ([209.85.216.193]:43922 "EHLO mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751161AbeAYOCn (ORCPT ); Thu, 25 Jan 2018 09:02:43 -0500 Received: by mail-qt0-f193.google.com with SMTP id s3so19351919qtb.10 for ; Thu, 25 Jan 2018 06:02:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=/Jc8Y2B1oLiJROpxHI1S3UVpNFoEgHlhudSImJtrDYo=; b=toW9gVw+BwwEVElSzNGZwlfayufAcKxPxuy46+iTbd9J02e2JT+ywDwVOURnFmstos +Ek9VlkG5vc+3LC2gOrKmHBvf66sFfI3+GtD0FTCxuZF3XxO7+omaDsXKdZ2e7c5Imz+ o3+jAPKAnwwXGgzlbxVMGSWki6norjN6I+A/q2jZkEiianiGNIMxbQeulGW417oKRM9w qy74Xu/dtqnoYwrkzdfydDOA/asY8/P9TMsTzBsjqTk9UFRgeimDcQ1J6Wh2DdagN4I7 on1w6+GJD/ilO9jSda+vG4GxmpSgmMeVY+jyOD18NHJjRi0XXx5F4mLGLd9Rl6R5lhrK XU2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=/Jc8Y2B1oLiJROpxHI1S3UVpNFoEgHlhudSImJtrDYo=; b=uB50rEXd8XCEOdsqEEpKWvtlfPZpSBgs9ZJ0GThHeZGUH/zjsHBNgHRhGdMQYKKivV r5/y6ubFJGLz8X0RxDX1GB+0ifFxJp5kbHQ7yWv8nr5pwYls6ai7eGA+itZ/yjTBM5Ni g8WZQ8xzawX2/2ntKRILivGoX8paxHrVPa7Ty71uBQiBcAVR/ra6VotnjaCW7eC4nwAW dPKNkwBke9Ia20/DCpEZJYCzHtquptWEBodvlTKnVjeyOtsfflQBXdSg7rzLQto8Azdx pURK/dOtUmkRXZkCzSOtokih6gwlOXtUrVpLnByO5diYBdCmfuWjTPR3RbyAB1nLESFq V9MQ== X-Gm-Message-State: AKwxyteMFpAfaoWgZBVt020Ii19sEDuVugZYSV0rxQxrIUXeyXMFvoAR P5G0Hj6Z83ay6EXxPElnO86AHK+e X-Google-Smtp-Source: AH8x224M+iBfVeIgQVl8HfPQmEUUbLAWgb4ljE8MNCwFtNqs9QCs3NISPn0LXTUvgVEYRB3ORN6gZA== X-Received: by 10.200.46.139 with SMTP id h11mr16112429qta.111.1516888961859; Thu, 25 Jan 2018 06:02:41 -0800 (PST) Received: from stolee-linux.corp.microsoft.com ([2001:4898:8010:0:eb4a:5dff:fe0f:7308]) by smtp.gmail.com with ESMTPSA id y123sm1850875qka.42.2018.01.25.06.02.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 25 Jan 2018 06:02:36 -0800 (PST) From: Derrick Stolee X-Google-Original-From: Derrick Stolee To: git@vger.kernel.org Cc: gitster@pobox.com, peff@peff.net, git@jeffhostetler.com, sbeller@google.com, dstolee@microsoft.com Subject: [PATCH 01/14] graph: add packed graph design document Date: Thu, 25 Jan 2018 09:02:18 -0500 Message-Id: <20180125140231.65604-2-dstolee@microsoft.com> X-Mailer: git-send-email 2.16.0 In-Reply-To: <20180125140231.65604-1-dstolee@microsoft.com> References: <20180125140231.65604-1-dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee --- Documentation/technical/packed-graph.txt | 185 +++++++++++++++++++++++++++++++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 0000000000..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes +============================= + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The packed graph is a file that stores the commit graph structure along +with some extra metadata to speed up graph walks. This format allows a +consumer to load the following info for a commit: + +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. + +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a root commit. The recursive definition +is easier to use for computation and the following property: + + If A and B are commits with generation numbers N and M, respectively, + and N <= M, then A cannot reach B. That is, we know without searching + that B is not an ancestor of A because it is further from a root commit + than A. + + Conversely, when checking if A is an ancestor of B, then we only need + to walk commits until all commits on the walk boundary have generation + number at most N. If we walk commits using a priority queue seeded by + generation numbers, then we always expand the boundary commit with highest + generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + + If A and B are commits with commit time X and Y, respectively, and + X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation can make +mistakes with topological orders (such as "git log" with default order), +but is not used when the topological order is required (such as merge +base calculations, "git log --graph"). + +Design Details +-------------- + +- A graph file is stored in a file named 'graph-.graph' in the pack + directory. This could be stored in an alternate. + +- The most-recent graph file OID is stored in a 'graph-head' file for + immediate access and storing backup graphs. This could be stored in an + alternate, and refers to a 'graph-.graph' file in the same pack + directory. + +- The core.graph config setting must be on to create or consume graph files. + +- The graph file is only a supplemental structure. If a user downgrades + or disables the 'core.graph' config setting, then the existing ODB is + sufficient. + +- The file format includes parameters for the object id length + and hash algorithm, so a future change of hash algorithm does + not require a change in format. + +Current Limitations +------------------- + +- Only one graph file is used at one time. This allows the integer ID to + seek into the single graph file. It is possible to extend the model + for multiple graph files, but that is currently not part of the design. + +- .graph files are managed only by the 'graph' builtin. These are not + updated automatically during clone or fetch. + +- There is no '--verify' option for the 'graph' builtin to verify the + contents of the graph file. + +- The graph only considers commits existing in packfiles and does not + walk to fill in reachable commits. [Small] + +- When rewriting the graph, we do not check for a commit still existing + in the ODB, so garbage collection may remove commits + +- Generation numbers are not computed in the current version. The file + format supports storing them, along with a mechanism to upgrade from + a file without generation numbers to one that uses them. + +Future Work +----------- + +- The file format includes room for precomputed generation numbers. These + are not currently computed, so all generation numbers will be marked as + 0 (or "uncomputed"). A later patch will include this calculation. + +- The current implementation of the 'graph' builtin walks all packed objects + to find a complete list of commits in packfiles. If some commits are + stored as loose objects, then these do not appear in the graph. This is + handled gracefully by the file format, but it would cause incorrect + generation number calculations. We should implement the construct_graph() + method in a way that walks all commits reachable from some starting set + and then can use complete information for generation numbers. (Some + care must be taken around shallow clones.) + +- The graph is not currently integrated with grafts. + +- After computing and storing generation numbers, we must make graph + walks aware of generation numbers to gain performance benefits. This + will mostly be accomplished by swapping a commit-date-ordered priority + queue with one ordered by generation number. The following operations + are important candidates: + + - paint_down_to_common() + - 'log --topo-order' + +- The graph currently only adds commits to a previously existing graph. + When writing a new graph, we could check that the ODB still contains + the commits and choose to remove the commits that are deleted from the + ODB. For performance reasons, this check should remain optional. + +- Currently, parse_commit_gently() requires filling in the root tree + object for a commit. This passes through lookup_tree() and consequently + lookup_object(). Also, it calls lookup_commit() when loading the parents. + These method calls check the ODB for object existence, even if the + consumer does not need the content. For example, we do not need the + tree contents when computing merge bases. Now that commit parsing is + removed from the computation time, these lookup operations are the + slowest operations keeping graph walks from being fast. Consider + loading these objects without verifying their existence in the ODB and + only loading them fully when consumers need them. Consider a method + such as "ensure_tree_loaded(commit)" that fully loads a tree before + using commit->tree. + +- The current design uses the 'graph' builtin to generate the graph. When + this feature stabilizes enough to recommend to most users, we should + add automatic graph writes to common operations that create many commits. + For example, one coulde compute a graph on 'clone' and 'fetch' commands. + +Related Links +------------- + +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 + Chromium work item for: Serialized Commit Graph + +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ + An abandoned patch that introduced generation numbers. + +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ + Discussion about generation numbers on commits and how they interact + with fsck. + +[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba + More discussion about generation numbers and not storing them inside + commit objects. A valuable quote: + + "I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." + +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u + A patch to remove the ahead-behind calculation from 'status'. -- 2.16.0