git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, newren@gmail.com,
	"Derrick Stolee" <stolee@gmail.com>,
	"René Scharfe" <l.s.r@web.de>,
	"Derrick Stolee" <derrickstolee@github.com>
Subject: [PATCH v3 00/10] Cleanups around index operations
Date: Thu, 07 Jan 2021 16:32:01 +0000	[thread overview]
Message-ID: <pull.829.v3.git.1610037131.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.829.v2.git.1609729758.gitgitgadget@gmail.com>

I've taken a professional interest in the index lately, and I've been trying
mostly to learn about it and measure different operations. Along the way,
I've seen some possible improvements in documentation, behavior, and
tracing.

This series collects most of my thoughts so far, including:

 1. Adding extra trace2 regions and statistics (similar to [1]) (Patches
    1-5).
 2. Update documentation about the cached tree extension (Patches 6-7).
 3. Improved the performance of verify_cache() (Patches 8-9).

Thanks, -Stolee

[1]
https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/


UPDATES IN V3
=============

 * Added a patch that fixes previous uses of "cached tree"

 * Updated preamble patch with typos and semantic corrections.

Derrick Stolee (9):
  tree-walk: report recursion counts
  unpack-trees: add trace2 regions
  cache-tree: use trace2 in cache_tree_update()
  cache-tree: trace regions for I/O
  cache-tree: trace regions for prime_cache_tree
  index-format: use 'cache tree' over 'cached tree'
  index-format: update preamble to cache tree extension
  index-format: discuss recursion of cached-tree better
  cache-tree: speed up consecutive path comparisons

René Scharfe (1):
  cache-tree: use ce_namelen() instead of strlen()

 Documentation/technical/index-format.txt | 42 ++++++++++++++++++------
 cache-tree.c                             | 30 +++++++++++++----
 t/t7104-reset-hard.sh                    |  2 +-
 tree-walk.c                              | 33 +++++++++++++++++++
 unpack-trees.c                           |  5 +++
 5 files changed, 94 insertions(+), 18 deletions(-)


base-commit: 71ca53e8125e36efbda17293c50027d31681a41f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-829%2Fderrickstolee%2Fcache-tree%2Fbasics-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-829/derrickstolee/cache-tree/basics-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/829

Range-diff vs v2:

  1:  0e500c86f39 =  1:  0e500c86f39 tree-walk: report recursion counts
  2:  4157b91acf8 =  2:  4157b91acf8 unpack-trees: add trace2 regions
  3:  8959d57abdd =  3:  8959d57abdd cache-tree: use trace2 in cache_tree_update()
  4:  1d8a797ee26 =  4:  1d8a797ee26 cache-tree: trace regions for I/O
  5:  2b2e70bb77c =  5:  2b2e70bb77c cache-tree: trace regions for prime_cache_tree
  -:  ----------- >  6:  2d7b18c2e0b index-format: use 'cache tree' over 'cached tree'
  6:  75b51483d3c !  7:  c5cffb5956e index-format: update preamble to cached tree extension
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    index-format: update preamble to cached tree extension
     +    index-format: update preamble to cache tree extension
      
     -    I had difficulty in my efforts to learn about the cached tree extension
     +    I had difficulty in my efforts to learn about the cache tree extension
          based on the documentation and code because I had an incorrect
          assumption about how it behaved. This might be due to some ambiguity in
          the documentation, so this change modifies the beginning of the cached
     @@ Commit message
      
          3. When exactly are "new" trees created?
      
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Documentation/technical/index-format.txt ##
      @@ Documentation/technical/index-format.txt: Git index format
       
     - === Cached tree
     + === Cache tree
       
     --  Cached tree extension contains pre-computed hashes for trees that can
     +-  Cache tree extension contains pre-computed hashes for trees that can
      -  be derived from the index. It helps speed up tree object generation
      -  from index for a new commit.
      -
     @@ Documentation/technical/index-format.txt: Git index format
      +  Since the index does not record entries for directories, the cache
      +  entries cannot describe tree objects that already exist in the object
      +  database for regions of the index that are unchanged from an existing
     -+  commit. The cached tree extension stores a recursive tree structure that
     ++  commit. The cache tree extension stores a recursive tree structure that
      +  describes the trees that already exist and completely match sections of
      +  the cache entries. This speeds up tree object generation from the index
      +  for a new commit by only computing the trees that are "new" to that
     -+  commit.
     ++  commit. It also assists when comparing the index to another tree, such
     ++  as `HEAD^{tree}`, since sections of the index can be skipped when a tree
     ++  comparison demonstrates equality.
      +
      +  The recursive tree structure uses nodes that store a number of cache
      +  entries, a list of subnodes, and an object ID (OID). The OID references
     -+  the exising tree for that node, if it is known to exist. The subnodes
     -+  correspond to subdirectories that themselves have cached tree nodes. The
     ++  the existing tree for that node, if it is known to exist. The subnodes
     ++  correspond to subdirectories that themselves have cache tree nodes. The
      +  number of cache entries corresponds to the number of cache entries in
      +  the index that describe paths within that tree's directory.
      +
     -+  Note that the path for a given tree is part of the parent node in-memory
     -+  but is part of the child in the file format. The root tree has an empty
     -+  string for its name and its name does not exist in-memory.
     ++  The extension tracks the full directory structure in the cache tree
     ++  extension, but this is generally smaller than the full cache entry list.
      +
      +  When a path is updated in index, Git invalidates all nodes of the
     -+  recurisive cached tree corresponding to the parent directories of that
     ++  recursive cache tree corresponding to the parent directories of that
      +  path. We store these tree nodes as being "invalid" by using "-1" as the
     -+  number of cache entries. To create trees corresponding to the current
     -+  index, Git only walks the invalid tree nodes and uses the cached OIDs
     -+  for the valid trees to construct new trees. In this way, Git only
     -+  constructs trees on the order of the number of changed paths (and their
     -+  depth in the working directory). This comes at a cost of tracking the
     -+  full directory structure in the cached tree extension, but this is
     -+  generally smaller than the full cache entry list in the index.
     ++  number of cache entries. Invalid nodes still store a span of index
     ++  entries, allowing Git to focus its efforts when reconstructing a full
     ++  cache tree.
       
         The signature for this extension is { 'T', 'R', 'E', 'E' }.
       
  7:  b2bb141a254 =  8:  97c06c80a85 index-format: discuss recursion of cached-tree better
  8:  5298694786e =  9:  2532f5cc189 cache-tree: use ce_namelen() instead of strlen()
  9:  72edd7bb427 = 10:  7c1c206a0bc cache-tree: speed up consecutive path comparisons

-- 
gitgitgadget

  parent reply	other threads:[~2021-01-07 16:33 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-30 19:26 [PATCH 0/8] Cleanups around index operations Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 1/8] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2020-12-30 19:42   ` Elijah Newren
2020-12-30 19:51     ` Derrick Stolee
2020-12-30 19:26 ` [PATCH 2/8] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2020-12-30 19:45   ` Elijah Newren
2020-12-30 19:26 ` [PATCH 3/8] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 4/8] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 5/8] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2020-12-30 19:48   ` Elijah Newren
2020-12-30 19:53     ` Derrick Stolee
2020-12-30 19:26 ` [PATCH 6/8] index-format: update preamble to cached tree extension Derrick Stolee via GitGitGadget
2020-12-30 20:00   ` Elijah Newren
2020-12-30 19:26 ` [PATCH 7/8] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 8/8] cache-tree: avoid path comparison loop when silent Derrick Stolee via GitGitGadget
2020-12-30 20:14   ` Elijah Newren
2021-01-06  8:55     ` Junio C Hamano
2021-01-06 12:08       ` Derrick Stolee
2020-12-31 12:34   ` René Scharfe
2020-12-31 16:46     ` Derrick Stolee
2021-01-01 13:30       ` René Scharfe
2021-01-02 15:19       ` [PATCH] cache-tree: use ce_namelen() instead of strlen() René Scharfe
2021-01-04  1:26         ` Derrick Stolee
2021-01-05 12:05         ` Junio C Hamano
2021-01-02 15:31       ` [PATCH 8/8] cache-tree: avoid path comparison loop when silent René Scharfe
2020-12-30 20:19 ` [PATCH 0/8] Cleanups around index operations Elijah Newren
2020-12-30 20:24   ` Derrick Stolee
2021-01-04  3:09 ` [PATCH v2 0/9] " Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 1/9] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 2/9] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 3/9] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 4/9] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 5/9] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 6/9] index-format: update preamble to cached tree extension Derrick Stolee via GitGitGadget
2021-01-07  2:10     ` Junio C Hamano
2021-01-07 11:51       ` Derrick Stolee
2021-01-07 20:12         ` Junio C Hamano
2021-01-07 21:26         ` Junio C Hamano
2021-01-04  3:09   ` [PATCH v2 7/9] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 8/9] cache-tree: use ce_namelen() instead of strlen() René Scharfe via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 9/9] cache-tree: speed up consecutive path comparisons Derrick Stolee via GitGitGadget
2021-01-07 16:32   ` Derrick Stolee via GitGitGadget [this message]
2021-01-07 16:32     ` [PATCH v3 01/10] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 02/10] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 03/10] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 04/10] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 05/10] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 06/10] index-format: use 'cache tree' over 'cached tree' Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 07/10] index-format: update preamble to cache tree extension Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 08/10] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 09/10] cache-tree: use ce_namelen() instead of strlen() René Scharfe via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 10/10] cache-tree: speed up consecutive path comparisons Derrick Stolee via GitGitGadget
2021-01-16  6:58     ` [PATCH v3 00/10] Cleanups around index operations Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.829.v3.git.1610037131.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=newren@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).