Re: [PATCH 01/14] graph: add packed graph design document

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, git@jeffhostetler.com,
	sbeller@google.com, dstolee@microsoft.com
Subject: Re: [PATCH 01/14] graph: add packed graph design document
Date: Thu, 25 Jan 2018 13:14:44 -0800	[thread overview]
Message-ID: <xmqqzi518wu3.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20180125140231.65604-2-dstolee@microsoft.com> (Derrick Stolee's message of "Thu, 25 Jan 2018 09:02:18 -0500")

Derrick Stolee <stolee@gmail.com> writes:

> Add Documentation/technical/packed-graph.txt with details of the planned
> packed graph feature, including future plans.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/packed-graph.txt | 185 +++++++++++++++++++++++++++++++
>  1 file changed, 185 insertions(+)
>  create mode 100644 Documentation/technical/packed-graph.txt

I really wanted to like having this patch at the beginning, but
unfortunatelly I didn't see the actual file format description,
which was a bit disappointing.  An example of the things that I was
curious about was how the "integer ID" is used to access into the
file.  If we could somehow use "integer ID" as an index into an
array of fixed size elements, it would be ideal to gain "fast
lookups", but because of the "list of parents" thing, it needs some
trickery to do so, and that was among the things that I wanted to
see how much thought went into the design, for example.

> diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt
> new file mode 100644
> index 0000000000..fcc0c83874
> --- /dev/null
> +++ b/Documentation/technical/packed-graph.txt
> @@ -0,0 +1,185 @@
> +Git Packed Graph Design Notes
> +=============================
> +
> +Git walks the commit graph for many reasons, including:
> +
> +1. Listing and filtering commit history.
> +2. Computing merge bases.
> +
> +These operations can become slow as the commit count grows above 100K.
> +The merge base calculation shows up in many user-facing commands, such
> +as 'status' and 'fetch' and can take minutes to compute depending on
> +data shape. There are two main costs here:

s/data shape/history shape/ may make it even clearer.

> +1. The commit OID.
> +2. The list of parents.
> +3. The commit date.
> +4. The root tree OID.
> +5. An integer ID for fast lookups in the graph.
> +6. The generation number (see definition below).
> +
> +Values 1-4 satisfy the requirements of parse_commit_gently().
> +
> +By providing an integer ID we can avoid lookups in the graph as we walk
> +commits. Specifically, we need to provide the integer ID of the parent
> +commits so we navigate directly to their information on request.

Commits created after a packed graph file is built may of course not
appear in a packed graph file, but that is OK because they never need
to be listed as parents of commits in the file.  So "list of parents"
can always refer to the parents using the "integer ID for fast lookup".

Makes sense.  Item 2. might want to say "The list of parents, using
the fast lookup integer ID (see 5.) as reference instead of OID",
though.

> +Define the "generation number" of a commit recursively as follows:
> + * A commit with no parents (a root commit) has generation number 1.
> + * A commit with at least one parent has generation number 1 more than
> +   the largest generation number among its parents.
> +Equivalently, the generation number is one more than the length of a
> +longest path from the commit to a root commit.

When a commit A can reach roots X and Y, and Y is further than X,
the distance between Y and A becomes A's generation number.  "One
more than the length of the path from the commit to the furthest
root commit it can reach", in other words.

> +The recursive definition
> +is easier to use for computation and the following property:
> +
> +    If A and B are commits with generation numbers N and M, respectively,
> +    and N <= M, then A cannot reach B. That is, we know without searching
> +    that B is not an ancestor of A because it is further from a root commit
> +    than A.
> +
> +    Conversely, when checking if A is an ancestor of B, then we only need
> +    to walk commits until all commits on the walk boundary have generation
> +    number at most N. If we walk commits using a priority queue seeded by
> +    generation numbers, then we always expand the boundary commit with highest
> +    generation number and can easily detect the stopping condition.

These are both true.  It would be nice if an optimized walker
algorithm can also deal with history with recent commits for which
we do not yet know the generation numbers (i.e. you first traverse
and assign generation numbers and record in packed graph, then
history grows but we haven't added the new commits to the packed
graph yet).

> +- A graph file is stored in a file named 'graph-<oid>.graph' in the pack
> +  directory. This could be stored in an alternate.

Is that <oid> really an object name?  The <hash> that appears in the
name of a packfile pack-<hash>.pack is *not* an <oid>, and I somehow
suspect that you are doing a similar "use hash of (some) contents to
make it uniquely identify the content", not "information about a
single object that is identified by the <oid>".

> +- The graph file is only a supplemental structure. If a user downgrades
> +  or disables the 'core.graph' config setting, then the existing ODB is
> +  sufficient.

OK, that is exactly what I meant to say in a few paragraphs above
that I wanted to see.

> +Current Limitations
> +-------------------
> +
> +- Only one graph file is used at one time. This allows the integer ID to
> +  seek into the single graph file. It is possible to extend the model
> +  for multiple graph files, but that is currently not part of the design.
> +
> +- .graph files are managed only by the 'graph' builtin. These are not
> +  updated automatically during clone or fetch.

In addition to "clone or fetch", I presume operations that locally
create commits do not automatically create them, right?

> +- After computing and storing generation numbers, we must make graph
> +  walks aware of generation numbers to gain performance benefits. This
> +  will mostly be accomplished by swapping a commit-date-ordered priority
> +  queue with one ordered by generation number. The following operations
> +  are important candidates:
> +
> +    - paint_down_to_common()
> +    - 'log --topo-order'

Do you mean that this round only writes "graph" without any actualy
consumers?  It is somewhat hard to assess the value of what is
stored in the new file without the consumers.

Anyway, thanks for starting this.  Let's read on.

next prev parent reply	other threads:[~2018-01-25 21:14 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-25 14:02 [PATCH 00/14] Serialized Commit Graph Derrick Stolee
2018-01-25 14:02 ` [PATCH 01/14] graph: add packed graph design document Derrick Stolee
2018-01-25 20:04   ` Stefan Beller
2018-01-26 12:49     ` Derrick Stolee
2018-01-26 18:17       ` Stefan Beller
2018-01-25 21:14   ` Junio C Hamano [this message]
2018-01-26 13:06     ` Derrick Stolee
2018-01-26 14:13   ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 02/14] packed-graph: add core.graph setting Derrick Stolee
2018-01-25 20:17   ` Stefan Beller
2018-01-25 20:40     ` Derrick Stolee
2018-01-25 21:43   ` Junio C Hamano
2018-01-26 13:08     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 03/14] packed-graph: create git-graph builtin Derrick Stolee
2018-01-25 21:45   ` Stefan Beller
2018-01-26 13:13     ` Derrick Stolee
2018-01-25 23:01   ` Junio C Hamano
2018-01-26 13:14     ` Derrick Stolee
2018-01-26 14:16       ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 04/14] packed-graph: add format document Derrick Stolee
2018-01-25 22:06   ` Junio C Hamano
2018-01-25 22:18     ` Stefan Beller
2018-01-25 22:29       ` Junio C Hamano
2018-01-26 13:22         ` Derrick Stolee
2018-01-25 22:07   ` Stefan Beller
2018-01-26 13:25     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 05/14] packed-graph: implement construct_graph() Derrick Stolee
2018-01-25 23:21   ` Stefan Beller
2018-01-26 20:47     ` Junio C Hamano
2018-01-26 20:55   ` Junio C Hamano
2018-01-26 21:14     ` Andreas Schwab
2018-01-26 22:04       ` Junio C Hamano
2018-01-25 14:02 ` [PATCH 06/14] packed-graph: implement git-graph --write Derrick Stolee
2018-01-25 23:28   ` Stefan Beller
2018-01-26 13:28     ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 07/14] packed-graph: implement git-graph --read Derrick Stolee
2018-01-25 14:02 ` [PATCH 08/14] graph: implement git-graph --update-head Derrick Stolee
2018-01-25 14:02 ` [PATCH 09/14] packed-graph: implement git-graph --clear Derrick Stolee
2018-01-25 23:35   ` Stefan Beller
2018-01-25 14:02 ` [PATCH 10/14] packed-graph: teach git-graph --delete-expired Derrick Stolee
2018-01-25 14:02 ` [PATCH 11/14] commit: integrate packed graph with commit parsing Derrick Stolee
2018-01-26 19:38   ` Stefan Beller
2018-01-25 14:02 ` [PATCH 12/14] packed-graph: read only from specific pack-indexes Derrick Stolee
2018-01-25 14:02 ` [PATCH 13/14] packed-graph: close under reachability Derrick Stolee
2018-01-25 14:02 ` [PATCH 14/14] packed-graph: teach git-graph to read commits Derrick Stolee
2018-01-25 15:46 ` [PATCH 00/14] Serialized Commit Graph Ævar Arnfjörð Bjarmason
2018-01-25 16:09   ` Derrick Stolee
2018-01-25 23:06     ` Ævar Arnfjörð Bjarmason
2018-01-26 12:15       ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqzi518wu3.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.