git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>,
	Derrick Stolee <dstolee@microsoft.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	peff@peff.net, sbeller@google.com, avarab@gmail.com,
	larsxschneider@gmail.com, gitster@pobox.com
Subject: Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
Date: Sat, 7 Apr 2018 21:17:47 -0400	[thread overview]
Message-ID: <eb861bc6-b9f0-ccaa-7cb6-abdb0d343c3d@gmail.com> (raw)
In-Reply-To: <86y3hyeu6c.fsf@gmail.com>

On 4/7/2018 2:40 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
> [...]
>> On the Linux repository, performance tests were run for the following
>> command:
>>
>>      git log --graph --oneline -1000
>>
>>      Before: 0.92s
>>      After:  0.66s
>>      Rel %: -28.3%
>>
>> Adding '-- kernel/' to the command requires loading the root tree
>> for every commit that is walked. There was no measureable performance
>> change as a result of this patch.
> In the "Git Merge contributor summit notes" [1] one can read that:
>
>> - VSTS adds bloom filters to know which paths have changed on the commit
>> - tree-same check in the bloom filter is fast; speeds up file history checks
>> - if the file history is _very_ sparse, then bloom filter is useful
> Could this method speed up also the second case mentioned here?  Can
> anyone explain how this "path-changed bloom filter" works in VSTS?
>    

The idea is simple: for every commit, store a Bloom filter containing 
the list of paths that are not TREESAME against the first parent. (A 
slight detail: have a max cap on the number of paths, and store simply 
"TOO_BIG" for commits with too many diffs.)

When performing 'git log -- path' queries, the most important detail for 
considering how to advance the walk is whether the commit is TREESAME to 
its first parent. For a deep path in a large repo, this is almost always 
true. When a Bloom filter says "TREESAME" (i.e. "this path is not in my 
set") it is always correct, so we can set the treesame bit and continue 
without walking any trees. When a Bloom filter says "MAYBE NOT TREESAME" 
(i.e. "this path is probably in my set") you only need to do the same 
work as before: walk the trees to compare against your first parent.

If a Bloom filter has a false-positive rate of X%, then you can possibly 
drop your number of tree comparisons by (100-X)%. This is very important 
for large repos where some paths were changed only ten times or so, the 
full graph needs to be walked and it is helpful to avoid parsing too 
many trees.


> Could we add something like this to the commit-graph file? 

I'm not sure if it is necessary for client-side operations, but it is 
one of the reasons the commit-graph file has the idea of an "optional 
chunk". It could be added to the file format (without changing version 
numbers) and be ignored by clients that don't understand it. I could 
also be gated by a config setting for computing them. My guess is that 
only server-side operations will need the added response time, and can 
bear the cost of computing them when writing the commit-graph file. 
Clients are less likely to be patient waiting for a lot of diff 
calculations.

If we add commit-graph file downloads to the protocol, then the server 
could do this computation and send the data to all clients. But that 
would be "secondary" information that maybe clients want to verify, 
which is as difficult as computing it themselves.

Thanks,

-Stolee


  reply	other threads:[~2018-04-08  1:17 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-03 12:00 [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 12:00 ` [PATCH 1/3] commit: create get_commit_tree() method Derrick Stolee
2018-04-03 12:00 ` [PATCH 2/3] treewide: use get_commit_tree() for tree access Derrick Stolee
2018-04-03 12:00 ` [PATCH 3/3] commit-graph: lazy-load trees Derrick Stolee
2018-04-03 18:00   ` Stefan Beller
2018-04-03 18:22     ` Derrick Stolee
2018-04-03 18:37       ` Stefan Beller
2018-04-03 12:15 ` [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 13:06 ` Jeff King
2018-04-03 13:14   ` Derrick Stolee
2018-04-03 20:20     ` Jeff King
2018-04-04 12:08       ` Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 0/4] " Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 1/4] treewide: rename tree to maybe_tree Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 2/4] commit: create get_commit_tree() method Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 3/4] treewide: replace maybe_tree with accessor methods Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 4/4] commit-graph: lazy-load trees for commits Derrick Stolee
2018-04-06 19:21   ` [PATCH v2 0/4] Lazy-load trees when reading commit-graph Jeff King
2018-04-06 19:41     ` Derrick Stolee
2018-04-06 19:45     ` Stefan Beller
2018-04-08 23:18     ` Junio C Hamano
2018-04-09 13:15       ` Derrick Stolee
2018-04-09 17:25         ` Stefan Beller
2018-04-07 18:40   ` Jakub Narebski
2018-04-08  1:17     ` Derrick Stolee [this message]
2018-04-11 20:41       ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eb861bc6-b9f0-ccaa-7cb6-abdb0d343c3d@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=larsxschneider@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).