From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Derrick Stolee <dstolee@microsoft.com>,
"git\@vger.kernel.org" <git@vger.kernel.org>,
peff@peff.net, sbeller@google.com, avarab@gmail.com,
larsxschneider@gmail.com, gitster@pobox.com
Subject: Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
Date: Wed, 11 Apr 2018 22:41:12 +0200 [thread overview]
Message-ID: <864lkh5vdj.fsf@gmail.com> (raw)
In-Reply-To: <eb861bc6-b9f0-ccaa-7cb6-abdb0d343c3d@gmail.com> (Derrick Stolee's message of "Sat, 7 Apr 2018 21:17:47 -0400")
Derrick Stolee <stolee@gmail.com> writes:
> On 4/7/2018 2:40 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>> [...]
>>> On the Linux repository, performance tests were run for the following
>>> command:
>>>
>>> git log --graph --oneline -1000
>>>
>>> Before: 0.92s
>>> After: 0.66s
>>> Rel %: -28.3%
>>>
>>> Adding '-- kernel/' to the command requires loading the root tree
>>> for every commit that is walked. There was no measureable performance
>>> change as a result of this patch.
>>
>> In the "Git Merge contributor summit notes" [1] one can read that:
>>
>>> - VSTS adds bloom filters to know which paths have changed on the commit
>>> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> - if the file history is _very_ sparse, then bloom filter is useful
>>
>> Could this method speed up also the second case mentioned here? Can
>> anyone explain how this "path-changed bloom filter" works in VSTS?
>>
>
> The idea is simple: for every commit, store a Bloom filter containing
> the list of paths that are not TREESAME against the first parent. (A
> slight detail: have a max cap on the number of paths, and store simply
> "TOO_BIG" for commits with too many diffs.)
Isn't Bloom filter with all bits set to 1 equivalent of "too big"? It
would answer each query with "possibly in set, check it".
> When performing 'git log -- path' queries, the most important detail
> for considering how to advance the walk is whether the commit is
> TREESAME to its first parent. For a deep path in a large repo, this is
> almost always true. When a Bloom filter says "TREESAME" (i.e. "this
> path is not in my set") it is always correct, so we can set the
> treesame bit and continue without walking any trees. When a Bloom
> filter says "MAYBE NOT TREESAME" (i.e. "this path is probably in my
> set") you only need to do the same work as before: walk the trees to
> compare against your first parent.
>
> If a Bloom filter has a false-positive rate of X%, then you can
> possibly drop your number of tree comparisons by (100-X)%. This is
> very important for large repos where some paths were changed only ten
> times or so, the full graph needs to be walked and it is helpful to
> avoid parsing too many trees.
So this works only for exact file pathnames, or does checking for
subdirectory also work? I guess it cannot work for globbing patterns
(like "git log -- '*.c'").
I guess that it speeds up "git log -- <pathname>", but also "git blame
<pathname>".
Sidenote: I have stumbled upon https://github.com/efficient/ffbf project
(fast-forward Bllom filters - for exact matching of large number of
patterns), where they invent cache-efficient version of Bloom filter
algorithm. The paper that the project is based on also might be
interesting, if only because it points to other improvements of classic
Bloom filters.
>> Could we add something like this to the commit-graph file?
>
> I'm not sure if it is necessary for client-side operations, but it is
> one of the reasons the commit-graph file has the idea of an "optional
> chunk". It could be added to the file format (without changing version
> numbers) and be ignored by clients that don't understand it. I could
> also be gated by a config setting for computing them. My guess is that
> only server-side operations will need the added response time, and can
> bear the cost of computing them when writing the commit-graph
> file. Clients are less likely to be patient waiting for a lot of diff
> calculations.
So the problem is performing large amount of diff calculations. I
wonder if it would be possible to add Bloom filters incrementally, when
for some reason we need to calculate diff anyway.
>
> If we add commit-graph file downloads to the protocol, then the server
> could do this computation and send the data to all clients. But that
> would be "secondary" information that maybe clients want to verify,
> which is as difficult as computing it themselves.
Can you tell us how much work (how much time) must spend the server to
calculate those per-commit "files changed" Bloom fillters?
Best,
--
Jakub Narębski
prev parent reply other threads:[~2018-04-11 20:41 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-03 12:00 [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 12:00 ` [PATCH 1/3] commit: create get_commit_tree() method Derrick Stolee
2018-04-03 12:00 ` [PATCH 2/3] treewide: use get_commit_tree() for tree access Derrick Stolee
2018-04-03 12:00 ` [PATCH 3/3] commit-graph: lazy-load trees Derrick Stolee
2018-04-03 18:00 ` Stefan Beller
2018-04-03 18:22 ` Derrick Stolee
2018-04-03 18:37 ` Stefan Beller
2018-04-03 12:15 ` [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 13:06 ` Jeff King
2018-04-03 13:14 ` Derrick Stolee
2018-04-03 20:20 ` Jeff King
2018-04-04 12:08 ` Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 0/4] " Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 1/4] treewide: rename tree to maybe_tree Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 2/4] commit: create get_commit_tree() method Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 3/4] treewide: replace maybe_tree with accessor methods Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 4/4] commit-graph: lazy-load trees for commits Derrick Stolee
2018-04-06 19:21 ` [PATCH v2 0/4] Lazy-load trees when reading commit-graph Jeff King
2018-04-06 19:41 ` Derrick Stolee
2018-04-06 19:45 ` Stefan Beller
2018-04-08 23:18 ` Junio C Hamano
2018-04-09 13:15 ` Derrick Stolee
2018-04-09 17:25 ` Stefan Beller
2018-04-07 18:40 ` Jakub Narebski
2018-04-08 1:17 ` Derrick Stolee
2018-04-11 20:41 ` Jakub Narebski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=864lkh5vdj.fsf@gmail.com \
--to=jnareb@gmail.com \
--cc=avarab@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).