From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Derrick Stolee <dstolee@microsoft.com>,
"git\@vger.kernel.org" <git@vger.kernel.org>,
peff@peff.net, sbeller@google.com, avarab@gmail.com,
larsxschneider@gmail.com, gitster@pobox.com
Subject: Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
Date: Wed, 11 Apr 2018 22:41:12 +0200 [thread overview]
Message-ID: <864lkh5vdj.fsf@gmail.com> (raw)
In-Reply-To: <eb861bc6-b9f0-ccaa-7cb6-abdb0d343c3d@gmail.com> (Derrick Stolee's message of "Sat, 7 Apr 2018 21:17:47 -0400")
Derrick Stolee <stolee@gmail.com> writes:
> On 4/7/2018 2:40 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>> [...]
>>> On the Linux repository, performance tests were run for the following
>>> command:
>>>
>>> git log --graph --oneline -1000
>>>
>>> Before: 0.92s
>>> After: 0.66s
>>> Rel %: -28.3%
>>>
>>> Adding '-- kernel/' to the command requires loading the root tree
>>> for every commit that is walked. There was no measureable performance
>>> change as a result of this patch.
>>
>> In the "Git Merge contributor summit notes" [1] one can read that:
>>
>>> - VSTS adds bloom filters to know which paths have changed on the commit
>>> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> - if the file history is _very_ sparse, then bloom filter is useful
>>
>> Could this method speed up also the second case mentioned here? Can
>> anyone explain how this "path-changed bloom filter" works in VSTS?
>>
>
> The idea is simple: for every commit, store a Bloom filter containing
> the list of paths that are not TREESAME against the first parent. (A
> slight detail: have a max cap on the number of paths, and store simply
> "TOO_BIG" for commits with too many diffs.)
Isn't Bloom filter with all bits set to 1 equivalent of "too big"? It
would answer each query with "possibly in set, check it".
> When performing 'git log -- path' queries, the most important detail
> for considering how to advance the walk is whether the commit is
> TREESAME to its first parent. For a deep path in a large repo, this is
> almost always true. When a Bloom filter says "TREESAME" (i.e. "this
> path is not in my set") it is always correct, so we can set the
> treesame bit and continue without walking any trees. When a Bloom
> filter says "MAYBE NOT TREESAME" (i.e. "this path is probably in my
> set") you only need to do the same work as before: walk the trees to
> compare against your first parent.
>
> If a Bloom filter has a false-positive rate of X%, then you can
> possibly drop your number of tree comparisons by (100-X)%. This is
> very important for large repos where some paths were changed only ten
> times or so, the full graph needs to be walked and it is helpful to
> avoid parsing too many trees.
So this works only for exact file pathnames, or does checking for
subdirectory also work? I guess it cannot work for globbing patterns
(like "git log -- '*.c'").
I guess that it speeds up "git log -- <pathname>", but also "git blame
<pathname>".
Sidenote: I have stumbled upon https://github.com/efficient/ffbf project
(fast-forward Bllom filters - for exact matching of large number of
patterns), where they invent cache-efficient version of Bloom filter
algorithm. The paper that the project is based on also might be
interesting, if only because it points to other improvements of classic
Bloom filters.
>> Could we add something like this to the commit-graph file?
>
> I'm not sure if it is necessary for client-side operations, but it is
> one of the reasons the commit-graph file has the idea of an "optional
> chunk". It could be added to the file format (without changing version
> numbers) and be ignored by clients that don't understand it. I could
> also be gated by a config setting for computing them. My guess is that
> only server-side operations will need the added response time, and can
> bear the cost of computing them when writing the commit-graph
> file. Clients are less likely to be patient waiting for a lot of diff
> calculations.
So the problem is performing large amount of diff calculations. I
wonder if it would be possible to add Bloom filters incrementally, when
for some reason we need to calculate diff anyway.
>
> If we add commit-graph file downloads to the protocol, then the server
> could do this computation and send the data to all clients. But that
> would be "secondary" information that maybe clients want to verify,
> which is as difficult as computing it themselves.
Can you tell us how much work (how much time) must spend the server to
calculate those per-commit "files changed" Bloom fillters?
Best,
--
Jakub Narębski
prev parent reply other threads:[~2018-04-11 20:41 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-03 12:00 [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 12:00 ` [PATCH 1/3] commit: create get_commit_tree() method Derrick Stolee
2018-04-03 12:00 ` [PATCH 2/3] treewide: use get_commit_tree() for tree access Derrick Stolee
2018-04-03 12:00 ` [PATCH 3/3] commit-graph: lazy-load trees Derrick Stolee
2018-04-03 18:00 ` Stefan Beller
2018-04-03 18:22 ` Derrick Stolee
2018-04-03 18:37 ` Stefan Beller
2018-04-03 12:15 ` [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 13:06 ` Jeff King
2018-04-03 13:14 ` Derrick Stolee
2018-04-03 20:20 ` Jeff King
2018-04-04 12:08 ` Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 0/4] " Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 1/4] treewide: rename tree to maybe_tree Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 2/4] commit: create get_commit_tree() method Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 3/4] treewide: replace maybe_tree with accessor methods Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 4/4] commit-graph: lazy-load trees for commits Derrick Stolee
2018-04-06 19:21 ` [PATCH v2 0/4] Lazy-load trees when reading commit-graph Jeff King
2018-04-06 19:41 ` Derrick Stolee
2018-04-06 19:45 ` Stefan Beller
2018-04-08 23:18 ` Junio C Hamano
2018-04-09 13:15 ` Derrick Stolee
2018-04-09 17:25 ` Stefan Beller
2018-04-07 18:40 ` Jakub Narebski
2018-04-08 1:17 ` Derrick Stolee
2018-04-11 20:41 ` Jakub Narebski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=864lkh5vdj.fsf@gmail.com \
--to=jnareb@gmail.com \
--cc=avarab@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.