From: Abhishek Kumar <abhishekkumar8222@gmail.com>
To: "Jakub Narębski" <jnareb@gmail.com>
Cc: git@vger.kernel.org, stolee@gmail.com, abhishekkumar8222@gmail.com
Subject: Re: [GSOC] Blog about weeks 4, 5
Date: Tue, 14 Jul 2020 11:53:39 +0530 [thread overview]
Message-ID: <20200714062339.GA10242@Abhishek-Arch> (raw)
In-Reply-To: <85imerqj7g.fsf@gmail.com>
On Mon, Jul 13, 2020 at 10:00:03PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>
> > Hello everyone!
> >
> > Over the last two weeks, I have worked on refining the performance
> > report on generation numbers. Here are our conclusions:
> >
> > - Corrected Commit Dates With Monotonically Offset (i.e. generation
> > number v5) performs better than topological levels but is still walks
> > too many commits when compared with Corrected Commit Dates.
>
> Thank you for your work examining different approaches to introducing
> generation number v2.
>
> > Number of commits walked (git merge-base v4.8 v4.9, on linux repository):
> >
> > Topological Level : 635579
> > Corrected Commit Date : 167468
> > Corrected Commit Date With Monotonic Offset: 506577
>
> It is a bit strange that requiring monotonic offsets leads to so much
> of a difference in performance (in commits walked).
>
> >
> > As such, I am expecting that we will store Corrected Commit Date in an
> > additional chunk (called "generation data chunk") and store topological
> > levels into CDAT. Thus, old Git clients can operate as expected, with
> > new Git clients using the better generation number.
> >
> > - Using a new chunk does affect the locality of reference but did not
> > impact the performance appreciably.
> > - This does increase the size of commit graph file by nearly 5%.
>
> All right, it seems like it is the way to go.
>
> > You can read more in my report [1] and the pull request with
> > instructions to replicate the results [2].
> >
> > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
> > [2]: https://github.com/abhishekkumar2718/git/pull/1
> >
> > I talk a bit more about a patch I worked on, trying to improve
> > performance of commit graph write using buffers which ultimately did not
> > work and is dropped. Up next is actually implementing the generation
> > number and take care of all little details.
> >
> > https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html
> >
> > Feedback and suggestions welcome!
>
> Some comments about the blog entry contents:
>
> AK> Dr. Stolee pointed out ... [to] use the number of commits as a
> AK> metric instead of wall clock timing (which can be influenced by other
> AK> factors like CPU usage at the time).
>
> There are a few factors. If we compare similar algorithms, that might
> be a good decision.
>
> First, one can try to reduce the influence of random factors on the wall
> clock timing by using statistics. For example one can try to detect and
> remove outliers by using robust statistics measures to detect them, like
> tools like for example Dumbbench [3], hyperfine [4] or bench [5]. After
> warmup, one approach is to compute the robust estimate of value, e.g.
> median, and robust estimate of dispersion, e.g. MAD = median absolute
> deviation, and use those to detect outliers, e.g. rescale MAD and mark
> as outlier and remove entries that are more than "three sigma" of robust
> dispersion away from robust estimate of value. Dumbbench [3] has good
> explanation.
>
> [3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T
> [4]: https://github.com/sharkdp/hyperfine
> [5]: https://github.com/Gabriel439/bench
That's interesting. When you think about it, medians are a better
measure than average because medians are robust to the outliers.
>
> Second, because of pecularities of current processor architecture
> (caches, data prefetching, branch prediction) performing more operations
> might in admittedly rare cases be faster than doing less operations. One
> such example can be found in the CppCon 2019 talk by Andrei Alexandrescu
> "Speed Is Found In The Minds of People" [6][7] about 'small sort', where
> doing more operations results in, on average, faster sort. This of
> course has a possibility to happen only if difference with the number of
> operations is small enough... nevertheless it might be a good idea to at
> least check that the wall clock time agrees with conclusions from the
> number of commits walked, for at least a few examples.
>
> [6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ
> [7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf
>
> AK> With the second report, storing corrected commit date in GDAT as
> AK> well as computing topological levels seems like a no-brainer. I have
> AK> started working on the patch and will push to the mailing list after
> AK> some discussion on the report.
>
> Do you have any numbers how much does providing backward compatibility
> cost at `git commit-graph write`, that is how much more time it takes to
> computer topological levels during computation of corrected
> committerdate compared to storing GENERATION_NUMBER_MAX in place of
> topological level, and whether having topological level (as tie-breaker)
> helps with Git performance when using commit-graphh for querying? Does
> having topological levels as tie-breaker or secondary negative-cut
> reachability index helps at all?
>
We do have timings comparing the time to compute topological levels as
compared to storing GENERATION_NUMBER_MAX in place [1]:
Writing GENERATION_NUMBER_MAX to commit-graph: 14.175s
Writing topological levels to commit-graph: 14.331s
That's around 160ms and 1% percent faster.
I do think there's a case to be made for GENERATION_NUMBER_MAX because
the performance degradation for old Git would help in faster adoption
(Junio was in favor of this, the last time we discussed alternatives [2]).
It is a double-edged sword as we force people who cannot upgrade git
into worse performance.
I do not have anything for using topological level as a tie-breaker.
Will benchmark and get back to you.
[1]: https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/
[2]: https://lore.kernel.org/git/xmqq8sjp1mnz.fsf@gitster.c.googlers.com/
>
> Thank you for your work and for the report.
>
> P.S. Would it be possible to put GSoC entries into separate 'GSoC'
> category instead of generic 'Programming' one, or add a 'GSoC' tag?
>
Great idea! Try this out: https://abhishekkumar2718.github.io/gsoc/
> Best,
> --
> Jakub Narębski
prev parent reply other threads:[~2020-07-14 6:25 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20200706182213.GA51227@Abhishek-Arch>
2020-07-07 2:24 ` [GSOC] Blog about weeks 4, 5 Abhishek Kumar
2020-07-13 20:00 ` Jakub Narębski
2020-07-14 6:23 ` Abhishek Kumar [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200714062339.GA10242@Abhishek-Arch \
--to=abhishekkumar8222@gmail.com \
--cc=85imerqj7g.fsf@gmail.com \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).