Re: [GSOC] Blog about weeks 4, 5

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Abhishek Kumar <abhishekkumar8222@gmail.com>
To: "Jakub Narębski" <jnareb@gmail.com>
Cc: git@vger.kernel.org, stolee@gmail.com, abhishekkumar8222@gmail.com
Subject: Re: [GSOC] Blog about weeks 4, 5
Date: Tue, 14 Jul 2020 11:53:39 +0530	[thread overview]
Message-ID: <20200714062339.GA10242@Abhishek-Arch> (raw)
In-Reply-To: <85imerqj7g.fsf@gmail.com>

On Mon, Jul 13, 2020 at 10:00:03PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> > Hello everyone!
> >
> > Over the last two weeks, I have worked on refining the performance
> > report on generation numbers. Here are our conclusions:
> >
> > - Corrected Commit Dates With Monotonically Offset (i.e.  generation
> >   number v5) performs better than topological levels but is still walks
> >   too many commits when compared with Corrected Commit Dates.
> 
> Thank you for your work examining different approaches to introducing
> generation number v2.
> 
> > Number of commits walked (git merge-base v4.8 v4.9, on linux repository):
> >
> > Topological Level                          : 635579
> > Corrected Commit Date                      : 167468
> > Corrected Commit Date With Monotonic Offset: 506577
> 
> It is a bit strange that requiring monotonic offsets leads to so much
> of a difference in performance (in commits walked).
> 
> >
> > As such, I am expecting that we will store Corrected Commit Date in an
> > additional chunk (called "generation data chunk") and store topological
> > levels into CDAT. Thus, old Git clients can operate as expected, with
> > new Git clients using the better generation number.
> >
> > - Using a new chunk does affect the locality of reference but did not
> >   impact the performance appreciably.
> > - This does increase the size of commit graph file by nearly 5%.
> 
> All right, it seems like it is the way to go.
> 
> > You can read more in my report [1] and the pull request with
> > instructions to replicate the results [2].
> >
> > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
> > [2]: https://github.com/abhishekkumar2718/git/pull/1
> >
> > I talk a bit more about a patch I worked on, trying to improve
> > performance of commit graph write using buffers which ultimately did not
> > work and is dropped. Up next is actually implementing the generation
> > number and take care of all little details.
> >
> > https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html
> >
> > Feedback and suggestions welcome!
> 
> Some comments about the blog entry contents:
> 
> AK> Dr. Stolee pointed out ... [to] use the number of commits as a
> AK> metric instead of wall clock timing (which can be influenced by other
> AK> factors like CPU usage at the time).
> 
> There are a few factors.  If we compare similar algorithms, that might
> be a good decision.
> 
> First, one can try to reduce the influence of random factors on the wall
> clock timing by using statistics.  For example one can try to detect and
> remove outliers by using robust statistics measures to detect them, like
> tools like for example Dumbbench [3], hyperfine [4] or bench [5].  After
> warmup, one approach is to compute the robust estimate of value, e.g.
> median, and robust estimate of dispersion, e.g. MAD = median absolute
> deviation, and use those to detect outliers, e.g. rescale MAD and mark
> as outlier and remove entries that are more than "three sigma" of robust
> dispersion away from robust estimate of value.  Dumbbench [3] has good
> explanation.
> 
> [3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T
> [4]: https://github.com/sharkdp/hyperfine
> [5]: https://github.com/Gabriel439/bench

That's interesting. When you think about it, medians are a better
measure than average because medians are robust to the outliers.

> 
> Second, because of pecularities of current processor architecture
> (caches, data prefetching, branch prediction) performing more operations
> might in admittedly rare cases be faster than doing less operations. One
> such example can be found in the CppCon 2019 talk by Andrei Alexandrescu
> "Speed Is Found In The Minds of People" [6][7] about 'small sort', where
> doing more operations results in, on average, faster sort.  This of
> course has a possibility to happen only if difference with the number of
> operations is small enough... nevertheless it might be a good idea to at
> least check that the wall clock time agrees with conclusions from the
> number of commits walked, for at least a few examples.
> 
> [6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ
> [7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf
> 
> AK> With the second report, storing corrected commit date in GDAT as
> AK> well as computing topological levels seems like a no-brainer. I have
> AK> started working on the patch and will push to the mailing list after
> AK> some discussion on the report.
> 
> Do you have any numbers how much does providing backward compatibility
> cost at `git commit-graph write`, that is how much more time it takes to
> computer topological levels during computation of corrected
> committerdate compared to storing GENERATION_NUMBER_MAX in place of
> topological level, and whether having topological level (as tie-breaker)
> helps with Git performance when using commit-graphh for querying?  Does
> having topological levels as tie-breaker or secondary negative-cut
> reachability index helps at all?
> 
We do have timings comparing the time to compute topological levels as
compared to storing GENERATION_NUMBER_MAX in place [1]:

Writing GENERATION_NUMBER_MAX to commit-graph: 14.175s
Writing topological levels to commit-graph:    14.331s

That's around 160ms and 1% percent faster.

I do think there's a case to be made for GENERATION_NUMBER_MAX because
the performance degradation for old Git would help in faster adoption
(Junio was in favor of this, the last time we discussed alternatives [2]).
It is a double-edged sword as we force people who cannot upgrade git
into worse performance.

I do not have anything for using topological level as a tie-breaker.
Will benchmark and get back to you.

[1]: https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/
[2]: https://lore.kernel.org/git/xmqq8sjp1mnz.fsf@gitster.c.googlers.com/

> 
> Thank you for your work and for the report.
> 
> P.S. Would it be possible to put GSoC entries into separate 'GSoC'
> category instead of generic 'Programming' one, or add a 'GSoC' tag?
> 

Great idea! Try this out: https://abhishekkumar2718.github.io/gsoc/

> Best,
> --
> Jakub Narębski

     prev parent reply	other threads:[~2020-07-14  6:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20200706182213.GA51227@Abhishek-Arch>
2020-07-07  2:24 ` [GSOC] Blog about weeks 4, 5 Abhishek Kumar
2020-07-13 20:00   ` Jakub Narębski
2020-07-14  6:23     ` Abhishek Kumar [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200714062339.GA10242@Abhishek-Arch \
    --to=abhishekkumar8222@gmail.com \
    --cc=85imerqj7g.fsf@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).