All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felipe Contreras <felipe.contreras@gmail.com>
To: Elijah Newren <newren@gmail.com>, Jeff King <peff@peff.net>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Jonathan Tan" <jonathantanmy@google.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Bump {diff,merge}.renameLimit ?
Date: Mon, 12 Jul 2021 15:58:53 -0500	[thread overview]
Message-ID: <60ecad0dadf2c_a68ed208e7@natae.notmuch> (raw)
In-Reply-To: <CABPp-BEdUmxXVCx=5pb0=LN-0YBtrEB-wngPC5vys6fjVctgaQ@mail.gmail.com>

Elijah Newren wrote:
> On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:

> > > * I think the median file size is a better predictor of rename
> > >   performance than mean file size, and median file size is ~2.5x smaller
> > >   than the mean[18].
> >
> > There you might get hit with the quadratic-update thing again, though.
> > The big files are more likely to be touched, so could be weighted more
> > (though are they more likely to have been added/delete/renamed? Who
> > knows).
> 
> I'll agree that big files are more likely to be updated, but I don't
> think renames are weighted towards bigger files.  In fact, I wrote a
> quick script to look at the sizes of all the renamed files in the
> history of v2.6.25, and the mean (8034.1) and median (3866) of the
> renamed files sizes in that history are comparable to the mean
> (11150.3) and median (4198) of the files sizes in the v2.6.25 tree.
> 
> I re-did the calculations using v5.5, and found that the mean
> (12495.1) and median (3702) sizes of renames in all linux history up
> to that point again were a bit less than the mean (13449.2) and median
> (3860) file size of a file in the final v5.5 tree.
> 
> Granted, this is a bit hand-wavy (what about creations or deletions?
> Is there too much bias from the fact that I did rename sizes over all
> history (due to needing enough to get statistics) while just grabbing
> regular file sizes just in the end tree?), but I think it provides
> pretty good first order approximation suggesting that mean/median
> sizes of files involved in rename detection will be similar to the
> mean/median sizes of other files within the relevant trees.
> 
> > I don't think file size matters all _that_ much, though, as it has a
> > linear relationship to time spent. Whereas the number of entries is
> > quadratic. And of course the whole experiment is ball-parking in the
> > first place. We're looking for order-of-magnitude approximations, I'd
> > think.
> 
> I agree that the number of entries is what's important; in fact,
> that's why I think the median file size is more important than the
> mean file size:

That is almost always the case (except in unskewed distributions where
the mean is equal to the median).

Another option instead of an opaque configuration like 'renamelimit'
--which is almost entirely arbitrary for most users--would be to have
'renamelevel'. A renamelevel of 5 would be the median, so that's already
more meaningul than any value of renamelimit.

A renamelevel of 9 would be the equivalent of the 9th decile, so that
would catch 90% of renames.

If the distribution follows a Pareto distribution (which is often the
case), the formula to calculate the different deciles is trivial, but it
would also be possible to hard-code all the different levels.

-- 
Felipe Contreras

  reply	other threads:[~2021-07-12 20:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-11  0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
2021-07-12 15:23   ` Elijah Newren
2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
2021-07-12 17:39       ` Jeff King
2021-07-12 17:16 ` Jeff King
2021-07-12 20:23   ` Elijah Newren
2021-07-12 20:58     ` Felipe Contreras [this message]
2021-07-12 21:41     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=60ecad0dadf2c_a68ed208e7@natae.notmuch \
    --to=felipe.contreras@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.