Re: [PATCH] tree-diff: avoid alloca for large allocations

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Kirill Smelkov <kirr@mns.spb.ru>
Subject: Re: [PATCH] tree-diff: avoid alloca for large allocations
Date: Tue, 7 Jun 2016 21:36:56 -0400	[thread overview]
Message-ID: <20160608013656.GA7277@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqq8tygo6s4.fsf@gitster.mtv.corp.google.com>

On Tue, Jun 07, 2016 at 05:36:59PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > An alternative to this would be implement something like:
> >
> >   struct tree *tp, tp_fallback[2];
> >   if (nparent <= ARRAY_SIZE(tp_fallback))
> >           tp = tp_fallback;
> >   else
> > 	  ALLOC_ARRAY(tp, nparent);
> >   ...
> >   if (tp != tp_fallback)
> > 	  free(tp);
> >
> > That would let us drop our xalloca() portability code
> > entirely. But in my measurements, this seemed to perform
> > slightly worse than the xalloca solution.
> 
> It indeed is curious why this "obvious" alternative performed
> worse.

Yeah. I'd be happy if somebody wanted to prove me wrong here. It's
entirely possible I just did something stupid. I had wrapped up the
above in the FAST_ARRAY_ALLOC macro. It would waste the stack space when
we _did_ have to call malloc, but that should almost never happen in my
benchmark. It also allocates an extra slot on the stack for non-merge
commits. So I guess the wasted 24 bytes or whatever could have in
impact. It's also possible that the extra variable simply tickles the
optimizer in some way; I didn't look at the generated asm.

FWIW, here are the timings I had come up with for running "git log --raw
--no-abbrev --no-renames" on linux.git (the same benchmark from the
commit that originally added the alloca). The "time" output at the end
is best-of-five, with the "attempt" lines showing the wall-clock time
for each:

  [original, with xalloca]
  Attempt 1: 30.669
  Attempt 2: 30.782
  Attempt 3: 31.807
  Attempt 4: 31.152
  Attempt 5: 30.243

  real    0m30.243s
  user    0m30.112s
  sys     0m0.128s

  [xmalloc/free instead of xalloca]
  Attempt 1: 31.306
  Attempt 2: 31.814
  Attempt 3: 31.138
  Attempt 4: 31.787
  Attempt 5: 32.23

  real    0m31.138s
  user    0m30.976s
  sys     0m0.160s

  [local var with fallback to xmalloc]
  Attempt 1: 30.926
  Attempt 2: 31.466
  Attempt 3: 31.865
  Attempt 4: 31.345
  Attempt 5: 33.159

  real    0m30.926s
  user    0m30.804s
  sys     0m0.120s

So the improvement here over even a naive malloc/free is _really_ small.
You'll note that the best-of-five for xmalloc is actually smaller than
the worst-case with xalloca. But the mean is still 2% better. The mean
for the "fallback" case aren't really any better (which makes me wonder
if I was somehow accidentally calling malloc each time).

So I dunno. For 2%, I was tempted to simply say "screw it, let's just
forget this micro-optimization and call xmalloc". This patch is the
conservative choice in that it has the same performance profile as the
original. Or so I thought. I didn't record the old timings, but they
looked similar to the original. I just recreated them while writing this
email and got:

  [xalloca with fallback to xmalloc]
  Attempt 1: 31.356
  Attempt 2: 31.725
  Attempt 3: 31.454
  Attempt 4: 30.898
  Attempt 5: 31.937

  real    0m30.898s
  user    0m30.752s
  sys     0m0.144s

which really isn't much better than the local-var case. I wonder if just
having the conditional stack/heap is what kills us (either itself, or
because it affects the optimizer).

I dunno.

> > +#define FAST_ARRAY_ALLOC(x, nr) do { \
> > +	if ((nr) <= 2) \
> > +		(x) = xalloca((nr) * sizeof(*(x))); \
> > +	else \
> > +		ALLOC_ARRAY((x), nr); \
> > +} while(0)
> > +#define FAST_ARRAY_FREE(x, nr) do { \
> > +	if ((nr) > 2) \
> > +		free((x)); \
> > +} while(0)
> 
> An obvious and clean implementation of the idea.
> 
> The only slight worry I have is that nr must stay constant until the
> time the caller calls FREE(), but for the only three callsite pairs
> it is clear nparent will stay constant and this is local to the
> file.

Yep, I had the same worry. I think it's OK because the damage is limited
to tree-diff.c. I'm not sure I would promote this macro for general use.

-Peff

     prev parent reply	other threads:[~2016-06-08  1:37 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-07 22:53 [PATCH] tree-diff: avoid alloca for large allocations Jeff King
2016-06-08  0:36 ` Junio C Hamano
2016-06-08  1:36   ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160608013656.GA7277@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=kirr@mns.spb.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).