Re: [PATCH 2/2] util: add util function buffer_zero_avx512()

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Robert Hoo <robert.hu@linux.intel.com>
To: Richard Henderson <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, pbonzini@redhat.com, laurent@vivier.eu,
	philmd@redhat.com, berrange@redhat.com
Cc: robert.hu@intel.com
Subject: Re: [PATCH 2/2] util: add util function buffer_zero_avx512()
Date: Mon, 24 Feb 2020 15:07:46 +0800	[thread overview]
Message-ID: <160d077042713fc46b36650946712a43e6e89b51.camel@linux.intel.com> (raw)
In-Reply-To: <ee2ef44a-737b-e989-7f20-18a69e19d430@linaro.org>

Thanks Richard:-)
Sorry for late reply.
On Thu, 2020-02-13 at 10:20 -0800, Richard Henderson wrote:
> On 2/12/20 11:52 PM, Robert Hoo wrote:
> > And initialize buffer_is_zero() with it, when Intel AVX512F is
> > available on host.
> > 
> > This function utilizes Intel AVX512 fundamental instructions which
> > perform over previous AVX2 instructions.
> 
> Is it not still true that any AVX512 insn will cause the entire cpu
> package,
> not just the current core, to drop frequency by 20%?
> 
> As far as I know one should only use the 512-bit instructions when
> you can
> overcome that frequency drop, which seems unlikely in this
> case.  That said...
> I don't think so. AVX512 has been applied in various places.
> > +    if (unlikely(len < 64)) { /*buff less than 512 bits,
> > unlikely*/
> > +        return buffer_zero_int(buf, len);
> > +    }
> 
> First, len < 64 has been eliminated already in select_accel_fn.
> Second, len < 256 is not handled properly by the code below...
> 
Right. I'm going to fix this in v2.
> 
> > +    /* Begin with an unaligned head of 64 bytes.  */
> > +    t = _mm512_loadu_si512(buf);
> > +    p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
> > +    e = (__m512i *)(((uintptr_t)buf + len) & -64);
> > +
> > +    /* Loop over 64-byte aligned blocks of 256.  */
> > +    while (p < e) {
> > +        __builtin_prefetch(p);
> > +        if (unlikely(_mm512_test_epi64_mask(t, t))) {
> > +            return false;
> > +        }
> > +        t = p[-4] | p[-3] | p[-2] | p[-1];
> > +        p += 4;
> > +    }
> > +
> > +    t |= _mm512_loadu_si512(buf + len - 4 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 3 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 2 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 1 * 64);
> 
> ... because this final sequence loads 256 bytes.
> 
> Rather than make a second test vs 256 in buffer_zero_avx512, I wonder
> if it
> would be better to have select_accel_fn do the job.  Have a global
> variable
> buffer_accel_size alongside buffer_accel so there's only one branch
> (mis)predict to worry about.
> 
Thanks Richard, very enlightening!
Inspired by your suggestion, I'm thinking go further: use immediate
rather than a global variable, so that saves 1 memory(/cache) access. 

#ifdef CONFIG_AVX512F_OPT   
#define OPTIMIZE_LEN    256
#else
#define OPTIMIZE_LEN    64
#endif
> FWIW, something that the compiler should do, but doesn't currently,
> is use
> vpternlogq to perform a 3-input OR.  Something like
> 
>     /* 0xfe -> orABC */
>     t = _mm512_ternarylogic_epi64(t, p[-4], p[-3], 0xfe);
>     t = _mm512_ternarylogic_epi64(t, p[-2], p[-1], 0xfe);
> 
Very enlightening. Yes, seems compiler doesn't do this.
I tried explicitly use this, however, looks it will have more
instructions generated, and unit test shows it performs less than then
conventional code.
Let me keep the conventional code for this moment, will ask around and
dig further outside this patch.

> 
> r~

next prev parent reply	other threads:[~2020-02-24  7:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-13  7:52 [PATCH 0/2] Add AVX512F optimization option and buffer_zero_avx512() Robert Hoo
2020-02-13  7:52 ` [PATCH 1/2] configure: add configure option avx512f_opt Robert Hoo
2020-02-13  7:52 ` [PATCH 2/2] util: add util function buffer_zero_avx512() Robert Hoo
2020-02-13 10:30   ` Paolo Bonzini
2020-02-13 11:58     ` Robert Hoo
2020-02-13 18:20   ` Richard Henderson
2020-02-24  7:07     ` Robert Hoo [this message]
2020-02-24 16:13       ` Richard Henderson
2020-02-25  7:34         ` Robert Hoo
2020-02-25 15:29           ` Richard Henderson
2020-02-13  8:40 ` [PATCH 0/2] Add AVX512F optimization option and buffer_zero_avx512() no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=160d077042713fc46b36650946712a43e6e89b51.camel@linux.intel.com \
    --to=robert.hu@linux.intel.com \
    --cc=berrange@redhat.com \
    --cc=laurent@vivier.eu \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=robert.hu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).