Re: Fwd: [PATCH v3 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"Richardson, Bruce" <bruce.richardson@intel.com>
Subject: Re: Fwd: [PATCH v3 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()
Date: Mon, 29 Jan 2018 19:29:15 +0200	[thread overview]
Message-ID: <20180129170206-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <2601191342CEEE43887BDE71AB97725890565030@IRSMSX103.ger.corp.intel.com>

On Mon, Jan 29, 2018 at 09:29:52AM +0000, Ananyev, Konstantin wrote:
> Hi Michael,
> 
> > 
> > On Mon, Jan 15, 2018 at 04:15:00PM +0100, Maxime Coquelin wrote:
> > > Hi Michael,
> > >
> > > FYI:
> > >
> > > -------- Forwarded Message --------
> > > Subject: [dpdk-dev] [PATCH v3 2/2] eal/x86: Use lock-prefixed instructions
> > > to reduce cost of rte_smp_mb()
> > > Date: Mon, 15 Jan 2018 15:09:31 +0000
> > > From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > To: dev@dpdk.org
> > > CC: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > On x86 it  is possible to use lock-prefixed instructions to get
> > > the similar effect as mfence.
> > > As pointed by Java guys, on most modern HW that gives a better
> > > performance than using mfence:
> > > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > > That patch adopts that technique for rte_smp_mb() implementation.
> > > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> > > i.e. from ~110 to ~55 cycles per operation.
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_atomic.h           | 44
> > > +++++++++++++++++++++-
> > >  1 file changed, 42 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > index 8469f97e1..9d466d94a 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > @@ -26,12 +26,52 @@ extern "C" {
> > >   #define	rte_rmb() _mm_lfence()
> > >  -#define rte_smp_mb() rte_mb()
> > > -
> > >  #define rte_smp_wmb() rte_compiler_barrier()
> > >   #define rte_smp_rmb() rte_compiler_barrier()
> > >  +/*
> > > + * From Intel Software Development Manual; Vol 3;
> > > + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
> > > + * ...
> > > + * . Reads are not reordered with other reads.
> > > + * . Writes are not reordered with older reads.
> > > + * . Writes to memory are not reordered with other writes,
> > > + *   with the following exceptions:
> > > + *   . streaming stores (writes) executed with the non-temporal move
> > > + *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> > > + *   . string operations (see Section 8.2.4.1).
> > > + *  ...
> > > + * . Reads may be reordered with older writes to different locations but
> > > not
> > > + * with older writes to the same location.
> > > + * . Reads or writes cannot be reordered with I/O instructions,
> > > + * locked instructions, or serializing instructions.
> > > + * . Reads cannot pass earlier LFENCE and MFENCE instructions.
> > > + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE
> > > instructions.
> > > + * . LFENCE instructions cannot pass earlier reads.
> > > + * . SFENCE instructions cannot pass earlier writes ...
> > > + * . MFENCE instructions cannot pass earlier reads, writes ...
> > > + *
> > > + * As pointed by Java guys, that makes possible to use lock-prefixed
> > > + * instructions to get the same effect as mfence and on most modern HW
> > > + * that gives a better perfomance then using mfence:
> > > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > > + * Basic idea is to use lock prefixed add with some dummy memory location
> > > + * as the destination. From their experiments 128B(2 cache lines) below
> > > + * current stack pointer looks like a good candidate.
> > > + * So below we use that techinque for rte_smp_mb() implementation.
> > > + */
> > > +
> > > +static __rte_always_inline void
> > > +rte_smp_mb(void)
> > > +{
> > > +#ifdef RTE_ARCH_I686
> > > +	asm volatile("lock addl $0, -128(%%esp); " ::: "memory");
> > > +#else
> > > +	asm volatile("lock addl $0, -128(%%rsp); " ::: "memory");
> > > +#endif
> > > +}
> > > +
> > >  #define rte_io_mb() rte_mb()
> > >   #define rte_io_wmb() rte_compiler_barrier()
> > 
> > In my testing this appears to be suboptimal when the calling
> > function is large. The following seems to work better:
> > 
> > +static __rte_always_inline void
> > +rte_smp_mb(void)
> > +{
> > +#ifdef RTE_ARCH_I686
> > +	asm volatile("lock addl $0, -132(%%esp); " ::: "memory");
> > +#else
> > +	asm volatile("lock addl $0, -132(%%rsp); " ::: "memory");
> > +#endif
> > +}
> > +
> > 
> > The reason most likely is that 128 still overlaps the x86
> > red zone by 4 bytes.
> 
> I tried what you suggested but for my cases didn't see any improvement so far.
> Can you explain a bit more why do you expect it to be faster?
> Probably some particular scenario?
> Konstantin

It would depend on how much of a redzone is in use.
If the last 4 bytes of the red zone get used, you will
see a bit more stalls when trying to run the xor command.
You will need to call it from a function with lots of
scratch/temporary variables for that to be the case.

> > 
> > Feel free to reuse, and add
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > 
> > > --
> > > 2.13.6

next prev parent reply	other threads:[~2018-01-29 17:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-01 11:12 [PATCH 1/2] test/test: introduce new test-case for rte_smp_mb() Konstantin Ananyev
2017-12-01 11:12 ` [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb() Konstantin Ananyev
2017-12-01 18:04   ` Stephen Hemminger
2017-12-01 23:08     ` Ananyev, Konstantin
2018-03-08 21:15       ` Stephen Hemminger
2017-12-11 17:11   ` Bruce Richardson
2017-12-11 17:30     ` Ananyev, Konstantin
2017-12-18 15:34   ` [PATCH v2 0/2] eal/x86: Optimize rte_smp_mb() and create a new test case for it Konstantin Ananyev
2017-12-18 15:34   ` [PATCH v2 1/2] test/test: introduce new test-case for rte_smp_mb() Konstantin Ananyev
2018-01-12 17:23     ` Thomas Monjalon
2018-01-12 17:58       ` Ananyev, Konstantin
2018-01-13 13:54     ` Wiles, Keith
2018-01-13 13:54     ` Wiles, Keith
2018-01-15 15:04     ` [PATCH v3 0/2] eal/x86: Optimize rte_smp_mb() and create a new test case for it Konstantin Ananyev
2018-01-15 15:04     ` [PATCH v3 1/2] test/test: introduce new test-case for rte_smp_mb() Konstantin Ananyev
2018-01-16  0:16       ` Thomas Monjalon
2018-01-15 15:04     ` [PATCH v3 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb() Konstantin Ananyev
2018-01-15 15:09       ` Konstantin Ananyev
     [not found]         ` <8b05f533-d146-7f97-48f4-82ddcfc3613b@redhat.com>
2018-01-16  1:54           ` Fwd: " Michael S. Tsirkin
2018-01-29  9:29             ` Ananyev, Konstantin
2018-01-29 17:29               ` Michael S. Tsirkin [this message]
2018-01-29 15:47         ` Thomas Monjalon
2018-01-30 19:33           ` Stephen Hemminger
2017-12-18 15:34   ` [PATCH v2 " Konstantin Ananyev
2017-12-18 15:46     ` Bruce Richardson
2017-12-11 17:08 ` [PATCH 1/2] test/test: introduce new test-case for rte_smp_mb() Bruce Richardson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180129170206-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@intel.com \
    --cc=maxime.coquelin@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.