Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Davidlohr Bueso <dbueso@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	virtualization@lists.linux-foundation.org
Subject: Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
Date: Tue, 12 Jan 2016 15:57:38 +0200	[thread overview]
Message-ID: <20160112150032-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CA+55aFynbkeuUGs9s-q+fLY6MeRBA6MjEyWWbbe7A5AaqsAknw@mail.gmail.com>

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While there
> > was a decent amount of variation, this difference remained rather constant.
> 
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
> use on old cpu's without one (ie 32-bit).
> 
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
> 
> At least on my Haswell, the "lock addq" is pretty much exactly half
> the cost of "mfence".
> 
>                      Linus

mfence was high on some traces I was seeing, so I got curious, too:

---->
main.c
---->


extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	for (i = 0; i < 10000000; ++i) {
		x = i - j;
		barrier();
		j = x;
	}

	return 0;
}
---->
Makefile:
---->

ALL = xchg xchgrz lock mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done

.PHONY: all clean run

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

----->

Is this a good way to test it?

E.g. on my laptop I get:

perf stat -r 10 --log-fd 1 -- ./xchg

 Performance counter stats for './xchg' (10 runs):

         53.236967 task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
                10 context-switches          #    0.180 K/sec                    ( +-  1.70% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.691 K/sec                    ( +-  1.13% )
       190,997,612 cycles                    #    3.588 GHz                      ( +-  0.04% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,654,850 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,122,372 branches                  #  190.138 M/sec                    ( +-  0.01% )
             4,514 branch-misses             #    0.04% of all branches          ( +-  3.37% )

       0.053642809 seconds time elapsed                                          ( +-  0.12% )

perf stat -r 10 --log-fd 1 -- ./xchgrz

 Performance counter stats for './xchgrz' (10 runs):

         53.189533 task-clock                #    0.997 CPUs utilized            ( +-  0.22% )
                 0 context-switches          #    0.000 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.694 K/sec                    ( +-  0.75% )
       190,785,621 cycles                    #    3.587 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,602,086 instructions              #    0.42  insns per cycle          ( +-  0.00% )
        10,112,154 branches                  #  190.115 M/sec                    ( +-  0.01% )
             3,743 branch-misses             #    0.04% of all branches          ( +-  4.02% )

       0.053343693 seconds time elapsed                                          ( +-  0.23% )

perf stat -r 10 --log-fd 1 -- ./lock

 Performance counter stats for './lock' (10 runs):

         53.096434 task-clock                #    0.997 CPUs utilized            ( +-  0.16% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.693 K/sec                    ( +-  0.98% )
       190,796,621 cycles                    #    3.593 GHz                      ( +-  0.02% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,601,376 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,112,074 branches                  #  190.447 M/sec                    ( +-  0.01% )
             3,475 branch-misses             #    0.03% of all branches          ( +-  1.33% )

       0.053252678 seconds time elapsed                                          ( +-  0.16% )

perf stat -r 10 --log-fd 1 -- ./mfence

 Performance counter stats for './mfence' (10 runs):

        126.376473 task-clock                #    0.999 CPUs utilized            ( +-  0.21% )
                 0 context-switches          #    0.002 K/sec                    ( +- 66.67% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.289 K/sec                    ( +-  0.84% )
       456,147,770 cycles                    #    3.609 GHz                      ( +-  0.01% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,892,416 instructions              #    0.18  insns per cycle          ( +-  0.00% )
        10,163,220 branches                  #   80.420 M/sec                    ( +-  0.01% )
             4,653 branch-misses             #    0.05% of all branches          ( +-  1.27% )

       0.126539273 seconds time elapsed                                          ( +-  0.21% )

perf stat -r 10 --log-fd 1 -- ./lfence

 Performance counter stats for './lfence' (10 runs):

         47.617861 task-clock                #    0.997 CPUs utilized            ( +-  0.06% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.764 K/sec                    ( +-  0.45% )
       170,767,856 cycles                    #    3.586 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,581,607 instructions              #    0.47  insns per cycle          ( +-  0.00% )
        10,108,508 branches                  #  212.284 M/sec                    ( +-  0.00% )
             3,320 branch-misses             #    0.03% of all branches          ( +-  1.12% )

       0.047768505 seconds time elapsed                                          ( +-  0.07% )

perf stat -r 10 --log-fd 1 -- ./sfence

 Performance counter stats for './sfence' (10 runs):

         20.156676 task-clock                #    0.988 CPUs utilized            ( +-  0.45% )
                 3 context-switches          #    0.159 K/sec                    ( +- 12.15% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.002 M/sec                    ( +-  0.87% )
        72,212,225 cycles                    #    3.583 GHz                      ( +-  0.33% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,479,149 instructions              #    1.11  insns per cycle          ( +-  0.00% )
        10,090,785 branches                  #  500.618 M/sec                    ( +-  0.01% )
             3,626 branch-misses             #    0.04% of all branches          ( +-  3.59% )

       0.020411208 seconds time elapsed                                          ( +-  0.52% )


So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.

I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.

>From size point of view xchg is longer and xchgrz pokes at the red zone
which seems unnecessarily hacky, so good old lock+addl is probably the
best.

There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:

--->
x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
  * Some non-Intel clones support out of order store. wmb() ceases to be a
  * nop for these.
  */
-#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
+#define mb()	asm volatile("lock; addl $0,0(%%rsp)":::"memory")
 #define rmb()	asm volatile("lfence":::"memory")
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

next prev parent reply	other threads:[~2016-01-12 13:57 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
2015-10-27 23:27   ` David Howells
2015-12-04 12:01   ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
2015-10-27 20:03   ` Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
2015-10-27 21:33   ` Linus Torvalds
2015-10-27 22:01     ` Davidlohr Bueso
2015-10-27 22:37     ` Peter Zijlstra
2015-10-28 19:49       ` Davidlohr Bueso
2015-11-02 20:15       ` Davidlohr Bueso
2015-11-03  0:06         ` Linus Torvalds
2015-11-03  1:36           ` Davidlohr Bueso
2016-01-12 13:57           ` Michael S. Tsirkin [this message]
2016-01-12 17:20             ` Linus Torvalds
2016-01-12 17:20               ` Linus Torvalds
2016-01-12 17:45               ` Michael S. Tsirkin
2016-01-12 17:45                 ` Michael S. Tsirkin
2016-01-12 18:04                 ` Linus Torvalds
2016-01-12 18:04                   ` Linus Torvalds
2016-01-12 20:30               ` Andy Lutomirski
2016-01-12 20:54                 ` Linus Torvalds
2016-01-12 20:54                   ` Linus Torvalds
2016-01-12 20:59                   ` Andy Lutomirski
2016-01-12 20:59                     ` Andy Lutomirski
2016-01-12 21:37                     ` Linus Torvalds
2016-01-12 21:37                       ` Linus Torvalds
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-13 16:20                       ` Michael S. Tsirkin
2016-01-13 16:20                         ` Michael S. Tsirkin
2016-01-12 22:21                     ` Michael S. Tsirkin
2016-01-12 22:21                       ` Michael S. Tsirkin
2016-01-12 22:55                       ` H. Peter Anvin
2016-01-12 22:55                         ` H. Peter Anvin
2016-01-12 23:24                         ` Linus Torvalds
2016-01-13 16:17                           ` Borislav Petkov
2016-01-13 16:17                             ` Borislav Petkov
2016-01-13 16:25                             ` Michael S. Tsirkin
2016-01-13 16:25                               ` Michael S. Tsirkin
2016-01-13 16:33                               ` Borislav Petkov
2016-01-13 16:33                                 ` Borislav Petkov
2016-01-13 16:42                                 ` Michael S. Tsirkin
2016-01-13 16:42                                   ` Michael S. Tsirkin
2016-01-13 16:53                                   ` Borislav Petkov
2016-01-13 16:53                                     ` Borislav Petkov
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 18:38                                   ` Linus Torvalds
2016-01-13 18:38                                     ` Linus Torvalds
2016-01-12 23:24                         ` Linus Torvalds
2016-01-12 13:57           ` Michael S. Tsirkin
2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:a584e1c dfblob:f0d36e2 )
 OR (
bs:"Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160112150032-mutt-send-email-mst@redhat.com \
    --to=mst@redhat.com \
    --cc=dave@stgolabs.net \
    --cc=dbueso@suse.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.