All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	Davidlohr Bueso <dbueso@suse.de>,
	Peter Zijlstra <peterz@infradead.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Andy Lutomirski <luto@kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	virtualization <virtualization@lists.linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
Date: Wed, 13 Jan 2016 18:20:28 +0200	[thread overview]
Message-ID: <20160113174645-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CA+55aFyuR1YCZjC9++E4kpvRxgoM4sqzhNaS27EZPFh9CuKjYg@mail.gmail.com>

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.

It's an interesting read, thanks!

So sp is read on return from function I think.  I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.


> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	x = i - *j;
	barrier();
	*j = x;
}

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	for (i = 0; i < 10000000; ++i)
		test(i, &j);

	return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
       0.080420565 seconds time elapsed                                          ( +-  2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
       0.087798571 seconds time elapsed                                          ( +-  2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
       0.083023724 seconds time elapsed                                          ( +-  2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
       0.102880750 seconds time elapsed                                          ( +-  0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
       0.084917420 seconds time elapsed                                          ( +-  3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
       0.156014715 seconds time elapsed                                          ( +-  0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
       0.077731443 seconds time elapsed                                          ( +-  0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
       0.036655741 seconds time elapsed                                          ( +-  0.21% )

WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Andy Lutomirski <luto@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Davidlohr Bueso <dbueso@suse.de>,
	Peter Zijlstra <peterz@infradead.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	virtualization <virtualization@lists.linux-foundation.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
Date: Wed, 13 Jan 2016 18:20:28 +0200	[thread overview]
Message-ID: <20160113174645-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CA+55aFyuR1YCZjC9++E4kpvRxgoM4sqzhNaS27EZPFh9CuKjYg@mail.gmail.com>

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.

It's an interesting read, thanks!

So sp is read on return from function I think.  I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.


> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	x = i - *j;
	barrier();
	*j = x;
}

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	for (i = 0; i < 10000000; ++i)
		test(i, &j);

	return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
       0.080420565 seconds time elapsed                                          ( +-  2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
       0.087798571 seconds time elapsed                                          ( +-  2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
       0.083023724 seconds time elapsed                                          ( +-  2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
       0.102880750 seconds time elapsed                                          ( +-  0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
       0.084917420 seconds time elapsed                                          ( +-  3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
       0.156014715 seconds time elapsed                                          ( +-  0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
       0.077731443 seconds time elapsed                                          ( +-  0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
       0.036655741 seconds time elapsed                                          ( +-  0.21% )

  parent reply	other threads:[~2016-01-13 16:20 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
2015-10-27 23:27   ` David Howells
2015-12-04 12:01   ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
2015-10-27 20:03   ` Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
2015-10-27 21:33   ` Linus Torvalds
2015-10-27 22:01     ` Davidlohr Bueso
2015-10-27 22:37     ` Peter Zijlstra
2015-10-28 19:49       ` Davidlohr Bueso
2015-11-02 20:15       ` Davidlohr Bueso
2015-11-03  0:06         ` Linus Torvalds
2015-11-03  1:36           ` Davidlohr Bueso
2016-01-12 13:57           ` Michael S. Tsirkin
2016-01-12 17:20             ` Linus Torvalds
2016-01-12 17:20               ` Linus Torvalds
2016-01-12 17:45               ` Michael S. Tsirkin
2016-01-12 17:45                 ` Michael S. Tsirkin
2016-01-12 18:04                 ` Linus Torvalds
2016-01-12 18:04                   ` Linus Torvalds
2016-01-12 20:30               ` Andy Lutomirski
2016-01-12 20:54                 ` Linus Torvalds
2016-01-12 20:54                   ` Linus Torvalds
2016-01-12 20:59                   ` Andy Lutomirski
2016-01-12 20:59                     ` Andy Lutomirski
2016-01-12 21:37                     ` Linus Torvalds
2016-01-12 21:37                       ` Linus Torvalds
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-13 16:20                       ` Michael S. Tsirkin [this message]
2016-01-13 16:20                         ` Michael S. Tsirkin
2016-01-12 22:21                     ` Michael S. Tsirkin
2016-01-12 22:21                       ` Michael S. Tsirkin
2016-01-12 22:55                       ` H. Peter Anvin
2016-01-12 22:55                         ` H. Peter Anvin
2016-01-12 23:24                         ` Linus Torvalds
2016-01-13 16:17                           ` Borislav Petkov
2016-01-13 16:17                             ` Borislav Petkov
2016-01-13 16:25                             ` Michael S. Tsirkin
2016-01-13 16:25                               ` Michael S. Tsirkin
2016-01-13 16:33                               ` Borislav Petkov
2016-01-13 16:33                                 ` Borislav Petkov
2016-01-13 16:42                                 ` Michael S. Tsirkin
2016-01-13 16:42                                   ` Michael S. Tsirkin
2016-01-13 16:53                                   ` Borislav Petkov
2016-01-13 16:53                                     ` Borislav Petkov
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 18:38                                   ` Linus Torvalds
2016-01-13 18:38                                     ` Linus Torvalds
2016-01-12 23:24                         ` Linus Torvalds
2016-01-12 13:57           ` Michael S. Tsirkin
2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160113174645-mutt-send-email-mst@redhat.com \
    --to=mst@redhat.com \
    --cc=dave@stgolabs.net \
    --cc=dbueso@suse.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.