From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753516AbcAMQUg (ORCPT ); Wed, 13 Jan 2016 11:20:36 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40397 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752925AbcAMQUf (ORCPT ); Wed, 13 Jan 2016 11:20:35 -0500 Date: Wed, 13 Jan 2016 18:20:28 +0200 From: "Michael S. Tsirkin" To: Linus Torvalds Cc: Andy Lutomirski , Andy Lutomirski , Davidlohr Bueso , Davidlohr Bueso , Peter Zijlstra , the arch/x86 maintainers , Linux Kernel Mailing List , virtualization , "H. Peter Anvin" , Thomas Gleixner , "Paul E. McKenney" , Ingo Molnar Subject: Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() Message-ID: <20160113174645-mutt-send-email-mst@redhat.com> References: <20151027223744.GB11242@worktop.amr.corp.intel.com> <20151102201535.GB1707@linux-uzut.site> <20160112150032-mutt-send-email-mst@redhat.com> <56956276.1090705@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. It's an interesting read, thanks! So sp is read on return from function I think. I added a function and sure enough, it slows the add 0(sp) variant down. It's still faster than mfence for me though! Testing code + results below. Reaching below stack, or allocating extra 4 bytes above the stack pointer gives us back the performance. > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > Linus I'll try this in the kernel now, will report, though I'm not optimistic a high level benchmark can show this kind of thing. --------------- main.c: --------------- extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0) #endif #ifdef locksp #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef lockrz #define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif void __attribute__ ((noinline)) test(int i, int *j) { /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ x = i - *j; barrier(); *j = x; } int main(int argc, char **argv) { int i; int j = 1234; for (i = 0; i < 10000000; ++i) test(i, &j); return 0; } --------------- ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- TIME = /usr/bin/time -f %e FILTER = cat all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done perf time: run time: RUN=${TIME} perf: RUN=${PERF} perf: FILTER=grep elapsed .PHONY: all clean run perf time xchgrz: CFLAGS += -mno-red-zone ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -------------------------------------------- perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed 0.080420565 seconds time elapsed ( +- 2.31% ) perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed 0.087798571 seconds time elapsed ( +- 2.58% ) perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed 0.083023724 seconds time elapsed ( +- 2.44% ) perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed 0.102880750 seconds time elapsed ( +- 0.13% ) perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed 0.084917420 seconds time elapsed ( +- 3.28% ) perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed 0.156014715 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed 0.077731443 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed 0.036655741 seconds time elapsed ( +- 0.21% )