* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() [not found] ` <CA+55aFynbkeuUGs9s-q+fLY6MeRBA6MjEyWWbbe7A5AaqsAknw@mail.gmail.com> @ 2016-01-12 13:57 ` Michael S. Tsirkin [not found] ` <20160112150032-mutt-send-email-mst@redhat.com> 1 sibling, 0 replies; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's without one (ie 32-bit). > > I'm not actually convinced that mfence is necessarily a good idea. I > could easily see it being microcode, for example. > > At least on my Haswell, the "lock addq" is pretty much exactly half > the cost of "mfence". > > Linus mfence was high on some traces I was seeing, so I got curious, too: ----> main.c ----> extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif int main(int argc, char **argv) { int i; int j = 1234; /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ for (i = 0; i < 10000000; ++i) { x = i - j; barrier(); j = x; } return 0; } ----> Makefile: ----> ALL = xchg xchgrz lock mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done .PHONY: all clean run ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -----> Is this a good way to test it? E.g. on my laptop I get: perf stat -r 10 --log-fd 1 -- ./xchg Performance counter stats for './xchg' (10 runs): 53.236967 task-clock # 0.992 CPUs utilized ( +- 0.09% ) 10 context-switches # 0.180 K/sec ( +- 1.70% ) 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.691 K/sec ( +- 1.13% ) 190,997,612 cycles # 3.588 GHz ( +- 0.04% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,654,850 instructions # 0.42 insns per cycle ( +- 0.01% ) 10,122,372 branches # 190.138 M/sec ( +- 0.01% ) 4,514 branch-misses # 0.04% of all branches ( +- 3.37% ) 0.053642809 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./xchgrz Performance counter stats for './xchgrz' (10 runs): 53.189533 task-clock # 0.997 CPUs utilized ( +- 0.22% ) 0 context-switches # 0.000 K/sec 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.694 K/sec ( +- 0.75% ) 190,785,621 cycles # 3.587 GHz ( +- 0.03% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,602,086 instructions # 0.42 insns per cycle ( +- 0.00% ) 10,112,154 branches # 190.115 M/sec ( +- 0.01% ) 3,743 branch-misses # 0.04% of all branches ( +- 4.02% ) 0.053343693 seconds time elapsed ( +- 0.23% ) perf stat -r 10 --log-fd 1 -- ./lock Performance counter stats for './lock' (10 runs): 53.096434 task-clock # 0.997 CPUs utilized ( +- 0.16% ) 0 context-switches # 0.002 K/sec ( +-100.00% ) 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.693 K/sec ( +- 0.98% ) 190,796,621 cycles # 3.593 GHz ( +- 0.02% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,601,376 instructions # 0.42 insns per cycle ( +- 0.01% ) 10,112,074 branches # 190.447 M/sec ( +- 0.01% ) 3,475 branch-misses # 0.03% of all branches ( +- 1.33% ) 0.053252678 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./mfence Performance counter stats for './mfence' (10 runs): 126.376473 task-clock # 0.999 CPUs utilized ( +- 0.21% ) 0 context-switches # 0.002 K/sec ( +- 66.67% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.289 K/sec ( +- 0.84% ) 456,147,770 cycles # 3.609 GHz ( +- 0.01% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,892,416 instructions # 0.18 insns per cycle ( +- 0.00% ) 10,163,220 branches # 80.420 M/sec ( +- 0.01% ) 4,653 branch-misses # 0.05% of all branches ( +- 1.27% ) 0.126539273 seconds time elapsed ( +- 0.21% ) perf stat -r 10 --log-fd 1 -- ./lfence Performance counter stats for './lfence' (10 runs): 47.617861 task-clock # 0.997 CPUs utilized ( +- 0.06% ) 0 context-switches # 0.002 K/sec ( +-100.00% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.764 K/sec ( +- 0.45% ) 170,767,856 cycles # 3.586 GHz ( +- 0.03% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,581,607 instructions # 0.47 insns per cycle ( +- 0.00% ) 10,108,508 branches # 212.284 M/sec ( +- 0.00% ) 3,320 branch-misses # 0.03% of all branches ( +- 1.12% ) 0.047768505 seconds time elapsed ( +- 0.07% ) perf stat -r 10 --log-fd 1 -- ./sfence Performance counter stats for './sfence' (10 runs): 20.156676 task-clock # 0.988 CPUs utilized ( +- 0.45% ) 3 context-switches # 0.159 K/sec ( +- 12.15% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.002 M/sec ( +- 0.87% ) 72,212,225 cycles # 3.583 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,479,149 instructions # 1.11 insns per cycle ( +- 0.00% ) 10,090,785 branches # 500.618 M/sec ( +- 0.01% ) 3,626 branch-misses # 0.04% of all branches ( +- 3.59% ) 0.020411208 seconds time elapsed ( +- 0.52% ) So mfence is more expensive than locked instructions/xchg, but sfence/lfence are slightly faster, and xchg and locked instructions are very close if not the same. I poked at some 10 intel and AMD machines and the numbers are different but the results seem more or less consistent with this. From size point of view xchg is longer and xchgrz pokes at the red zone which seems unnecessarily hacky, so good old lock+addl is probably the best. There isn't any extra magic behind mfence, is there? E.g. I think lock orders accesses to WC memory as well, so apparently mb() can be redefined unconditionally, without looking at XMM2: ---> x86: drop mfence in favor of lock+addl mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, same as we always did on old 32-bit. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- I'll play with this some more before posting this as a non-stand alone patch. Is there a macro-benchmark where mb is prominent? diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index a584e1c..f0d36e2 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -15,15 +15,15 @@ * Some non-Intel clones support out of order store. wmb() ceases to be a * nop for these. */ -#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2) +#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory") #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2) #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) #else +#define mb() asm volatile("lock; addl $0,0(%%rsp)":::"memory") #define rmb() asm volatile("lfence":::"memory") #define wmb() asm volatile("sfence" ::: "memory") #endif #ifdef CONFIG_X86_PPRO_FENCE #define dma_rmb() rmb() #else > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply related [flat|nested] 20+ messages in thread
[parent not found: <20160112150032-mutt-send-email-mst@redhat.com>]
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() [not found] ` <20160112150032-mutt-send-email-mst@redhat.com> @ 2016-01-12 17:20 ` Linus Torvalds 2016-01-12 17:45 ` Michael S. Tsirkin 2016-01-12 20:30 ` Andy Lutomirski 0 siblings, 2 replies; 20+ messages in thread From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > #ifdef xchgrz > /* same as xchg but poking at gcc red zone */ > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) > #endif That's not safe in general. gcc might be using its redzone, so doing xchg into it is unsafe. But.. > Is this a good way to test it? .. it's fine for some basic testing. It doesn't show any subtle interactions (ie some operations may have different dynamic behavior when the write buffers are busy etc), but as a baseline for "how fast can things go" the stupid raw loop is fine. And while the xchg into the redzoen wouldn't be acceptable as a real implementation, for timing testing it's likely fine (ie you aren't hitting the problem it can cause). > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > are slightly faster, and xchg and locked instructions are very close if > not the same. Note that we never actually *use* lfence/sfence. They are pointless instructions when looking at CPU memory ordering, because for pure CPU memory ordering stores and loads are already ordered. The only reason to use lfence/sfence is after you've used nontemporal stores for IO. That's very very rare in the kernel. So I wouldn't worry about those. But yes, it does sound like mfence is just a bad idea too. > There isn't any extra magic behind mfence, is there? No. I think the only issue is that there has never been any real reason for CPU designers to try to make mfence go particularly fast. Nobody uses it, again with the exception of some odd loops that use nontemporal stores, and for those the cost tends to always be about the nontemporal accesses themselves (often to things like GPU memory over PCIe), and the mfence cost of a few extra cycles is negligible. The reason "lock ; add $0" has generally been the fastest we've found is simply that locked ops have been important for CPU designers. So I think the patch is fine, and we should likely drop the use of mfence.. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:20 ` Linus Torvalds @ 2016-01-12 17:45 ` Michael S. Tsirkin 2016-01-12 18:04 ` Linus Torvalds 2016-01-12 20:30 ` Andy Lutomirski 1 sibling, 1 reply; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) > > #endif > > That's not safe in general. gcc might be using its redzone, so doing > xchg into it is unsafe. > > But.. > > > Is this a good way to test it? > > .. it's fine for some basic testing. It doesn't show any subtle > interactions (ie some operations may have different dynamic behavior > when the write buffers are busy etc), but as a baseline for "how fast > can things go" the stupid raw loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). > > > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > > are slightly faster, and xchg and locked instructions are very close if > > not the same. > > Note that we never actually *use* lfence/sfence. They are pointless > instructions when looking at CPU memory ordering, because for pure CPU > memory ordering stores and loads are already ordered. > > The only reason to use lfence/sfence is after you've used nontemporal > stores for IO. By the way, the comment in barrier.h says: /* * Some non-Intel clones support out of order store. wmb() ceases to be * a nop for these. */ and while the 1st sentence may well be true, if you have an SMP system with out of order stores, making wmb not a nop will not help. Additionally as you point out, wmb is not a nop even for regular intel CPUs because of these weird use-cases. Drop this comment? > That's very very rare in the kernel. So I wouldn't > worry about those. Right - I'll leave these alone, whoever wants to optimize this path will have to do the necessary research. > But yes, it does sound like mfence is just a bad idea too. > > > There isn't any extra magic behind mfence, is there? > > No. > > I think the only issue is that there has never been any real reason > for CPU designers to try to make mfence go particularly fast. Nobody > uses it, again with the exception of some odd loops that use > nontemporal stores, and for those the cost tends to always be about > the nontemporal accesses themselves (often to things like GPU memory > over PCIe), and the mfence cost of a few extra cycles is negligible. > > The reason "lock ; add $0" has generally been the fastest we've found > is simply that locked ops have been important for CPU designers. > > So I think the patch is fine, and we should likely drop the use of mfence.. > > Linus OK so should I repost after a bit more testing? I don't believe this will affect the kernel build benchmark, but I'll try :) -- MST ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:45 ` Michael S. Tsirkin @ 2016-01-12 18:04 ` Linus Torvalds 0 siblings, 0 replies; 20+ messages in thread From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > By the way, the comment in barrier.h says: > > /* > * Some non-Intel clones support out of order store. wmb() ceases to be > * a nop for these. > */ > > and while the 1st sentence may well be true, if you have > an SMP system with out of order stores, making wmb > not a nop will not help. > > Additionally as you point out, wmb is not a nop even > for regular intel CPUs because of these weird use-cases. > > Drop this comment? We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE almost two years ago. See commit 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE") and it was questionable for a long time even before that (perhaps ever). So the comment is stale. We *do* still use the non-nop rmb/wmb for IO barriers, but even that is generally questionable. See our "copy_user_64.S" for an actual use of "movnt" followed by sfence. There's a couple of other cases too. So that's all correct, but the point is that when we use "movnt" we don't actually use "wmb()", we are doing assembly, and the assembly should just use sfence directly. So it's actually very questionable to ever make even the IO wmb()/rmb() functions use lfence/sfence. They should never really need it. But at the same time, I _really_ don't think we care enough. I'd rather leave those non-smp barrier cases alone as historial unless somebody can point to a case where they care about the performance. We also do have the whole PPRO_FENCE thing, which we can hopefully get rid of at some point too. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:20 ` Linus Torvalds 2016-01-12 17:45 ` Michael S. Tsirkin @ 2016-01-12 20:30 ` Andy Lutomirski 2016-01-12 20:54 ` Linus Torvalds 1 sibling, 1 reply; 20+ messages in thread From: Andy Lutomirski @ 2016-01-12 20:30 UTC (permalink / raw) To: Linus Torvalds, Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On 01/12/2016 09:20 AM, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: >> #ifdef xchgrz >> /* same as xchg but poking at gcc red zone */ >> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) >> #endif > > That's not safe in general. gcc might be using its redzone, so doing > xchg into it is unsafe. > > But.. > >> Is this a good way to test it? > > .. it's fine for some basic testing. It doesn't show any subtle > interactions (ie some operations may have different dynamic behavior > when the write buffers are busy etc), but as a baseline for "how fast > can things go" the stupid raw loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) was better because it avoided stomping on very-likely-to-be-hot write buffers. --Andy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:30 ` Andy Lutomirski @ 2016-01-12 20:54 ` Linus Torvalds 2016-01-12 20:59 ` Andy Lutomirski 0 siblings, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Michael S. Tsirkin, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > > I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) > was better because it avoided stomping on very-likely-to-be-hot write > buffers. I suspect it could go either way. You want a small constant (for the isntruction size), but any small constant is likely to be within the current stack frame anyway. I don't think 0(%rsp) is particularly likely to have a spill on it right then and there, but who knows.. And 64(%rsp) is possibly going to be cold in the L1 cache, especially if it's just after a deep function call. Which it might be. So it might work the other way. So my guess would be that you wouldn't be able to measure the difference. It might be there, but probably too small to really see in any noise. But numbers talk, bullshit walks. It would be interesting to be proven wrong. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:54 ` Linus Torvalds @ 2016-01-12 20:59 ` Andy Lutomirski 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:21 ` Michael S. Tsirkin 0 siblings, 2 replies; 20+ messages in thread From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Michael S. Tsirkin, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it could go either way. You want a small constant (for the > isntruction size), but any small constant is likely to be within the > current stack frame anyway. I don't think 0(%rsp) is particularly > likely to have a spill on it right then and there, but who knows.. > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > if it's just after a deep function call. Which it might be. So it > might work the other way. > > So my guess would be that you wouldn't be able to measure the > difference. It might be there, but probably too small to really see in > any noise. > > But numbers talk, bullshit walks. It would be interesting to be proven wrong. Here's an article with numbers: http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ I think they're suggesting using a negative offset, which is safe as long as it doesn't page fault, even though we have the redzone disabled. --Andy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:59 ` Andy Lutomirski @ 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:14 ` Michael S. Tsirkin 2016-01-13 16:20 ` Michael S. Tsirkin 2016-01-12 22:21 ` Michael S. Tsirkin 1 sibling, 2 replies; 20+ messages in thread From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Michael S. Tsirkin, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ Well, that's with the busy loop and one set of code generation. It doesn't show the "oops, deeper stack isn't even in the cache any more due to call chains" issue. But yes: > I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled. I think a negative offset might work very well. Partly exactly *because* we have the redzone disabled: we know that inside the kernel, we'll never have any live stack frame accesses under the stack pointer, so "-4(%rsp)" sounds good to me. There should never be any pending writes in the write buffer, because even if it *was* live, it would have been read off first. Yeah, it potentially does extend the stack cache footprint by another 4 bytes, but that sounds very benign. So perhaps it might be worth trying to switch the "mfence" to "lock ; addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate for x86-32. I'd still want to see somebody try to benchmark it. I doubt it's noticeable, but making changes because you think it might save a few cycles without then even measuring it is just wrong. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 21:37 ` Linus Torvalds @ 2016-01-12 22:14 ` Michael S. Tsirkin 2016-01-13 16:20 ` Michael S. Tsirkin 1 sibling, 0 replies; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. > > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > Linus Oops, I posted v2 with just offset 0 before reading the rest of this thread. I did try with offset 0 and didn't measure any change on any perf bench test, or on kernel build. I wonder which benchmark stresses smp_mb the most. I'll look into using a negative offset. -- MST ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:14 ` Michael S. Tsirkin @ 2016-01-13 16:20 ` Michael S. Tsirkin 1 sibling, 0 replies; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. It's an interesting read, thanks! So sp is read on return from function I think. I added a function and sure enough, it slows the add 0(sp) variant down. It's still faster than mfence for me though! Testing code + results below. Reaching below stack, or allocating extra 4 bytes above the stack pointer gives us back the performance. > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > Linus I'll try this in the kernel now, will report, though I'm not optimistic a high level benchmark can show this kind of thing. --------------- main.c: --------------- extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0) #endif #ifdef locksp #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef lockrz #define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif void __attribute__ ((noinline)) test(int i, int *j) { /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ x = i - *j; barrier(); *j = x; } int main(int argc, char **argv) { int i; int j = 1234; for (i = 0; i < 10000000; ++i) test(i, &j); return 0; } --------------- ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- TIME = /usr/bin/time -f %e FILTER = cat all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done perf time: run time: RUN=${TIME} perf: RUN=${PERF} perf: FILTER=grep elapsed .PHONY: all clean run perf time xchgrz: CFLAGS += -mno-red-zone ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -------------------------------------------- perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed 0.080420565 seconds time elapsed ( +- 2.31% ) perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed 0.087798571 seconds time elapsed ( +- 2.58% ) perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed 0.083023724 seconds time elapsed ( +- 2.44% ) perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed 0.102880750 seconds time elapsed ( +- 0.13% ) perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed 0.084917420 seconds time elapsed ( +- 3.28% ) perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed 0.156014715 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed 0.077731443 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed 0.036655741 seconds time elapsed ( +- 0.21% ) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:59 ` Andy Lutomirski 2016-01-12 21:37 ` Linus Torvalds @ 2016-01-12 22:21 ` Michael S. Tsirkin 2016-01-12 22:55 ` H. Peter Anvin 1 sibling, 1 reply; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Linus Torvalds, Ingo Molnar On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote: > On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > >> > >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) > >> was better because it avoided stomping on very-likely-to-be-hot write > >> buffers. > > > > I suspect it could go either way. You want a small constant (for the > > isntruction size), but any small constant is likely to be within the > > current stack frame anyway. I don't think 0(%rsp) is particularly > > likely to have a spill on it right then and there, but who knows.. > > > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > > if it's just after a deep function call. Which it might be. So it > > might work the other way. > > > > So my guess would be that you wouldn't be able to measure the > > difference. It might be there, but probably too small to really see in > > any noise. > > > > But numbers talk, bullshit walks. It would be interesting to be proven wrong. > > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled. > > --Andy OK so I'll have to tweak the test to put something on stack to measure the difference: my test tweaks a global variable instead. I'll try that by tomorrow. I couldn't measure any difference between mfence and lock+addl except in a micro-benchmark, but hey since we are tweaking this, let's do the optimal thing. -- MST ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 22:21 ` Michael S. Tsirkin @ 2016-01-12 22:55 ` H. Peter Anvin 2016-01-12 23:24 ` Linus Torvalds [not found] ` <CA+55aFz44oYK5GD3=-Y2E7bNwmJ0Ko5GaZ+iMQDamLALiw7RqQ@mail.gmail.com> 0 siblings, 2 replies; 20+ messages in thread From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw) To: Michael S. Tsirkin, Andy Lutomirski Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Andy Lutomirski, Thomas Gleixner, Paul E. McKenney, Linus Torvalds, Ingo Molnar On 01/12/16 14:21, Michael S. Tsirkin wrote: > > OK so I'll have to tweak the test to put something > on stack to measure the difference: my test tweaks a > global variable instead. > I'll try that by tomorrow. > > I couldn't measure any difference between mfence and lock+addl > except in a micro-benchmark, but hey since we are tweaking this, > let's do the optimal thing. > Be careful with this: if it only shows up in a microbenchmark, we may introduce a hard-to-debug regression for no real benefit. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 22:55 ` H. Peter Anvin @ 2016-01-12 23:24 ` Linus Torvalds [not found] ` <CA+55aFz44oYK5GD3=-Y2E7bNwmJ0Ko5GaZ+iMQDamLALiw7RqQ@mail.gmail.com> 1 sibling, 0 replies; 20+ messages in thread From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw) To: H. Peter Anvin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, Linux Kernel Mailing List, Michael S. Tsirkin, the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski, Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > Be careful with this: if it only shows up in a microbenchmark, we may > introduce a hard-to-debug regression for no real benefit. So I can pretty much guarantee that it shouldn't regress from a correctness angle, since we rely *heavily* on locked instructions being barriers, in locking and in various other situations. Indeed, much more so than we ever rely on "smp_mb()". The places that rely on smp_mb() are pretty few in the end. So I think the only issue is whether sometimes "mfence" might be faster. So far, I've never actually heard of that being the case. The fence instructions have always sucked when I've seen them. But talking to the hw people about this is certainly a good idea regardless. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CA+55aFz44oYK5GD3=-Y2E7bNwmJ0Ko5GaZ+iMQDamLALiw7RqQ@mail.gmail.com>]
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() [not found] ` <CA+55aFz44oYK5GD3=-Y2E7bNwmJ0Ko5GaZ+iMQDamLALiw7RqQ@mail.gmail.com> @ 2016-01-13 16:17 ` Borislav Petkov 2016-01-13 16:25 ` Michael S. Tsirkin 0 siblings, 1 reply; 20+ messages in thread From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, Linux Kernel Mailing List, Michael S. Tsirkin, the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote: > But talking to the hw people about this is certainly a good idea regardless. I'm not seeing it in this thread but I might've missed it too. Anyway, I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't touch them. Do we care? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:17 ` Borislav Petkov @ 2016-01-13 16:25 ` Michael S. Tsirkin 2016-01-13 16:33 ` Borislav Petkov 0 siblings, 1 reply; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw) To: Borislav Petkov Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Linus Torvalds, Ingo Molnar On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote: > On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote: > > But talking to the hw people about this is certainly a good idea regardless. > > I'm not seeing it in this thread but I might've missed it too. Anyway, > I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't > touch them. > > Do we care? Which flag do you refer to, exactly? > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:25 ` Michael S. Tsirkin @ 2016-01-13 16:33 ` Borislav Petkov 2016-01-13 16:42 ` Michael S. Tsirkin 0 siblings, 1 reply; 20+ messages in thread From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Linus Torvalds, Ingo Molnar On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote: > Which flag do you refer to, exactly? All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:33 ` Borislav Petkov @ 2016-01-13 16:42 ` Michael S. Tsirkin 2016-01-13 16:53 ` Borislav Petkov 2016-01-13 18:38 ` Linus Torvalds 0 siblings, 2 replies; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw) To: Borislav Petkov Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Linus Torvalds, Ingo Molnar On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote: > On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote: > > Which flag do you refer to, exactly? > > All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF Oh, I think this means we need a "cc" clobber. This also seems to be a bug on !XMM CPUs. cmpxchg.h gets it right. I'll send a patch. > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:42 ` Michael S. Tsirkin @ 2016-01-13 16:53 ` Borislav Petkov 2016-01-13 17:00 ` Michael S. Tsirkin 2016-01-13 18:38 ` Linus Torvalds 1 sibling, 1 reply; 20+ messages in thread From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Linus Torvalds, Ingo Molnar On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote: > Oh, I think this means we need a "cc" clobber. Btw, does your microbenchmark do it too? Because, the "cc" clobber should cause additional handling of flags, depending on the context. It won't matter if the context doesn't need rFLAGS handling in the benchmark but if we start using LOCK; ADD in the kernel, I can imagine some places where mb() is used and rFLAGS are live, causing gcc to either reorder code or stash them away... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:53 ` Borislav Petkov @ 2016-01-13 17:00 ` Michael S. Tsirkin 0 siblings, 0 replies; 20+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw) To: Borislav Petkov Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Linus Torvalds, Ingo Molnar On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote: > On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote: > > Oh, I think this means we need a "cc" clobber. > > Btw, does your microbenchmark do it too? Yes - I fixed it now, but it did not affect the result. We'd need some code where gcc carries flags around though. > Because, the "cc" clobber should cause additional handling of flags, > depending on the context. It won't matter if the context doesn't need > rFLAGS handling in the benchmark but if we start using LOCK; ADD in the > kernel, I can imagine some places where mb() is used and rFLAGS are > live, causing gcc to either reorder code or stash them away... It will reorder code but not necessarily for the worse :) Best I can do, I will add cc clobber to kernel and see whether binary size grows. > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:42 ` Michael S. Tsirkin 2016-01-13 16:53 ` Borislav Petkov @ 2016-01-13 18:38 ` Linus Torvalds 1 sibling, 0 replies; 20+ messages in thread From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Borislav Petkov, Andy Lutomirski, H. Peter Anvin, Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > Oh, I think this means we need a "cc" clobber. It's probably a good idea to add one. Historically, gcc doesn't need one on x86, and always considers flags clobbered. We are probably missing the cc clobber in a *lot* of places for this reason. But even if not necessary, it's probably a good thing to add for documentation, and in case gcc semantcs ever change. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-01-13 18:38 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1445975631-17047-1-git-send-email-dave@stgolabs.net>
[not found] ` <1445975631-17047-4-git-send-email-dave@stgolabs.net>
[not found] ` <CA+55aFw8d29E26o4eB8VgbZBC5Ot2y9K=T_yE5Dj0dFdyVgOUA@mail.gmail.com>
[not found] ` <20151027223744.GB11242@worktop.amr.corp.intel.com>
[not found] ` <20151102201535.GB1707@linux-uzut.site>
[not found] ` <CA+55aFynbkeuUGs9s-q+fLY6MeRBA6MjEyWWbbe7A5AaqsAknw@mail.gmail.com>
2016-01-12 13:57 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Michael S. Tsirkin
[not found] ` <20160112150032-mutt-send-email-mst@redhat.com>
2016-01-12 17:20 ` Linus Torvalds
2016-01-12 17:45 ` Michael S. Tsirkin
2016-01-12 18:04 ` Linus Torvalds
2016-01-12 20:30 ` Andy Lutomirski
2016-01-12 20:54 ` Linus Torvalds
2016-01-12 20:59 ` Andy Lutomirski
2016-01-12 21:37 ` Linus Torvalds
2016-01-12 22:14 ` Michael S. Tsirkin
2016-01-13 16:20 ` Michael S. Tsirkin
2016-01-12 22:21 ` Michael S. Tsirkin
2016-01-12 22:55 ` H. Peter Anvin
2016-01-12 23:24 ` Linus Torvalds
[not found] ` <CA+55aFz44oYK5GD3=-Y2E7bNwmJ0Ko5GaZ+iMQDamLALiw7RqQ@mail.gmail.com>
2016-01-13 16:17 ` Borislav Petkov
2016-01-13 16:25 ` Michael S. Tsirkin
2016-01-13 16:33 ` Borislav Petkov
2016-01-13 16:42 ` Michael S. Tsirkin
2016-01-13 16:53 ` Borislav Petkov
2016-01-13 17:00 ` Michael S. Tsirkin
2016-01-13 18:38 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox