* [RFC patch] cmpxchg_double: remove local variables to get better performance @ 2012-03-02 8:31 Alex Shi 2012-03-02 8:54 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Alex Shi @ 2012-03-02 8:31 UTC (permalink / raw) To: tglx, hpa@zytor.com, mingo@redhat.com, x86@kernel.org, linux-kernel@vger.kernel.org, jeremy, jbeulich Cc: Andi Kleen, asit.k.mallick@intel.com There are some local variables in cmpxchg_double macro, seems these are used to for force casting on input variables to transfer them into '*p1' type. May there are some reason I don't know. But I just saw 2 problems here: 1, user may mis-use the macro, like give a 'long' type o1, but just use a 'int*' or 'char*' p1. If we remove the force cast here, gcc will check the mis-using in compiling. and user can get the error report in compiling for such issues. 2, local variable increased the data section, and bring extra memory bus accesses, that hurt performance in this critical macro. I did a little experiment on my nhm i7 desktop, to run the macro with a fixed times, here is the data: using local vars no local variable with lock prefix, 267700578ns 232079696ns without lock prefix, 34715666ns 34687566ns So, we may need rethink about the local variable usage here. Signed-off-by: Alex Shi <alex.shi@intel.com> --- diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h index b3b7332..8bf9127 100644 --- a/arch/x86/include/asm/cmpxchg.h +++ b/arch/x86/include/asm/cmpxchg.h @@ -210,17 +210,15 @@ extern void __add_wrong_size(void) #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \ ({ \ bool __ret; \ - __typeof__(*(p1)) __old1 = (o1), __new1 = (n1); \ - __typeof__(*(p2)) __old2 = (o2), __new2 = (n2); \ BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long)); \ BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long)); \ VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long))); \ VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2)); \ asm volatile(pfx "cmpxchg%c4b %2; sete %0" \ - : "=a" (__ret), "+d" (__old2), \ + : "=a" (__ret), "+d" (o2), \ "+m" (*(p1)), "+m" (*(p2)) \ - : "i" (2 * sizeof(long)), "a" (__old1), \ - "b" (__new1), "c" (__new2)); \ + : "i" (2 * sizeof(long)), "a" (o1), \ + "b" (n1), "c" (n2)); \ __ret; \ }) ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 8:31 [RFC patch] cmpxchg_double: remove local variables to get better performance Alex Shi @ 2012-03-02 8:54 ` Jan Beulich 2012-03-02 9:00 ` Alex Shi 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2012-03-02 8:54 UTC (permalink / raw) To: Alex Shi Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com >>> On 02.03.12 at 09:31, Alex Shi <alex.shi@intel.com> wrote: > There are some local variables in cmpxchg_double macro, seems these are > used to for force casting on input variables to transfer them into '*p1' > type. May there are some reason I don't know. But I just saw 2 problems > here: > > 1, user may mis-use the macro, like give a 'long' type o1, but just use > a 'int*' or 'char*' p1. No - see the BUILD_BUG_ON()s right after the lines you suggest to remove. Further, it seems to be intentional to allow _compatible_ types for o1 and o2 - you could pass in a literal number without L suffix here, which I don't think you can anymore with the intermediate variable removed. > If we remove the force cast here, gcc will check the mis-using in > compiling. and user can get the error report in compiling for such > issues. > > 2, local variable increased the data section, and bring extra memory bus These aren't static, so the data section can't possibly increase. > accesses, that hurt performance in this critical macro. With optimization enabled, the compiler should eliminate all unnecessary intermediate variables. > I did a little experiment on my nhm i7 desktop, to run the macro with a > fixed times, here is the data: > using local vars no local variable > with lock prefix, 267700578ns 232079696ns > without lock prefix, 34715666ns 34687566ns > > So, we may need rethink about the local variable usage here. > > Signed-off-by: Alex Shi <alex.shi@intel.com> Sorry, but if this counts, this is a nack from me. Jan > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h > index b3b7332..8bf9127 100644 > --- a/arch/x86/include/asm/cmpxchg.h > +++ b/arch/x86/include/asm/cmpxchg.h > @@ -210,17 +210,15 @@ extern void __add_wrong_size(void) > #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \ > ({ \ > bool __ret; \ > - __typeof__(*(p1)) __old1 = (o1), __new1 = (n1); \ > - __typeof__(*(p2)) __old2 = (o2), __new2 = (n2); \ > BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long)); \ > BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long)); \ > VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long))); \ > VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2)); \ > asm volatile(pfx "cmpxchg%c4b %2; sete %0" \ > - : "=a" (__ret), "+d" (__old2), \ > + : "=a" (__ret), "+d" (o2), \ > "+m" (*(p1)), "+m" (*(p2)) \ > - : "i" (2 * sizeof(long)), "a" (__old1), \ > - "b" (__new1), "c" (__new2)); \ > + : "i" (2 * sizeof(long)), "a" (o1), \ > + "b" (n1), "c" (n2)); \ > __ret; \ > }) > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 8:54 ` Jan Beulich @ 2012-03-02 9:00 ` Alex Shi 2012-03-02 9:11 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Alex Shi @ 2012-03-02 9:00 UTC (permalink / raw) To: Jan Beulich Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com On Fri, 2012-03-02 at 08:54 +0000, Jan Beulich wrote: > >>> On 02.03.12 at 09:31, Alex Shi <alex.shi@intel.com> wrote: > > There are some local variables in cmpxchg_double macro, seems these are > > used to for force casting on input variables to transfer them into '*p1' > > type. May there are some reason I don't know. But I just saw 2 problems > > here: > > > > 1, user may mis-use the macro, like give a 'long' type o1, but just use > > a 'int*' or 'char*' p1. > > No - see the BUILD_BUG_ON()s right after the lines you suggest to > remove. > > Further, it seems to be intentional to allow _compatible_ types for > o1 and o2 - you could pass in a literal number without L suffix here, > which I don't think you can anymore with the intermediate variable > removed. Yes, we can use cast for intermediate data. And actually, current kernel has live mis-used case on cmpxchg(), that I plan to point out too. -- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { - int dirty_count = kvm->tlbs_dirty; + long dirty_count = kvm->tlbs_dirty; smp_mb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.remote_tlb_flush; - cmpxchg(&kvm->tlbs_dirty, dirty_count, 0); + cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L); } > > > If we remove the force cast here, gcc will check the mis-using in > > compiling. and user can get the error report in compiling for such > > issues. > > > > 2, local variable increased the data section, and bring extra memory bus > > These aren't static, so the data section can't possibly increase. sorry, it is text section increasing. > > > accesses, that hurt performance in this critical macro. > > With optimization enabled, the compiler should eliminate all unnecessary > intermediate variables. oh, I don't know now. I will recheck this point. > > > I did a little experiment on my nhm i7 desktop, to run the macro with a > > fixed times, here is the data: > > using local vars no local variable > > with lock prefix, 267700578ns 232079696ns > > without lock prefix, 34715666ns 34687566ns > > > > So, we may need rethink about the local variable usage here. > > > > Signed-off-by: Alex Shi <alex.shi@intel.com> > > Sorry, but if this counts, this is a nack from me. > > Jan > > > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h > > index b3b7332..8bf9127 100644 > > --- a/arch/x86/include/asm/cmpxchg.h > > +++ b/arch/x86/include/asm/cmpxchg.h > > @@ -210,17 +210,15 @@ extern void __add_wrong_size(void) > > #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \ > > ({ \ > > bool __ret; \ > > - __typeof__(*(p1)) __old1 = (o1), __new1 = (n1); \ > > - __typeof__(*(p2)) __old2 = (o2), __new2 = (n2); \ > > BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long)); \ > > BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long)); \ > > VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long))); \ > > VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2)); \ > > asm volatile(pfx "cmpxchg%c4b %2; sete %0" \ > > - : "=a" (__ret), "+d" (__old2), \ > > + : "=a" (__ret), "+d" (o2), \ > > "+m" (*(p1)), "+m" (*(p2)) \ > > - : "i" (2 * sizeof(long)), "a" (__old1), \ > > - "b" (__new1), "c" (__new2)); \ > > + : "i" (2 * sizeof(long)), "a" (o1), \ > > + "b" (n1), "c" (n2)); \ > > __ret; \ > > }) > > > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 9:00 ` Alex Shi @ 2012-03-02 9:11 ` Jan Beulich 2012-03-02 15:12 ` Alex Shi 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2012-03-02 9:11 UTC (permalink / raw) To: Alex Shi Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com >>> On 02.03.12 at 10:00, Alex Shi <alex.shi@intel.com> wrote: > Yes, we can use cast for intermediate data. And actually, current kernel > has live mis-used case on cmpxchg(), that I plan to point out too. > > -- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, > unsigned int req) > > void kvm_flush_remote_tlbs(struct kvm *kvm) > { > - int dirty_count = kvm->tlbs_dirty; > + long dirty_count = kvm->tlbs_dirty; > > smp_mb(); > if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) > ++kvm->stat.remote_tlb_flush; > - cmpxchg(&kvm->tlbs_dirty, dirty_count, 0); > + cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L); Indeed - the cmpxchg would fail if the value doesn't fit. But this is not to say that in certain cases it isn't valid to pass an int for the second and/or third argument. (And quite likely the issue here is theoretical only anyway.) In particular, requiring an L suffix here on literals should be avoided. Jan > } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 9:11 ` Jan Beulich @ 2012-03-02 15:12 ` Alex Shi 2012-03-02 15:30 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Alex Shi @ 2012-03-02 15:12 UTC (permalink / raw) To: Jan Beulich Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com On 03/02/2012 05:11 PM, Jan Beulich wrote: >>>> On 02.03.12 at 10:00, Alex Shi <alex.shi@intel.com> wrote: >> Yes, we can use cast for intermediate data. And actually, current kernel >> has live mis-used case on cmpxchg(), that I plan to point out too. >> >> -- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, >> unsigned int req) >> >> void kvm_flush_remote_tlbs(struct kvm *kvm) >> { >> - int dirty_count = kvm->tlbs_dirty; >> + long dirty_count = kvm->tlbs_dirty; >> >> smp_mb(); >> if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) >> ++kvm->stat.remote_tlb_flush; >> - cmpxchg(&kvm->tlbs_dirty, dirty_count, 0); >> + cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L); > > Indeed - the cmpxchg would fail if the value doesn't fit. But this is not > to say that in certain cases it isn't valid to pass an int for the second > and/or third argument. (And quite likely the issue here is theoretical > only anyway.) It may cause potential issue, if it is tlbs_dirty mis-used here, not dirty_count. If so, it may cause data damage. > > In particular, requiring an L suffix here on literals should be avoided. Even the each macro may save 0x40 bytes text, and bring more 10% execution speed? > > Jan > >> } > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 15:12 ` Alex Shi @ 2012-03-02 15:30 ` Jan Beulich 2012-03-03 6:03 ` Alex Shi 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2012-03-02 15:30 UTC (permalink / raw) To: Alex Shi Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com >>> On 02.03.12 at 16:12, Alex Shi <alex.shi@intel.com> wrote: > On 03/02/2012 05:11 PM, Jan Beulich wrote: >> In particular, requiring an L suffix here on literals should be avoided. > > > Even the each macro may save 0x40 bytes text, and bring more 10% > execution speed? Again - if you see meaningful text size differences with optimization properly enabled, this may need investigation at the compiler end. Jan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance 2012-03-02 15:30 ` Jan Beulich @ 2012-03-03 6:03 ` Alex Shi 0 siblings, 0 replies; 7+ messages in thread From: Alex Shi @ 2012-03-03 6:03 UTC (permalink / raw) To: Jan Beulich Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx, Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com On 03/02/2012 11:30 PM, Jan Beulich wrote: >>>> On 02.03.12 at 16:12, Alex Shi <alex.shi@intel.com> wrote: >> On 03/02/2012 05:11 PM, Jan Beulich wrote: >>> In particular, requiring an L suffix here on literals should be avoided. >> >> >> Even the each macro may save 0x40 bytes text, and bring more 10% >> execution speed? > > Again - if you see meaningful text size differences with optimization > properly enabled, this may need investigation at the compiler end. Yes, you are right. I take back this patch. > > Jan > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-03-03 6:03 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-02 8:31 [RFC patch] cmpxchg_double: remove local variables to get better performance Alex Shi 2012-03-02 8:54 ` Jan Beulich 2012-03-02 9:00 ` Alex Shi 2012-03-02 9:11 ` Jan Beulich 2012-03-02 15:12 ` Alex Shi 2012-03-02 15:30 ` Jan Beulich 2012-03-03 6:03 ` Alex Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox