public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC patch] cmpxchg_double: remove local variables to get better performance
@ 2012-03-02  8:31 Alex Shi
  2012-03-02  8:54 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Shi @ 2012-03-02  8:31 UTC (permalink / raw)
  To: tglx, hpa@zytor.com, mingo@redhat.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, jeremy, jbeulich
  Cc: Andi Kleen, asit.k.mallick@intel.com

There are some local variables in cmpxchg_double macro, seems these are
used to for force casting on input variables to transfer them into '*p1'
type. May there are some reason I don't know. But I just saw 2 problems
here:

1, user may mis-use the macro, like give a 'long' type o1, but just use
a 'int*' or 'char*' p1.  
If we remove the force cast here, gcc will check the mis-using in
compiling. and user can get the error report in compiling for such
issues.

2, local variable increased the data section, and bring extra memory bus
accesses, that hurt performance in this critical macro.
I did a little experiment on my nhm i7 desktop, to run the macro with a
fixed times, here is the data:
			 using local vars         no local variable
with lock prefix,         267700578ns             232079696ns
without lock prefix,      34715666ns              34687566ns

So, we may need rethink about the local variable usage here. 

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index b3b7332..8bf9127 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -210,17 +210,15 @@ extern void __add_wrong_size(void)
 #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)			\
 ({									\
 	bool __ret;							\
-	__typeof__(*(p1)) __old1 = (o1), __new1 = (n1);			\
-	__typeof__(*(p2)) __old2 = (o2), __new2 = (n2);			\
 	BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));			\
 	BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));			\
 	VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));		\
 	VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));	\
 	asm volatile(pfx "cmpxchg%c4b %2; sete %0"			\
-		     : "=a" (__ret), "+d" (__old2),			\
+		     : "=a" (__ret), "+d" (o2),				\
 		       "+m" (*(p1)), "+m" (*(p2))			\
-		     : "i" (2 * sizeof(long)), "a" (__old1),		\
-		       "b" (__new1), "c" (__new2));			\
+		     : "i" (2 * sizeof(long)), "a" (o1),		\
+		       "b" (n1), "c" (n2));				\
 	__ret;								\
 })
 



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02  8:31 [RFC patch] cmpxchg_double: remove local variables to get better performance Alex Shi
@ 2012-03-02  8:54 ` Jan Beulich
  2012-03-02  9:00   ` Alex Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2012-03-02  8:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

>>> On 02.03.12 at 09:31, Alex Shi <alex.shi@intel.com> wrote:
> There are some local variables in cmpxchg_double macro, seems these are
> used to for force casting on input variables to transfer them into '*p1'
> type. May there are some reason I don't know. But I just saw 2 problems
> here:
> 
> 1, user may mis-use the macro, like give a 'long' type o1, but just use
> a 'int*' or 'char*' p1.  

No - see the BUILD_BUG_ON()s right after the lines you suggest to
remove.

Further, it seems to be intentional to allow _compatible_ types for
o1 and o2 - you could pass in a literal number without L suffix here,
which I don't think you can anymore with the intermediate variable
removed.

> If we remove the force cast here, gcc will check the mis-using in
> compiling. and user can get the error report in compiling for such
> issues.
> 
> 2, local variable increased the data section, and bring extra memory bus

These aren't static, so the data section can't possibly increase.

> accesses, that hurt performance in this critical macro.

With optimization enabled, the compiler should eliminate all unnecessary
intermediate variables.

> I did a little experiment on my nhm i7 desktop, to run the macro with a
> fixed times, here is the data:
> 			 using local vars         no local variable
> with lock prefix,         267700578ns             232079696ns
> without lock prefix,      34715666ns              34687566ns
> 
> So, we may need rethink about the local variable usage here. 
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>

Sorry, but if this counts, this is a nack from me.

Jan

> diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> index b3b7332..8bf9127 100644
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> @@ -210,17 +210,15 @@ extern void __add_wrong_size(void)
>  #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)			\
>  ({									\
>  	bool __ret;							\
> -	__typeof__(*(p1)) __old1 = (o1), __new1 = (n1);			\
> -	__typeof__(*(p2)) __old2 = (o2), __new2 = (n2);			\
>  	BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));			\
>  	BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));			\
>  	VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));		\
>  	VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));	\
>  	asm volatile(pfx "cmpxchg%c4b %2; sete %0"			\
> -		     : "=a" (__ret), "+d" (__old2),			\
> +		     : "=a" (__ret), "+d" (o2),				\
>  		       "+m" (*(p1)), "+m" (*(p2))			\
> -		     : "i" (2 * sizeof(long)), "a" (__old1),		\
> -		       "b" (__new1), "c" (__new2));			\
> +		     : "i" (2 * sizeof(long)), "a" (o1),		\
> +		       "b" (n1), "c" (n2));				\
>  	__ret;								\
>  })
>  




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02  8:54 ` Jan Beulich
@ 2012-03-02  9:00   ` Alex Shi
  2012-03-02  9:11     ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Shi @ 2012-03-02  9:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

On Fri, 2012-03-02 at 08:54 +0000, Jan Beulich wrote:
> >>> On 02.03.12 at 09:31, Alex Shi <alex.shi@intel.com> wrote:
> > There are some local variables in cmpxchg_double macro, seems these are
> > used to for force casting on input variables to transfer them into '*p1'
> > type. May there are some reason I don't know. But I just saw 2 problems
> > here:
> > 
> > 1, user may mis-use the macro, like give a 'long' type o1, but just use
> > a 'int*' or 'char*' p1.  
> 
> No - see the BUILD_BUG_ON()s right after the lines you suggest to
> remove.
> 
> Further, it seems to be intentional to allow _compatible_ types for
> o1 and o2 - you could pass in a literal number without L suffix here,
> which I don't think you can anymore with the intermediate variable
> removed.

Yes, we can use cast for intermediate data. And actually, current kernel
has live mis-used case on cmpxchg(), that I plan to point out too. 

-- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)

 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
-       int dirty_count = kvm->tlbs_dirty;
+       long dirty_count = kvm->tlbs_dirty;

        smp_mb();
        if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
                ++kvm->stat.remote_tlb_flush;
-       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
+       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L);
 }

> 
> > If we remove the force cast here, gcc will check the mis-using in
> > compiling. and user can get the error report in compiling for such
> > issues.
> > 
> > 2, local variable increased the data section, and bring extra memory bus
> 
> These aren't static, so the data section can't possibly increase.

sorry, it is text section increasing. 
> 
> > accesses, that hurt performance in this critical macro.
> 
> With optimization enabled, the compiler should eliminate all unnecessary
> intermediate variables.

oh, I don't know now. I will recheck this point. 
> 
> > I did a little experiment on my nhm i7 desktop, to run the macro with a
> > fixed times, here is the data:
> > 			 using local vars         no local variable
> > with lock prefix,         267700578ns             232079696ns
> > without lock prefix,      34715666ns              34687566ns
> > 
> > So, we may need rethink about the local variable usage here. 
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> 
> Sorry, but if this counts, this is a nack from me.
> 
> Jan
> 
> > diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
> > index b3b7332..8bf9127 100644
> > --- a/arch/x86/include/asm/cmpxchg.h
> > +++ b/arch/x86/include/asm/cmpxchg.h
> > @@ -210,17 +210,15 @@ extern void __add_wrong_size(void)
> >  #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)			\
> >  ({									\
> >  	bool __ret;							\
> > -	__typeof__(*(p1)) __old1 = (o1), __new1 = (n1);			\
> > -	__typeof__(*(p2)) __old2 = (o2), __new2 = (n2);			\
> >  	BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));			\
> >  	BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));			\
> >  	VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));		\
> >  	VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));	\
> >  	asm volatile(pfx "cmpxchg%c4b %2; sete %0"			\
> > -		     : "=a" (__ret), "+d" (__old2),			\
> > +		     : "=a" (__ret), "+d" (o2),				\
> >  		       "+m" (*(p1)), "+m" (*(p2))			\
> > -		     : "i" (2 * sizeof(long)), "a" (__old1),		\
> > -		       "b" (__new1), "c" (__new2));			\
> > +		     : "i" (2 * sizeof(long)), "a" (o1),		\
> > +		       "b" (n1), "c" (n2));				\
> >  	__ret;								\
> >  })
> >  
> 
> 
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02  9:00   ` Alex Shi
@ 2012-03-02  9:11     ` Jan Beulich
  2012-03-02 15:12       ` Alex Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2012-03-02  9:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

>>> On 02.03.12 at 10:00, Alex Shi <alex.shi@intel.com> wrote:
> Yes, we can use cast for intermediate data. And actually, current kernel
> has live mis-used case on cmpxchg(), that I plan to point out too. 
> 
> -- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, 
> unsigned int req)
> 
>  void kvm_flush_remote_tlbs(struct kvm *kvm)
>  {
> -       int dirty_count = kvm->tlbs_dirty;
> +       long dirty_count = kvm->tlbs_dirty;
> 
>         smp_mb();
>         if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
>                 ++kvm->stat.remote_tlb_flush;
> -       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
> +       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L);

Indeed - the cmpxchg would fail if the value doesn't fit. But this is not
to say that in certain cases it isn't valid to pass an int for the second
and/or third argument. (And quite likely the issue here is theoretical
only anyway.)

In particular, requiring an L suffix here on literals should be avoided.

Jan

>  }



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02  9:11     ` Jan Beulich
@ 2012-03-02 15:12       ` Alex Shi
  2012-03-02 15:30         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Shi @ 2012-03-02 15:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

On 03/02/2012 05:11 PM, Jan Beulich wrote:

>>>> On 02.03.12 at 10:00, Alex Shi <alex.shi@intel.com> wrote:
>> Yes, we can use cast for intermediate data. And actually, current kernel
>> has live mis-used case on cmpxchg(), that I plan to point out too. 
>>
>> -- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -203,12 +203,12 @@ static bool make_all_cpus_request(struct kvm *kvm, 
>> unsigned int req)
>>
>>  void kvm_flush_remote_tlbs(struct kvm *kvm)
>>  {
>> -       int dirty_count = kvm->tlbs_dirty;
>> +       long dirty_count = kvm->tlbs_dirty;
>>
>>         smp_mb();
>>         if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
>>                 ++kvm->stat.remote_tlb_flush;
>> -       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
>> +       cmpxchg(&kvm->tlbs_dirty, dirty_count, 0L);
> 
> Indeed - the cmpxchg would fail if the value doesn't fit. But this is not
> to say that in certain cases it isn't valid to pass an int for the second
> and/or third argument. (And quite likely the issue here is theoretical
> only anyway.)


It may cause potential issue, if it is tlbs_dirty mis-used here, not
dirty_count. If so, it may cause data damage.

> 
> In particular, requiring an L suffix here on literals should be avoided.


Even the each macro may save 0x40 bytes text, and bring more 10%
execution speed?

> 
> Jan
> 
>>  }
> 
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02 15:12       ` Alex Shi
@ 2012-03-02 15:30         ` Jan Beulich
  2012-03-03  6:03           ` Alex Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2012-03-02 15:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

>>> On 02.03.12 at 16:12, Alex Shi <alex.shi@intel.com> wrote:
> On 03/02/2012 05:11 PM, Jan Beulich wrote:
>> In particular, requiring an L suffix here on literals should be avoided.
> 
> 
> Even the each macro may save 0x40 bytes text, and bring more 10%
> execution speed?

Again - if you see meaningful text size differences with optimization
properly enabled, this may need investigation at the compiler end.

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC patch] cmpxchg_double: remove local variables to get better performance
  2012-03-02 15:30         ` Jan Beulich
@ 2012-03-03  6:03           ` Alex Shi
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Shi @ 2012-03-03  6:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: jeremy, asit.k.mallick@intel.com, x86@kernel.org, tglx,
	Andi Kleen, mingo@redhat.com, linux-kernel@vger.kernel.org,
	hpa@zytor.com

On 03/02/2012 11:30 PM, Jan Beulich wrote:

>>>> On 02.03.12 at 16:12, Alex Shi <alex.shi@intel.com> wrote:
>> On 03/02/2012 05:11 PM, Jan Beulich wrote:
>>> In particular, requiring an L suffix here on literals should be avoided.
>>
>>
>> Even the each macro may save 0x40 bytes text, and bring more 10%
>> execution speed?
> 
> Again - if you see meaningful text size differences with optimization
> properly enabled, this may need investigation at the compiler end.


Yes, you are right.
I take back this patch.

> 
> Jan
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-03-03  6:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-02  8:31 [RFC patch] cmpxchg_double: remove local variables to get better performance Alex Shi
2012-03-02  8:54 ` Jan Beulich
2012-03-02  9:00   ` Alex Shi
2012-03-02  9:11     ` Jan Beulich
2012-03-02 15:12       ` Alex Shi
2012-03-02 15:30         ` Jan Beulich
2012-03-03  6:03           ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox