public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC Patch]Use ar.kr2 for smp_processor_id
@ 2007-02-08  3:28 Zou Nan hai
  2007-02-08  4:27 ` Zou Nan hai
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  3:28 UTC (permalink / raw)
  To: linux-ia64

Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
This will save some memory foot-print when smp_procerror_id() 
is called.

This is also useful for implement sys_getcpu in fast path.

I have simply tested the patch by boot on a 16p system then try 
offline and online some CPUs through /sys/.
 

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>


diff -Nraup linux-2.6.20/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- linux-2.6.20/arch/ia64/kernel/setup.c	2007-02-04 13:44:54.000000000 -0500
+++ b/arch/ia64/kernel/setup.c	2007-02-07 23:30:03.000000000 -0500
@@ -458,6 +458,9 @@ early_param("elfcorehdr", parse_elfcoreh
 void __init
 setup_arch (char **cmdline_p)
 {
+	/* setup SMP processor id */
+	ia64_set_kr(IA64_KR_CPU_ID, (current_thread_info()->cpu));
+
 	unw_init();
 
 	ia64_patch_vtop((u64) __start___vtop_patchlist, (u64) __end___vtop_patchlist);
diff -Nraup linux-2.6.20/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- linux-2.6.20/arch/ia64/kernel/smpboot.c	2007-02-04 13:44:54.000000000 -0500
+++ b/arch/ia64/kernel/smpboot.c	2007-02-07 23:29:44.000000000 -0500
@@ -445,8 +445,12 @@ smp_callin (void)
 int __devinit
 start_secondary (void *unused)
 {
+	/* setup SMP processor id */
+	ia64_set_kr(IA64_KR_CPU_ID, (current_thread_info()->cpu));
+	
 	/* Early console may use I/O ports */
 	ia64_set_kr(IA64_KR_IO_BASE, __pa(ia64_iobase));
+
 	Dprintk("start_secondary: starting CPU 0x%x\n", hard_smp_processor_id());
 	efi_map_pal_code();
 	cpu_init();
diff -Nraup linux-2.6.20/include/asm-ia64/kregs.h b/include/asm-ia64/kregs.h
--- linux-2.6.20/include/asm-ia64/kregs.h	2007-02-04 13:44:54.000000000 -0500
+++ b/include/asm-ia64/kregs.h	2007-02-07 23:28:21.000000000 -0500
@@ -14,6 +14,7 @@
  */
 #define IA64_KR_IO_BASE		0	/* ar.k0: legacy I/O base address */
 #define IA64_KR_TSSD		1	/* ar.k1: IVE uses this as the TSSD */
+#define IA64_KR_CPU_ID		2	/* ar.k2: Processor ID */
 #define IA64_KR_PER_CPU_DATA	3	/* ar.k3: physical per-CPU base */
 #define IA64_KR_CURRENT_STACK	4	/* ar.k4: what's mapped in IA64_TR_CURRENT_STACK */
 #define IA64_KR_FPU_OWNER	5	/* ar.k5: fpu-owner (UP only, at the moment) */
diff -Nraup linux-2.6.20/include/asm-ia64/smp.h b/include/asm-ia64/smp.h
--- linux-2.6.20/include/asm-ia64/smp.h	2007-02-04 13:44:54.000000000 -0500
+++ b/include/asm-ia64/smp.h	2007-02-08 00:53:58.000000000 -0500
@@ -45,7 +45,7 @@ ia64_get_lid (void)
 #define SMP_IRQ_REDIRECTION	(1 << 0)
 #define SMP_IPI_REDIRECTION	(1 << 1)
 
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+#define raw_smp_processor_id() (ia64_get_kr(IA64_KR_CPU_ID))
 
 extern struct smp_boot_data {
 	int cpu_count;




  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
@ 2007-02-08  4:27 ` Zou Nan hai
  2007-02-08  4:59 ` Zou Nan hai
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  4:27 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
> 
> Historically ar.k2 has been reserved for debugging purposes, for
> example in ivt.S.  Debuggers often need a location that can be used to
> track progress, it has to be somewhere that does not rely on TLB
> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> perfect for this.
> 
  Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
> be unique on every cpu) and caching the corresponding cpu number when
> it changes.
> 
  But why do we even need to cache it? 

  It is already in a register if we put it to kr3. 
  so smp_processor_id() could be very fast. and later sys_getcpu can
also be very fast.
 

  Thanks
  Zou Nan hai
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
  2007-02-08  4:27 ` Zou Nan hai
@ 2007-02-08  4:59 ` Zou Nan hai
  2007-02-08  5:11 ` Zou Nan hai
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  4:59 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-02-08 at 14:37, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
> >On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> >> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
> >> 
> >> Historically ar.k2 has been reserved for debugging purposes, for
> >> example in ivt.S.  Debuggers often need a location that can be used
> to
> >> track progress, it has to be somewhere that does not rely on TLB
> >> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> >> perfect for this.
> >> 
> >  Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> >> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed
> to
> >> be unique on every cpu) and caching the corresponding cpu number
> when
> >> it changes.
> >> 
> >  But why do we even need to cache it? 
> >
> >  It is already in a register if we put it to kr3. 
> >  so smp_processor_id() could be very fast. and later sys_getcpu can
> >also be very fast.
> 
> ar.k3 is currently used for the address of the per-cpu data area,
> which
> speeds up access to all the per-cpu data.  Changing ar.k3 to hold the
> cpu number means an extra array calculation and lookup for every
> per-cpu variable, slowing down the rest of the system.
> 

  Are you sure ar.k3 is used by per-cpu data access? Disassembly of
vmlinux show there are only MCA code and ia64_itc_printk_clock used
ar.k3.

Thanks
Zou Nan hai


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
  2007-02-08  4:27 ` Zou Nan hai
  2007-02-08  4:59 ` Zou Nan hai
@ 2007-02-08  5:11 ` Zou Nan hai
  2007-02-08  6:04 ` Keith Owens
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  5:11 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> >Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
> >>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> >>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
> >>> 
> >>> Historically ar.k2 has been reserved for debugging purposes, for
> >>> example in ivt.S.  Debuggers often need a location that can be
> used to
> >>> track progress, it has to be somewhere that does not rely on TLB
> >>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> >>> perfect for this.
> >>> 
> >>  Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> >>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed
> to
> >>> be unique on every cpu) and caching the corresponding cpu number
> when
> >>> it changes.
> >>> 
> >>  But why do we even need to cache it? 
> >>
> >>  It is already in a register if we put it to kr3. 
> >>  so smp_processor_id() could be very fast. and later sys_getcpu can
> >>also be very fast.
> >
> >ar.k3 is currently used for the address of the per-cpu data area,
> which
> >speeds up access to all the per-cpu data.  Changing ar.k3 to hold the
> >cpu number means an extra array calculation and lookup for every
> >per-cpu variable, slowing down the rest of the system.
> 
> Correction: ar.k3 contains the physical address of the per-cpu data
> area, virtual access to per-cpu data goes via the cpu local TLB and
> does not rely on an ar.k<n> variable.  ar.k3 is used in the MCA
> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
> and
> arch/ia64/kernel/mca_asm.S.
> 

 Since MCA is slow path, 
 so I think put smp_processor_id in ar.kr3 is a gain.

 We could even optimize get_cpu_var based on this...

 Thanks
 Zou Nan hai

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (2 preceding siblings ...)
  2007-02-08  5:11 ` Zou Nan hai
@ 2007-02-08  6:04 ` Keith Owens
  2007-02-08  6:37 ` Keith Owens
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08  6:04 UTC (permalink / raw)
  To: linux-ia64

Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>Pin ar.kr2 of each CPU, so that smp_processor_id can use it.

Historically ar.k2 has been reserved for debugging purposes, for
example in ivt.S.  Debuggers often need a location that can be used to
track progress, it has to be somewhere that does not rely on TLB
entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
perfect for this.

Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
be unique on every cpu) and caching the corresponding cpu number when
it changes.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (3 preceding siblings ...)
  2007-02-08  6:04 ` Keith Owens
@ 2007-02-08  6:37 ` Keith Owens
  2007-02-08  6:55 ` Keith Owens
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08  6:37 UTC (permalink / raw)
  To: linux-ia64

Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
>> 
>> Historically ar.k2 has been reserved for debugging purposes, for
>> example in ivt.S.  Debuggers often need a location that can be used to
>> track progress, it has to be somewhere that does not rely on TLB
>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
>> perfect for this.
>> 
>  Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
>> be unique on every cpu) and caching the corresponding cpu number when
>> it changes.
>> 
>  But why do we even need to cache it? 
>
>  It is already in a register if we put it to kr3. 
>  so smp_processor_id() could be very fast. and later sys_getcpu can
>also be very fast.

ar.k3 is currently used for the address of the per-cpu data area, which
speeds up access to all the per-cpu data.  Changing ar.k3 to hold the
cpu number means an extra array calculation and lookup for every
per-cpu variable, slowing down the rest of the system.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (4 preceding siblings ...)
  2007-02-08  6:37 ` Keith Owens
@ 2007-02-08  6:55 ` Keith Owens
  2007-02-08  7:14 ` Zou Nan hai
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08  6:55 UTC (permalink / raw)
  To: linux-ia64

Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
>Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
>>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
>>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
>>> 
>>> Historically ar.k2 has been reserved for debugging purposes, for
>>> example in ivt.S.  Debuggers often need a location that can be used to
>>> track progress, it has to be somewhere that does not rely on TLB
>>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
>>> perfect for this.
>>> 
>>  Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
>>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
>>> be unique on every cpu) and caching the corresponding cpu number when
>>> it changes.
>>> 
>>  But why do we even need to cache it? 
>>
>>  It is already in a register if we put it to kr3. 
>>  so smp_processor_id() could be very fast. and later sys_getcpu can
>>also be very fast.
>
>ar.k3 is currently used for the address of the per-cpu data area, which
>speeds up access to all the per-cpu data.  Changing ar.k3 to hold the
>cpu number means an extra array calculation and lookup for every
>per-cpu variable, slowing down the rest of the system.

Correction: ar.k3 contains the physical address of the per-cpu data
area, virtual access to per-cpu data goes via the cpu local TLB and
does not rely on an ar.k<n> variable.  ar.k3 is used in the MCA
assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h and
arch/ia64/kernel/mca_asm.S.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (5 preceding siblings ...)
  2007-02-08  6:55 ` Keith Owens
@ 2007-02-08  7:14 ` Zou Nan hai
  2007-02-08  7:38 ` Zou Nan hai
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  7:14 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-02-08 at 16:40, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
> >On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> >> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> >> Correction: ar.k3 contains the physical address of the per-cpu data
> >> area, virtual access to per-cpu data goes via the cpu local TLB and
> >> does not rely on an ar.k<n> variable.  ar.k3 is used in the MCA
> >> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
> >> and
> >> arch/ia64/kernel/mca_asm.S.
> >> 
> >
> > Since MCA is slow path, 
> > so I think put smp_processor_id in ar.kr3 is a gain.
> >
> > We could even optimize get_cpu_var based on this...
> 
> (1) Somebody else (not me) gets to fix up and test the MCA handler
>     assembler code - lots of luck.
> 
> (2) smp_processor_id() in the IA64 kernel is accessed via struct
>     thread_info.cpu.  That maps to a simple memory access with code
>     like this:
> 
>        adds r14252,r13
>        ;;
>        ld4 r15=[r14]
> 
>     The stop bits usually get amortized away with other code.
>     thread_info.cpu will normally be cached in L1 so reading
>     smp_processor_id() is relatively fast.
> 
> (3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
>     slower than the existing kernel code.  See the timing program
>     below.
> 
> (4) If the justification for storing cpu number in ar.k<n> is to speed
>     up user space, how can user space tell if the current kernel
> stores
>     the physical address of the per-cpu data in k3 or if it stores the
>     cpu number in k3?  Detecting which variant of the kernel is
> running
>     will slow down user space.
> 
> 
> Timing results on 'modprobe measure'
> 
> init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992
> 
> module measure.c
> 
> -----------------------------------------------------------------------
> 
> #include <linux/init.h>
> #include <linux/kernel.h>
> #include <linux/module.h>
> #include <linux/preempt.h>
> #include <asm/kregs.h>
> #include <asm/timex.h>
> 
> MODULE_LICENSE("GPL");
> 
> #define LOOPS 1000000
> 
> static int __init init_measure(void)
> {
>         int loop;
>         register int cpu;
>         unsigned long start, end, empty_loop, cpu_loop, k3_loop;
>         printk("%s: start\n", __FUNCTION__);
>         preempt_disable();
> 
>         local_irq_disable();
>         start = get_cycles();
>         barrier();
>         for (loop = 0; loop < LOOPS; ++loop) {
>                 /* ensure that all loops are the same size (2 bundles)
> */
>                 asm volatile ("nop 0; nop 0; nop 0;");
>                 barrier();
>         };
>         end = get_cycles();
>         barrier();
>         local_irq_enable();
>         empty_loop = end - start;
> 
>         local_irq_disable();
>         start = get_cycles();
>         barrier();
>         for (loop = 0; loop < LOOPS; ++loop) {
>                 /* hand code the read of smp_processor_id() to stop
> gcc moving
>                  * the address calculation outside the loop
>                  */
>                 asm volatile ("adds r14=%0,r13"
>                               ";;"
>                               "ld4 r15=[r14]"
>                               : :
>                               "i" (IA64_TASK_SIZE + offsetof(struct
> thread_info, cpu)) :
>                               "r14", "r15" );
>                 barrier();
>         };
>         end = get_cycles();
>         barrier();
>         local_irq_enable();
>         cpu_loop = end - start;
> 
>         local_irq_disable();
>         start = get_cycles();
>         barrier();
>         for (loop = 0; loop < LOOPS; ++loop) {
>                 cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
>                 barrier();
>         };
>         end = get_cycles();
>         barrier();
>         local_irq_enable();
>         k3_loop = end - start;
> 
>         preempt_enable();
>         printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n",
> __FUNCTION__, empty_loop, cpu_loop, k3_loop);
>         return 0;
> }
> 
> static void __exit exit_measure(void)
> {
>         printk("%s: start\n", __FUNCTION__);
>         printk("%s: end\n", __FUNCTION__);
> }
> 
> module_init(init_measure)
> module_exit(exit_measure)
> 
> -----------------------------------------------------------------------
> 
> objdump of the interesting bits (the three loops):
> 
> empty loop:
> 
>   40:   09 08 00 50 00 21       [MMI]       mov r1=r40
>   46:   00 00 00 02 00 e0                   nop.m 0x0
>   4c:   81 6c 64 84                         adds r15272,r13;;
>   50:   0a 18 00 1e 10 10       [MMI]       ld4 r3=[r15];;
>   56:   20 08 0c 00 42 00                   adds r2=1,r3
>   5c:   00 00 04 00                         nop.i 0x0
>   60:   0b 00 00 00 01 00       [MMI]       nop.m 0x0;;
>   66:   00 10 3c 20 23 00                   st4 [r15]=r2
>   6c:   00 00 04 00                         nop.i 0x0;;
>   70:   0b 00 00 02 07 00       [MMI]       rsm 0x4000;;
>   76:   50 02 b0 44 08 00                   mov.m r37=ar.itc
>   7c:   00 00 04 00                         nop.i 0x0;;
>   80:   0b 70 fc 78 84 24       [MMI]       mov r14™9999;;
>   86:   00 00 00 02 00 00                   nop.m 0x0
>   8c:   e0 08 aa 00                         mov.i ar.lc=r14;;
>   90:   01 00 00 00 01 00       [MII]       nop.m 0x0
>   96:   00 00 00 02 00 00                   nop.i 0x0
>   9c:   00 00 04 00                         nop.i 0x0;;
>   a0:   10 00 00 00 01 00       [MIB]       nop.m 0x0
>   a6:   00 00 00 02 00 a0                   nop.i 0x0
>   ac:   f0 ff ff 48                         br.cloop.sptk.few 90
> <init_module+0x90>
>   b0:   0b 20 01 58 22 04       [MMI]       mov.m r36=ar.itc;;
>   b6:   00 00 04 0c 00 00                   ssm 0x4000
>   bc:   00 00 04 00                         nop.i 0x0;;
>   c0:   0b 00 00 00 30 00       [MMI]       srlz.d;;
> 
> Read smp_processor_id:
> 
>   c6:   00 00 04 0e 00 00                   rsm 0x4000
>   cc:   00 00 04 00                         nop.i 0x0;;
>   d0:   01 18 01 58 22 04       [MII]       mov.m r35=ar.itc
>   d6:   00 00 00 02 00 00                   nop.i 0x0
>   dc:   00 00 04 00                         nop.i 0x0;;
>   e0:   0a 40 fc 78 84 24       [MMI]       mov r8™9999;;
>   e6:   00 00 00 02 00 00                   nop.m 0x0
>   ec:   80 08 aa 00                         mov.i ar.lc=r8
>   f0:   0b 70 d0 1a 19 21       [MMI]       adds r14252,r13;;
>   f6:   f0 00 38 20 20 00                   ld4 r15=[r14]
>   fc:   00 00 04 00                         nop.i 0x0;;
>  100:   10 00 00 00 01 00       [MIB]       nop.m 0x0
>  106:   00 00 00 02 00 a0                   nop.i 0x0
>  10c:   f0 ff ff 48                         br.cloop.sptk.few f0
> <init_module+0xf0>
>  110:   0b 10 01 58 22 04       [MMI]       mov.m r34=ar.itc;;
>  116:   00 00 04 0c 00 00                   ssm 0x4000
>  11c:   00 00 04 00                         nop.i 0x0;;
>  120:   0b 00 00 00 30 00       [MMI]       srlz.d;;
> 
> Read ar.k3:
> 
>  126:   00 00 04 0e 00 00                   rsm 0x4000
>  12c:   00 00 04 00                         nop.i 0x0;;
>  130:   01 08 01 58 22 04       [MII]       mov.m r33=ar.itc
>  136:   00 00 00 02 00 00                   nop.i 0x0
>  13c:   00 00 04 00                         nop.i 0x0;;
>  140:   0a 48 fc 78 84 24       [MMI]       mov r9™9999;;
>  146:   00 00 00 02 00 00                   nop.m 0x0
>  14c:   90 08 aa 00                         mov.i ar.lc=r9
>  150:   01 70 00 06 22 04       [MII]       mov.m r14=ar.k3
>  156:   00 00 00 02 00 00                   nop.i 0x0
>  15c:   00 00 04 00                         nop.i 0x0;;
>  160:   10 00 00 00 01 00       [MIB]       nop.m 0x0
>  166:   00 00 00 02 00 a0                   nop.i 0x0
>  16c:   f0 ff ff 48                         br.cloop.sptk.few 150
> <init_module+0x150>
>  170:   0b 00 01 58 22 04       [MMI]       mov.m r32=ar.itc;;
>  176:   00 00 04 0c 00 00                   ssm 0x4000
>  17c:   00 00 04 00                         nop.i 0x0;;
>  180:   01 00 00 00 30 00       [MII]       srlz.d
> 
 
Ok, 
  I think using a static value to cache getcpu will heavily bounced on
that cache line contain the static value if multi cpus calls getcpu very
frequently. 

  then implement current_thread_info()->cpu in fsys call should be
better?

Thanks
Zou Nan hai
  

 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (6 preceding siblings ...)
  2007-02-08  7:14 ` Zou Nan hai
@ 2007-02-08  7:38 ` Zou Nan hai
  2007-02-08  8:28 ` peterc
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08  7:38 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2007-02-08 at 15:14, Zou Nan hai wrote:
> On Thu, 2007-02-08 at 16:40, Keith Owens wrote:
> > Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
> > >On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> > >> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> > >> Correction: ar.k3 contains the physical address of the per-cpu
> data
> > >> area, virtual access to per-cpu data goes via the cpu local TLB
> and
> > >> does not rely on an ar.k<n> variable.  ar.k3 is used in the MCA
> > >> assembler handler, see GET_THIS_PADDR in
> include/asm-ia64/mca_asm.h
> > >> and
> > >> arch/ia64/kernel/mca_asm.S.
> > >> 
> > >
> > > Since MCA is slow path, 
> > > so I think put smp_processor_id in ar.kr3 is a gain.
> > >
> > > We could even optimize get_cpu_var based on this...
> > 
> > (1) Somebody else (not me) gets to fix up and test the MCA handler
> >     assembler code - lots of luck.
> > 
> > (2) smp_processor_id() in the IA64 kernel is accessed via struct
> >     thread_info.cpu.  That maps to a simple memory access with code
> >     like this:
> > 
> >        adds r14252,r13
> >        ;;
> >        ld4 r15=[r14]
> > 
> >     The stop bits usually get amortized away with other code.
> >     thread_info.cpu will normally be cached in L1 so reading
> >     smp_processor_id() is relatively fast.
> > 
> > (3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
> >     slower than the existing kernel code.  See the timing program
> >     below.
> > 
> > (4) If the justification for storing cpu number in ar.k<n> is to
> speed
> >     up user space, how can user space tell if the current kernel
> > stores
> >     the physical address of the per-cpu data in k3 or if it stores
> the
> >     cpu number in k3?  Detecting which variant of the kernel is
> > running
> >     will slow down user space.
> > 
> > 
> > Timing results on 'modprobe measure'
> > 
> > init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992
> > 
> > module measure.c
> > 
> >
> -----------------------------------------------------------------------
> > 
> > #include <linux/init.h>
> > #include <linux/kernel.h>
> > #include <linux/module.h>
> > #include <linux/preempt.h>
> > #include <asm/kregs.h>
> > #include <asm/timex.h>
> > 
> > MODULE_LICENSE("GPL");
> > 
> > #define LOOPS 1000000
> > 
> > static int __init init_measure(void)
> > {
> >         int loop;
> >         register int cpu;
> >         unsigned long start, end, empty_loop, cpu_loop, k3_loop;
> >         printk("%s: start\n", __FUNCTION__);
> >         preempt_disable();
> > 
> >         local_irq_disable();
> >         start = get_cycles();
> >         barrier();
> >         for (loop = 0; loop < LOOPS; ++loop) {
> >                 /* ensure that all loops are the same size (2
> bundles)
> > */
> >                 asm volatile ("nop 0; nop 0; nop 0;");
> >                 barrier();
> >         };
> >         end = get_cycles();
> >         barrier();
> >         local_irq_enable();
> >         empty_loop = end - start;
> > 
> >         local_irq_disable();
> >         start = get_cycles();
> >         barrier();
> >         for (loop = 0; loop < LOOPS; ++loop) {
> >                 /* hand code the read of smp_processor_id() to stop
> > gcc moving
> >                  * the address calculation outside the loop
> >                  */
> >                 asm volatile ("adds r14=%0,r13"
> >                               ";;"
> >                               "ld4 r15=[r14]"
> >                               : :
> >                               "i" (IA64_TASK_SIZE + offsetof(struct
> > thread_info, cpu)) :
> >                               "r14", "r15" );
> >                 barrier();
> >         };
> >         end = get_cycles();
> >         barrier();
> >         local_irq_enable();
> >         cpu_loop = end - start;
> > 
> >         local_irq_disable();
> >         start = get_cycles();
> >         barrier();
> >         for (loop = 0; loop < LOOPS; ++loop) {
> >                 cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
> >                 barrier();
> >         };
> >         end = get_cycles();
> >         barrier();
> >         local_irq_enable();
> >         k3_loop = end - start;
> > 
> >         preempt_enable();
> >         printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n",
> > __FUNCTION__, empty_loop, cpu_loop, k3_loop);
> >         return 0;
> > }
> > 
> > static void __exit exit_measure(void)
> > {
> >         printk("%s: start\n", __FUNCTION__);
> >         printk("%s: end\n", __FUNCTION__);
> > }
> > 
> > module_init(init_measure)
> > module_exit(exit_measure)
> > 
> >
> -----------------------------------------------------------------------
> > 
> > objdump of the interesting bits (the three loops):
> > 
> > empty loop:
> > 
> >   40:   09 08 00 50 00 21       [MMI]       mov r1=r40
> >   46:   00 00 00 02 00 e0                   nop.m 0x0
> >   4c:   81 6c 64 84                         adds r15272,r13;;
> >   50:   0a 18 00 1e 10 10       [MMI]       ld4 r3=[r15];;
> >   56:   20 08 0c 00 42 00                   adds r2=1,r3
> >   5c:   00 00 04 00                         nop.i 0x0
> >   60:   0b 00 00 00 01 00       [MMI]       nop.m 0x0;;
> >   66:   00 10 3c 20 23 00                   st4 [r15]=r2
> >   6c:   00 00 04 00                         nop.i 0x0;;
> >   70:   0b 00 00 02 07 00       [MMI]       rsm 0x4000;;
> >   76:   50 02 b0 44 08 00                   mov.m r37=ar.itc
> >   7c:   00 00 04 00                         nop.i 0x0;;
> >   80:   0b 70 fc 78 84 24       [MMI]       mov r14™9999;;
> >   86:   00 00 00 02 00 00                   nop.m 0x0
> >   8c:   e0 08 aa 00                         mov.i ar.lc=r14;;
> >   90:   01 00 00 00 01 00       [MII]       nop.m 0x0
> >   96:   00 00 00 02 00 00                   nop.i 0x0
> >   9c:   00 00 04 00                         nop.i 0x0;;
> >   a0:   10 00 00 00 01 00       [MIB]       nop.m 0x0
> >   a6:   00 00 00 02 00 a0                   nop.i 0x0
> >   ac:   f0 ff ff 48                         br.cloop.sptk.few 90
> > <init_module+0x90>
> >   b0:   0b 20 01 58 22 04       [MMI]       mov.m r36=ar.itc;;
> >   b6:   00 00 04 0c 00 00                   ssm 0x4000
> >   bc:   00 00 04 00                         nop.i 0x0;;
> >   c0:   0b 00 00 00 30 00       [MMI]       srlz.d;;
> > 
> > Read smp_processor_id:
> > 
> >   c6:   00 00 04 0e 00 00                   rsm 0x4000
> >   cc:   00 00 04 00                         nop.i 0x0;;
> >   d0:   01 18 01 58 22 04       [MII]       mov.m r35=ar.itc
> >   d6:   00 00 00 02 00 00                   nop.i 0x0
> >   dc:   00 00 04 00                         nop.i 0x0;;
> >   e0:   0a 40 fc 78 84 24       [MMI]       mov r8™9999;;
> >   e6:   00 00 00 02 00 00                   nop.m 0x0
> >   ec:   80 08 aa 00                         mov.i ar.lc=r8
> >   f0:   0b 70 d0 1a 19 21       [MMI]       adds r14252,r13;;
> >   f6:   f0 00 38 20 20 00                   ld4 r15=[r14]
> >   fc:   00 00 04 00                         nop.i 0x0;;
> >  100:   10 00 00 00 01 00       [MIB]       nop.m 0x0
> >  106:   00 00 00 02 00 a0                   nop.i 0x0
> >  10c:   f0 ff ff 48                         br.cloop.sptk.few f0
> > <init_module+0xf0>
> >  110:   0b 10 01 58 22 04       [MMI]       mov.m r34=ar.itc;;
> >  116:   00 00 04 0c 00 00                   ssm 0x4000
> >  11c:   00 00 04 00                         nop.i 0x0;;
> >  120:   0b 00 00 00 30 00       [MMI]       srlz.d;;
> > 
> > Read ar.k3:
> > 
> >  126:   00 00 04 0e 00 00                   rsm 0x4000
> >  12c:   00 00 04 00                         nop.i 0x0;;
> >  130:   01 08 01 58 22 04       [MII]       mov.m r33=ar.itc
> >  136:   00 00 00 02 00 00                   nop.i 0x0
> >  13c:   00 00 04 00                         nop.i 0x0;;
> >  140:   0a 48 fc 78 84 24       [MMI]       mov r9™9999;;
> >  146:   00 00 00 02 00 00                   nop.m 0x0
> >  14c:   90 08 aa 00                         mov.i ar.lc=r9
> >  150:   01 70 00 06 22 04       [MII]       mov.m r14=ar.k3
> >  156:   00 00 00 02 00 00                   nop.i 0x0
> >  15c:   00 00 04 00                         nop.i 0x0;;
> >  160:   10 00 00 00 01 00       [MIB]       nop.m 0x0
> >  166:   00 00 00 02 00 a0                   nop.i 0x0
> >  16c:   f0 ff ff 48                         br.cloop.sptk.few 150
> > <init_module+0x150>
> >  170:   0b 00 01 58 22 04       [MMI]       mov.m r32=ar.itc;;
> >  176:   00 00 04 0c 00 00                   ssm 0x4000
> >  17c:   00 00 04 00                         nop.i 0x0;;
> >  180:   01 00 00 00 30 00       [MII]       srlz.d
> > 
>  
> Ok, 
>   I think using a static value to cache getcpu will heavily bounced on
> that cache line contain the static value if multi cpus calls getcpu
> very
> frequently. 
> 
>   then implement current_thread_info()->cpu in fsys call should be
> better?
> 
> Thanks
> Zou Nan hai
>   
> 
  Maybe let glibc caches CPU ID and Node ID in thread locale storage
will be better?

Zou Nan hai
>  
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (7 preceding siblings ...)
  2007-02-08  7:38 ` Zou Nan hai
@ 2007-02-08  8:28 ` peterc
  2007-02-08  8:40 ` Keith Owens
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: peterc @ 2007-02-08  8:28 UTC (permalink / raw)
  To: linux-ia64

>>>>> "Zou" = Zou Nan hai <nanhai.zou@intel.com> writes:

Zou> Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
Zou> This will save some memory foot-print when smp_procerror_id() is
Zou> called.

Accessing ar.k? takes 12 cycles... it's faster in the kernel to
use current_thread_info()->cpu which if cache hot can be done in two
cycles.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (8 preceding siblings ...)
  2007-02-08  8:28 ` peterc
@ 2007-02-08  8:40 ` Keith Owens
  2007-02-08 18:03 ` Luck, Tony
  2007-02-08 23:59 ` Keith Owens
  11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08  8:40 UTC (permalink / raw)
  To: linux-ia64

Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
>On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
>> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
>> Correction: ar.k3 contains the physical address of the per-cpu data
>> area, virtual access to per-cpu data goes via the cpu local TLB and
>> does not rely on an ar.k<n> variable.  ar.k3 is used in the MCA
>> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
>> and
>> arch/ia64/kernel/mca_asm.S.
>> 
>
> Since MCA is slow path, 
> so I think put smp_processor_id in ar.kr3 is a gain.
>
> We could even optimize get_cpu_var based on this...

(1) Somebody else (not me) gets to fix up and test the MCA handler
    assembler code - lots of luck.

(2) smp_processor_id() in the IA64 kernel is accessed via struct
    thread_info.cpu.  That maps to a simple memory access with code
    like this:

       adds r14252,r13
       ;;
       ld4 r15=[r14]

    The stop bits usually get amortized away with other code.
    thread_info.cpu will normally be cached in L1 so reading
    smp_processor_id() is relatively fast.

(3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
    slower than the existing kernel code.  See the timing program
    below.

(4) If the justification for storing cpu number in ar.k<n> is to speed
    up user space, how can user space tell if the current kernel stores
    the physical address of the per-cpu data in k3 or if it stores the
    cpu number in k3?  Detecting which variant of the kernel is running
    will slow down user space.


Timing results on 'modprobe measure'

init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992

module measure.c

-----------------------------------------------------------------------

#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/preempt.h>
#include <asm/kregs.h>
#include <asm/timex.h>

MODULE_LICENSE("GPL");

#define LOOPS 1000000

static int __init init_measure(void)
{
	int loop;
	register int cpu;
	unsigned long start, end, empty_loop, cpu_loop, k3_loop;
	printk("%s: start\n", __FUNCTION__);
	preempt_disable();

	local_irq_disable();
	start = get_cycles();
	barrier();
	for (loop = 0; loop < LOOPS; ++loop) {
		/* ensure that all loops are the same size (2 bundles) */
		asm volatile ("nop 0; nop 0; nop 0;");
		barrier();
	};
	end = get_cycles();
	barrier();
	local_irq_enable();
	empty_loop = end - start;

	local_irq_disable();
	start = get_cycles();
	barrier();
	for (loop = 0; loop < LOOPS; ++loop) {
		/* hand code the read of smp_processor_id() to stop gcc moving
		 * the address calculation outside the loop
		 */
		asm volatile ("adds r14=%0,r13"
			      ";;"
			      "ld4 r15=[r14]"
			      : :
			      "i" (IA64_TASK_SIZE + offsetof(struct thread_info, cpu)) :
			      "r14", "r15" );
		barrier();
	};
	end = get_cycles();
	barrier();
	local_irq_enable();
	cpu_loop = end - start;

	local_irq_disable();
	start = get_cycles();
	barrier();
	for (loop = 0; loop < LOOPS; ++loop) {
		cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
		barrier();
	};
	end = get_cycles();
	barrier();
	local_irq_enable();
	k3_loop = end - start;

	preempt_enable();
	printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n", __FUNCTION__, empty_loop, cpu_loop, k3_loop);
	return 0;
}

static void __exit exit_measure(void)
{
	printk("%s: start\n", __FUNCTION__);
	printk("%s: end\n", __FUNCTION__);
}

module_init(init_measure)
module_exit(exit_measure)

-----------------------------------------------------------------------

objdump of the interesting bits (the three loops):

empty loop:

  40:	09 08 00 50 00 21 	[MMI]       mov r1=r40
  46:	00 00 00 02 00 e0 	            nop.m 0x0
  4c:	81 6c 64 84       	            adds r15272,r13;;
  50:	0a 18 00 1e 10 10 	[MMI]       ld4 r3=[r15];;
  56:	20 08 0c 00 42 00 	            adds r2=1,r3
  5c:	00 00 04 00       	            nop.i 0x0
  60:	0b 00 00 00 01 00 	[MMI]       nop.m 0x0;;
  66:	00 10 3c 20 23 00 	            st4 [r15]=r2
  6c:	00 00 04 00       	            nop.i 0x0;;
  70:	0b 00 00 02 07 00 	[MMI]       rsm 0x4000;;
  76:	50 02 b0 44 08 00 	            mov.m r37=ar.itc
  7c:	00 00 04 00       	            nop.i 0x0;;
  80:	0b 70 fc 78 84 24 	[MMI]       mov r14™9999;;
  86:	00 00 00 02 00 00 	            nop.m 0x0
  8c:	e0 08 aa 00       	            mov.i ar.lc=r14;;
  90:	01 00 00 00 01 00 	[MII]       nop.m 0x0
  96:	00 00 00 02 00 00 	            nop.i 0x0
  9c:	00 00 04 00       	            nop.i 0x0;;
  a0:	10 00 00 00 01 00 	[MIB]       nop.m 0x0
  a6:	00 00 00 02 00 a0 	            nop.i 0x0
  ac:	f0 ff ff 48       	            br.cloop.sptk.few 90 <init_module+0x90>
  b0:	0b 20 01 58 22 04 	[MMI]       mov.m r36=ar.itc;;
  b6:	00 00 04 0c 00 00 	            ssm 0x4000
  bc:	00 00 04 00       	            nop.i 0x0;;
  c0:	0b 00 00 00 30 00 	[MMI]       srlz.d;;

Read smp_processor_id:

  c6:	00 00 04 0e 00 00 	            rsm 0x4000
  cc:	00 00 04 00       	            nop.i 0x0;;
  d0:	01 18 01 58 22 04 	[MII]       mov.m r35=ar.itc
  d6:	00 00 00 02 00 00 	            nop.i 0x0
  dc:	00 00 04 00       	            nop.i 0x0;;
  e0:	0a 40 fc 78 84 24 	[MMI]       mov r8™9999;;
  e6:	00 00 00 02 00 00 	            nop.m 0x0
  ec:	80 08 aa 00       	            mov.i ar.lc=r8
  f0:	0b 70 d0 1a 19 21 	[MMI]       adds r14252,r13;;
  f6:	f0 00 38 20 20 00 	            ld4 r15=[r14]
  fc:	00 00 04 00       	            nop.i 0x0;;
 100:	10 00 00 00 01 00 	[MIB]       nop.m 0x0
 106:	00 00 00 02 00 a0 	            nop.i 0x0
 10c:	f0 ff ff 48       	            br.cloop.sptk.few f0 <init_module+0xf0>
 110:	0b 10 01 58 22 04 	[MMI]       mov.m r34=ar.itc;;
 116:	00 00 04 0c 00 00 	            ssm 0x4000
 11c:	00 00 04 00       	            nop.i 0x0;;
 120:	0b 00 00 00 30 00 	[MMI]       srlz.d;;

Read ar.k3:

 126:	00 00 04 0e 00 00 	            rsm 0x4000
 12c:	00 00 04 00       	            nop.i 0x0;;
 130:	01 08 01 58 22 04 	[MII]       mov.m r33=ar.itc
 136:	00 00 00 02 00 00 	            nop.i 0x0
 13c:	00 00 04 00       	            nop.i 0x0;;
 140:	0a 48 fc 78 84 24 	[MMI]       mov r9™9999;;
 146:	00 00 00 02 00 00 	            nop.m 0x0
 14c:	90 08 aa 00       	            mov.i ar.lc=r9
 150:	01 70 00 06 22 04 	[MII]       mov.m r14=ar.k3
 156:	00 00 00 02 00 00 	            nop.i 0x0
 15c:	00 00 04 00       	            nop.i 0x0;;
 160:	10 00 00 00 01 00 	[MIB]       nop.m 0x0
 166:	00 00 00 02 00 a0 	            nop.i 0x0
 16c:	f0 ff ff 48       	            br.cloop.sptk.few 150 <init_module+0x150>
 170:	0b 00 01 58 22 04 	[MMI]       mov.m r32=ar.itc;;
 176:	00 00 04 0c 00 00 	            ssm 0x4000
 17c:	00 00 04 00       	            nop.i 0x0;;
 180:	01 00 00 00 30 00 	[MII]       srlz.d


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (9 preceding siblings ...)
  2007-02-08  8:40 ` Keith Owens
@ 2007-02-08 18:03 ` Luck, Tony
  2007-02-08 23:59 ` Keith Owens
  11 siblings, 0 replies; 13+ messages in thread
From: Luck, Tony @ 2007-02-08 18:03 UTC (permalink / raw)
  To: linux-ia64

Summarizing thread that I was sleeping through:

1) Use ar.kr2 for ...
No ... as Keith pointed out there is debug code in ivt.S to use
it to track the last few traps, and if that isn't being used it
is very handy for other debugging uses.  I won't give up the last
of these registers unless it is for some cause which is a clear and
obvious major win in performance or functionality.  An allegedly
faster way to find the cpu number is not a clear win (if the
percpu variable is in cache, then it is clearly faster to read
from memory).

2) Use ar.kr3 for cpu number, and then make the MCA code index an
array to get the phys address of the per-cpu area.
Messes with a lot of MCA code, and for a microscopic improvement
over my proposed getcpu() code.  Yes, you can avoid ever doing the system
call ... but only running the system call when you have migrated to a
different cpu should cover most calls [possible exception ... a future
scheduler might frequently move a process between logical cpus that
share all cache levels, since there is no cache penalty for running
on other cpus in the same cache domain].  So I'm not looking favourably
at this option at the moment ... but could change my mind if presented
with some data on getcpu() usage.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
  2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
                   ` (10 preceding siblings ...)
  2007-02-08 18:03 ` Luck, Tony
@ 2007-02-08 23:59 ` Keith Owens
  11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 23:59 UTC (permalink / raw)
  To: linux-ia64

Zou Nan hai (on 08 Feb 2007 15:14:54 +0800) wrote:
>  I think using a static value to cache getcpu will heavily bounced on
>that cache line contain the static value if multi cpus calls getcpu very
>frequently. 

AFAICT, Tony's suggestion[*] is all in user space, e.g. glibc.  Each
application will get its own thread local copy of the static variable,
there is no globally shared static value so no cache line bouncing.

>  then implement current_thread_info()->cpu in fsys call should be
>better?

Maybe.  Implement it, time it and see which is faster.

[*] http://marc.theaimsgroup.com/?l=linux-ia64&m\x117087180232044&w=2

BTW Tony, in that code there is no need to initialise cpu to ~0 nor to
test for that value.  On the first call it is guaranteed that ar.k3 !save_ar_k3.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-02-08 23:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-08  3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
2007-02-08  4:27 ` Zou Nan hai
2007-02-08  4:59 ` Zou Nan hai
2007-02-08  5:11 ` Zou Nan hai
2007-02-08  6:04 ` Keith Owens
2007-02-08  6:37 ` Keith Owens
2007-02-08  6:55 ` Keith Owens
2007-02-08  7:14 ` Zou Nan hai
2007-02-08  7:38 ` Zou Nan hai
2007-02-08  8:28 ` peterc
2007-02-08  8:40 ` Keith Owens
2007-02-08 18:03 ` Luck, Tony
2007-02-08 23:59 ` Keith Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox