* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
@ 2007-02-08 4:27 ` Zou Nan hai
2007-02-08 4:59 ` Zou Nan hai
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08 4:27 UTC (permalink / raw)
To: linux-ia64
On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
>
> Historically ar.k2 has been reserved for debugging purposes, for
> example in ivt.S. Debuggers often need a location that can be used to
> track progress, it has to be somewhere that does not rely on TLB
> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> perfect for this.
>
Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
> be unique on every cpu) and caching the corresponding cpu number when
> it changes.
>
But why do we even need to cache it?
It is already in a register if we put it to kr3.
so smp_processor_id() could be very fast. and later sys_getcpu can
also be very fast.
Thanks
Zou Nan hai
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
2007-02-08 4:27 ` Zou Nan hai
@ 2007-02-08 4:59 ` Zou Nan hai
2007-02-08 5:11 ` Zou Nan hai
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08 4:59 UTC (permalink / raw)
To: linux-ia64
On Thu, 2007-02-08 at 14:37, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
> >On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> >> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
> >>
> >> Historically ar.k2 has been reserved for debugging purposes, for
> >> example in ivt.S. Debuggers often need a location that can be used
> to
> >> track progress, it has to be somewhere that does not rely on TLB
> >> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> >> perfect for this.
> >>
> > Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> >> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed
> to
> >> be unique on every cpu) and caching the corresponding cpu number
> when
> >> it changes.
> >>
> > But why do we even need to cache it?
> >
> > It is already in a register if we put it to kr3.
> > so smp_processor_id() could be very fast. and later sys_getcpu can
> >also be very fast.
>
> ar.k3 is currently used for the address of the per-cpu data area,
> which
> speeds up access to all the per-cpu data. Changing ar.k3 to hold the
> cpu number means an extra array calculation and lookup for every
> per-cpu variable, slowing down the rest of the system.
>
Are you sure ar.k3 is used by per-cpu data access? Disassembly of
vmlinux show there are only MCA code and ia64_itc_printk_clock used
ar.k3.
Thanks
Zou Nan hai
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
2007-02-08 4:27 ` Zou Nan hai
2007-02-08 4:59 ` Zou Nan hai
@ 2007-02-08 5:11 ` Zou Nan hai
2007-02-08 6:04 ` Keith Owens
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08 5:11 UTC (permalink / raw)
To: linux-ia64
On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> >Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
> >>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
> >>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
> >>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
> >>>
> >>> Historically ar.k2 has been reserved for debugging purposes, for
> >>> example in ivt.S. Debuggers often need a location that can be
> used to
> >>> track progress, it has to be somewhere that does not rely on TLB
> >>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
> >>> perfect for this.
> >>>
> >> Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
> >>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed
> to
> >>> be unique on every cpu) and caching the corresponding cpu number
> when
> >>> it changes.
> >>>
> >> But why do we even need to cache it?
> >>
> >> It is already in a register if we put it to kr3.
> >> so smp_processor_id() could be very fast. and later sys_getcpu can
> >>also be very fast.
> >
> >ar.k3 is currently used for the address of the per-cpu data area,
> which
> >speeds up access to all the per-cpu data. Changing ar.k3 to hold the
> >cpu number means an extra array calculation and lookup for every
> >per-cpu variable, slowing down the rest of the system.
>
> Correction: ar.k3 contains the physical address of the per-cpu data
> area, virtual access to per-cpu data goes via the cpu local TLB and
> does not rely on an ar.k<n> variable. ar.k3 is used in the MCA
> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
> and
> arch/ia64/kernel/mca_asm.S.
>
Since MCA is slow path,
so I think put smp_processor_id in ar.kr3 is a gain.
We could even optimize get_cpu_var based on this...
Thanks
Zou Nan hai
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (2 preceding siblings ...)
2007-02-08 5:11 ` Zou Nan hai
@ 2007-02-08 6:04 ` Keith Owens
2007-02-08 6:37 ` Keith Owens
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 6:04 UTC (permalink / raw)
To: linux-ia64
Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
Historically ar.k2 has been reserved for debugging purposes, for
example in ivt.S. Debuggers often need a location that can be used to
track progress, it has to be somewhere that does not rely on TLB
entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
perfect for this.
Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
be unique on every cpu) and caching the corresponding cpu number when
it changes.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (3 preceding siblings ...)
2007-02-08 6:04 ` Keith Owens
@ 2007-02-08 6:37 ` Keith Owens
2007-02-08 6:55 ` Keith Owens
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 6:37 UTC (permalink / raw)
To: linux-ia64
Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
>>
>> Historically ar.k2 has been reserved for debugging purposes, for
>> example in ivt.S. Debuggers often need a location that can be used to
>> track progress, it has to be somewhere that does not rely on TLB
>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
>> perfect for this.
>>
> Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
>> be unique on every cpu) and caching the corresponding cpu number when
>> it changes.
>>
> But why do we even need to cache it?
>
> It is already in a register if we put it to kr3.
> so smp_processor_id() could be very fast. and later sys_getcpu can
>also be very fast.
ar.k3 is currently used for the address of the per-cpu data area, which
speeds up access to all the per-cpu data. Changing ar.k3 to hold the
cpu number means an extra array calculation and lookup for every
per-cpu variable, slowing down the rest of the system.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (4 preceding siblings ...)
2007-02-08 6:37 ` Keith Owens
@ 2007-02-08 6:55 ` Keith Owens
2007-02-08 7:14 ` Zou Nan hai
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 6:55 UTC (permalink / raw)
To: linux-ia64
Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
>Zou Nan hai (on 08 Feb 2007 12:27:31 +0800) wrote:
>>On Thu, 2007-02-08 at 14:04, Keith Owens wrote:
>>> Zou Nan hai (on 08 Feb 2007 11:28:44 +0800) wrote:
>>> >Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
>>>
>>> Historically ar.k2 has been reserved for debugging purposes, for
>>> example in ivt.S. Debuggers often need a location that can be used to
>>> track progress, it has to be somewhere that does not rely on TLB
>>> entries and is guaranteed to appear in MCA/INIT records - ar.k2 is
>>> perfect for this.
>>>
>> Ok, seems that current kr3 is only used by ia64_itc_printk_clock?
>>> Use Tony's suggestion of testing for a change in ar.k3 (guaranteed to
>>> be unique on every cpu) and caching the corresponding cpu number when
>>> it changes.
>>>
>> But why do we even need to cache it?
>>
>> It is already in a register if we put it to kr3.
>> so smp_processor_id() could be very fast. and later sys_getcpu can
>>also be very fast.
>
>ar.k3 is currently used for the address of the per-cpu data area, which
>speeds up access to all the per-cpu data. Changing ar.k3 to hold the
>cpu number means an extra array calculation and lookup for every
>per-cpu variable, slowing down the rest of the system.
Correction: ar.k3 contains the physical address of the per-cpu data
area, virtual access to per-cpu data goes via the cpu local TLB and
does not rely on an ar.k<n> variable. ar.k3 is used in the MCA
assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h and
arch/ia64/kernel/mca_asm.S.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (5 preceding siblings ...)
2007-02-08 6:55 ` Keith Owens
@ 2007-02-08 7:14 ` Zou Nan hai
2007-02-08 7:38 ` Zou Nan hai
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08 7:14 UTC (permalink / raw)
To: linux-ia64
On Thu, 2007-02-08 at 16:40, Keith Owens wrote:
> Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
> >On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> >> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> >> Correction: ar.k3 contains the physical address of the per-cpu data
> >> area, virtual access to per-cpu data goes via the cpu local TLB and
> >> does not rely on an ar.k<n> variable. ar.k3 is used in the MCA
> >> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
> >> and
> >> arch/ia64/kernel/mca_asm.S.
> >>
> >
> > Since MCA is slow path,
> > so I think put smp_processor_id in ar.kr3 is a gain.
> >
> > We could even optimize get_cpu_var based on this...
>
> (1) Somebody else (not me) gets to fix up and test the MCA handler
> assembler code - lots of luck.
>
> (2) smp_processor_id() in the IA64 kernel is accessed via struct
> thread_info.cpu. That maps to a simple memory access with code
> like this:
>
> adds r14252,r13
> ;;
> ld4 r15=[r14]
>
> The stop bits usually get amortized away with other code.
> thread_info.cpu will normally be cached in L1 so reading
> smp_processor_id() is relatively fast.
>
> (3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
> slower than the existing kernel code. See the timing program
> below.
>
> (4) If the justification for storing cpu number in ar.k<n> is to speed
> up user space, how can user space tell if the current kernel
> stores
> the physical address of the per-cpu data in k3 or if it stores the
> cpu number in k3? Detecting which variant of the kernel is
> running
> will slow down user space.
>
>
> Timing results on 'modprobe measure'
>
> init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992
>
> module measure.c
>
> -----------------------------------------------------------------------
>
> #include <linux/init.h>
> #include <linux/kernel.h>
> #include <linux/module.h>
> #include <linux/preempt.h>
> #include <asm/kregs.h>
> #include <asm/timex.h>
>
> MODULE_LICENSE("GPL");
>
> #define LOOPS 1000000
>
> static int __init init_measure(void)
> {
> int loop;
> register int cpu;
> unsigned long start, end, empty_loop, cpu_loop, k3_loop;
> printk("%s: start\n", __FUNCTION__);
> preempt_disable();
>
> local_irq_disable();
> start = get_cycles();
> barrier();
> for (loop = 0; loop < LOOPS; ++loop) {
> /* ensure that all loops are the same size (2 bundles)
> */
> asm volatile ("nop 0; nop 0; nop 0;");
> barrier();
> };
> end = get_cycles();
> barrier();
> local_irq_enable();
> empty_loop = end - start;
>
> local_irq_disable();
> start = get_cycles();
> barrier();
> for (loop = 0; loop < LOOPS; ++loop) {
> /* hand code the read of smp_processor_id() to stop
> gcc moving
> * the address calculation outside the loop
> */
> asm volatile ("adds r14=%0,r13"
> ";;"
> "ld4 r15=[r14]"
> : :
> "i" (IA64_TASK_SIZE + offsetof(struct
> thread_info, cpu)) :
> "r14", "r15" );
> barrier();
> };
> end = get_cycles();
> barrier();
> local_irq_enable();
> cpu_loop = end - start;
>
> local_irq_disable();
> start = get_cycles();
> barrier();
> for (loop = 0; loop < LOOPS; ++loop) {
> cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
> barrier();
> };
> end = get_cycles();
> barrier();
> local_irq_enable();
> k3_loop = end - start;
>
> preempt_enable();
> printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n",
> __FUNCTION__, empty_loop, cpu_loop, k3_loop);
> return 0;
> }
>
> static void __exit exit_measure(void)
> {
> printk("%s: start\n", __FUNCTION__);
> printk("%s: end\n", __FUNCTION__);
> }
>
> module_init(init_measure)
> module_exit(exit_measure)
>
> -----------------------------------------------------------------------
>
> objdump of the interesting bits (the three loops):
>
> empty loop:
>
> 40: 09 08 00 50 00 21 [MMI] mov r1=r40
> 46: 00 00 00 02 00 e0 nop.m 0x0
> 4c: 81 6c 64 84 adds r15272,r13;;
> 50: 0a 18 00 1e 10 10 [MMI] ld4 r3=[r15];;
> 56: 20 08 0c 00 42 00 adds r2=1,r3
> 5c: 00 00 04 00 nop.i 0x0
> 60: 0b 00 00 00 01 00 [MMI] nop.m 0x0;;
> 66: 00 10 3c 20 23 00 st4 [r15]=r2
> 6c: 00 00 04 00 nop.i 0x0;;
> 70: 0b 00 00 02 07 00 [MMI] rsm 0x4000;;
> 76: 50 02 b0 44 08 00 mov.m r37=ar.itc
> 7c: 00 00 04 00 nop.i 0x0;;
> 80: 0b 70 fc 78 84 24 [MMI] mov r14™9999;;
> 86: 00 00 00 02 00 00 nop.m 0x0
> 8c: e0 08 aa 00 mov.i ar.lc=r14;;
> 90: 01 00 00 00 01 00 [MII] nop.m 0x0
> 96: 00 00 00 02 00 00 nop.i 0x0
> 9c: 00 00 04 00 nop.i 0x0;;
> a0: 10 00 00 00 01 00 [MIB] nop.m 0x0
> a6: 00 00 00 02 00 a0 nop.i 0x0
> ac: f0 ff ff 48 br.cloop.sptk.few 90
> <init_module+0x90>
> b0: 0b 20 01 58 22 04 [MMI] mov.m r36=ar.itc;;
> b6: 00 00 04 0c 00 00 ssm 0x4000
> bc: 00 00 04 00 nop.i 0x0;;
> c0: 0b 00 00 00 30 00 [MMI] srlz.d;;
>
> Read smp_processor_id:
>
> c6: 00 00 04 0e 00 00 rsm 0x4000
> cc: 00 00 04 00 nop.i 0x0;;
> d0: 01 18 01 58 22 04 [MII] mov.m r35=ar.itc
> d6: 00 00 00 02 00 00 nop.i 0x0
> dc: 00 00 04 00 nop.i 0x0;;
> e0: 0a 40 fc 78 84 24 [MMI] mov r8™9999;;
> e6: 00 00 00 02 00 00 nop.m 0x0
> ec: 80 08 aa 00 mov.i ar.lc=r8
> f0: 0b 70 d0 1a 19 21 [MMI] adds r14252,r13;;
> f6: f0 00 38 20 20 00 ld4 r15=[r14]
> fc: 00 00 04 00 nop.i 0x0;;
> 100: 10 00 00 00 01 00 [MIB] nop.m 0x0
> 106: 00 00 00 02 00 a0 nop.i 0x0
> 10c: f0 ff ff 48 br.cloop.sptk.few f0
> <init_module+0xf0>
> 110: 0b 10 01 58 22 04 [MMI] mov.m r34=ar.itc;;
> 116: 00 00 04 0c 00 00 ssm 0x4000
> 11c: 00 00 04 00 nop.i 0x0;;
> 120: 0b 00 00 00 30 00 [MMI] srlz.d;;
>
> Read ar.k3:
>
> 126: 00 00 04 0e 00 00 rsm 0x4000
> 12c: 00 00 04 00 nop.i 0x0;;
> 130: 01 08 01 58 22 04 [MII] mov.m r33=ar.itc
> 136: 00 00 00 02 00 00 nop.i 0x0
> 13c: 00 00 04 00 nop.i 0x0;;
> 140: 0a 48 fc 78 84 24 [MMI] mov r9™9999;;
> 146: 00 00 00 02 00 00 nop.m 0x0
> 14c: 90 08 aa 00 mov.i ar.lc=r9
> 150: 01 70 00 06 22 04 [MII] mov.m r14=ar.k3
> 156: 00 00 00 02 00 00 nop.i 0x0
> 15c: 00 00 04 00 nop.i 0x0;;
> 160: 10 00 00 00 01 00 [MIB] nop.m 0x0
> 166: 00 00 00 02 00 a0 nop.i 0x0
> 16c: f0 ff ff 48 br.cloop.sptk.few 150
> <init_module+0x150>
> 170: 0b 00 01 58 22 04 [MMI] mov.m r32=ar.itc;;
> 176: 00 00 04 0c 00 00 ssm 0x4000
> 17c: 00 00 04 00 nop.i 0x0;;
> 180: 01 00 00 00 30 00 [MII] srlz.d
>
Ok,
I think using a static value to cache getcpu will heavily bounced on
that cache line contain the static value if multi cpus calls getcpu very
frequently.
then implement current_thread_info()->cpu in fsys call should be
better?
Thanks
Zou Nan hai
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (6 preceding siblings ...)
2007-02-08 7:14 ` Zou Nan hai
@ 2007-02-08 7:38 ` Zou Nan hai
2007-02-08 8:28 ` peterc
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Zou Nan hai @ 2007-02-08 7:38 UTC (permalink / raw)
To: linux-ia64
On Thu, 2007-02-08 at 15:14, Zou Nan hai wrote:
> On Thu, 2007-02-08 at 16:40, Keith Owens wrote:
> > Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
> > >On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
> > >> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
> > >> Correction: ar.k3 contains the physical address of the per-cpu
> data
> > >> area, virtual access to per-cpu data goes via the cpu local TLB
> and
> > >> does not rely on an ar.k<n> variable. ar.k3 is used in the MCA
> > >> assembler handler, see GET_THIS_PADDR in
> include/asm-ia64/mca_asm.h
> > >> and
> > >> arch/ia64/kernel/mca_asm.S.
> > >>
> > >
> > > Since MCA is slow path,
> > > so I think put smp_processor_id in ar.kr3 is a gain.
> > >
> > > We could even optimize get_cpu_var based on this...
> >
> > (1) Somebody else (not me) gets to fix up and test the MCA handler
> > assembler code - lots of luck.
> >
> > (2) smp_processor_id() in the IA64 kernel is accessed via struct
> > thread_info.cpu. That maps to a simple memory access with code
> > like this:
> >
> > adds r14252,r13
> > ;;
> > ld4 r15=[r14]
> >
> > The stop bits usually get amortized away with other code.
> > thread_info.cpu will normally be cached in L1 so reading
> > smp_processor_id() is relatively fast.
> >
> > (3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
> > slower than the existing kernel code. See the timing program
> > below.
> >
> > (4) If the justification for storing cpu number in ar.k<n> is to
> speed
> > up user space, how can user space tell if the current kernel
> > stores
> > the physical address of the per-cpu data in k3 or if it stores
> the
> > cpu number in k3? Detecting which variant of the kernel is
> > running
> > will slow down user space.
> >
> >
> > Timing results on 'modprobe measure'
> >
> > init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992
> >
> > module measure.c
> >
> >
> -----------------------------------------------------------------------
> >
> > #include <linux/init.h>
> > #include <linux/kernel.h>
> > #include <linux/module.h>
> > #include <linux/preempt.h>
> > #include <asm/kregs.h>
> > #include <asm/timex.h>
> >
> > MODULE_LICENSE("GPL");
> >
> > #define LOOPS 1000000
> >
> > static int __init init_measure(void)
> > {
> > int loop;
> > register int cpu;
> > unsigned long start, end, empty_loop, cpu_loop, k3_loop;
> > printk("%s: start\n", __FUNCTION__);
> > preempt_disable();
> >
> > local_irq_disable();
> > start = get_cycles();
> > barrier();
> > for (loop = 0; loop < LOOPS; ++loop) {
> > /* ensure that all loops are the same size (2
> bundles)
> > */
> > asm volatile ("nop 0; nop 0; nop 0;");
> > barrier();
> > };
> > end = get_cycles();
> > barrier();
> > local_irq_enable();
> > empty_loop = end - start;
> >
> > local_irq_disable();
> > start = get_cycles();
> > barrier();
> > for (loop = 0; loop < LOOPS; ++loop) {
> > /* hand code the read of smp_processor_id() to stop
> > gcc moving
> > * the address calculation outside the loop
> > */
> > asm volatile ("adds r14=%0,r13"
> > ";;"
> > "ld4 r15=[r14]"
> > : :
> > "i" (IA64_TASK_SIZE + offsetof(struct
> > thread_info, cpu)) :
> > "r14", "r15" );
> > barrier();
> > };
> > end = get_cycles();
> > barrier();
> > local_irq_enable();
> > cpu_loop = end - start;
> >
> > local_irq_disable();
> > start = get_cycles();
> > barrier();
> > for (loop = 0; loop < LOOPS; ++loop) {
> > cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
> > barrier();
> > };
> > end = get_cycles();
> > barrier();
> > local_irq_enable();
> > k3_loop = end - start;
> >
> > preempt_enable();
> > printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n",
> > __FUNCTION__, empty_loop, cpu_loop, k3_loop);
> > return 0;
> > }
> >
> > static void __exit exit_measure(void)
> > {
> > printk("%s: start\n", __FUNCTION__);
> > printk("%s: end\n", __FUNCTION__);
> > }
> >
> > module_init(init_measure)
> > module_exit(exit_measure)
> >
> >
> -----------------------------------------------------------------------
> >
> > objdump of the interesting bits (the three loops):
> >
> > empty loop:
> >
> > 40: 09 08 00 50 00 21 [MMI] mov r1=r40
> > 46: 00 00 00 02 00 e0 nop.m 0x0
> > 4c: 81 6c 64 84 adds r15272,r13;;
> > 50: 0a 18 00 1e 10 10 [MMI] ld4 r3=[r15];;
> > 56: 20 08 0c 00 42 00 adds r2=1,r3
> > 5c: 00 00 04 00 nop.i 0x0
> > 60: 0b 00 00 00 01 00 [MMI] nop.m 0x0;;
> > 66: 00 10 3c 20 23 00 st4 [r15]=r2
> > 6c: 00 00 04 00 nop.i 0x0;;
> > 70: 0b 00 00 02 07 00 [MMI] rsm 0x4000;;
> > 76: 50 02 b0 44 08 00 mov.m r37=ar.itc
> > 7c: 00 00 04 00 nop.i 0x0;;
> > 80: 0b 70 fc 78 84 24 [MMI] mov r14™9999;;
> > 86: 00 00 00 02 00 00 nop.m 0x0
> > 8c: e0 08 aa 00 mov.i ar.lc=r14;;
> > 90: 01 00 00 00 01 00 [MII] nop.m 0x0
> > 96: 00 00 00 02 00 00 nop.i 0x0
> > 9c: 00 00 04 00 nop.i 0x0;;
> > a0: 10 00 00 00 01 00 [MIB] nop.m 0x0
> > a6: 00 00 00 02 00 a0 nop.i 0x0
> > ac: f0 ff ff 48 br.cloop.sptk.few 90
> > <init_module+0x90>
> > b0: 0b 20 01 58 22 04 [MMI] mov.m r36=ar.itc;;
> > b6: 00 00 04 0c 00 00 ssm 0x4000
> > bc: 00 00 04 00 nop.i 0x0;;
> > c0: 0b 00 00 00 30 00 [MMI] srlz.d;;
> >
> > Read smp_processor_id:
> >
> > c6: 00 00 04 0e 00 00 rsm 0x4000
> > cc: 00 00 04 00 nop.i 0x0;;
> > d0: 01 18 01 58 22 04 [MII] mov.m r35=ar.itc
> > d6: 00 00 00 02 00 00 nop.i 0x0
> > dc: 00 00 04 00 nop.i 0x0;;
> > e0: 0a 40 fc 78 84 24 [MMI] mov r8™9999;;
> > e6: 00 00 00 02 00 00 nop.m 0x0
> > ec: 80 08 aa 00 mov.i ar.lc=r8
> > f0: 0b 70 d0 1a 19 21 [MMI] adds r14252,r13;;
> > f6: f0 00 38 20 20 00 ld4 r15=[r14]
> > fc: 00 00 04 00 nop.i 0x0;;
> > 100: 10 00 00 00 01 00 [MIB] nop.m 0x0
> > 106: 00 00 00 02 00 a0 nop.i 0x0
> > 10c: f0 ff ff 48 br.cloop.sptk.few f0
> > <init_module+0xf0>
> > 110: 0b 10 01 58 22 04 [MMI] mov.m r34=ar.itc;;
> > 116: 00 00 04 0c 00 00 ssm 0x4000
> > 11c: 00 00 04 00 nop.i 0x0;;
> > 120: 0b 00 00 00 30 00 [MMI] srlz.d;;
> >
> > Read ar.k3:
> >
> > 126: 00 00 04 0e 00 00 rsm 0x4000
> > 12c: 00 00 04 00 nop.i 0x0;;
> > 130: 01 08 01 58 22 04 [MII] mov.m r33=ar.itc
> > 136: 00 00 00 02 00 00 nop.i 0x0
> > 13c: 00 00 04 00 nop.i 0x0;;
> > 140: 0a 48 fc 78 84 24 [MMI] mov r9™9999;;
> > 146: 00 00 00 02 00 00 nop.m 0x0
> > 14c: 90 08 aa 00 mov.i ar.lc=r9
> > 150: 01 70 00 06 22 04 [MII] mov.m r14=ar.k3
> > 156: 00 00 00 02 00 00 nop.i 0x0
> > 15c: 00 00 04 00 nop.i 0x0;;
> > 160: 10 00 00 00 01 00 [MIB] nop.m 0x0
> > 166: 00 00 00 02 00 a0 nop.i 0x0
> > 16c: f0 ff ff 48 br.cloop.sptk.few 150
> > <init_module+0x150>
> > 170: 0b 00 01 58 22 04 [MMI] mov.m r32=ar.itc;;
> > 176: 00 00 04 0c 00 00 ssm 0x4000
> > 17c: 00 00 04 00 nop.i 0x0;;
> > 180: 01 00 00 00 30 00 [MII] srlz.d
> >
>
> Ok,
> I think using a static value to cache getcpu will heavily bounced on
> that cache line contain the static value if multi cpus calls getcpu
> very
> frequently.
>
> then implement current_thread_info()->cpu in fsys call should be
> better?
>
> Thanks
> Zou Nan hai
>
>
Maybe let glibc caches CPU ID and Node ID in thread locale storage
will be better?
Zou Nan hai
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (7 preceding siblings ...)
2007-02-08 7:38 ` Zou Nan hai
@ 2007-02-08 8:28 ` peterc
2007-02-08 8:40 ` Keith Owens
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: peterc @ 2007-02-08 8:28 UTC (permalink / raw)
To: linux-ia64
>>>>> "Zou" = Zou Nan hai <nanhai.zou@intel.com> writes:
Zou> Pin ar.kr2 of each CPU, so that smp_processor_id can use it.
Zou> This will save some memory foot-print when smp_procerror_id() is
Zou> called.
Accessing ar.k? takes 12 cycles... it's faster in the kernel to
use current_thread_info()->cpu which if cache hot can be done in two
cycles.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (8 preceding siblings ...)
2007-02-08 8:28 ` peterc
@ 2007-02-08 8:40 ` Keith Owens
2007-02-08 18:03 ` Luck, Tony
2007-02-08 23:59 ` Keith Owens
11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 8:40 UTC (permalink / raw)
To: linux-ia64
Zou Nan hai (on 08 Feb 2007 13:11:49 +0800) wrote:
>On Thu, 2007-02-08 at 14:55, Keith Owens wrote:
>> Keith Owens (on Thu, 08 Feb 2007 17:37:54 +1100) wrote:
>> Correction: ar.k3 contains the physical address of the per-cpu data
>> area, virtual access to per-cpu data goes via the cpu local TLB and
>> does not rely on an ar.k<n> variable. ar.k3 is used in the MCA
>> assembler handler, see GET_THIS_PADDR in include/asm-ia64/mca_asm.h
>> and
>> arch/ia64/kernel/mca_asm.S.
>>
>
> Since MCA is slow path,
> so I think put smp_processor_id in ar.kr3 is a gain.
>
> We could even optimize get_cpu_var based on this...
(1) Somebody else (not me) gets to fix up and test the MCA handler
assembler code - lots of luck.
(2) smp_processor_id() in the IA64 kernel is accessed via struct
thread_info.cpu. That maps to a simple memory access with code
like this:
adds r14252,r13
;;
ld4 r15=[r14]
The stop bits usually get amortized away with other code.
thread_info.cpu will normally be cached in L1 so reading
smp_processor_id() is relatively fast.
(3) Reading smp_processor_id() from ar.k3 in the kernel is 10 times
slower than the existing kernel code. See the timing program
below.
(4) If the justification for storing cpu number in ar.k<n> is to speed
up user space, how can user space tell if the current kernel stores
the physical address of the per-cpu data in k3 or if it stores the
cpu number in k3? Detecting which variant of the kernel is running
will slow down user space.
Timing results on 'modprobe measure'
init_measure: empty_loop 2000007 cpu_loop 3000011 k3_loop 11999992
module measure.c
-----------------------------------------------------------------------
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/preempt.h>
#include <asm/kregs.h>
#include <asm/timex.h>
MODULE_LICENSE("GPL");
#define LOOPS 1000000
static int __init init_measure(void)
{
int loop;
register int cpu;
unsigned long start, end, empty_loop, cpu_loop, k3_loop;
printk("%s: start\n", __FUNCTION__);
preempt_disable();
local_irq_disable();
start = get_cycles();
barrier();
for (loop = 0; loop < LOOPS; ++loop) {
/* ensure that all loops are the same size (2 bundles) */
asm volatile ("nop 0; nop 0; nop 0;");
barrier();
};
end = get_cycles();
barrier();
local_irq_enable();
empty_loop = end - start;
local_irq_disable();
start = get_cycles();
barrier();
for (loop = 0; loop < LOOPS; ++loop) {
/* hand code the read of smp_processor_id() to stop gcc moving
* the address calculation outside the loop
*/
asm volatile ("adds r14=%0,r13"
";;"
"ld4 r15=[r14]"
: :
"i" (IA64_TASK_SIZE + offsetof(struct thread_info, cpu)) :
"r14", "r15" );
barrier();
};
end = get_cycles();
barrier();
local_irq_enable();
cpu_loop = end - start;
local_irq_disable();
start = get_cycles();
barrier();
for (loop = 0; loop < LOOPS; ++loop) {
cpu = ia64_get_kr(IA64_KR_PER_CPU_DATA);
barrier();
};
end = get_cycles();
barrier();
local_irq_enable();
k3_loop = end - start;
preempt_enable();
printk("%s: empty_loop %ld cpu_loop %ld k3_loop %ld\n", __FUNCTION__, empty_loop, cpu_loop, k3_loop);
return 0;
}
static void __exit exit_measure(void)
{
printk("%s: start\n", __FUNCTION__);
printk("%s: end\n", __FUNCTION__);
}
module_init(init_measure)
module_exit(exit_measure)
-----------------------------------------------------------------------
objdump of the interesting bits (the three loops):
empty loop:
40: 09 08 00 50 00 21 [MMI] mov r1=r40
46: 00 00 00 02 00 e0 nop.m 0x0
4c: 81 6c 64 84 adds r15272,r13;;
50: 0a 18 00 1e 10 10 [MMI] ld4 r3=[r15];;
56: 20 08 0c 00 42 00 adds r2=1,r3
5c: 00 00 04 00 nop.i 0x0
60: 0b 00 00 00 01 00 [MMI] nop.m 0x0;;
66: 00 10 3c 20 23 00 st4 [r15]=r2
6c: 00 00 04 00 nop.i 0x0;;
70: 0b 00 00 02 07 00 [MMI] rsm 0x4000;;
76: 50 02 b0 44 08 00 mov.m r37=ar.itc
7c: 00 00 04 00 nop.i 0x0;;
80: 0b 70 fc 78 84 24 [MMI] mov r14™9999;;
86: 00 00 00 02 00 00 nop.m 0x0
8c: e0 08 aa 00 mov.i ar.lc=r14;;
90: 01 00 00 00 01 00 [MII] nop.m 0x0
96: 00 00 00 02 00 00 nop.i 0x0
9c: 00 00 04 00 nop.i 0x0;;
a0: 10 00 00 00 01 00 [MIB] nop.m 0x0
a6: 00 00 00 02 00 a0 nop.i 0x0
ac: f0 ff ff 48 br.cloop.sptk.few 90 <init_module+0x90>
b0: 0b 20 01 58 22 04 [MMI] mov.m r36=ar.itc;;
b6: 00 00 04 0c 00 00 ssm 0x4000
bc: 00 00 04 00 nop.i 0x0;;
c0: 0b 00 00 00 30 00 [MMI] srlz.d;;
Read smp_processor_id:
c6: 00 00 04 0e 00 00 rsm 0x4000
cc: 00 00 04 00 nop.i 0x0;;
d0: 01 18 01 58 22 04 [MII] mov.m r35=ar.itc
d6: 00 00 00 02 00 00 nop.i 0x0
dc: 00 00 04 00 nop.i 0x0;;
e0: 0a 40 fc 78 84 24 [MMI] mov r8™9999;;
e6: 00 00 00 02 00 00 nop.m 0x0
ec: 80 08 aa 00 mov.i ar.lc=r8
f0: 0b 70 d0 1a 19 21 [MMI] adds r14252,r13;;
f6: f0 00 38 20 20 00 ld4 r15=[r14]
fc: 00 00 04 00 nop.i 0x0;;
100: 10 00 00 00 01 00 [MIB] nop.m 0x0
106: 00 00 00 02 00 a0 nop.i 0x0
10c: f0 ff ff 48 br.cloop.sptk.few f0 <init_module+0xf0>
110: 0b 10 01 58 22 04 [MMI] mov.m r34=ar.itc;;
116: 00 00 04 0c 00 00 ssm 0x4000
11c: 00 00 04 00 nop.i 0x0;;
120: 0b 00 00 00 30 00 [MMI] srlz.d;;
Read ar.k3:
126: 00 00 04 0e 00 00 rsm 0x4000
12c: 00 00 04 00 nop.i 0x0;;
130: 01 08 01 58 22 04 [MII] mov.m r33=ar.itc
136: 00 00 00 02 00 00 nop.i 0x0
13c: 00 00 04 00 nop.i 0x0;;
140: 0a 48 fc 78 84 24 [MMI] mov r9™9999;;
146: 00 00 00 02 00 00 nop.m 0x0
14c: 90 08 aa 00 mov.i ar.lc=r9
150: 01 70 00 06 22 04 [MII] mov.m r14=ar.k3
156: 00 00 00 02 00 00 nop.i 0x0
15c: 00 00 04 00 nop.i 0x0;;
160: 10 00 00 00 01 00 [MIB] nop.m 0x0
166: 00 00 00 02 00 a0 nop.i 0x0
16c: f0 ff ff 48 br.cloop.sptk.few 150 <init_module+0x150>
170: 0b 00 01 58 22 04 [MMI] mov.m r32=ar.itc;;
176: 00 00 04 0c 00 00 ssm 0x4000
17c: 00 00 04 00 nop.i 0x0;;
180: 01 00 00 00 30 00 [MII] srlz.d
^ permalink raw reply [flat|nested] 13+ messages in thread* RE: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (9 preceding siblings ...)
2007-02-08 8:40 ` Keith Owens
@ 2007-02-08 18:03 ` Luck, Tony
2007-02-08 23:59 ` Keith Owens
11 siblings, 0 replies; 13+ messages in thread
From: Luck, Tony @ 2007-02-08 18:03 UTC (permalink / raw)
To: linux-ia64
Summarizing thread that I was sleeping through:
1) Use ar.kr2 for ...
No ... as Keith pointed out there is debug code in ivt.S to use
it to track the last few traps, and if that isn't being used it
is very handy for other debugging uses. I won't give up the last
of these registers unless it is for some cause which is a clear and
obvious major win in performance or functionality. An allegedly
faster way to find the cpu number is not a clear win (if the
percpu variable is in cache, then it is clearly faster to read
from memory).
2) Use ar.kr3 for cpu number, and then make the MCA code index an
array to get the phys address of the per-cpu area.
Messes with a lot of MCA code, and for a microscopic improvement
over my proposed getcpu() code. Yes, you can avoid ever doing the system
call ... but only running the system call when you have migrated to a
different cpu should cover most calls [possible exception ... a future
scheduler might frequently move a process between logical cpus that
share all cache levels, since there is no cache penalty for running
on other cpus in the same cache domain]. So I'm not looking favourably
at this option at the moment ... but could change my mind if presented
with some data on getcpu() usage.
-Tony
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC Patch]Use ar.kr2 for smp_processor_id
2007-02-08 3:28 [RFC Patch]Use ar.kr2 for smp_processor_id Zou Nan hai
` (10 preceding siblings ...)
2007-02-08 18:03 ` Luck, Tony
@ 2007-02-08 23:59 ` Keith Owens
11 siblings, 0 replies; 13+ messages in thread
From: Keith Owens @ 2007-02-08 23:59 UTC (permalink / raw)
To: linux-ia64
Zou Nan hai (on 08 Feb 2007 15:14:54 +0800) wrote:
> I think using a static value to cache getcpu will heavily bounced on
>that cache line contain the static value if multi cpus calls getcpu very
>frequently.
AFAICT, Tony's suggestion[*] is all in user space, e.g. glibc. Each
application will get its own thread local copy of the static variable,
there is no globally shared static value so no cache line bouncing.
> then implement current_thread_info()->cpu in fsys call should be
>better?
Maybe. Implement it, time it and see which is faster.
[*] http://marc.theaimsgroup.com/?l=linux-ia64&m\x117087180232044&w=2
BTW Tony, in that code there is no need to initialise cpu to ~0 nor to
test for that value. On the first call it is guaranteed that ar.k3 !save_ar_k3.
^ permalink raw reply [flat|nested] 13+ messages in thread