bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64: implement raw_smp_processor_id() using thread_info
@ 2024-05-01 15:42 Puranjay Mohan
  2024-05-01 16:23 ` Christoph Lameter (Ampere)
  2024-05-01 16:41 ` Mark Rutland
  0 siblings, 2 replies; 4+ messages in thread
From: Puranjay Mohan @ 2024-05-01 15:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sumit Garg, Stephen Boyd,
	Douglas Anderson, Peter Zijlstra (Intel), Thomas Gleixner,
	Mark Rutland, linux-arm-kernel, linux-kernel, bpf
  Cc: puranjay12

ARM64 defines THREAD_INFO_IN_TASK which means the cpu id can be found
from current_thread_info()->cpu.

Implement raw_smp_processor_id() using the above. This decreases the
number of emitted instructions like in the following example:

Dump of assembler code for function bpf_get_smp_processor_id:
   0xffff8000802cd608 <+0>:     nop
   0xffff8000802cd60c <+4>:     nop
   0xffff8000802cd610 <+8>:     adrp    x0, 0xffff800082138000
   0xffff8000802cd614 <+12>:    mrs     x1, tpidr_el1
   0xffff8000802cd618 <+16>:    add     x0, x0, #0x8
   0xffff8000802cd61c <+20>:    ldrsw   x0, [x0, x1]
   0xffff8000802cd620 <+24>:    ret

After this patch:

Dump of assembler code for function bpf_get_smp_processor_id:
   0xffff8000802c9130 <+0>:     nop
   0xffff8000802c9134 <+4>:     nop
   0xffff8000802c9138 <+8>:     mrs     x0, sp_el0
   0xffff8000802c913c <+12>:    ldr     w0, [x0, #24]
   0xffff8000802c9140 <+16>:    ret

A microbenchmark[1] was built to measure the performance improvement
provided by this change. It calls the following function given number of
times and finds the runtime overhead:

static noinline int get_cpu_id(void)
{
	return smp_processor_id();
}

Run the benchmark like:
 modprobe smp_processor_id nr_function_calls=1000000000

      +--------------------------+------------------------+
      |        | Number of Calls |    Time taken          |
      +--------+-----------------+------------------------+
      | Before |   1000000000    |   1602888401ns         |
      +--------+-----------------+------------------------+
      | After  |   1000000000    |   1206212658ns         |
      +--------+-----------------+------------------------+
      |  Difference (decrease)   |   396675743ns (24.74%) |
      +---------------------------------------------------+

This improvement is in this very specific microbenchmark but it proves
the point.

The percpu variable cpu_number is left as it is because it is used in
set_smp_ipi_range()

[1] https://github.com/puranjaymohan/linux/commit/77d3fdd

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/arm64/include/asm/smp.h | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index efb13112b408..88fd2ab805ec 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -34,13 +34,9 @@
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 /*
- * We don't use this_cpu_read(cpu_number) as that has implicit writes to
- * preempt_count, and associated (compiler) barriers, that we'd like to avoid
- * the expense of. If we're preemptible, the value can be stale at use anyway.
- * And we can't use this_cpu_ptr() either, as that winds up recursing back
- * here under CONFIG_DEBUG_PREEMPT=y.
+ * This relies on THREAD_INFO_IN_TASK, but arm64 defines that unconditionally.
  */
-#define raw_smp_processor_id() (*raw_cpu_ptr(&cpu_number))
+#define raw_smp_processor_id() (current_thread_info()->cpu)
 
 /*
  * Logical CPU mapping.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: implement raw_smp_processor_id() using thread_info
  2024-05-01 15:42 [PATCH] arm64: implement raw_smp_processor_id() using thread_info Puranjay Mohan
@ 2024-05-01 16:23 ` Christoph Lameter (Ampere)
  2024-05-01 16:41 ` Mark Rutland
  1 sibling, 0 replies; 4+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-05-01 16:23 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Catalin Marinas, Will Deacon, Sumit Garg, Stephen Boyd,
	Douglas Anderson, Peter Zijlstra (Intel), Thomas Gleixner,
	Mark Rutland, linux-arm-kernel, linux-kernel, bpf, puranjay12,
	Russell King

On Wed, 1 May 2024, Puranjay Mohan wrote:

> Dump of assembler code for function bpf_get_smp_processor_id:
>   0xffff8000802cd608 <+0>:     nop
>   0xffff8000802cd60c <+4>:     nop
>   0xffff8000802cd610 <+8>:     adrp    x0, 0xffff800082138000
>   0xffff8000802cd614 <+12>:    mrs     x1, tpidr_el1
>   0xffff8000802cd618 <+16>:    add     x0, x0, #0x8
>   0xffff8000802cd61c <+20>:    ldrsw   x0, [x0, x1]
>   0xffff8000802cd620 <+24>:    ret

In general arm64 has inefficient per cpu variable access. On x86 it is 
possible to access the processor id via a segment register relative 
access with a single instruction.

Arm64 calculates the address of a percpu variable for each access. This 
result in inefficiencies because:

1. The address calculation is processor specific. Therefore preemption 
needs to be disabled during the calculation of the address and while it is 
in use.

2. Additional registers are used causing the compiler to potentially 
generate less efficient code.

3. Even RMV instructions on percpu variables require the disabling of 
preemption due to the address calculation.

Russel King has a patchset for NUMA text replication and as part of that 
he introduces per cpu kernel page tables.

https://lwn.net/Articles/957023/

If we had per cpu page tables then we could create a mapping for a fixed 
address virtual memory range to the physical per cpu area for each cpu.

With that the address calculation would no longer be necessary for per 
cpu variable access and workarounds like this would not be necessary 
anymore.

The retrieval of the cpu id would be a single instruction that 
performs a load from a fixed virtual address. No preemption etc would be 
required.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: implement raw_smp_processor_id() using thread_info
  2024-05-01 15:42 [PATCH] arm64: implement raw_smp_processor_id() using thread_info Puranjay Mohan
  2024-05-01 16:23 ` Christoph Lameter (Ampere)
@ 2024-05-01 16:41 ` Mark Rutland
  2024-05-01 17:12   ` Puranjay Mohan
  1 sibling, 1 reply; 4+ messages in thread
From: Mark Rutland @ 2024-05-01 16:41 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Catalin Marinas, Will Deacon, Sumit Garg, Stephen Boyd,
	Douglas Anderson, Peter Zijlstra (Intel), Thomas Gleixner,
	linux-arm-kernel, linux-kernel, bpf, puranjay12, Ard Biesheuvel

Hi Puranjay,

On Wed, May 01, 2024 at 03:42:36PM +0000, Puranjay Mohan wrote:
> ARM64 defines THREAD_INFO_IN_TASK which means the cpu id can be found
> from current_thread_info()->cpu.

Nice!

This is something that we'd wanted to do, but there were some historical
reasons that prevented that. I think it'd be worth describing that in the
commit message, e.g.

| Historically, arm64 implemented raw_smp_processor_id() as a read of
| current_thread_info()->cpu. This changed when arm64 moved thread_info into
| task struct, as at the time CONFIG_THREAD_INFO_IN_TASK made core code use
| thread_struct::cpu for the cpu number, and due to header dependencies
| prevented using this in raw_smp_processor_id(). As a workaround, we moved to
| using a percpu variable in commit:
|
|   57c82954e77fa12c ("arm64: make cpu number a percpu variable")
|
| Since then, thread_info::cpu was reintroduced, and core code was made to use
| this in commits:
|
|   001430c1910df65a ("arm64: add CPU field to struct thread_info")
|   bcf9033e5449bdca ("sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y")
|
| Consequently it is possible to use current_thread_info()->cpu again.

> Implement raw_smp_processor_id() using the above. This decreases the
> number of emitted instructions like in the following example:
> 
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffff8000802cd608 <+0>:     nop
>    0xffff8000802cd60c <+4>:     nop
>    0xffff8000802cd610 <+8>:     adrp    x0, 0xffff800082138000
>    0xffff8000802cd614 <+12>:    mrs     x1, tpidr_el1
>    0xffff8000802cd618 <+16>:    add     x0, x0, #0x8
>    0xffff8000802cd61c <+20>:    ldrsw   x0, [x0, x1]
>    0xffff8000802cd620 <+24>:    ret
> 
> After this patch:
> 
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffff8000802c9130 <+0>:     nop
>    0xffff8000802c9134 <+4>:     nop
>    0xffff8000802c9138 <+8>:     mrs     x0, sp_el0
>    0xffff8000802c913c <+12>:    ldr     w0, [x0, #24]
>    0xffff8000802c9140 <+16>:    ret
> 
> A microbenchmark[1] was built to measure the performance improvement
> provided by this change. It calls the following function given number of
> times and finds the runtime overhead:
> 
> static noinline int get_cpu_id(void)
> {
> 	return smp_processor_id();
> }
> 
> Run the benchmark like:
>  modprobe smp_processor_id nr_function_calls=1000000000
> 
>       +--------------------------+------------------------+
>       |        | Number of Calls |    Time taken          |
>       +--------+-----------------+------------------------+
>       | Before |   1000000000    |   1602888401ns         |
>       +--------+-----------------+------------------------+
>       | After  |   1000000000    |   1206212658ns         |
>       +--------+-----------------+------------------------+
>       |  Difference (decrease)   |   396675743ns (24.74%) |
>       +---------------------------------------------------+
> 
> This improvement is in this very specific microbenchmark but it proves
> the point.
> 
> The percpu variable cpu_number is left as it is because it is used in
> set_smp_ipi_range()
> 
> [1] https://github.com/puranjaymohan/linux/commit/77d3fdd
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  arch/arm64/include/asm/smp.h | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> index efb13112b408..88fd2ab805ec 100644
> --- a/arch/arm64/include/asm/smp.h
> +++ b/arch/arm64/include/asm/smp.h
> @@ -34,13 +34,9 @@
>  DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
>  
>  /*
> - * We don't use this_cpu_read(cpu_number) as that has implicit writes to
> - * preempt_count, and associated (compiler) barriers, that we'd like to avoid
> - * the expense of. If we're preemptible, the value can be stale at use anyway.
> - * And we can't use this_cpu_ptr() either, as that winds up recursing back
> - * here under CONFIG_DEBUG_PREEMPT=y.
> + * This relies on THREAD_INFO_IN_TASK, but arm64 defines that unconditionally.
>   */
> -#define raw_smp_processor_id() (*raw_cpu_ptr(&cpu_number))
> +#define raw_smp_processor_id() (current_thread_info()->cpu)

I think we can (and should) delete the comment entirely.

Mark.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: implement raw_smp_processor_id() using thread_info
  2024-05-01 16:41 ` Mark Rutland
@ 2024-05-01 17:12   ` Puranjay Mohan
  0 siblings, 0 replies; 4+ messages in thread
From: Puranjay Mohan @ 2024-05-01 17:12 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Sumit Garg, Stephen Boyd,
	Douglas Anderson, Peter Zijlstra (Intel), Thomas Gleixner,
	linux-arm-kernel, linux-kernel, bpf, Ard Biesheuvel

Mark Rutland <mark.rutland@arm.com> writes:

> Hi Puranjay,
>
> On Wed, May 01, 2024 at 03:42:36PM +0000, Puranjay Mohan wrote:
>> ARM64 defines THREAD_INFO_IN_TASK which means the cpu id can be found
>> from current_thread_info()->cpu.
>
> Nice!
>
> This is something that we'd wanted to do, but there were some historical
> reasons that prevented that. I think it'd be worth describing that in the
> commit message, e.g.
>
> | Historically, arm64 implemented raw_smp_processor_id() as a read of
> | current_thread_info()->cpu. This changed when arm64 moved thread_info into
> | task struct, as at the time CONFIG_THREAD_INFO_IN_TASK made core code use
> | thread_struct::cpu for the cpu number, and due to header dependencies
> | prevented using this in raw_smp_processor_id(). As a workaround, we moved to
> | using a percpu variable in commit:
> |
> |   57c82954e77fa12c ("arm64: make cpu number a percpu variable")
> |
> | Since then, thread_info::cpu was reintroduced, and core code was made to use
> | this in commits:
> |
> |   001430c1910df65a ("arm64: add CPU field to struct thread_info")
> |   bcf9033e5449bdca ("sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y")
> |
> | Consequently it is possible to use current_thread_info()->cpu again.
>
>> Implement raw_smp_processor_id() using the above. This decreases the
>> number of emitted instructions like in the following example:
>> 
>> Dump of assembler code for function bpf_get_smp_processor_id:
>>    0xffff8000802cd608 <+0>:     nop
>>    0xffff8000802cd60c <+4>:     nop
>>    0xffff8000802cd610 <+8>:     adrp    x0, 0xffff800082138000
>>    0xffff8000802cd614 <+12>:    mrs     x1, tpidr_el1
>>    0xffff8000802cd618 <+16>:    add     x0, x0, #0x8
>>    0xffff8000802cd61c <+20>:    ldrsw   x0, [x0, x1]
>>    0xffff8000802cd620 <+24>:    ret
>> 
>> After this patch:
>> 
>> Dump of assembler code for function bpf_get_smp_processor_id:
>>    0xffff8000802c9130 <+0>:     nop
>>    0xffff8000802c9134 <+4>:     nop
>>    0xffff8000802c9138 <+8>:     mrs     x0, sp_el0
>>    0xffff8000802c913c <+12>:    ldr     w0, [x0, #24]
>>    0xffff8000802c9140 <+16>:    ret
>> 
>> A microbenchmark[1] was built to measure the performance improvement
>> provided by this change. It calls the following function given number of
>> times and finds the runtime overhead:
>> 
>> static noinline int get_cpu_id(void)
>> {
>> 	return smp_processor_id();
>> }
>> 
>> Run the benchmark like:
>>  modprobe smp_processor_id nr_function_calls=1000000000
>> 
>>       +--------------------------+------------------------+
>>       |        | Number of Calls |    Time taken          |
>>       +--------+-----------------+------------------------+
>>       | Before |   1000000000    |   1602888401ns         |
>>       +--------+-----------------+------------------------+
>>       | After  |   1000000000    |   1206212658ns         |
>>       +--------+-----------------+------------------------+
>>       |  Difference (decrease)   |   396675743ns (24.74%) |
>>       +---------------------------------------------------+
>> 
>> This improvement is in this very specific microbenchmark but it proves
>> the point.
>> 
>> The percpu variable cpu_number is left as it is because it is used in
>> set_smp_ipi_range()
>> 
>> [1] https://github.com/puranjaymohan/linux/commit/77d3fdd
>> 
>> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> ---
>>  arch/arm64/include/asm/smp.h | 8 ++------
>>  1 file changed, 2 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
>> index efb13112b408..88fd2ab805ec 100644
>> --- a/arch/arm64/include/asm/smp.h
>> +++ b/arch/arm64/include/asm/smp.h
>> @@ -34,13 +34,9 @@
>>  DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
>>  
>>  /*
>> - * We don't use this_cpu_read(cpu_number) as that has implicit writes to
>> - * preempt_count, and associated (compiler) barriers, that we'd like to avoid
>> - * the expense of. If we're preemptible, the value can be stale at use anyway.
>> - * And we can't use this_cpu_ptr() either, as that winds up recursing back
>> - * here under CONFIG_DEBUG_PREEMPT=y.
>> + * This relies on THREAD_INFO_IN_TASK, but arm64 defines that unconditionally.
>>   */
>> -#define raw_smp_processor_id() (*raw_cpu_ptr(&cpu_number))
>> +#define raw_smp_processor_id() (current_thread_info()->cpu)
>
> I think we can (and should) delete the comment entirely.

Sure,
I will add the information to the commit message and remove this comment
in the next version.

I think it would be useful to remove the cpu_number percpu variable as
well.

We can use &irq_stat in place of &cpu_number in set_smp_ipi_range() in
the calls to request_percpu_nmi/irq() as this is just a dummy value and
ipi_handler() doesn't use it.

There are no other users of cpu_number.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-01 17:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-01 15:42 [PATCH] arm64: implement raw_smp_processor_id() using thread_info Puranjay Mohan
2024-05-01 16:23 ` Christoph Lameter (Ampere)
2024-05-01 16:41 ` Mark Rutland
2024-05-01 17:12   ` Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).