All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Yajun Deng" <yajun.deng@linux.dev>
To: "Christophe Leroy" <christophe.leroy@csgroup.eu>,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/early_ioremap: Combine two loops to improve performance
Date: Tue, 27 Sep 2022 10:30:21 +0000	[thread overview]
Message-ID: <59951c46c30608a0909632828102bce1@linux.dev> (raw)
In-Reply-To: <f06b9b34-bb74-4d15-88a6-b44de2adcc3a@csgroup.eu>

September 27, 2022 4:47 PM, "Christophe Leroy" <christophe.leroy@csgroup.eu> wrote:

> Le 27/09/2022 à 09:52, Yajun Deng a écrit :
> 
>> The first loop will waring once if prev_map is init, we can add a
>> boolean variable to do that. So those two loops can be combined to
>> improve performance.
> 
> Do you have evidence of the improved performance ?
> 
> Looking at the generated code, I have the fealing that the performance
> is reduced by the !init_prev_map that is checked at every lap.
> 
> Before the patch:
> 
> 00000250 <early_ioremap_setup>:
> 250: 3d 20 00 00 lis r9,0
> 252: R_PPC_ADDR16_HA .init.data
> 254: 39 29 00 00 addi r9,r9,0
> 256: R_PPC_ADDR16_LO .init.data
> 258: 38 e0 00 10 li r7,16
> 25c: 39 09 00 04 addi r8,r9,4
> 260: 39 40 00 00 li r10,0
> 264: 7c e9 03 a6 mtctr r7
> 
> ---- First loop : ----
> 268: 55 47 10 3a rlwinm r7,r10,2,0,29
> 26c: 7c e7 40 2e lwzx r7,r7,r8
> 270: 2c 07 00 00 cmpwi r7,0
> 274: 41 a2 00 08 beq 27c <early_ioremap_setup+0x2c>
> 278: 0f e0 00 00 twui r0,0
> 27c: 39 4a 00 01 addi r10,r10,1
> 280: 42 00 ff e8 bdnz 268 <early_ioremap_setup+0x18>
> 
> 284: 39 00 00 10 li r8,16
> 288: 39 29 00 84 addi r9,r9,132
> 28c: 3d 40 ff b0 lis r10,-80
> 290: 7d 09 03 a6 mtctr r8
> 
> ---- Second loop : ----
> 294: 95 49 00 04 stwu r10,4(r9)
> 298: 3d 4a 00 04 addis r10,r10,4
> 29c: 42 00 ff f8 bdnz 294 <early_ioremap_setup+0x44>
> 
> 2a0: 4e 80 00 20 blr
> 
> After the patch:
> 
> 00000250 <early_ioremap_setup>:
> 250: 3d 20 00 00 lis r9,0
> 252: R_PPC_ADDR16_HA .init.data
> 254: 39 29 00 00 addi r9,r9,0
> 256: R_PPC_ADDR16_LO .init.data
> 258: 39 00 00 10 li r8,16
> 25c: 38 c9 00 04 addi r6,r9,4
> 260: 39 40 00 00 li r10,0
> 264: 39 29 00 88 addi r9,r9,136
> 268: 38 e0 00 00 li r7,0
> 26c: 7d 09 03 a6 mtctr r8
> 
> --- Loop ---
> 270: 70 e8 00 01 andi. r8,r7,1
> 274: 40 82 00 18 bne 28c <early_ioremap_setup+0x3c>
> 278: 7d 0a 30 2e lwzx r8,r10,r6
> 27c: 2c 08 00 00 cmpwi r8,0
> 280: 41 a2 00 0c beq 28c <early_ioremap_setup+0x3c>
> 284: 0f e0 00 00 twui r0,0
> 288: 38 e0 00 01 li r7,1
> 28c: 55 48 80 1e rlwinm r8,r10,16,0,15
> 290: 3d 08 ff b0 addis r8,r8,-80
> 294: 7d 0a 49 2e stwx r8,r10,r9
> 298: 39 4a 00 04 addi r10,r10,4
> 29c: 42 00 ff d4 bdnz 270 <early_ioremap_setup+0x20>
> 
> 2a0: 4e 80 00 20 blr
> 

Yes, I do it.

test1.c:
<===================================================================>
#include <stdio.h>
#include <stdlib.h>

#define FIX_BTMAPS_SLOTS 8

static void *prev_map[FIX_BTMAPS_SLOTS];
static unsigned long slot_virt[FIX_BTMAPS_SLOTS];

int main(void)
{
        int i;

        for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
                if (prev_map[i]) {
                        printf("warning!\n");
                        break;
                }

        for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
                slot_virt[i] = FIX_BTMAPS_SLOTS * i;

        return 0;
}
<===================================================================>

test2.c

<===================================================================>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>

#define FIX_BTMAPS_SLOTS 8

static void *prev_map[FIX_BTMAPS_SLOTS];
static unsigned long slot_virt[FIX_BTMAPS_SLOTS];

int main(void)
{
        bool init_prev_map = false;
        int i;

        for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
                if (!init_prev_map && prev_map[i])
                        init_prev_map = true;

                slot_virt[i] = FIX_BTMAPS_SLOTS * i;
        }

        return 0;
}
<===================================================================>

$ gcc test1.c -o test1  && gcc test2.c -o test2
<===================================================================>

$ perf stat ./test1   

 Performance counter stats for './test1':

              0.17 msec task-clock:u              #    0.444 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                43      page-faults:u             #    0.246 M/sec                  
           154,813      cycles:u                  #    0.885 GHz                    
           106,234      instructions:u            #    0.69  insn per cycle         
            21,510      branches:u                #  122.990 M/sec                  
             1,409      branch-misses:u           #    6.55% of all branches        

       0.000393857 seconds time elapsed

       0.000420000 seconds user
       0.000000000 seconds sys



$ perf stat ./test2

 Performance counter stats for './test2':

              0.17 msec task-clock:u              #    0.439 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                43      page-faults:u             #    0.249 M/sec                  
           152,744      cycles:u                  #    0.884 GHz                    
           105,282      instructions:u            #    0.69  insn per cycle         
            21,342      branches:u                #  123.545 M/sec                  
             1,334      branch-misses:u           #    6.25% of all branches        

       0.000393084 seconds time elapsed

       0.000412000 seconds user
       0.000000000 seconds sys

<===================================================================>
It seems almost the same. If we change FIX_BTMAPS_SLOTS from 8 to 80000, 
It takes less time after the patch.
<===================================================================>

$ perf stat ./test1

 Performance counter stats for './test1':

              0.73 msec task-clock:u              #    0.768 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               355      page-faults:u             #    0.484 M/sec                  
         1,532,520      cycles:u                  #    2.087 GHz                    
         1,786,378      instructions:u            #    1.17  insn per cycle         
           261,798      branches:u                #  356.594 M/sec                  
             1,580      branch-misses:u           #    0.60% of all branches        

       0.000956129 seconds time elapsed

       0.000000000 seconds user
       0.000981000 seconds sys


$ perf stat ./test2

 Performance counter stats for './test2':

              0.60 msec task-clock:u              #    0.732 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               355      page-faults:u             #    0.589 M/sec                  
         1,066,851      cycles:u                  #    1.769 GHz                    
         1,865,418      instructions:u            #    1.75  insn per cycle         
           261,630      branches:u                #  433.808 M/sec                  
             1,369      branch-misses:u           #    0.52% of all branches        

       0.000824064 seconds time elapsed

       0.000846000 seconds user
       0.000000000 seconds sys






> Maybe using WARN_ON_ONCE() could be the solution. But looking at the
> code generated if using it, I still have the feeling that GCC has a
> better code with the original code.
> 
>> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
>> ---
>> mm/early_ioremap.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/early_ioremap.c b/mm/early_ioremap.c
>> index 9bc12e526ed0..3076fb47c685 100644
>> --- a/mm/early_ioremap.c
>> +++ b/mm/early_ioremap.c
>> @@ -70,14 +70,15 @@ static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata;
>> 
>> void __init early_ioremap_setup(void)
>> {
>> + bool init_prev_map = false;
>> int i;
>> 
>> - for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
>> - if (WARN_ON(prev_map[i]))
>> - break;
>> + for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
>> + if (!init_prev_map && WARN_ON(prev_map[i]))
>> + init_prev_map = true;
>> 
>> - for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
>> slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
>> + }
>> }
>> 
>> static int __init check_early_ioremap_leak(void)


      reply	other threads:[~2022-09-27 10:30 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-27  7:52 [PATCH] mm/early_ioremap: Combine two loops to improve performance Yajun Deng
2022-09-27  8:47 ` Christophe Leroy
2022-09-27 10:30   ` Yajun Deng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59951c46c30608a0909632828102bce1@linux.dev \
    --to=yajun.deng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.