[PATCH v1] mm/contpte: Optimize loop to reduce redundant operations

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
@ 2025-04-07  9:22 Xavier
  2025-04-07 11:29 ` Lance Yang
  2025-04-10 21:25 ` Barry Song
  0 siblings, 2 replies; 49+ messages in thread
From: Xavier @ 2025-04-07  9:22 UTC (permalink / raw)
  To: catalin.marinas, will, akpm, baohua, ryan.roberts, ioworker0
  Cc: xavier_qy, linux-arm-kernel, linux-kernel

This commit optimizes the contpte_ptep_get function by adding early
 termination logic. It checks if the dirty and young bits of orig_pte
 are already set and skips redundant bit-setting operations during
 the loop. This reduces unnecessary iterations and improves performance.

Signed-off-by: Xavier <xavier_qy@163.com>
---
 arch/arm64/mm/contpte.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..ca15d8f52d14 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 
 	pte_t pte;
 	int i;
+	bool dirty = false;
+	bool young = false;
 
 	ptep = contpte_align_down(ptep);
 
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (!dirty && pte_dirty(pte)) {
+			dirty = true;
 			orig_pte = pte_mkdirty(orig_pte);
+		}
 
-		if (pte_young(pte))
+		if (!young && pte_young(pte)) {
+			young = true;
 			orig_pte = pte_mkyoung(orig_pte);
+		}
+
+		if (dirty && young)
+			break;
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07  9:22 [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations Xavier
@ 2025-04-07 11:29 ` Lance Yang
  2025-04-07 12:56   ` Xavier
  2025-04-10 21:25 ` Barry Song
  1 sibling, 1 reply; 49+ messages in thread
From: Lance Yang @ 2025-04-07 11:29 UTC (permalink / raw)
  To: xavier_qy
  Cc: akpm, baohua, catalin.marinas, ioworker0, linux-arm-kernel,
	linux-kernel, ryan.roberts, will

Thanks for the patch. Would the following change be better?

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 55107d27d3f8..64eb3b2fbf06 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 
 		if (pte_young(pte))
 			orig_pte = pte_mkyoung(orig_pte);
+
+		if (pte_young(orig_pte) && pte_dirty(orig_pte))
+			break;
 	}
 
 	return orig_pte;
-- 

We can check the orig_pte flags directly instead of using extra boolean
variables, which gives us an early-exit when both dirty and young flags
are set.

Also, is this optimization really needed for the common case?

Thanks,
Lance


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re:Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07 11:29 ` Lance Yang
@ 2025-04-07 12:56   ` Xavier
  2025-04-07 13:31     ` Lance Yang
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-07 12:56 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, baohua, catalin.marinas, linux-arm-kernel, linux-kernel,
	ryan.roberts, will



Hi Lance,

Thanks for your feedback, my response is as follows.

--
Thanks,
Xavier





At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@gmail.com> wrote:
>Thanks for the patch. Would the following change be better?
>
>diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>index 55107d27d3f8..64eb3b2fbf06 100644
>--- a/arch/arm64/mm/contpte.c
>+++ b/arch/arm64/mm/contpte.c
>@@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> 
> 		if (pte_young(pte))
> 			orig_pte = pte_mkyoung(orig_pte);
>+
>+		if (pte_young(orig_pte) && pte_dirty(orig_pte))
>+			break;
> 	}
> 
> 	return orig_pte;
>-- 
>
>We can check the orig_pte flags directly instead of using extra boolean
>variables, which gives us an early-exit when both dirty and young flags
>are set.
Your way of writing the code is indeed more concise. However, I think
 using boolean variables might be more efficient. Although it introduces
 additional variables, comparing boolean values is likely to be more
 efficient than checking bit settings.

>
>Also, is this optimization really needed for the common case?
This function is on a high-frequency execution path. During debugging,
 I found that in most cases, the first few pages are already marked as
 both dirty and young. But currently, the program still has to complete
 the entire loop of 16 ptep iterations, which seriously reduces the efficiency.
>
>Thanks,
>Lance

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07 12:56   ` Xavier
@ 2025-04-07 13:31     ` Lance Yang
  2025-04-07 16:19       ` Dev Jain
  0 siblings, 1 reply; 49+ messages in thread
From: Lance Yang @ 2025-04-07 13:31 UTC (permalink / raw)
  To: Xavier
  Cc: akpm, baohua, catalin.marinas, linux-arm-kernel, linux-kernel,
	ryan.roberts, will

On Mon, Apr 7, 2025 at 8:56 PM Xavier <xavier_qy@163.com> wrote:
>
>
>
> Hi Lance,
>
> Thanks for your feedback, my response is as follows.
>
> --
> Thanks,
> Xavier
>
>
>
>
>
> At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@gmail.com> wrote:
> >Thanks for the patch. Would the following change be better?
> >
> >diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >index 55107d27d3f8..64eb3b2fbf06 100644
> >--- a/arch/arm64/mm/contpte.c
> >+++ b/arch/arm64/mm/contpte.c
> >@@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >
> >               if (pte_young(pte))
> >                       orig_pte = pte_mkyoung(orig_pte);
> >+
> >+              if (pte_young(orig_pte) && pte_dirty(orig_pte))
> >+                      break;
> >       }
> >
> >       return orig_pte;
> >--
> >
> >We can check the orig_pte flags directly instead of using extra boolean
> >variables, which gives us an early-exit when both dirty and young flags
> >are set.
> Your way of writing the code is indeed more concise. However, I think
>  using boolean variables might be more efficient. Although it introduces
>  additional variables, comparing boolean values is likely to be more
>  efficient than checking bit settings.
>
> >
> >Also, is this optimization really needed for the common case?
> This function is on a high-frequency execution path. During debugging,
>  I found that in most cases, the first few pages are already marked as
>  both dirty and young. But currently, the program still has to complete
>  the entire loop of 16 ptep iterations, which seriously reduces the efficiency.

Hmm... agreed that this patch helps when early PTEs are dirty/young, but
for late-ones-only cases, it only introduces overhead with no benefit, IIUC.

So, let's wait for folks to take a look ;)

Thanks,
Lance

> >
> >Thanks,
> >Lance


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07 13:31     ` Lance Yang
@ 2025-04-07 16:19       ` Dev Jain
  2025-04-08  8:58         ` [PATCH v2 0/1] " Xavier
  2025-04-08  9:17         ` [PATCH v1] " Lance Yang
  0 siblings, 2 replies; 49+ messages in thread
From: Dev Jain @ 2025-04-07 16:19 UTC (permalink / raw)
  To: Lance Yang, Xavier
  Cc: akpm, baohua, catalin.marinas, linux-arm-kernel, linux-kernel,
	ryan.roberts, will


Hi Xavier,

On 07/04/25 7:01 pm, Lance Yang wrote:
> On Mon, Apr 7, 2025 at 8:56 PM Xavier <xavier_qy@163.com> wrote:
>>
>>
>>
>> Hi Lance,
>>
>> Thanks for your feedback, my response is as follows.
>>
>> --
>> Thanks,
>> Xavier
>>
>>
>>
>>
>>
>> At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@gmail.com> wrote:
>>> Thanks for the patch. Would the following change be better?
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index 55107d27d3f8..64eb3b2fbf06 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>
>>>                if (pte_young(pte))
>>>                        orig_pte = pte_mkyoung(orig_pte);
>>> +
>>> +              if (pte_young(orig_pte) && pte_dirty(orig_pte))
>>> +                      break;
>>>        }

Quite the coincidence, I was thinking of doing exactly this some days 
back and testing it out : ) Can you do a microanalysis whether this gets 
us a benefit or not? This looks like an optimization on paper but may 
not be one after all because CONT_PTES is only 16 and a simple loop 
without extra if-conditions may just be faster.

>>>
>>>        return orig_pte;
>>> --
>>>
>>> We can check the orig_pte flags directly instead of using extra boolean
>>> variables, which gives us an early-exit when both dirty and young flags
>>> are set.
>> Your way of writing the code is indeed more concise. However, I think
>>   using boolean variables might be more efficient. Although it introduces
>>   additional variables, comparing boolean values is likely to be more
>>   efficient than checking bit settings.
>>
>>>
>>> Also, is this optimization really needed for the common case?
>> This function is on a high-frequency execution path. During debugging,
>>   I found that in most cases, the first few pages are already marked as
>>   both dirty and young. But currently, the program still has to complete
>>   the entire loop of 16 ptep iterations, which seriously reduces the efficiency.
> 
> Hmm... agreed that this patch helps when early PTEs are dirty/young, but
> for late-ones-only cases, it only introduces overhead with no benefit, IIUC.
> 
> So, let's wait for folks to take a look ;)
> 
> Thanks,
> Lance
> 
>>>
>>> Thanks,
>>> Lance
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07 16:19       ` Dev Jain
@ 2025-04-08  8:58         ` Xavier
  2025-04-08  8:58           ` [PATCH v2 1/1] " Xavier
  2025-04-08  9:17         ` [PATCH v1] " Lance Yang
  1 sibling, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-08  8:58 UTC (permalink / raw)
  To: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0
  Cc: linux-arm-kernel, linux-kernel, will, xavier_qy

Hi Dev,

Thanks for your valuable feedback earlier. I have optimized the patch again,
 eliminating additional boolean variables check and allowing the loop to break
 immediately once both dirty and young flags are set. This approach should
 theoretically be more efficient than the original code. Thank you for taking
 the time to review this work, I'm happy to make further adjustments based on
 your suggestions!

Xavier (1):
  mm/contpte: Optimize loop to reduce redundant operations

 arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-08  8:58         ` [PATCH v2 0/1] " Xavier
@ 2025-04-08  8:58           ` Xavier
  2025-04-09  4:09             ` Gavin Shan
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-08  8:58 UTC (permalink / raw)
  To: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0
  Cc: linux-arm-kernel, linux-kernel, will, xavier_qy

This commit optimizes the contpte_ptep_get function by adding early
 termination logic. It checks if the dirty and young bits of orig_pte
 are already set and skips redundant bit-setting operations during
 the loop. This reduces unnecessary iterations and improves performance.

Signed-off-by: Xavier <xavier_qy@163.com>
---
 arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..034d153d7d19 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(__contpte_try_unfold);
 
+#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
+	do { \
+		int _start = start; \
+		pte_t *_ptep = ptep; \
+		while (_start++ < CONT_PTES) { \
+			if (pte_##flag(__ptep_get(_ptep++))) { \
+				orig_pte = pte_mk##flag(orig_pte); \
+				break; \
+			} \
+		} \
+	} while (0)
+
 pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 {
 	/*
@@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
+			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
+			break;
+		}
 
-		if (pte_young(pte))
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-08  8:58           ` [PATCH v2 1/1] " Xavier
@ 2025-04-09  4:09             ` Gavin Shan
  2025-04-09 15:10               ` Xavier
  0 siblings, 1 reply; 49+ messages in thread
From: Gavin Shan @ 2025-04-09  4:09 UTC (permalink / raw)
  To: Xavier, dev.jain, akpm, baohua, ryan.roberts, catalin.marinas,
	ioworker0
  Cc: linux-arm-kernel, linux-kernel, will

Hi Xavier,

On 4/8/25 6:58 PM, Xavier wrote:
> This commit optimizes the contpte_ptep_get function by adding early
>   termination logic. It checks if the dirty and young bits of orig_pte
>   are already set and skips redundant bit-setting operations during
>   the loop. This reduces unnecessary iterations and improves performance.
> 
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
>   arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..034d153d7d19 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   }
>   EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>   

I'm wandering how it can work. More details are given below.

> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> +	do { \
> +		int _start = start; \
> +		pte_t *_ptep = ptep; \
> +		while (_start++ < CONT_PTES) { \
> +			if (pte_##flag(__ptep_get(_ptep++))) { \
> +				orig_pte = pte_mk##flag(orig_pte); \
> +				break; \
> +			} \
> +		} \
> +	} while (0)
> +

CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
collects the flag for ptep[start, CONT_PTES].

>   pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>   {
>   	/*
> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>   	for (i = 0; i < CONT_PTES; i++, ptep++) {
>   		pte = __ptep_get(ptep);
>   
> -		if (pte_dirty(pte))
> +		if (pte_dirty(pte)) {
>   			orig_pte = pte_mkdirty(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> +			break;
> +		}
>   
> -		if (pte_young(pte))
> +		if (pte_young(pte)) {
>   			orig_pte = pte_mkyoung(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> +			break;
> +		}
>   	}
>   
>   	return orig_pte;

There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
conflicting to the outer loop, which iterates ptep[0, CONT_PTES].

Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
early with luck, we don't have to step into another cache line. Note that extra checks needs
more CPU cycles.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re:Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-09  4:09             ` Gavin Shan
@ 2025-04-09 15:10               ` Xavier
  2025-04-10  0:58                 ` Gavin Shan
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-09 15:10 UTC (permalink / raw)
  To: Gavin Shan
  Cc: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0,
	linux-arm-kernel, linux-kernel, will

Hi Gavin,

Thank you for carefully reviewing this patch and raising your questions. 
I'll try to explain and answer them below.

At 2025-04-09 12:09:48, "Gavin Shan" <gshan@redhat.com> wrote:
>Hi Xavier,
>
>On 4/8/25 6:58 PM, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>>   termination logic. It checks if the dirty and young bits of orig_pte
>>   are already set and skips redundant bit-setting operations during
>>   the loop. This reduces unnecessary iterations and improves performance.
>> 
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>>   arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..034d153d7d19 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>   }
>>   EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>   
>
>I'm wandering how it can work. More details are given below.
>
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +	do { \
>> +		int _start = start; \
>> +		pte_t *_ptep = ptep; \
>> +		while (_start++ < CONT_PTES) { \
>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>> +				orig_pte = pte_mk##flag(orig_pte); \
>> +				break; \
>> +			} \
>> +		} \
>> +	} while (0)
>> +
>
>CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>collects the flag for ptep[start, CONT_PTES].
>
>>   pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>   {
>>   	/*
>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>   	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>   		pte = __ptep_get(ptep);
>>   
>> -		if (pte_dirty(pte))
>> +		if (pte_dirty(pte)) {
>>   			orig_pte = pte_mkdirty(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> +			break;
>> +		}
>>   
>> -		if (pte_young(pte))
>> +		if (pte_young(pte)) {
>>   			orig_pte = pte_mkyoung(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> +			break;
>> +		}
>>   	}
>>   
>>   	return orig_pte;
>
>There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>conflicting to the outer loop, which iterates ptep[0, CONT_PTES].

No flags will be missed. The outer loop is used to check for the first flag,
which could be either the dirty or young flag.
Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
the dirty flag of orig_pte will be set, and the code will immediately enter
the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
to check only for the young flag starting from the i-th position, and we needn't
concern about the dirty flag anymore.
If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
of orig_pte will be set. At this point, both the young and dirty flags of
orig_pte have been set, and there's no need for further loop judgments, so
the both the inner and outer loops can be exited directly. This approach
reduces unnecessary repeated traversals and judgments.

>
>Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>early with luck, we don't have to step into another cache line. Note that extra checks needs
>more CPU cycles.

Compared to the previous function, this code doesn't add any extra checks.
Even in the worst-case scenario, where neither a dirty nor a young flag is
found among the 16 PTEs, the number of checks is the same as in the original
function. If any flag is found earlier, the optimized patch will reduce the
number of subsequent checks for that flag compared to the original code.

I'm not sure if my explanation is clear. If you have any further questions,
feel free to communicate with me again.

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-09 15:10               ` Xavier
@ 2025-04-10  0:58                 ` Gavin Shan
  2025-04-10  2:53                   ` Xavier
  0 siblings, 1 reply; 49+ messages in thread
From: Gavin Shan @ 2025-04-10  0:58 UTC (permalink / raw)
  To: Xavier
  Cc: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0,
	linux-arm-kernel, linux-kernel, will

Hi Xavier,

On 4/10/25 1:10 AM, Xavier wrote:
> 
> Thank you for carefully reviewing this patch and raising your questions.
> I'll try to explain and answer them below.
> 

Not a problem :)

> 
> At 2025-04-09 12:09:48, "Gavin Shan" <gshan@redhat.com> wrote:
>> Hi Xavier,
>>
>> On 4/8/25 6:58 PM, Xavier wrote:
>>> This commit optimizes the contpte_ptep_get function by adding early
>>>    termination logic. It checks if the dirty and young bits of orig_pte
>>>    are already set and skips redundant bit-setting operations during
>>>    the loop. This reduces unnecessary iterations and improves performance.
>>>
>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>> ---
>>>    arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index bcac4f55f9c1..034d153d7d19 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>    }
>>>    EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>    
>>
>> I'm wandering how it can work. More details are given below.
>>
>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>> +	do { \
>>> +		int _start = start; \
>>> +		pte_t *_ptep = ptep; \
>>> +		while (_start++ < CONT_PTES) { \
>>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>>> +				orig_pte = pte_mk##flag(orig_pte); \
>>> +				break; \
>>> +			} \
>>> +		} \
>>> +	} while (0)
>>> +

nit: the variable _start and _ptep can be dropped since the caller is going to
bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
burden to contpte_ptep_get() for its readability, much more complex than I
thought.

>>
>> CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>> collects the flag for ptep[start, CONT_PTES].
>>
>>>    pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>    {
>>>    	/*
>>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>    	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>    		pte = __ptep_get(ptep);
>>>    
>>> -		if (pte_dirty(pte))
>>> +		if (pte_dirty(pte)) {
>>>    			orig_pte = pte_mkdirty(orig_pte);
>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>> +			break;
>>> +		}
>>>    
>>> -		if (pte_young(pte))
>>> +		if (pte_young(pte)) {
>>>    			orig_pte = pte_mkyoung(orig_pte);
>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>> +			break;
>>> +		}
>>>    	}
>>>    
>>>    	return orig_pte;
>>
>> There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>> flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>> flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>> conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
> 
> No flags will be missed. The outer loop is used to check for the first flag,
> which could be either the dirty or young flag.
> Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
> the dirty flag of orig_pte will be set, and the code will immediately enter
> the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
> to check only for the young flag starting from the i-th position, and we needn't
> concern about the dirty flag anymore.
> If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
> of orig_pte will be set. At this point, both the young and dirty flags of
> orig_pte have been set, and there's no need for further loop judgments, so
> the both the inner and outer loops can be exited directly. This approach
> reduces unnecessary repeated traversals and judgments.
> 

Thanks for the explanation. I missed that the subsequent young bit is collected
on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
Now I can see all (dirty | young) bits are collected with a lost.

>>
>> Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>> However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>> size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>> early with luck, we don't have to step into another cache line. Note that extra checks needs
>> more CPU cycles.
> 
> Compared to the previous function, this code doesn't add any extra checks.
> Even in the worst-case scenario, where neither a dirty nor a young flag is
> found among the 16 PTEs, the number of checks is the same as in the original
> function. If any flag is found earlier, the optimized patch will reduce the
> number of subsequent checks for that flag compared to the original code.
> 

There are 32 checks in the original code (assuming we have 4kb base page size
and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
pte_dirty() and none of them has pte_young(), if I don't miss anything here.


> I'm not sure if my explanation is clear. If you have any further questions,
> feel free to communicate with me again.
> 

Thanks for the explanation, helped me to understand how all (dirty|young) bits
are collected without a lost.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re:Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-10  0:58                 ` Gavin Shan
@ 2025-04-10  2:53                   ` Xavier
  2025-04-10  3:06                     ` Gavin Shan
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-10  2:53 UTC (permalink / raw)
  To: Gavin Shan
  Cc: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0,
	linux-arm-kernel, linux-kernel, will



Hi Gavin,


At 2025-04-10 08:58:46, "Gavin Shan" <gshan@redhat.com> wrote:
>Hi Xavier,
>
>On 4/10/25 1:10 AM, Xavier wrote:
>> 
>> Thank you for carefully reviewing this patch and raising your questions.
>> I'll try to explain and answer them below.
>> 
>
>Not a problem :)
>
>> 
>> At 2025-04-09 12:09:48, "Gavin Shan" <gshan@redhat.com> wrote:
>>> Hi Xavier,
>>>
>>> On 4/8/25 6:58 PM, Xavier wrote:
>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>    termination logic. It checks if the dirty and young bits of orig_pte
>>>>    are already set and skips redundant bit-setting operations during
>>>>    the loop. This reduces unnecessary iterations and improves performance.
>>>>
>>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>>> ---
>>>>    arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index bcac4f55f9c1..034d153d7d19 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>>    
>>>
>>> I'm wandering how it can work. More details are given below.
>>>
>>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>>> +	do { \
>>>> +		int _start = start; \
>>>> +		pte_t *_ptep = ptep; \
>>>> +		while (_start++ < CONT_PTES) { \
>>>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>>>> +				orig_pte = pte_mk##flag(orig_pte); \
>>>> +				break; \
>>>> +			} \
>>>> +		} \
>>>> +	} while (0)
>>>> +
>
>nit: the variable _start and _ptep can be dropped since the caller is going to
>bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
>burden to contpte_ptep_get() for its readability, much more complex than I
>thought.

Not adding these two variables in the current code has no impact. The main purpose
of adding them is to improve maintainability and prevent the external code from
continuing operations unknowingly after the macro has modified the variables.

>
>>>
>>> CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>>> collects the flag for ptep[start, CONT_PTES].
>>>
>>>>    pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>    {
>>>>    	/*
>>>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>    	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>    		pte = __ptep_get(ptep);
>>>>    
>>>> -		if (pte_dirty(pte))
>>>> +		if (pte_dirty(pte)) {
>>>>    			orig_pte = pte_mkdirty(orig_pte);
>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>>> +			break;
>>>> +		}
>>>>    
>>>> -		if (pte_young(pte))
>>>> +		if (pte_young(pte)) {
>>>>    			orig_pte = pte_mkyoung(orig_pte);
>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>>> +			break;
>>>> +		}
>>>>    	}
>>>>    
>>>>    	return orig_pte;
>>>
>>> There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>>> flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>>> flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>>> conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
>> 
>> No flags will be missed. The outer loop is used to check for the first flag,
>> which could be either the dirty or young flag.
>> Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
>> the dirty flag of orig_pte will be set, and the code will immediately enter
>> the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
>> to check only for the young flag starting from the i-th position, and we needn't
>> concern about the dirty flag anymore.
>> If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
>> of orig_pte will be set. At this point, both the young and dirty flags of
>> orig_pte have been set, and there's no need for further loop judgments, so
>> the both the inner and outer loops can be exited directly. This approach
>> reduces unnecessary repeated traversals and judgments.
>> 
>
>Thanks for the explanation. I missed that the subsequent young bit is collected
>on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
>Now I can see all (dirty | young) bits are collected with a lost.
>
>>>
>>> Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>>> However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>>> size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>>> early with luck, we don't have to step into another cache line. Note that extra checks needs
>>> more CPU cycles.
>> 
>> Compared to the previous function, this code doesn't add any extra checks.
>> Even in the worst-case scenario, where neither a dirty nor a young flag is
>> found among the 16 PTEs, the number of checks is the same as in the original
>> function. If any flag is found earlier, the optimized patch will reduce the
>> number of subsequent checks for that flag compared to the original code.
>> 
>
>There are 32 checks in the original code (assuming we have 4kb base page size
>and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
>the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
>pte_dirty() and none of them has pte_young(), if I don't miss anything here.
>

If all PTEs are dirty and none are young, the code will enter CHECK_CONTPTE_FLAG
when it encounters the first dirty PTE. Inside CHECK_CONTPTE_FLAG, it only checks
for the young bit (16 - 0) times and then exits all loops. In this case, the total number
of checks is 1 + 16, which is less than the original 32.

>
>> I'm not sure if my explanation is clear. If you have any further questions,
>> feel free to communicate with me again.
>> 
>
>Thanks for the explanation, helped me to understand how all (dirty|young) bits
>are collected without a lost.
>

If you still have any questions, feel free to discuss.

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-10  2:53                   ` Xavier
@ 2025-04-10  3:06                     ` Gavin Shan
  0 siblings, 0 replies; 49+ messages in thread
From: Gavin Shan @ 2025-04-10  3:06 UTC (permalink / raw)
  To: Xavier
  Cc: dev.jain, akpm, baohua, ryan.roberts, catalin.marinas, ioworker0,
	linux-arm-kernel, linux-kernel, will

Hi Xavier,

On 4/10/25 12:53 PM, Xavier wrote:
> At 2025-04-10 08:58:46, "Gavin Shan" <gshan@redhat.com> wrote:
>>
>> On 4/10/25 1:10 AM, Xavier wrote:
>>>
>>> Thank you for carefully reviewing this patch and raising your questions.
>>> I'll try to explain and answer them below.
>>>
>>
>> Not a problem :)
>>
>>>
>>> At 2025-04-09 12:09:48, "Gavin Shan" <gshan@redhat.com> wrote:
>>>> Hi Xavier,
>>>>
>>>> On 4/8/25 6:58 PM, Xavier wrote:
>>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>>     termination logic. It checks if the dirty and young bits of orig_pte
>>>>>     are already set and skips redundant bit-setting operations during
>>>>>     the loop. This reduces unnecessary iterations and improves performance.
>>>>>
>>>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>>>> ---
>>>>>     arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>>>>     1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>> index bcac4f55f9c1..034d153d7d19 100644
>>>>> --- a/arch/arm64/mm/contpte.c
>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>>>     
>>>>
>>>> I'm wandering how it can work. More details are given below.
>>>>
>>>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>>>> +	do { \
>>>>> +		int _start = start; \
>>>>> +		pte_t *_ptep = ptep; \
>>>>> +		while (_start++ < CONT_PTES) { \
>>>>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>>>>> +				orig_pte = pte_mk##flag(orig_pte); \
>>>>> +				break; \
>>>>> +			} \
>>>>> +		} \
>>>>> +	} while (0)
>>>>> +
>>
>> nit: the variable _start and _ptep can be dropped since the caller is going to
>> bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
>> burden to contpte_ptep_get() for its readability, much more complex than I
>> thought.
> 
> Not adding these two variables in the current code has no impact. The main purpose
> of adding them is to improve maintainability and prevent the external code from
> continuing operations unknowingly after the macro has modified the variables.
> 
>>
>>>>
>>>> CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>>>> collects the flag for ptep[start, CONT_PTES].
>>>>
>>>>>     pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>     {
>>>>>     	/*
>>>>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>     	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>>     		pte = __ptep_get(ptep);
>>>>>     
>>>>> -		if (pte_dirty(pte))
>>>>> +		if (pte_dirty(pte)) {
>>>>>     			orig_pte = pte_mkdirty(orig_pte);
>>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>>>> +			break;
>>>>> +		}
>>>>>     
>>>>> -		if (pte_young(pte))
>>>>> +		if (pte_young(pte)) {
>>>>>     			orig_pte = pte_mkyoung(orig_pte);
>>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>>>> +			break;
>>>>> +		}
>>>>>     	}
>>>>>     
>>>>>     	return orig_pte;
>>>>
>>>> There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>>>> flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>>>> flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>>>> conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
>>>
>>> No flags will be missed. The outer loop is used to check for the first flag,
>>> which could be either the dirty or young flag.
>>> Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
>>> the dirty flag of orig_pte will be set, and the code will immediately enter
>>> the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
>>> to check only for the young flag starting from the i-th position, and we needn't
>>> concern about the dirty flag anymore.
>>> If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
>>> of orig_pte will be set. At this point, both the young and dirty flags of
>>> orig_pte have been set, and there's no need for further loop judgments, so
>>> the both the inner and outer loops can be exited directly. This approach
>>> reduces unnecessary repeated traversals and judgments.
>>>
>>
>> Thanks for the explanation. I missed that the subsequent young bit is collected
>> on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
>> Now I can see all (dirty | young) bits are collected with a lost.
>>
>>>>
>>>> Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>>>> However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>>>> size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>>>> early with luck, we don't have to step into another cache line. Note that extra checks needs
>>>> more CPU cycles.
>>>
>>> Compared to the previous function, this code doesn't add any extra checks.
>>> Even in the worst-case scenario, where neither a dirty nor a young flag is
>>> found among the 16 PTEs, the number of checks is the same as in the original
>>> function. If any flag is found earlier, the optimized patch will reduce the
>>> number of subsequent checks for that flag compared to the original code.
>>>
>>
>> There are 32 checks in the original code (assuming we have 4kb base page size
>> and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
>> the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
>> pte_dirty() and none of them has pte_young(), if I don't miss anything here.
>>
> 
> If all PTEs are dirty and none are young, the code will enter CHECK_CONTPTE_FLAG
> when it encounters the first dirty PTE. Inside CHECK_CONTPTE_FLAG, it only checks
> for the young bit (16 - 0) times and then exits all loops. In this case, the total number
> of checks is 1 + 16, which is less than the original 32.
> 

You're correct. The outer loop is going to stop on pte_dirty() or pte_young(). Well,
you can see the code becomes hard to be understood, at least to me :)

[...]

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07 16:19       ` Dev Jain
  2025-04-08  8:58         ` [PATCH v2 0/1] " Xavier
@ 2025-04-08  9:17         ` Lance Yang
  2025-04-09 15:15           ` Xavier
  1 sibling, 1 reply; 49+ messages in thread
From: Lance Yang @ 2025-04-08  9:17 UTC (permalink / raw)
  To: Dev Jain
  Cc: Xavier, akpm, baohua, catalin.marinas, linux-arm-kernel,
	linux-kernel, ryan.roberts, will

On Tue, Apr 8, 2025 at 12:19 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
> Hi Xavier,
>
> On 07/04/25 7:01 pm, Lance Yang wrote:
> > On Mon, Apr 7, 2025 at 8:56 PM Xavier <xavier_qy@163.com> wrote:
> >>
> >>
> >>
> >> Hi Lance,
> >>
> >> Thanks for your feedback, my response is as follows.
> >>
> >> --
> >> Thanks,
> >> Xavier
> >>
> >>
> >>
> >>
> >>
> >> At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@gmail.com> wrote:
> >>> Thanks for the patch. Would the following change be better?
> >>>
> >>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >>> index 55107d27d3f8..64eb3b2fbf06 100644
> >>> --- a/arch/arm64/mm/contpte.c
> >>> +++ b/arch/arm64/mm/contpte.c
> >>> @@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >>>
> >>>                if (pte_young(pte))
> >>>                        orig_pte = pte_mkyoung(orig_pte);
> >>> +
> >>> +              if (pte_young(orig_pte) && pte_dirty(orig_pte))
> >>> +                      break;
> >>>        }
>
> Quite the coincidence, I was thinking of doing exactly this some days
> back and testing it out : ) Can you do a microanalysis whether this gets
> us a benefit or not? This looks like an optimization on paper but may
> not be one after all because CONT_PTES is only 16 and a simple loop
> without extra if-conditions may just be faster.

Yeah, I was thinking the same ;)

This change seems reasonable in theory. But with CONT_PTES=16, keeping
it simple might actually be faster, IMO.

Thanks,
Lance

>
> >>>
> >>>        return orig_pte;
> >>> --
> >>>
> >>> We can check the orig_pte flags directly instead of using extra boolean
> >>> variables, which gives us an early-exit when both dirty and young flags
> >>> are set.
> >> Your way of writing the code is indeed more concise. However, I think
> >>   using boolean variables might be more efficient. Although it introduces
> >>   additional variables, comparing boolean values is likely to be more
> >>   efficient than checking bit settings.
> >>
> >>>
> >>> Also, is this optimization really needed for the common case?
> >> This function is on a high-frequency execution path. During debugging,
> >>   I found that in most cases, the first few pages are already marked as
> >>   both dirty and young. But currently, the program still has to complete
> >>   the entire loop of 16 ptep iterations, which seriously reduces the efficiency.
> >
> > Hmm... agreed that this patch helps when early PTEs are dirty/young, but
> > for late-ones-only cases, it only introduces overhead with no benefit, IIUC.
> >
> > So, let's wait for folks to take a look ;)
> >
> > Thanks,
> > Lance
> >
> >>>
> >>> Thanks,
> >>> Lance
> >
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re:Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-08  9:17         ` [PATCH v1] " Lance Yang
@ 2025-04-09 15:15           ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-09 15:15 UTC (permalink / raw)
  To: Lance Yang
  Cc: Dev Jain, akpm, baohua, catalin.marinas, linux-arm-kernel,
	linux-kernel, ryan.roberts, will



Hi Lance,


At 2025-04-08 17:17:27, "Lance Yang" <ioworker0@gmail.com> wrote:
>On Tue, Apr 8, 2025 at 12:19 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> Hi Xavier,
>>
>> On 07/04/25 7:01 pm, Lance Yang wrote:
>> > On Mon, Apr 7, 2025 at 8:56 PM Xavier <xavier_qy@163.com> wrote:
>> >>
>> >>
>> >>
>> >> Hi Lance,
>> >>
>> >> Thanks for your feedback, my response is as follows.
>> >>
>> >> --
>> >> Thanks,
>> >> Xavier
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@gmail.com> wrote:
>> >>> Thanks for the patch. Would the following change be better?
>> >>>
>> >>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> >>> index 55107d27d3f8..64eb3b2fbf06 100644
>> >>> --- a/arch/arm64/mm/contpte.c
>> >>> +++ b/arch/arm64/mm/contpte.c
>> >>> @@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> >>>
>> >>>                if (pte_young(pte))
>> >>>                        orig_pte = pte_mkyoung(orig_pte);
>> >>> +
>> >>> +              if (pte_young(orig_pte) && pte_dirty(orig_pte))
>> >>> +                      break;
>> >>>        }
>>
>> Quite the coincidence, I was thinking of doing exactly this some days
>> back and testing it out : ) Can you do a microanalysis whether this gets
>> us a benefit or not? This looks like an optimization on paper but may
>> not be one after all because CONT_PTES is only 16 and a simple loop
>> without extra if-conditions may just be faster.
>
>Yeah, I was thinking the same ;)
>
>This change seems reasonable in theory. But with CONT_PTES=16, keeping
>it simple might actually be faster, IMO.

Please take a look at patch v2. I've updated the code without introducing any
additional checks. The number of checks in the new patch is definitely less than
or equal to that in the original code.

--
Thanks,
Xavier


>
>Thanks,
>Lance
>
>>
>> >>>
>> >>>        return orig_pte;
>> >>> --
>> >>>
>> >>> We can check the orig_pte flags directly instead of using extra boolean
>> >>> variables, which gives us an early-exit when both dirty and young flags
>> >>> are set.
>> >> Your way of writing the code is indeed more concise. However, I think
>> >>   using boolean variables might be more efficient. Although it introduces
>> >>   additional variables, comparing boolean values is likely to be more
>> >>   efficient than checking bit settings.
>> >>
>> >>>
>> >>> Also, is this optimization really needed for the common case?
>> >> This function is on a high-frequency execution path. During debugging,
>> >>   I found that in most cases, the first few pages are already marked as
>> >>   both dirty and young. But currently, the program still has to complete
>> >>   the entire loop of 16 ptep iterations, which seriously reduces the efficiency.
>> >
>> > Hmm... agreed that this patch helps when early PTEs are dirty/young, but
>> > for late-ones-only cases, it only introduces overhead with no benefit, IIUC.
>> >
>> > So, let's wait for folks to take a look ;)
>> >
>> > Thanks,
>> > Lance
>> >
>> >>>
>> >>> Thanks,
>> >>> Lance
>> >
>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-07  9:22 [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations Xavier
  2025-04-07 11:29 ` Lance Yang
@ 2025-04-10 21:25 ` Barry Song
  2025-04-11 12:03   ` David Laight
  2025-04-11 17:30   ` Dev Jain
  1 sibling, 2 replies; 49+ messages in thread
From: Barry Song @ 2025-04-10 21:25 UTC (permalink / raw)
  To: Xavier
  Cc: catalin.marinas, will, akpm, ryan.roberts, ioworker0,
	linux-arm-kernel, linux-kernel

On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>
> This commit optimizes the contpte_ptep_get function by adding early
>  termination logic. It checks if the dirty and young bits of orig_pte
>  are already set and skips redundant bit-setting operations during
>  the loop. This reduces unnecessary iterations and improves performance.
>
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
>  arch/arm64/mm/contpte.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..ca15d8f52d14 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>
>         pte_t pte;
>         int i;
> +       bool dirty = false;
> +       bool young = false;
>
>         ptep = contpte_align_down(ptep);
>
>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>
> -               if (pte_dirty(pte))
> +               if (!dirty && pte_dirty(pte)) {
> +                       dirty = true;
>                         orig_pte = pte_mkdirty(orig_pte);
> +               }
>
> -               if (pte_young(pte))
> +               if (!young && pte_young(pte)) {
> +                       young = true;
>                         orig_pte = pte_mkyoung(orig_pte);
> +               }
> +
> +               if (dirty && young)
> +                       break;

This kind of optimization is always tricky. Dev previously tried a similar
approach to reduce the loop count, but it ended up causing performance
degradation:
https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/

So we may need actual data to validate this idea.

>         }
>
>         return orig_pte;
> --
> 2.34.1
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-10 21:25 ` Barry Song
@ 2025-04-11 12:03   ` David Laight
  2025-04-12  7:18     ` Barry Song
  2025-04-11 17:30   ` Dev Jain
  1 sibling, 1 reply; 49+ messages in thread
From: David Laight @ 2025-04-11 12:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Xavier, catalin.marinas, will, akpm, ryan.roberts, ioworker0,
	linux-arm-kernel, linux-kernel

On Fri, 11 Apr 2025 09:25:39 +1200
Barry Song <21cnbao@gmail.com> wrote:

> On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
> >
> > This commit optimizes the contpte_ptep_get function by adding early
> >  termination logic. It checks if the dirty and young bits of orig_pte
> >  are already set and skips redundant bit-setting operations during
> >  the loop. This reduces unnecessary iterations and improves performance.
> >
> > Signed-off-by: Xavier <xavier_qy@163.com>
> > ---
> >  arch/arm64/mm/contpte.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index bcac4f55f9c1..ca15d8f52d14 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >
> >         pte_t pte;
> >         int i;
> > +       bool dirty = false;
> > +       bool young = false;
> >
> >         ptep = contpte_align_down(ptep);
> >
> >         for (i = 0; i < CONT_PTES; i++, ptep++) {
> >                 pte = __ptep_get(ptep);
> >
> > -               if (pte_dirty(pte))
> > +               if (!dirty && pte_dirty(pte)) {
> > +                       dirty = true;
> >                         orig_pte = pte_mkdirty(orig_pte);
> > +               }
> >
> > -               if (pte_young(pte))
> > +               if (!young && pte_young(pte)) {
> > +                       young = true;
> >                         orig_pte = pte_mkyoung(orig_pte);
> > +               }
> > +
> > +               if (dirty && young)
> > +                       break;  
> 
> This kind of optimization is always tricky. Dev previously tried a similar
> approach to reduce the loop count, but it ended up causing performance
> degradation:
> https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
> 
> So we may need actual data to validate this idea.

You might win with 3 loops.
The first looks for both 'dirty' and 'young'.
If it finds only one it jumps to a different loop that continues
the search but only looks for the other flag.

	David

> 
> >         }
> >
> >         return orig_pte;
> > --
> > 2.34.1
> >  
> 
> Thanks
> Barry
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-11 12:03   ` David Laight
@ 2025-04-12  7:18     ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2025-04-12  7:18 UTC (permalink / raw)
  To: David Laight
  Cc: Xavier, catalin.marinas, will, akpm, ryan.roberts, ioworker0,
	linux-arm-kernel, linux-kernel

On Fri, Apr 11, 2025 at 8:03 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Fri, 11 Apr 2025 09:25:39 +1200
> Barry Song <21cnbao@gmail.com> wrote:
>
> > On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
> > >
> > > This commit optimizes the contpte_ptep_get function by adding early
> > >  termination logic. It checks if the dirty and young bits of orig_pte
> > >  are already set and skips redundant bit-setting operations during
> > >  the loop. This reduces unnecessary iterations and improves performance.
> > >
> > > Signed-off-by: Xavier <xavier_qy@163.com>
> > > ---
> > >  arch/arm64/mm/contpte.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > index bcac4f55f9c1..ca15d8f52d14 100644
> > > --- a/arch/arm64/mm/contpte.c
> > > +++ b/arch/arm64/mm/contpte.c
> > > @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > >
> > >         pte_t pte;
> > >         int i;
> > > +       bool dirty = false;
> > > +       bool young = false;
> > >
> > >         ptep = contpte_align_down(ptep);
> > >
> > >         for (i = 0; i < CONT_PTES; i++, ptep++) {
> > >                 pte = __ptep_get(ptep);
> > >
> > > -               if (pte_dirty(pte))
> > > +               if (!dirty && pte_dirty(pte)) {
> > > +                       dirty = true;
> > >                         orig_pte = pte_mkdirty(orig_pte);
> > > +               }
> > >
> > > -               if (pte_young(pte))
> > > +               if (!young && pte_young(pte)) {
> > > +                       young = true;
> > >                         orig_pte = pte_mkyoung(orig_pte);
> > > +               }
> > > +
> > > +               if (dirty && young)
> > > +                       break;
> >
> > This kind of optimization is always tricky. Dev previously tried a similar
> > approach to reduce the loop count, but it ended up causing performance
> > degradation:
> > https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
> >
> > So we may need actual data to validate this idea.
>
> You might win with 3 loops.
> The first looks for both 'dirty' and 'young'.
> If it finds only one it jumps to a different loop that continues
> the search but only looks for the other flag.

However, there's no concrete evidence that this loop is a hot path. It might
save a few instructions when the first PTE is both young and dirty, but in
many cases, none of the PTEs are either. Previous profiling indicates that
the actual hot path lies in the rmap walk to locate the page table entries
during folio_reference in the reclamation path.

I suspect the code you're optimizing isn't actually part of a hot path at all.

>
>         David
>
> >
> > >         }
> > >
> > >         return orig_pte;
> > > --
> > > 2.34.1
> > >
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-10 21:25 ` Barry Song
  2025-04-11 12:03   ` David Laight
@ 2025-04-11 17:30   ` Dev Jain
  2025-04-12  5:05     ` Lance Yang
  1 sibling, 1 reply; 49+ messages in thread
From: Dev Jain @ 2025-04-11 17:30 UTC (permalink / raw)
  To: Barry Song, Xavier
  Cc: catalin.marinas, will, akpm, ryan.roberts, ioworker0,
	linux-arm-kernel, linux-kernel, David Hildenbrand, Matthew Wilcox,
	Zi Yan

+others

On 11/04/25 2:55 am, Barry Song wrote:
> On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>>
>> This commit optimizes the contpte_ptep_get function by adding early
>>   termination logic. It checks if the dirty and young bits of orig_pte
>>   are already set and skips redundant bit-setting operations during
>>   the loop. This reduces unnecessary iterations and improves performance.
>>
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>>   arch/arm64/mm/contpte.c | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..ca15d8f52d14 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>
>>          pte_t pte;
>>          int i;
>> +       bool dirty = false;
>> +       bool young = false;
>>
>>          ptep = contpte_align_down(ptep);
>>
>>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                  pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (!dirty && pte_dirty(pte)) {
>> +                       dirty = true;
>>                          orig_pte = pte_mkdirty(orig_pte);
>> +               }
>>
>> -               if (pte_young(pte))
>> +               if (!young && pte_young(pte)) {
>> +                       young = true;
>>                          orig_pte = pte_mkyoung(orig_pte);
>> +               }
>> +
>> +               if (dirty && young)
>> +                       break;
> 
> This kind of optimization is always tricky. Dev previously tried a similar
> approach to reduce the loop count, but it ended up causing performance
> degradation:
> https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
> 
> So we may need actual data to validate this idea.

The original v2 patch does not work, I changed it to the following:

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..db0ad38601db 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, 
unsigned long addr,
  }
  EXPORT_SYMBOL_GPL(__contpte_try_unfold);

+#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
+       int _start; \
+       pte_t *_ptep = ptep; \
+       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
+               if (pte_##flag(__ptep_get(_ptep))) { \
+                       orig_pte = pte_mk##flag(orig_pte); \
+                       break; \
+               } \
+       }
+
  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
  {
         /*
@@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
         for (i = 0; i < CONT_PTES; i++, ptep++) {
                 pte = __ptep_get(ptep);

-               if (pte_dirty(pte))
+               if (pte_dirty(pte)) {
                         orig_pte = pte_mkdirty(orig_pte);
+                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
+                       break;
+               }

-               if (pte_young(pte))
+               if (pte_young(pte)) {
                         orig_pte = pte_mkyoung(orig_pte);
+                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
+                       break;
+               }
         }

         return orig_pte;

Some rudimentary testing with micromm reveals that this may be 
*slightly* faster. I cannot say for sure yet.

> 
>>          }
>>
>>          return orig_pte;
>> --
>> 2.34.1
>>
> 
> Thanks
> Barry
> 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-11 17:30   ` Dev Jain
@ 2025-04-12  5:05     ` Lance Yang
  2025-04-12  5:27       ` Xavier
  2025-04-14  8:06       ` Ryan Roberts
  0 siblings, 2 replies; 49+ messages in thread
From: Lance Yang @ 2025-04-12  5:05 UTC (permalink / raw)
  To: Dev Jain, Xavier
  Cc: Barry Song, catalin.marinas, will, akpm, ryan.roberts,
	linux-arm-kernel, linux-kernel, David Hildenbrand, Matthew Wilcox,
	Zi Yan

On Sat, Apr 12, 2025 at 1:30 AM Dev Jain <dev.jain@arm.com> wrote:
>
> +others
>
> On 11/04/25 2:55 am, Barry Song wrote:
> > On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
> >>
> >> This commit optimizes the contpte_ptep_get function by adding early
> >>   termination logic. It checks if the dirty and young bits of orig_pte
> >>   are already set and skips redundant bit-setting operations during
> >>   the loop. This reduces unnecessary iterations and improves performance.
> >>
> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> ---
> >>   arch/arm64/mm/contpte.c | 13 +++++++++++--
> >>   1 file changed, 11 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> index bcac4f55f9c1..ca15d8f52d14 100644
> >> --- a/arch/arm64/mm/contpte.c
> >> +++ b/arch/arm64/mm/contpte.c
> >> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >>
> >>          pte_t pte;
> >>          int i;
> >> +       bool dirty = false;
> >> +       bool young = false;
> >>
> >>          ptep = contpte_align_down(ptep);
> >>
> >>          for (i = 0; i < CONT_PTES; i++, ptep++) {
> >>                  pte = __ptep_get(ptep);
> >>
> >> -               if (pte_dirty(pte))
> >> +               if (!dirty && pte_dirty(pte)) {
> >> +                       dirty = true;
> >>                          orig_pte = pte_mkdirty(orig_pte);
> >> +               }
> >>
> >> -               if (pte_young(pte))
> >> +               if (!young && pte_young(pte)) {
> >> +                       young = true;
> >>                          orig_pte = pte_mkyoung(orig_pte);
> >> +               }
> >> +
> >> +               if (dirty && young)
> >> +                       break;
> >
> > This kind of optimization is always tricky. Dev previously tried a similar
> > approach to reduce the loop count, but it ended up causing performance
> > degradation:
> > https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
> >
> > So we may need actual data to validate this idea.
>
> The original v2 patch does not work, I changed it to the following:
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..db0ad38601db 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm,
> unsigned long addr,
>   }
>   EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>
> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> +       int _start; \
> +       pte_t *_ptep = ptep; \
> +       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
> +               if (pte_##flag(__ptep_get(_ptep))) { \
> +                       orig_pte = pte_mk##flag(orig_pte); \
> +                       break; \
> +               } \
> +       }
> +
>   pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>   {
>          /*
> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>                  pte = __ptep_get(ptep);
>
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                          orig_pte = pte_mkdirty(orig_pte);
> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> +                       break;
> +               }
>
> -               if (pte_young(pte))
> +               if (pte_young(pte)) {
>                          orig_pte = pte_mkyoung(orig_pte);
> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> +                       break;
> +               }
>          }
>
>          return orig_pte;
>
> Some rudimentary testing with micromm reveals that this may be
> *slightly* faster. I cannot say for sure yet.

Yep, this change works as expected, IIUC.

However, I'm still wondering if the added complexity is worth it for
such a slight/negligible performance gain. That said, if we have
solid numbers/data to back it up, all doubts would disappear ;)

Thanks,
Lance

>
> >
> >>          }
> >>
> >>          return orig_pte;
> >> --
> >> 2.34.1
> >>
> >
> > Thanks
> > Barry
> >
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-12  5:05     ` Lance Yang
@ 2025-04-12  5:27       ` Xavier
  2025-04-14  8:06       ` Ryan Roberts
  1 sibling, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-12  5:27 UTC (permalink / raw)
  To: Lance Yang, Dev Jain, Barry Song, David Hildenbrand
  Cc: catalin.marinas, will, akpm, ryan.roberts, linux-arm-kernel,
	linux-kernel, Matthew Wilcox, Zi Yan



Hi,


At 2025-04-12 13:05:13, "Lance Yang" <ioworker0@gmail.com> wrote:
>On Sat, Apr 12, 2025 at 1:30 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> +others
>>
>> On 11/04/25 2:55 am, Barry Song wrote:
>> > On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>> >>
>> >> This commit optimizes the contpte_ptep_get function by adding early
>> >>   termination logic. It checks if the dirty and young bits of orig_pte
>> >>   are already set and skips redundant bit-setting operations during
>> >>   the loop. This reduces unnecessary iterations and improves performance.
>> >>
>> >> Signed-off-by: Xavier <xavier_qy@163.com>
>> >> ---
>> >>   arch/arm64/mm/contpte.c | 13 +++++++++++--
>> >>   1 file changed, 11 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> >> index bcac4f55f9c1..ca15d8f52d14 100644
>> >> --- a/arch/arm64/mm/contpte.c
>> >> +++ b/arch/arm64/mm/contpte.c
>> >> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> >>
>> >>          pte_t pte;
>> >>          int i;
>> >> +       bool dirty = false;
>> >> +       bool young = false;
>> >>
>> >>          ptep = contpte_align_down(ptep);
>> >>
>> >>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>> >>                  pte = __ptep_get(ptep);
>> >>
>> >> -               if (pte_dirty(pte))
>> >> +               if (!dirty && pte_dirty(pte)) {
>> >> +                       dirty = true;
>> >>                          orig_pte = pte_mkdirty(orig_pte);
>> >> +               }
>> >>
>> >> -               if (pte_young(pte))
>> >> +               if (!young && pte_young(pte)) {
>> >> +                       young = true;
>> >>                          orig_pte = pte_mkyoung(orig_pte);
>> >> +               }
>> >> +
>> >> +               if (dirty && young)
>> >> +                       break;
>> >
>> > This kind of optimization is always tricky. Dev previously tried a similar
>> > approach to reduce the loop count, but it ended up causing performance
>> > degradation:
>> > https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
>> >
>> > So we may need actual data to validate this idea.

You're absolutely right. Optimization must be proven by test data. I will
 conduct some tests and validations next.

>>
>> The original v2 patch does not work, I changed it to the following:
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..db0ad38601db 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm,
>> unsigned long addr,
>>   }
>>   EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +       int _start; \
>> +       pte_t *_ptep = ptep; \
>> +       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
>> +               if (pte_##flag(__ptep_get(_ptep))) { \
>> +                       orig_pte = pte_mk##flag(orig_pte); \
>> +                       break; \
>> +               } \
>> +       }
>> +
>>   pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>   {
>>          /*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                  pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (pte_dirty(pte)) {
>>                          orig_pte = pte_mkdirty(orig_pte);
>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> +                       break;
>> +               }
>>
>> -               if (pte_young(pte))
>> +               if (pte_young(pte)) {
>>                          orig_pte = pte_mkyoung(orig_pte);
>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> +                       break;
>> +               }
>>          }
>>
>>          return orig_pte;
>>
>> Some rudimentary testing with micromm reveals that this may be
>> *slightly* faster. I cannot say for sure yet.

Thank you for your suggestions on the modification. I will write some
test cases for verification.

>
>Yep, this change works as expected, IIUC.
>
>However, I'm still wondering if the added complexity is worth it for
>such a slight/negligible performance gain. That said, if we have
>solid numbers/data to back it up, all doubts would disappear ;)
>

Indeed, unless there is a noticeable improvement in efficiency, increasing
complexity may not be worth the cost. I will summarize some test data
to determine whether this optimization is worthwhile.

------
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-12  5:05     ` Lance Yang
  2025-04-12  5:27       ` Xavier
@ 2025-04-14  8:06       ` Ryan Roberts
  2025-04-14  8:51         ` Dev Jain
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-04-14  8:06 UTC (permalink / raw)
  To: Lance Yang, Dev Jain, Xavier
  Cc: Barry Song, catalin.marinas, will, akpm, linux-arm-kernel,
	linux-kernel, David Hildenbrand, Matthew Wilcox, Zi Yan

On 12/04/2025 06:05, Lance Yang wrote:
> On Sat, Apr 12, 2025 at 1:30 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> +others
>>
>> On 11/04/25 2:55 am, Barry Song wrote:
>>> On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>>>>
>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>   termination logic. It checks if the dirty and young bits of orig_pte
>>>>   are already set and skips redundant bit-setting operations during
>>>>   the loop. This reduces unnecessary iterations and improves performance.
>>>>
>>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>>> ---
>>>>   arch/arm64/mm/contpte.c | 13 +++++++++++--
>>>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index bcac4f55f9c1..ca15d8f52d14 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>
>>>>          pte_t pte;
>>>>          int i;
>>>> +       bool dirty = false;
>>>> +       bool young = false;
>>>>
>>>>          ptep = contpte_align_down(ptep);
>>>>
>>>>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>                  pte = __ptep_get(ptep);
>>>>
>>>> -               if (pte_dirty(pte))
>>>> +               if (!dirty && pte_dirty(pte)) {
>>>> +                       dirty = true;
>>>>                          orig_pte = pte_mkdirty(orig_pte);
>>>> +               }
>>>>
>>>> -               if (pte_young(pte))
>>>> +               if (!young && pte_young(pte)) {
>>>> +                       young = true;
>>>>                          orig_pte = pte_mkyoung(orig_pte);
>>>> +               }
>>>> +
>>>> +               if (dirty && young)
>>>> +                       break;
>>>
>>> This kind of optimization is always tricky. Dev previously tried a similar
>>> approach to reduce the loop count, but it ended up causing performance
>>> degradation:
>>> https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
>>>
>>> So we may need actual data to validate this idea.
>>
>> The original v2 patch does not work, I changed it to the following:
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..db0ad38601db 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm,
>> unsigned long addr,
>>   }
>>   EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +       int _start; \
>> +       pte_t *_ptep = ptep; \
>> +       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
>> +               if (pte_##flag(__ptep_get(_ptep))) { \
>> +                       orig_pte = pte_mk##flag(orig_pte); \
>> +                       break; \
>> +               } \
>> +       }
>> +
>>   pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>   {
>>          /*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                  pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (pte_dirty(pte)) {
>>                          orig_pte = pte_mkdirty(orig_pte);
>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> +                       break;
>> +               }
>>
>> -               if (pte_young(pte))
>> +               if (pte_young(pte)) {
>>                          orig_pte = pte_mkyoung(orig_pte);
>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> +                       break;
>> +               }
>>          }
>>
>>          return orig_pte;
>>
>> Some rudimentary testing with micromm reveals that this may be
>> *slightly* faster. I cannot say for sure yet.
> 
> Yep, this change works as expected, IIUC.
> 
> However, I'm still wondering if the added complexity is worth it for
> such a slight/negligible performance gain. That said, if we have
> solid numbers/data to back it up, all doubts would disappear ;)

I agree with Barry; we need clear performance improvement numbers to consider
this type of optimization. I doubt there will be measurable improvement for 4K
and 64K base pages (because the the number of PTEs in a contpte block are 16 and
32 respectively). But 16K base pages may benefit given there are 128 PTEs in a
contpte block for that case.

Also FWIW, I'm struggling to understand CHECK_CONTPTE_FLAG(); Perhaps something
like this would suffice?

----8<----
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 55107d27d3f8..7787b116b339 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,11 +169,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
        for (i = 0; i < CONT_PTES; i++, ptep++) {
                pte = __ptep_get(ptep);

-               if (pte_dirty(pte))
+               if (pte_dirty(pte)) {
                        orig_pte = pte_mkdirty(orig_pte);
+                       if (pte_young(orig_pte))
+                               break;
+               }

-               if (pte_young(pte))
+               if (pte_young(pte)) {
                        orig_pte = pte_mkyoung(orig_pte);
+                       if (pte_dirty(orig_pte))
+                               break;
+               }
        }

        return orig_pte;
----8<----

Thanks,
Ryan


> 
> Thanks,
> Lance
> 
>>
>>>
>>>>          }
>>>>
>>>>          return orig_pte;
>>>> --
>>>> 2.34.1
>>>>
>>>
>>> Thanks
>>> Barry
>>>
>>



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-14  8:06       ` Ryan Roberts
@ 2025-04-14  8:51         ` Dev Jain
  2025-04-14 12:11           ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Dev Jain @ 2025-04-14  8:51 UTC (permalink / raw)
  To: Ryan Roberts, Lance Yang, Xavier
  Cc: Barry Song, catalin.marinas, will, akpm, linux-arm-kernel,
	linux-kernel, David Hildenbrand, Matthew Wilcox, Zi Yan



On 14/04/25 1:36 pm, Ryan Roberts wrote:
> On 12/04/2025 06:05, Lance Yang wrote:
>> On Sat, Apr 12, 2025 at 1:30 AM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>> +others
>>>
>>> On 11/04/25 2:55 am, Barry Song wrote:
>>>> On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>>>>>
>>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>>    termination logic. It checks if the dirty and young bits of orig_pte
>>>>>    are already set and skips redundant bit-setting operations during
>>>>>    the loop. This reduces unnecessary iterations and improves performance.
>>>>>
>>>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>>>> ---
>>>>>    arch/arm64/mm/contpte.c | 13 +++++++++++--
>>>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>> index bcac4f55f9c1..ca15d8f52d14 100644
>>>>> --- a/arch/arm64/mm/contpte.c
>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>
>>>>>           pte_t pte;
>>>>>           int i;
>>>>> +       bool dirty = false;
>>>>> +       bool young = false;
>>>>>
>>>>>           ptep = contpte_align_down(ptep);
>>>>>
>>>>>           for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>>                   pte = __ptep_get(ptep);
>>>>>
>>>>> -               if (pte_dirty(pte))
>>>>> +               if (!dirty && pte_dirty(pte)) {
>>>>> +                       dirty = true;
>>>>>                           orig_pte = pte_mkdirty(orig_pte);
>>>>> +               }
>>>>>
>>>>> -               if (pte_young(pte))
>>>>> +               if (!young && pte_young(pte)) {
>>>>> +                       young = true;
>>>>>                           orig_pte = pte_mkyoung(orig_pte);
>>>>> +               }
>>>>> +
>>>>> +               if (dirty && young)
>>>>> +                       break;
>>>>
>>>> This kind of optimization is always tricky. Dev previously tried a similar
>>>> approach to reduce the loop count, but it ended up causing performance
>>>> degradation:
>>>> https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
>>>>
>>>> So we may need actual data to validate this idea.
>>>
>>> The original v2 patch does not work, I changed it to the following:
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index bcac4f55f9c1..db0ad38601db 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm,
>>> unsigned long addr,
>>>    }
>>>    EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>
>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>> +       int _start; \
>>> +       pte_t *_ptep = ptep; \
>>> +       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
>>> +               if (pte_##flag(__ptep_get(_ptep))) { \
>>> +                       orig_pte = pte_mk##flag(orig_pte); \
>>> +                       break; \
>>> +               } \
>>> +       }
>>> +
>>>    pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>    {
>>>           /*
>>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>           for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>                   pte = __ptep_get(ptep);
>>>
>>> -               if (pte_dirty(pte))
>>> +               if (pte_dirty(pte)) {
>>>                           orig_pte = pte_mkdirty(orig_pte);
>>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>> +                       break;
>>> +               }
>>>
>>> -               if (pte_young(pte))
>>> +               if (pte_young(pte)) {
>>>                           orig_pte = pte_mkyoung(orig_pte);
>>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>> +                       break;
>>> +               }
>>>           }
>>>
>>>           return orig_pte;
>>>
>>> Some rudimentary testing with micromm reveals that this may be
>>> *slightly* faster. I cannot say for sure yet.
>>
>> Yep, this change works as expected, IIUC.
>>
>> However, I'm still wondering if the added complexity is worth it for
>> such a slight/negligible performance gain. That said, if we have
>> solid numbers/data to back it up, all doubts would disappear ;)
> 
> I agree with Barry; we need clear performance improvement numbers to consider
> this type of optimization. I doubt there will be measurable improvement for 4K
> and 64K base pages (because the the number of PTEs in a contpte block are 16 and
> 32 respectively). But 16K base pages may benefit given there are 128 PTEs in a
> contpte block for that case.
> 
> Also FWIW, I'm struggling to understand CHECK_CONTPTE_FLAG(); Perhaps something
> like this would suffice?

The idea of CHECK_CONTPTE_FLAG() is to check only for the other flag 
until it's found; the diff below will still keep checking for both until 
both are found.

> 
> ----8<----
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 55107d27d3f8..7787b116b339 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -169,11 +169,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>                  pte = __ptep_get(ptep);
> 
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                          orig_pte = pte_mkdirty(orig_pte);
> +                       if (pte_young(orig_pte))
> +                               break;
> +               }
> 
> -               if (pte_young(pte))
> +               if (pte_young(pte)) {
>                          orig_pte = pte_mkyoung(orig_pte);
> +                       if (pte_dirty(orig_pte))
> +                               break;
> +               }
>          }
> 
>          return orig_pte;
> ----8<----
> 
> Thanks,
> Ryan
> 
> 
>>
>> Thanks,
>> Lance
>>
>>>
>>>>
>>>>>           }
>>>>>
>>>>>           return orig_pte;
>>>>> --
>>>>> 2.34.1
>>>>>
>>>>
>>>> Thanks
>>>> Barry
>>>>
>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-14  8:51         ` Dev Jain
@ 2025-04-14 12:11           ` Ryan Roberts
  2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-04-14 12:11 UTC (permalink / raw)
  To: Dev Jain, Lance Yang, Xavier
  Cc: Barry Song, catalin.marinas, will, akpm, linux-arm-kernel,
	linux-kernel, David Hildenbrand, Matthew Wilcox, Zi Yan

On 14/04/2025 09:51, Dev Jain wrote:
> 
> 
> On 14/04/25 1:36 pm, Ryan Roberts wrote:
>> On 12/04/2025 06:05, Lance Yang wrote:
>>> On Sat, Apr 12, 2025 at 1:30 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>> +others
>>>>
>>>> On 11/04/25 2:55 am, Barry Song wrote:
>>>>> On Mon, Apr 7, 2025 at 9:23 PM Xavier <xavier_qy@163.com> wrote:
>>>>>>
>>>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>>>    termination logic. It checks if the dirty and young bits of orig_pte
>>>>>>    are already set and skips redundant bit-setting operations during
>>>>>>    the loop. This reduces unnecessary iterations and improves performance.
>>>>>>
>>>>>> Signed-off-by: Xavier <xavier_qy@163.com>
>>>>>> ---
>>>>>>    arch/arm64/mm/contpte.c | 13 +++++++++++--
>>>>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>> index bcac4f55f9c1..ca15d8f52d14 100644
>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>> @@ -163,17 +163,26 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>>
>>>>>>           pte_t pte;
>>>>>>           int i;
>>>>>> +       bool dirty = false;
>>>>>> +       bool young = false;
>>>>>>
>>>>>>           ptep = contpte_align_down(ptep);
>>>>>>
>>>>>>           for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>>>                   pte = __ptep_get(ptep);
>>>>>>
>>>>>> -               if (pte_dirty(pte))
>>>>>> +               if (!dirty && pte_dirty(pte)) {
>>>>>> +                       dirty = true;
>>>>>>                           orig_pte = pte_mkdirty(orig_pte);
>>>>>> +               }
>>>>>>
>>>>>> -               if (pte_young(pte))
>>>>>> +               if (!young && pte_young(pte)) {
>>>>>> +                       young = true;
>>>>>>                           orig_pte = pte_mkyoung(orig_pte);
>>>>>> +               }
>>>>>> +
>>>>>> +               if (dirty && young)
>>>>>> +                       break;
>>>>>
>>>>> This kind of optimization is always tricky. Dev previously tried a similar
>>>>> approach to reduce the loop count, but it ended up causing performance
>>>>> degradation:
>>>>> https://lore.kernel.org/linux-mm/20240913091902.1160520-1-dev.jain@arm.com/
>>>>>
>>>>> So we may need actual data to validate this idea.
>>>>
>>>> The original v2 patch does not work, I changed it to the following:
>>>>
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index bcac4f55f9c1..db0ad38601db 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm,
>>>> unsigned long addr,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>>
>>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>>> +       int _start; \
>>>> +       pte_t *_ptep = ptep; \
>>>> +       for (_start = start; _start < CONT_PTES; _start++, ptep++) { \
>>>> +               if (pte_##flag(__ptep_get(_ptep))) { \
>>>> +                       orig_pte = pte_mk##flag(orig_pte); \
>>>> +                       break; \
>>>> +               } \
>>>> +       }
>>>> +
>>>>    pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>    {
>>>>           /*
>>>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>           for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>                   pte = __ptep_get(ptep);
>>>>
>>>> -               if (pte_dirty(pte))
>>>> +               if (pte_dirty(pte)) {
>>>>                           orig_pte = pte_mkdirty(orig_pte);
>>>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>>> +                       break;
>>>> +               }
>>>>
>>>> -               if (pte_young(pte))
>>>> +               if (pte_young(pte)) {
>>>>                           orig_pte = pte_mkyoung(orig_pte);
>>>> +                       CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>>> +                       break;
>>>> +               }
>>>>           }
>>>>
>>>>           return orig_pte;
>>>>
>>>> Some rudimentary testing with micromm reveals that this may be
>>>> *slightly* faster. I cannot say for sure yet.
>>>
>>> Yep, this change works as expected, IIUC.
>>>
>>> However, I'm still wondering if the added complexity is worth it for
>>> such a slight/negligible performance gain. That said, if we have
>>> solid numbers/data to back it up, all doubts would disappear ;)
>>
>> I agree with Barry; we need clear performance improvement numbers to consider
>> this type of optimization. I doubt there will be measurable improvement for 4K
>> and 64K base pages (because the the number of PTEs in a contpte block are 16 and
>> 32 respectively). But 16K base pages may benefit given there are 128 PTEs in a
>> contpte block for that case.
>>
>> Also FWIW, I'm struggling to understand CHECK_CONTPTE_FLAG(); Perhaps something
>> like this would suffice?
> 
> The idea of CHECK_CONTPTE_FLAG() is to check only for the other flag until it's
> found; the diff below will still keep checking for both until both are found.

Oh I see. Although you're effectively just saving a tbnz instruction, which will
be in the noise.

Anyway, come back with numbers and we can talk :)

> 
>>
>> ----8<----
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 55107d27d3f8..7787b116b339 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -169,11 +169,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>          for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                  pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (pte_dirty(pte)) {
>>                          orig_pte = pte_mkdirty(orig_pte);
>> +                       if (pte_young(orig_pte))
>> +                               break;
>> +               }
>>
>> -               if (pte_young(pte))
>> +               if (pte_young(pte)) {
>>                          orig_pte = pte_mkyoung(orig_pte);
>> +                       if (pte_dirty(orig_pte))
>> +                               break;
>> +               }
>>          }
>>
>>          return orig_pte;
>> ----8<----
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>> Thanks,
>>> Lance
>>>
>>>>
>>>>>
>>>>>>           }
>>>>>>
>>>>>>           return orig_pte;
>>>>>> -- 
>>>>>> 2.34.1
>>>>>>
>>>>>
>>>>> Thanks
>>>>> Barry
>>>>>
>>>>
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-14 12:11           ` Ryan Roberts
@ 2025-04-15  8:22             ` Xavier
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
                                 ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Xavier @ 2025-04-15  8:22 UTC (permalink / raw)
  To: ryan.roberts, dev.jain, ioworker0, 21cnbao
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, xavier_qy, ziy

Patch V3 has changed the while loop to a for loop according to the suggestions
of Dev. Meanwhile, to improve efficiency, the definition of local variables has
been removed. This macro is only used within the current function and there
will be no additional risks. In order to verify the optimization performance of
Patch V3, a test function has been designed. By repeatedly calling mlock in a
loop, the kernel is made to call contpte_ptep_get extensively to test the
optimization effect of this function.
The function's execution time and instruction statistics have been traced using
perf, and the following are the operation results on a certain Qualcomm mobile
phone chip:

Instruction Statistics - Before Optimization
#          count  event_name              # count / runtime
      20,814,352  branch-load-misses      # 662.244 K/sec
  41,894,986,323  branch-loads            # 1.333 G/sec
       1,957,415  iTLB-load-misses        # 62.278 K/sec
  49,872,282,100  iTLB-loads              # 1.587 G/sec
     302,808,096  L1-icache-load-misses   # 9.634 M/sec
  49,872,282,100  L1-icache-loads         # 1.587 G/sec

Total test time: 31.485237 seconds.

Instruction Statistics - After Optimization
#          count  event_name              # count / runtime
      19,340,524  branch-load-misses      # 688.753 K/sec
  38,510,185,183  branch-loads            # 1.371 G/sec
       1,812,716  iTLB-load-misses        # 64.554 K/sec
  47,673,923,151  iTLB-loads              # 1.698 G/sec
     675,853,661  L1-icache-load-misses   # 24.068 M/sec
  47,673,923,151  L1-icache-loads         # 1.698 G/sec

Total test time: 28.108048 seconds.

Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900

Overhead   Symbol
21.42%     lock_release
21.26%     lock_acquire
20.88%     arch_counter_get_cntvct
14.32%     _raw_spin_unlock_irq
6.79%      contpte_ptep_get
2.20%      test_contpte_perf
1.82%      follow_page_pte
0.97%      lock_acquired
0.97%      rcu_is_watching
0.89%      mlock_pte_range
0.84%      sched_clock_noinstr
0.70%      handle_softirqs.llvm.8218488130471452153
0.58%      test_preempt_disable_long
0.57%      _raw_spin_unlock_irqrestore
0.54%      arch_stack_walk
0.51%      vm_normal_folio
0.48%      check_preemption_disabled
0.47%      stackinfo_get_task
0.36%      try_grab_folio
0.34%      preempt_count
0.32%      trace_preempt_on
0.29%      trace_preempt_off
0.24%      debug_smp_processor_id

Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042

Overhead   Symbol
22.59%     lock_release
22.13%     arch_counter_get_cntvct
22.08%     lock_acquire
15.32%     _raw_spin_unlock_irq
2.26%      test_contpte_perf
1.50%      follow_page_pte
1.49%      arch_stack_walk
1.30%      rcu_is_watching
1.09%      lock_acquired
1.07%      sched_clock_noinstr
0.88%      handle_softirqs.llvm.12507768597002095717
0.88%      trace_preempt_off
0.76%      _raw_spin_unlock_irqrestore
0.61%      check_preemption_disabled
0.52%      trace_preempt_on
0.50%      mlock_pte_range
0.43%      try_grab_folio
0.41%      folio_mark_accessed
0.40%      vm_normal_folio
0.38%      test_preempt_disable_long
0.28%      contpte_ptep_get
0.27%      __traceiter_android_rvh_preempt_disable
0.26%      debug_smp_processor_id
0.24%      return_address
0.20%      __pte_offset_map_lock
0.19%      unwind_next_frame_record

If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.

If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.

Test Function:
---
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)

void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void test_contpte_perf()
{
	char *buf;
	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
	if (ret != 0) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		rwdata(buf);

		munlock(buf, TEST_SIZE);
	}
	
	free(buf);
}
---

Xavier (1):
  mm/contpte: Optimize loop to reduce redundant operations

 arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
@ 2025-04-15  8:22               ` Xavier
  2025-04-15  9:01                 ` [mm/contpte v3] " Markus Elfring
                                   ` (2 more replies)
  2025-04-16  2:10               ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
                                 ` (2 subsequent siblings)
  3 siblings, 3 replies; 49+ messages in thread
From: Xavier @ 2025-04-15  8:22 UTC (permalink / raw)
  To: ryan.roberts, dev.jain, ioworker0, 21cnbao
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, xavier_qy, ziy

This commit optimizes the contpte_ptep_get function by adding early
 termination logic. It checks if the dirty and young bits of orig_pte
 are already set and skips redundant bit-setting operations during
 the loop. This reduces unnecessary iterations and improves performance.

Signed-off-by: Xavier <xavier_qy@163.com>
---
 arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..0acfee604947 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(__contpte_try_unfold);
 
+/* Note: in order to improve efficiency, using this macro will modify the
+ * passed-in parameters.*/
+#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
+    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
+		if (pte_##flag(__ptep_get(ptep))) { \
+				orig_pte = pte_mk##flag(orig_pte); \
+				break; \
+		} \
+    }
+
 pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 {
 	/*
@@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
+			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
+			break;
+		}
 
-		if (pte_young(pte))
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
@ 2025-04-15  9:01                 ` Markus Elfring
  2025-04-16  8:57                 ` [mm/contpte v3 1/1] " David Laight
  2025-04-16 12:54                 ` Ryan Roberts
  2 siblings, 0 replies; 49+ messages in thread
From: Markus Elfring @ 2025-04-15  9:01 UTC (permalink / raw)
  To: xavier_qy, linux-arm-kernel
  Cc: LKML, Andrew Morton, Barry Song, Catalin Marinas,
	David Hildenbrand, Dev Jain, Gavin Shan, Lance Yang,
	Matthew Wilcox, Ryan Roberts, Will Deacon, Zi Yan

> This commit optimizes the contpte_ptep_get function by …

> Signed-off-by: Xavier <xavier_qy@163.com>

I guess that a complete personal name would be more desirable according to
the Developer's Certificate of Origin.
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.15-rc2#n436


Should an other key word be specified for the prefix in the subject?


See also:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.15-rc2#n94

Regards,
Markus


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
  2025-04-15  9:01                 ` [mm/contpte v3] " Markus Elfring
@ 2025-04-16  8:57                 ` David Laight
  2025-04-16 16:15                   ` Xavier
  2025-04-16 12:54                 ` Ryan Roberts
  2 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2025-04-16  8:57 UTC (permalink / raw)
  To: Xavier
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Tue, 15 Apr 2025 16:22:05 +0800
Xavier <xavier_qy@163.com> wrote:

> This commit optimizes the contpte_ptep_get function by adding early
>  termination logic. It checks if the dirty and young bits of orig_pte
>  are already set and skips redundant bit-setting operations during
>  the loop. This reduces unnecessary iterations and improves performance.

Benchmarks?

As has been pointed out before CONT_PTES is small and IIRC dirty+young
is unusual.

> 
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..0acfee604947 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>  
> +/* Note: in order to improve efficiency, using this macro will modify the
> + * passed-in parameters.*/

... this macro modifies ...

But you can make it obvious my passing by reference.
The compiler will generate the same code.

	David

> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> +		if (pte_##flag(__ptep_get(ptep))) { \
> +				orig_pte = pte_mk##flag(orig_pte); \
> +				break; \
> +		} \
> +    }
> +
>  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  {
>  	/*
> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>  		pte = __ptep_get(ptep);
>  
> -		if (pte_dirty(pte))
> +		if (pte_dirty(pte)) {
>  			orig_pte = pte_mkdirty(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> +			break;
> +		}
>  
> -		if (pte_young(pte))
> +		if (pte_young(pte)) {
>  			orig_pte = pte_mkyoung(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> +			break;
> +		}
>  	}
>  
>  	return orig_pte;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16  8:57                 ` [mm/contpte v3 1/1] " David Laight
@ 2025-04-16 16:15                   ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 16:15 UTC (permalink / raw)
  To: David Laight
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy




At 2025-04-16 16:57:06, "David Laight" <david.laight.linux@gmail.com> wrote:
>On Tue, 15 Apr 2025 16:22:05 +0800
>Xavier <xavier_qy@163.com> wrote:
>
>> This commit optimizes the contpte_ptep_get function by adding early
>>  termination logic. It checks if the dirty and young bits of orig_pte
>>  are already set and skips redundant bit-setting operations during
>>  the loop. This reduces unnecessary iterations and improves performance.
>
>Benchmarks?
>
>As has been pointed out before CONT_PTES is small and IIRC dirty+young
>is unusual.

I haven't found some suitable benchmark tests yet. I will write some more
general test scenarios. Please pay attention to the subsequent emails.

>
>> 
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>  
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>
>... this macro modifies ...
>
>But you can make it obvious my passing by reference.
>The compiler will generate the same code.
>

This part may also be further refined.

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
  2025-04-15  9:01                 ` [mm/contpte v3] " Markus Elfring
  2025-04-16  8:57                 ` [mm/contpte v3 1/1] " David Laight
@ 2025-04-16 12:54                 ` Ryan Roberts
  2025-04-16 16:09                   ` Xavier
  2025-04-22  9:33                   ` Xavier
  2 siblings, 2 replies; 49+ messages in thread
From: Ryan Roberts @ 2025-04-16 12:54 UTC (permalink / raw)
  To: Xavier, dev.jain, ioworker0, 21cnbao
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, ziy

On 15/04/2025 09:22, Xavier wrote:
> This commit optimizes the contpte_ptep_get function by adding early
>  termination logic. It checks if the dirty and young bits of orig_pte
>  are already set and skips redundant bit-setting operations during
>  the loop. This reduces unnecessary iterations and improves performance.
> 
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..0acfee604947 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>  
> +/* Note: in order to improve efficiency, using this macro will modify the
> + * passed-in parameters.*/
> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> +		if (pte_##flag(__ptep_get(ptep))) { \
> +				orig_pte = pte_mk##flag(orig_pte); \
> +				break; \
> +		} \
> +    }

I'm really not a fan of this macro, it just obfuscates what is going on. I'd
personally prefer to see the 2 extra loops open coded below.

Or even better, could you provide results comparing this 3 loop version to the
simpler approach I suggested previously? If the performance is similar (which I
expect it will be, especially given Barry's point that your test always ensures
the first PTE is both young and dirty) then I'd prefer to go with the simpler code.


> +
>  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  {
>  	/*
> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>  		pte = __ptep_get(ptep);
>  
> -		if (pte_dirty(pte))
> +		if (pte_dirty(pte)) {
>  			orig_pte = pte_mkdirty(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> +			break;
> +		}
>  
> -		if (pte_young(pte))
> +		if (pte_young(pte)) {
>  			orig_pte = pte_mkyoung(orig_pte);
> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> +			break;
> +		}
>  	}
>  
>  	return orig_pte;

If we decide this is all worth the trouble, then I think we can (and *should*,
in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16 12:54                 ` Ryan Roberts
@ 2025-04-16 16:09                   ` Xavier
  2025-04-22  9:33                   ` Xavier
  1 sibling, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 16:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas, david, gshan,
	linux-arm-kernel, linux-kernel, will, willy, ziy



At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>>  termination logic. It checks if the dirty and young bits of orig_pte
>>  are already set and skips redundant bit-setting operations during
>>  the loop. This reduces unnecessary iterations and improves performance.
>> 
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>  
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> +		if (pte_##flag(__ptep_get(ptep))) { \
>> +				orig_pte = pte_mk##flag(orig_pte); \
>> +				break; \
>> +		} \
>> +    }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.

I will test the performance differences among these several implementations.
If the differences are not significant, I will try to implement it with the simpler method.

At the same time, according to Barry's suggestion, I will add some more general
test cases to verify whether the optimization is effective.

Since I'm a bit busy recently, it may take me a few days to complete this task.

>> +
>>  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>  {
>>  	/*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>  		pte = __ptep_get(ptep);
>>  
>> -		if (pte_dirty(pte))
>> +		if (pte_dirty(pte)) {
>>  			orig_pte = pte_mkdirty(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> +			break;
>> +		}
>>  
>> -		if (pte_young(pte))
>> +		if (pte_young(pte)) {
>>  			orig_pte = pte_mkyoung(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> +			break;
>> +		}
>>  	}
>>  
>>  	return orig_pte;
>
>If we decide this is all worth the trouble, then I think we can (and *should*,
>in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().

OK. This function seems to be a bit more complicated. I'll give it a try to
figure out how to optimize it.

--
Thanks,
Xavier


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16 12:54                 ` Ryan Roberts
  2025-04-16 16:09                   ` Xavier
@ 2025-04-22  9:33                   ` Xavier
  2025-04-30 23:17                     ` Barry Song
  1 sibling, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-22  9:33 UTC (permalink / raw)
  To: Ryan Roberts, dev.jain, ioworker0, 21cnbao
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, ziy


Hi all,


At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>>  termination logic. It checks if the dirty and young bits of orig_pte
>>  are already set and skips redundant bit-setting operations during
>>  the loop. This reduces unnecessary iterations and improves performance.
>> 
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>  
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> +		if (pte_##flag(__ptep_get(ptep))) { \
>> +				orig_pte = pte_mk##flag(orig_pte); \
>> +				break; \
>> +		} \
>> +    }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>

Based on the discussions in the previous email, two modifications were adopted
and tested, and the results are as follows:

Modification 1

pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
	pte_t pte;
	int i;

	ptep = contpte_align_down(ptep);

	for (i = 0; i < CONT_PTES; i++, ptep++) {
		pte = __ptep_get(ptep);

		if (pte_dirty(pte)) {
			orig_pte = pte_mkdirty(orig_pte);
			if (pte_young(orig_pte))
				break;
		}

		if (pte_young(pte)) {
			orig_pte = pte_mkyoung(orig_pte);
			if (pte_dirty(orig_pte))
				break;
		}
	}

	return orig_pte;
}

Modification 2

pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
	pte_t pte;
	int i;

	ptep = contpte_align_down(ptep);

	for (i = 0; i < CONT_PTES; i++, ptep++) {
		pte = __ptep_get(ptep);

		if (pte_dirty(pte)) {
			orig_pte = pte_mkdirty(orig_pte);
			for (; i < CONT_PTES; i++, ptep++) {
				pte = __ptep_get(ptep);
				if (pte_young(pte)) {
					orig_pte = pte_mkyoung(orig_pte);
					break;
				}
			}
			break;
		}

		if (pte_young(pte)) {
			orig_pte = pte_mkyoung(orig_pte);
			i++;
			ptep++;
			for (; i < CONT_PTES; i++, ptep++) {
				pte = __ptep_get(ptep);
				if (pte_dirty(pte)) {
					orig_pte = pte_mkdirty(orig_pte);
					break;
				}
			}
			break;
		}
	}

	return orig_pte;
}

Test Code:

#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
#define YOUNG_BIT 8
void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void clear_young_dirty(char *buf)
{
	if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
		perror("madvise free failed");
		free(buf);
		exit(EXIT_FAILURE);
	}
	if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
		perror("madvise free failed");
		free(buf);
		exit(EXIT_FAILURE);
	}
}
void set_one_young(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
		volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
	}
}

void test_contpte_perf() {
	char *buf;
	int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
	if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);
#if TEST_CASE2 || TEST_CASE3
	clear_young_dirty(buf);
#endif
#if TEST_CASE2
	set_one_young(buf);
#endif

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		munlock(buf, TEST_SIZE);
	}
	free(buf);
}
---

Descriptions of three test scenarios

Scenario 1
The data of all 16 PTEs are both dirty and young.
#define TEST_CASE2 0
#define TEST_CASE3 0

Scenario 2
Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
#define TEST_CASE2 1
#define TEST_CASE3 0

Scenario 3
Among the 16 PTEs, there are neither young nor dirty ones.
#define TEST_CASE2 0
#define TEST_CASE3 1


Test results

|Scenario 1         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    37912436160|     18303833386|     18731580031|
|test time          |         4.2797|          2.2687|          2.2949|
|overhead of        |               |                |                |
|contpte_ptep_get() |         21.31%|           4.72%|           4.80%|

|Scenario 2         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    36701270862|     38729716276|     36115790086|
|test time          |         3.2335|          3.5732|          3.0874|
|Overhead of        |               |                |                |
|contpte_ptep_get() |         32.26%|          41.35%|          33.57%|

|Scenario 3         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    36706279735|     38305241759|     36750881878|
|test time          |         3.2008|          3.5389|          3.1249|
|Overhead of        |               |                |                |
|contpte_ptep_get() |         31.94%|          41.30%|          34.59%|


For Scenario 1, Modification 1 can achieve an instruction count benefit of
51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
benefit of 50.59% and a time benefit of 46.38%.

For Scenarios 2, Modification 2 can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
the instructions and time due to additional conditional checks.

For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
the branches taken by Modification 1 and Modification 2 should be the same as
those of the original code. In fact, the test results of Modification 2 seem
to be closer to those of the original code. I don't know why there is a
performance regression in Modification 1.

Therefore, I believe modifying the code according to Modification 2 can bring
maximum benefits. Everyone can discuss whether this approach is acceptable,
and if so, I will send Patch V4 to proceed with submitting this modification.


--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-22  9:33                   ` Xavier
@ 2025-04-30 23:17                     ` Barry Song
  2025-05-01 12:39                       ` Xavier
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-04-30 23:17 UTC (permalink / raw)
  To: Xavier
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>
>
> Hi all,
>
>
> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >On 15/04/2025 09:22, Xavier wrote:
> >> This commit optimizes the contpte_ptep_get function by adding early
> >>  termination logic. It checks if the dirty and young bits of orig_pte
> >>  are already set and skips redundant bit-setting operations during
> >>  the loop. This reduces unnecessary iterations and improves performance.
> >>
> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> ---
> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >>  1 file changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> index bcac4f55f9c1..0acfee604947 100644
> >> --- a/arch/arm64/mm/contpte.c
> >> +++ b/arch/arm64/mm/contpte.c
> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >>  }
> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >>
> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> + * passed-in parameters.*/
> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> +            if (pte_##flag(__ptep_get(ptep))) { \
> >> +                            orig_pte = pte_mk##flag(orig_pte); \
> >> +                            break; \
> >> +            } \
> >> +    }
> >
> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >personally prefer to see the 2 extra loops open coded below.
> >
> >Or even better, could you provide results comparing this 3 loop version to the
> >simpler approach I suggested previously? If the performance is similar (which I
> >expect it will be, especially given Barry's point that your test always ensures
> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >
>
> Based on the discussions in the previous email, two modifications were adopted
> and tested, and the results are as follows:
>
> Modification 1
>
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
>         pte_t pte;
>         int i;
>
>         ptep = contpte_align_down(ptep);
>
>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>
>                 if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
>                         if (pte_young(orig_pte))
>                                 break;
>                 }
>
>                 if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
>                         if (pte_dirty(orig_pte))
>                                 break;
>                 }
>         }
>
>         return orig_pte;
> }
>
> Modification 2
>
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
>         pte_t pte;
>         int i;
>
>         ptep = contpte_align_down(ptep);
>
>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>
>                 if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
>                         for (; i < CONT_PTES; i++, ptep++) {
>                                 pte = __ptep_get(ptep);
>                                 if (pte_young(pte)) {
>                                         orig_pte = pte_mkyoung(orig_pte);
>                                         break;
>                                 }
>                         }
>                         break;
>                 }
>
>                 if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
>                         i++;
>                         ptep++;
>                         for (; i < CONT_PTES; i++, ptep++) {
>                                 pte = __ptep_get(ptep);
>                                 if (pte_dirty(pte)) {
>                                         orig_pte = pte_mkdirty(orig_pte);
>                                         break;
>                                 }
>                         }
>                         break;
>                 }
>         }
>
>         return orig_pte;
> }
>
> Test Code:
>
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> #define YOUNG_BIT 8
> void rwdata(char *buf)
> {
>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>                 buf[i] = 'a';
>                 volatile char c = buf[i];
>         }
> }
> void clear_young_dirty(char *buf)
> {
>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>                 perror("madvise free failed");
>                 free(buf);
>                 exit(EXIT_FAILURE);
>         }
>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>                 perror("madvise free failed");
>                 free(buf);
>                 exit(EXIT_FAILURE);
>         }
> }
> void set_one_young(char *buf)
> {
>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>         }
> }
>
> void test_contpte_perf() {
>         char *buf;
>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>                 perror("posix_memalign failed");
>                 exit(EXIT_FAILURE);
>         }
>
>         rwdata(buf);
> #if TEST_CASE2 || TEST_CASE3
>         clear_young_dirty(buf);
> #endif
> #if TEST_CASE2
>         set_one_young(buf);
> #endif
>
>         for (int j = 0; j < 500; j++) {
>                 mlock(buf, TEST_SIZE);
>
>                 munlock(buf, TEST_SIZE);
>         }
>         free(buf);
> }
> ---
>
> Descriptions of three test scenarios
>
> Scenario 1
> The data of all 16 PTEs are both dirty and young.
> #define TEST_CASE2 0
> #define TEST_CASE3 0
>
> Scenario 2
> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> #define TEST_CASE2 1
> #define TEST_CASE3 0
>
> Scenario 3
> Among the 16 PTEs, there are neither young nor dirty ones.
> #define TEST_CASE2 0
> #define TEST_CASE3 1
>
>
> Test results
>
> |Scenario 1         |       Original|  Modification 1|  Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions       |    37912436160|     18303833386|     18731580031|
> |test time          |         4.2797|          2.2687|          2.2949|
> |overhead of        |               |                |                |
> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
>
> |Scenario 2         |       Original|  Modification 1|  Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions       |    36701270862|     38729716276|     36115790086|
> |test time          |         3.2335|          3.5732|          3.0874|
> |Overhead of        |               |                |                |
> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
>
> |Scenario 3         |       Original|  Modification 1|  Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions       |    36706279735|     38305241759|     36750881878|
> |test time          |         3.2008|          3.5389|          3.1249|
> |Overhead of        |               |                |                |
> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
>
>
> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> benefit of 50.59% and a time benefit of 46.38%.
>
> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> the instructions and time due to additional conditional checks.
>
> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> the branches taken by Modification 1 and Modification 2 should be the same as
> those of the original code. In fact, the test results of Modification 2 seem
> to be closer to those of the original code. I don't know why there is a
> performance regression in Modification 1.
>
> Therefore, I believe modifying the code according to Modification 2 can bring
> maximum benefits. Everyone can discuss whether this approach is acceptable,
> and if so, I will send Patch V4 to proceed with submitting this modification.
>

modification 2 is not correct. if pte0~pte14 are all young and no one
is dirty, we are
having lots of useless "for (; i < CONT_PTES; i++, ptep++)"

                 if (pte_young(pte)) {
                         orig_pte = pte_mkyoung(orig_pte);
                         i++;
                         ptep++;
                         for (; i < CONT_PTES; i++, ptep++) {
                                 pte = __ptep_get(ptep);
                                 if (pte_dirty(pte)) {
                                         orig_pte = pte_mkdirty(orig_pte);
                                         break;
                                 }
                         }
                         break;
                 }

Thanks,
Xavier


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-30 23:17                     ` Barry Song
@ 2025-05-01 12:39                       ` Xavier
  2025-05-01 21:19                         ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-01 12:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy



Hi Barry,


At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
>On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>>
>>
>> Hi all,
>>
>>
>> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>> >On 15/04/2025 09:22, Xavier wrote:
>> >> This commit optimizes the contpte_ptep_get function by adding early
>> >>  termination logic. It checks if the dirty and young bits of orig_pte
>> >>  are already set and skips redundant bit-setting operations during
>> >>  the loop. This reduces unnecessary iterations and improves performance.
>> >>
>> >> Signed-off-by: Xavier <xavier_qy@163.com>
>> >> ---
>> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> >>  1 file changed, 18 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> >> index bcac4f55f9c1..0acfee604947 100644
>> >> --- a/arch/arm64/mm/contpte.c
>> >> +++ b/arch/arm64/mm/contpte.c
>> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> >>  }
>> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>> >>
>> >> +/* Note: in order to improve efficiency, using this macro will modify the
>> >> + * passed-in parameters.*/
>> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> >> +            if (pte_##flag(__ptep_get(ptep))) { \
>> >> +                            orig_pte = pte_mk##flag(orig_pte); \
>> >> +                            break; \
>> >> +            } \
>> >> +    }
>> >
>> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>> >personally prefer to see the 2 extra loops open coded below.
>> >
>> >Or even better, could you provide results comparing this 3 loop version to the
>> >simpler approach I suggested previously? If the performance is similar (which I
>> >expect it will be, especially given Barry's point that your test always ensures
>> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>> >
>>
>> Based on the discussions in the previous email, two modifications were adopted
>> and tested, and the results are as follows:
>>
>> Modification 1
>>
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>>         pte_t pte;
>>         int i;
>>
>>         ptep = contpte_align_down(ptep);
>>
>>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                 pte = __ptep_get(ptep);
>>
>>                 if (pte_dirty(pte)) {
>>                         orig_pte = pte_mkdirty(orig_pte);
>>                         if (pte_young(orig_pte))
>>                                 break;
>>                 }
>>
>>                 if (pte_young(pte)) {
>>                         orig_pte = pte_mkyoung(orig_pte);
>>                         if (pte_dirty(orig_pte))
>>                                 break;
>>                 }
>>         }
>>
>>         return orig_pte;
>> }
>>
>> Modification 2
>>
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>>         pte_t pte;
>>         int i;
>>
>>         ptep = contpte_align_down(ptep);
>>
>>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                 pte = __ptep_get(ptep);
>>
>>                 if (pte_dirty(pte)) {
>>                         orig_pte = pte_mkdirty(orig_pte);
>>                         for (; i < CONT_PTES; i++, ptep++) {
>>                                 pte = __ptep_get(ptep);
>>                                 if (pte_young(pte)) {
>>                                         orig_pte = pte_mkyoung(orig_pte);
>>                                         break;
>>                                 }
>>                         }
>>                         break;
>>                 }
>>
>>                 if (pte_young(pte)) {
>>                         orig_pte = pte_mkyoung(orig_pte);
>>                         i++;
>>                         ptep++;
>>                         for (; i < CONT_PTES; i++, ptep++) {
>>                                 pte = __ptep_get(ptep);
>>                                 if (pte_dirty(pte)) {
>>                                         orig_pte = pte_mkdirty(orig_pte);
>>                                         break;
>>                                 }
>>                         }
>>                         break;
>>                 }
>>         }
>>
>>         return orig_pte;
>> }
>>
>> Test Code:
>>
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> #define YOUNG_BIT 8
>> void rwdata(char *buf)
>> {
>>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>>                 buf[i] = 'a';
>>                 volatile char c = buf[i];
>>         }
>> }
>> void clear_young_dirty(char *buf)
>> {
>>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>>                 perror("madvise free failed");
>>                 free(buf);
>>                 exit(EXIT_FAILURE);
>>         }
>>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>>                 perror("madvise free failed");
>>                 free(buf);
>>                 exit(EXIT_FAILURE);
>>         }
>> }
>> void set_one_young(char *buf)
>> {
>>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>>         }
>> }
>>
>> void test_contpte_perf() {
>>         char *buf;
>>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
>>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>>                 perror("posix_memalign failed");
>>                 exit(EXIT_FAILURE);
>>         }
>>
>>         rwdata(buf);
>> #if TEST_CASE2 || TEST_CASE3
>>         clear_young_dirty(buf);
>> #endif
>> #if TEST_CASE2
>>         set_one_young(buf);
>> #endif
>>
>>         for (int j = 0; j < 500; j++) {
>>                 mlock(buf, TEST_SIZE);
>>
>>                 munlock(buf, TEST_SIZE);
>>         }
>>         free(buf);
>> }
>> ---
>>
>> Descriptions of three test scenarios
>>
>> Scenario 1
>> The data of all 16 PTEs are both dirty and young.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 0
>>
>> Scenario 2
>> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>> #define TEST_CASE2 1
>> #define TEST_CASE3 0
>>
>> Scenario 3
>> Among the 16 PTEs, there are neither young nor dirty ones.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 1
>>
>>
>> Test results
>>
>> |Scenario 1         |       Original|  Modification 1|  Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions       |    37912436160|     18303833386|     18731580031|
>> |test time          |         4.2797|          2.2687|          2.2949|
>> |overhead of        |               |                |                |
>> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
>>
>> |Scenario 2         |       Original|  Modification 1|  Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions       |    36701270862|     38729716276|     36115790086|
>> |test time          |         3.2335|          3.5732|          3.0874|
>> |Overhead of        |               |                |                |
>> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
>>
>> |Scenario 3         |       Original|  Modification 1|  Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions       |    36706279735|     38305241759|     36750881878|
>> |test time          |         3.2008|          3.5389|          3.1249|
>> |Overhead of        |               |                |                |
>> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
>>
>>
>> For Scenario 1, Modification 1 can achieve an instruction count benefit of
>> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
>> benefit of 50.59% and a time benefit of 46.38%.
>>
>> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
>> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
>> the instructions and time due to additional conditional checks.
>>
>> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
>> the branches taken by Modification 1 and Modification 2 should be the same as
>> those of the original code. In fact, the test results of Modification 2 seem
>> to be closer to those of the original code. I don't know why there is a
>> performance regression in Modification 1.
>>
>> Therefore, I believe modifying the code according to Modification 2 can bring
>> maximum benefits. Everyone can discuss whether this approach is acceptable,
>> and if so, I will send Patch V4 to proceed with submitting this modification.
>>
>
>modification 2 is not correct. if pte0~pte14 are all young and no one
>is dirty, we are
>having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
>
>                 if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
>                         i++;
>                         ptep++;
>                         for (; i < CONT_PTES; i++, ptep++) {
>                                 pte = __ptep_get(ptep);
>                                 if (pte_dirty(pte)) {
>                                         orig_pte = pte_mkdirty(orig_pte);
>                                         break;
>                                 }
>                         }
>                         break;
>                 }
>

I didn't understand which part you referred to when you said there were a lot of
useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
young and no one is dirty", Modification 2 will enter the following branch when
judging pte0:

if (pte_young(pte)) {
	orig_pte = pte_mkyoung(orig_pte);
	// The dirty status of pte0 has already been checked, skip it.
	i++;
	ptep++;
	// Then we only need to check whether pte1~pte15 are dirty.
	for (; i < CONT_PTES; i++, ptep++) {
		pte = __ptep_get(ptep);
		if (pte_dirty(pte)) {
			// Exit as soon as a dirty entry is found.
			orig_pte = pte_mkdirty(orig_pte);
			break;
		}
	}
	// Exit directly here without going through the outer loop again.
	break;
}

In this scenario, the total number of judgments in Modification 2 is nearly half less
than that of the original code. I should have understood it correctly, right?


--

Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-05-01 12:39                       ` Xavier
@ 2025-05-01 21:19                         ` Barry Song
  2025-05-01 21:32                           ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-05-01 21:19 UTC (permalink / raw)
  To: Xavier
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
>
>
>
> Hi Barry,
>
>
> At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>
> >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >> >On 15/04/2025 09:22, Xavier wrote:
> >> >> This commit optimizes the contpte_ptep_get function by adding early
> >> >>  termination logic. It checks if the dirty and young bits of orig_pte
> >> >>  are already set and skips redundant bit-setting operations during
> >> >>  the loop. This reduces unnecessary iterations and improves performance.
> >> >>
> >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> >> ---
> >> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >> >>  1 file changed, 18 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> >> index bcac4f55f9c1..0acfee604947 100644
> >> >> --- a/arch/arm64/mm/contpte.c
> >> >> +++ b/arch/arm64/mm/contpte.c
> >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >> >>  }
> >> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >> >>
> >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> >> + * passed-in parameters.*/
> >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> >> +            if (pte_##flag(__ptep_get(ptep))) { \
> >> >> +                            orig_pte = pte_mk##flag(orig_pte); \
> >> >> +                            break; \
> >> >> +            } \
> >> >> +    }
> >> >
> >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >> >personally prefer to see the 2 extra loops open coded below.
> >> >
> >> >Or even better, could you provide results comparing this 3 loop version to the
> >> >simpler approach I suggested previously? If the performance is similar (which I
> >> >expect it will be, especially given Barry's point that your test always ensures
> >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >> >
> >>
> >> Based on the discussions in the previous email, two modifications were adopted
> >> and tested, and the results are as follows:
> >>
> >> Modification 1
> >>
> >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> {
> >>         pte_t pte;
> >>         int i;
> >>
> >>         ptep = contpte_align_down(ptep);
> >>
> >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> >>                 pte = __ptep_get(ptep);
> >>
> >>                 if (pte_dirty(pte)) {
> >>                         orig_pte = pte_mkdirty(orig_pte);
> >>                         if (pte_young(orig_pte))
> >>                                 break;
> >>                 }
> >>
> >>                 if (pte_young(pte)) {
> >>                         orig_pte = pte_mkyoung(orig_pte);
> >>                         if (pte_dirty(orig_pte))
> >>                                 break;
> >>                 }
> >>         }
> >>
> >>         return orig_pte;
> >> }
> >>
> >> Modification 2
> >>
> >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> {
> >>         pte_t pte;
> >>         int i;
> >>
> >>         ptep = contpte_align_down(ptep);
> >>
> >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> >>                 pte = __ptep_get(ptep);
> >>
> >>                 if (pte_dirty(pte)) {
> >>                         orig_pte = pte_mkdirty(orig_pte);
> >>                         for (; i < CONT_PTES; i++, ptep++) {
> >>                                 pte = __ptep_get(ptep);
> >>                                 if (pte_young(pte)) {
> >>                                         orig_pte = pte_mkyoung(orig_pte);
> >>                                         break;
> >>                                 }
> >>                         }
> >>                         break;
> >>                 }
> >>
> >>                 if (pte_young(pte)) {
> >>                         orig_pte = pte_mkyoung(orig_pte);
> >>                         i++;
> >>                         ptep++;
> >>                         for (; i < CONT_PTES; i++, ptep++) {
> >>                                 pte = __ptep_get(ptep);
> >>                                 if (pte_dirty(pte)) {
> >>                                         orig_pte = pte_mkdirty(orig_pte);
> >>                                         break;
> >>                                 }
> >>                         }
> >>                         break;
> >>                 }
> >>         }
> >>
> >>         return orig_pte;
> >> }
> >>
> >> Test Code:
> >>
> >> #define PAGE_SIZE 4096
> >> #define CONT_PTES 16
> >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >> #define YOUNG_BIT 8
> >> void rwdata(char *buf)
> >> {
> >>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >>                 buf[i] = 'a';
> >>                 volatile char c = buf[i];
> >>         }
> >> }
> >> void clear_young_dirty(char *buf)
> >> {
> >>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >>                 perror("madvise free failed");
> >>                 free(buf);
> >>                 exit(EXIT_FAILURE);
> >>         }
> >>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >>                 perror("madvise free failed");
> >>                 free(buf);
> >>                 exit(EXIT_FAILURE);
> >>         }
> >> }
> >> void set_one_young(char *buf)
> >> {
> >>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >>         }
> >> }
> >>
> >> void test_contpte_perf() {
> >>         char *buf;
> >>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> >>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >>                 perror("posix_memalign failed");
> >>                 exit(EXIT_FAILURE);
> >>         }
> >>
> >>         rwdata(buf);
> >> #if TEST_CASE2 || TEST_CASE3
> >>         clear_young_dirty(buf);
> >> #endif
> >> #if TEST_CASE2
> >>         set_one_young(buf);
> >> #endif
> >>
> >>         for (int j = 0; j < 500; j++) {
> >>                 mlock(buf, TEST_SIZE);
> >>
> >>                 munlock(buf, TEST_SIZE);
> >>         }
> >>         free(buf);
> >> }
> >> ---
> >>
> >> Descriptions of three test scenarios
> >>
> >> Scenario 1
> >> The data of all 16 PTEs are both dirty and young.
> >> #define TEST_CASE2 0
> >> #define TEST_CASE3 0
> >>
> >> Scenario 2
> >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >> #define TEST_CASE2 1
> >> #define TEST_CASE3 0
> >>
> >> Scenario 3
> >> Among the 16 PTEs, there are neither young nor dirty ones.
> >> #define TEST_CASE2 0
> >> #define TEST_CASE3 1
> >>
> >>
> >> Test results
> >>
> >> |Scenario 1         |       Original|  Modification 1|  Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions       |    37912436160|     18303833386|     18731580031|
> >> |test time          |         4.2797|          2.2687|          2.2949|
> >> |overhead of        |               |                |                |
> >> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
> >>
> >> |Scenario 2         |       Original|  Modification 1|  Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions       |    36701270862|     38729716276|     36115790086|
> >> |test time          |         3.2335|          3.5732|          3.0874|
> >> |Overhead of        |               |                |                |
> >> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
> >>
> >> |Scenario 3         |       Original|  Modification 1|  Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions       |    36706279735|     38305241759|     36750881878|
> >> |test time          |         3.2008|          3.5389|          3.1249|
> >> |Overhead of        |               |                |                |
> >> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
> >>
> >>
> >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> >> benefit of 50.59% and a time benefit of 46.38%.
> >>
> >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> >> the instructions and time due to additional conditional checks.
> >>
> >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> >> the branches taken by Modification 1 and Modification 2 should be the same as
> >> those of the original code. In fact, the test results of Modification 2 seem
> >> to be closer to those of the original code. I don't know why there is a
> >> performance regression in Modification 1.
> >>
> >> Therefore, I believe modifying the code according to Modification 2 can bring
> >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> >> and if so, I will send Patch V4 to proceed with submitting this modification.
> >>
> >
> >modification 2 is not correct. if pte0~pte14 are all young and no one
> >is dirty, we are
> >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> >
> >                 if (pte_young(pte)) {
> >                         orig_pte = pte_mkyoung(orig_pte);
> >                         i++;
> >                         ptep++;
> >                         for (; i < CONT_PTES; i++, ptep++) {
> >                                 pte = __ptep_get(ptep);
> >                                 if (pte_dirty(pte)) {
> >                                         orig_pte = pte_mkdirty(orig_pte);
> >                                         break;
> >                                 }
> >                         }
> >                         break;
> >                 }
> >
>
> I didn't understand which part you referred to when you said there were a lot of
> useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> young and no one is dirty", Modification 2 will enter the following branch when
> judging pte0:
>
> if (pte_young(pte)) {
>         orig_pte = pte_mkyoung(orig_pte);
>         // The dirty status of pte0 has already been checked, skip it.
>         i++;
>         ptep++;
>         // Then we only need to check whether pte1~pte15 are dirty.
>         for (; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>                 if (pte_dirty(pte)) {
>                         // Exit as soon as a dirty entry is found.
>                         orig_pte = pte_mkdirty(orig_pte);
>                         break;
>                 }
>         }
>         // Exit directly here without going through the outer loop again.
>         break;
> }
>
> In this scenario, the total number of judgments in Modification 2 is nearly half less
> than that of the original code. I should have understood it correctly, right?

You're right—I missed the part where you're also taking a break, even though
no one is dirty. Based on your data, modification 2 seems to be good.

But I don't quite understand why you are doing
         i++;
         ptep++;
before for (; i < CONT_PTES; i++, ptep++).

it seems to be wrong. if i==15, you will get i=16. you are skipping
the pte_dirty
check for i==15. it is also true for any value between 0 and 15. My
point is that
you should drop it and re-test.

>
>
> --
>
> Thanks,
> Xavier

Thanks
barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-05-01 21:19                         ` Barry Song
@ 2025-05-01 21:32                           ` Barry Song
  2025-05-04  2:39                             ` Xavier
  2025-05-08  7:03                             ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  0 siblings, 2 replies; 49+ messages in thread
From: Barry Song @ 2025-05-01 21:32 UTC (permalink / raw)
  To: Xavier
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
> >
> >
> >
> > Hi Barry,
> >
> >
> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> > >>
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> > >> >On 15/04/2025 09:22, Xavier wrote:
> > >> >> This commit optimizes the contpte_ptep_get function by adding early
> > >> >>  termination logic. It checks if the dirty and young bits of orig_pte
> > >> >>  are already set and skips redundant bit-setting operations during
> > >> >>  the loop. This reduces unnecessary iterations and improves performance.
> > >> >>
> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> > >> >> ---
> > >> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> > >> >>  1 file changed, 18 insertions(+), 2 deletions(-)
> > >> >>
> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > >> >> index bcac4f55f9c1..0acfee604947 100644
> > >> >> --- a/arch/arm64/mm/contpte.c
> > >> >> +++ b/arch/arm64/mm/contpte.c
> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> > >> >>  }
> > >> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> > >> >>
> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> > >> >> + * passed-in parameters.*/
> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> > >> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> > >> >> +            if (pte_##flag(__ptep_get(ptep))) { \
> > >> >> +                            orig_pte = pte_mk##flag(orig_pte); \
> > >> >> +                            break; \
> > >> >> +            } \
> > >> >> +    }
> > >> >
> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> > >> >personally prefer to see the 2 extra loops open coded below.
> > >> >
> > >> >Or even better, could you provide results comparing this 3 loop version to the
> > >> >simpler approach I suggested previously? If the performance is similar (which I
> > >> >expect it will be, especially given Barry's point that your test always ensures
> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> > >> >
> > >>
> > >> Based on the discussions in the previous email, two modifications were adopted
> > >> and tested, and the results are as follows:
> > >>
> > >> Modification 1
> > >>
> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > >> {
> > >>         pte_t pte;
> > >>         int i;
> > >>
> > >>         ptep = contpte_align_down(ptep);
> > >>
> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> > >>                 pte = __ptep_get(ptep);
> > >>
> > >>                 if (pte_dirty(pte)) {
> > >>                         orig_pte = pte_mkdirty(orig_pte);
> > >>                         if (pte_young(orig_pte))
> > >>                                 break;
> > >>                 }
> > >>
> > >>                 if (pte_young(pte)) {
> > >>                         orig_pte = pte_mkyoung(orig_pte);
> > >>                         if (pte_dirty(orig_pte))
> > >>                                 break;
> > >>                 }
> > >>         }
> > >>
> > >>         return orig_pte;
> > >> }
> > >>
> > >> Modification 2
> > >>
> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > >> {
> > >>         pte_t pte;
> > >>         int i;
> > >>
> > >>         ptep = contpte_align_down(ptep);
> > >>
> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> > >>                 pte = __ptep_get(ptep);
> > >>
> > >>                 if (pte_dirty(pte)) {
> > >>                         orig_pte = pte_mkdirty(orig_pte);
> > >>                         for (; i < CONT_PTES; i++, ptep++) {
> > >>                                 pte = __ptep_get(ptep);
> > >>                                 if (pte_young(pte)) {
> > >>                                         orig_pte = pte_mkyoung(orig_pte);
> > >>                                         break;
> > >>                                 }
> > >>                         }
> > >>                         break;
> > >>                 }
> > >>
> > >>                 if (pte_young(pte)) {
> > >>                         orig_pte = pte_mkyoung(orig_pte);
> > >>                         i++;
> > >>                         ptep++;
> > >>                         for (; i < CONT_PTES; i++, ptep++) {
> > >>                                 pte = __ptep_get(ptep);
> > >>                                 if (pte_dirty(pte)) {
> > >>                                         orig_pte = pte_mkdirty(orig_pte);
> > >>                                         break;
> > >>                                 }
> > >>                         }
> > >>                         break;
> > >>                 }
> > >>         }
> > >>
> > >>         return orig_pte;
> > >> }
> > >>
> > >> Test Code:
> > >>
> > >> #define PAGE_SIZE 4096
> > >> #define CONT_PTES 16
> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> > >> #define YOUNG_BIT 8
> > >> void rwdata(char *buf)
> > >> {
> > >>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> > >>                 buf[i] = 'a';
> > >>                 volatile char c = buf[i];
> > >>         }
> > >> }
> > >> void clear_young_dirty(char *buf)
> > >> {
> > >>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> > >>                 perror("madvise free failed");
> > >>                 free(buf);
> > >>                 exit(EXIT_FAILURE);
> > >>         }
> > >>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> > >>                 perror("madvise free failed");
> > >>                 free(buf);
> > >>                 exit(EXIT_FAILURE);
> > >>         }
> > >> }
> > >> void set_one_young(char *buf)
> > >> {
> > >>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> > >>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> > >>         }
> > >> }
> > >>
> > >> void test_contpte_perf() {
> > >>         char *buf;
> > >>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> > >>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> > >>                 perror("posix_memalign failed");
> > >>                 exit(EXIT_FAILURE);
> > >>         }
> > >>
> > >>         rwdata(buf);
> > >> #if TEST_CASE2 || TEST_CASE3
> > >>         clear_young_dirty(buf);
> > >> #endif
> > >> #if TEST_CASE2
> > >>         set_one_young(buf);
> > >> #endif
> > >>
> > >>         for (int j = 0; j < 500; j++) {
> > >>                 mlock(buf, TEST_SIZE);
> > >>
> > >>                 munlock(buf, TEST_SIZE);
> > >>         }
> > >>         free(buf);
> > >> }
> > >> ---
> > >>
> > >> Descriptions of three test scenarios
> > >>
> > >> Scenario 1
> > >> The data of all 16 PTEs are both dirty and young.
> > >> #define TEST_CASE2 0
> > >> #define TEST_CASE3 0
> > >>
> > >> Scenario 2
> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> > >> #define TEST_CASE2 1
> > >> #define TEST_CASE3 0
> > >>
> > >> Scenario 3
> > >> Among the 16 PTEs, there are neither young nor dirty ones.
> > >> #define TEST_CASE2 0
> > >> #define TEST_CASE3 1
> > >>
> > >>
> > >> Test results
> > >>
> > >> |Scenario 1         |       Original|  Modification 1|  Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions       |    37912436160|     18303833386|     18731580031|
> > >> |test time          |         4.2797|          2.2687|          2.2949|
> > >> |overhead of        |               |                |                |
> > >> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
> > >>
> > >> |Scenario 2         |       Original|  Modification 1|  Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions       |    36701270862|     38729716276|     36115790086|
> > >> |test time          |         3.2335|          3.5732|          3.0874|
> > >> |Overhead of        |               |                |                |
> > >> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
> > >>
> > >> |Scenario 3         |       Original|  Modification 1|  Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions       |    36706279735|     38305241759|     36750881878|
> > >> |test time          |         3.2008|          3.5389|          3.1249|
> > >> |Overhead of        |               |                |                |
> > >> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
> > >>
> > >>
> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> > >> benefit of 50.59% and a time benefit of 46.38%.
> > >>
> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> > >> the instructions and time due to additional conditional checks.
> > >>
> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> > >> the branches taken by Modification 1 and Modification 2 should be the same as
> > >> those of the original code. In fact, the test results of Modification 2 seem
> > >> to be closer to those of the original code. I don't know why there is a
> > >> performance regression in Modification 1.
> > >>
> > >> Therefore, I believe modifying the code according to Modification 2 can bring
> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
> > >>
> > >
> > >modification 2 is not correct. if pte0~pte14 are all young and no one
> > >is dirty, we are
> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> > >
> > >                 if (pte_young(pte)) {
> > >                         orig_pte = pte_mkyoung(orig_pte);
> > >                         i++;
> > >                         ptep++;
> > >                         for (; i < CONT_PTES; i++, ptep++) {
> > >                                 pte = __ptep_get(ptep);
> > >                                 if (pte_dirty(pte)) {
> > >                                         orig_pte = pte_mkdirty(orig_pte);
> > >                                         break;
> > >                                 }
> > >                         }
> > >                         break;
> > >                 }
> > >
> >
> > I didn't understand which part you referred to when you said there were a lot of
> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> > young and no one is dirty", Modification 2 will enter the following branch when
> > judging pte0:
> >
> > if (pte_young(pte)) {
> >         orig_pte = pte_mkyoung(orig_pte);
> >         // The dirty status of pte0 has already been checked, skip it.
> >         i++;
> >         ptep++;
> >         // Then we only need to check whether pte1~pte15 are dirty.
> >         for (; i < CONT_PTES; i++, ptep++) {
> >                 pte = __ptep_get(ptep);
> >                 if (pte_dirty(pte)) {
> >                         // Exit as soon as a dirty entry is found.
> >                         orig_pte = pte_mkdirty(orig_pte);
> >                         break;
> >                 }
> >         }
> >         // Exit directly here without going through the outer loop again.
> >         break;
> > }
> >
> > In this scenario, the total number of judgments in Modification 2 is nearly half less
> > than that of the original code. I should have understood it correctly, right?
>
> You're right—I missed the part where you're also taking a break, even though
> no one is dirty. Based on your data, modification 2 seems to be good.
>
> But I don't quite understand why you are doing
>          i++;
>          ptep++;
> before for (; i < CONT_PTES; i++, ptep++).
>
> it seems to be wrong. if i==15, you will get i=16. you are skipping
> the pte_dirty
> check for i==15. it is also true for any value between 0 and 15. My
> point is that
> you should drop it and re-test.

Sorry, I missed that you already checked pte_dirty(pte) before calling
pte_young(). So please ignore my comment. The code's a bit hard to
follow now. :-)

>
> >
> >
> > --
> >
> > Thanks,
> > Xavier
>

Thanks
barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-05-01 21:32                           ` Barry Song
@ 2025-05-04  2:39                             ` Xavier
  2025-05-08  1:29                               ` Barry Song
  2025-05-08  7:03                             ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  1 sibling, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-04  2:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy










At 2025-05-02 05:32:50, "Barry Song" <21cnbao@gmail.com> wrote:
>On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
>> >
>> >
>> >
>> > Hi Barry,
>> >
>> >
>> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
>> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>> > >>
>> > >>
>> > >> Hi all,
>> > >>
>> > >>
>> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>> > >> >On 15/04/2025 09:22, Xavier wrote:
>> > >> >> This commit optimizes the contpte_ptep_get function by adding early
>> > >> >>  termination logic. It checks if the dirty and young bits of orig_pte
>> > >> >>  are already set and skips redundant bit-setting operations during
>> > >> >>  the loop. This reduces unnecessary iterations and improves performance.
>> > >> >>
>> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
>> > >> >> ---
>> > >> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> > >> >>  1 file changed, 18 insertions(+), 2 deletions(-)
>> > >> >>
>> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> > >> >> index bcac4f55f9c1..0acfee604947 100644
>> > >> >> --- a/arch/arm64/mm/contpte.c
>> > >> >> +++ b/arch/arm64/mm/contpte.c
>> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> > >> >>  }
>> > >> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>> > >> >>
>> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
>> > >> >> + * passed-in parameters.*/
>> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> > >> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> > >> >> +            if (pte_##flag(__ptep_get(ptep))) { \
>> > >> >> +                            orig_pte = pte_mk##flag(orig_pte); \
>> > >> >> +                            break; \
>> > >> >> +            } \
>> > >> >> +    }
>> > >> >
>> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>> > >> >personally prefer to see the 2 extra loops open coded below.
>> > >> >
>> > >> >Or even better, could you provide results comparing this 3 loop version to the
>> > >> >simpler approach I suggested previously? If the performance is similar (which I
>> > >> >expect it will be, especially given Barry's point that your test always ensures
>> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>> > >> >
>> > >>
>> > >> Based on the discussions in the previous email, two modifications were adopted
>> > >> and tested, and the results are as follows:
>> > >>
>> > >> Modification 1
>> > >>
>> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> > >> {
>> > >>         pte_t pte;
>> > >>         int i;
>> > >>
>> > >>         ptep = contpte_align_down(ptep);
>> > >>
>> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>> > >>                 pte = __ptep_get(ptep);
>> > >>
>> > >>                 if (pte_dirty(pte)) {
>> > >>                         orig_pte = pte_mkdirty(orig_pte);
>> > >>                         if (pte_young(orig_pte))
>> > >>                                 break;
>> > >>                 }
>> > >>
>> > >>                 if (pte_young(pte)) {
>> > >>                         orig_pte = pte_mkyoung(orig_pte);
>> > >>                         if (pte_dirty(orig_pte))
>> > >>                                 break;
>> > >>                 }
>> > >>         }
>> > >>
>> > >>         return orig_pte;
>> > >> }
>> > >>
>> > >> Modification 2
>> > >>
>> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> > >> {
>> > >>         pte_t pte;
>> > >>         int i;
>> > >>
>> > >>         ptep = contpte_align_down(ptep);
>> > >>
>> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>> > >>                 pte = __ptep_get(ptep);
>> > >>
>> > >>                 if (pte_dirty(pte)) {
>> > >>                         orig_pte = pte_mkdirty(orig_pte);
>> > >>                         for (; i < CONT_PTES; i++, ptep++) {
>> > >>                                 pte = __ptep_get(ptep);
>> > >>                                 if (pte_young(pte)) {
>> > >>                                         orig_pte = pte_mkyoung(orig_pte);
>> > >>                                         break;
>> > >>                                 }
>> > >>                         }
>> > >>                         break;
>> > >>                 }
>> > >>
>> > >>                 if (pte_young(pte)) {
>> > >>                         orig_pte = pte_mkyoung(orig_pte);
>> > >>                         i++;
>> > >>                         ptep++;
>> > >>                         for (; i < CONT_PTES; i++, ptep++) {
>> > >>                                 pte = __ptep_get(ptep);
>> > >>                                 if (pte_dirty(pte)) {
>> > >>                                         orig_pte = pte_mkdirty(orig_pte);
>> > >>                                         break;
>> > >>                                 }
>> > >>                         }
>> > >>                         break;
>> > >>                 }
>> > >>         }
>> > >>
>> > >>         return orig_pte;
>> > >> }
>> > >>
>> > >> Test Code:
>> > >>
>> > >> #define PAGE_SIZE 4096
>> > >> #define CONT_PTES 16
>> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> > >> #define YOUNG_BIT 8
>> > >> void rwdata(char *buf)
>> > >> {
>> > >>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> > >>                 buf[i] = 'a';
>> > >>                 volatile char c = buf[i];
>> > >>         }
>> > >> }
>> > >> void clear_young_dirty(char *buf)
>> > >> {
>> > >>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>> > >>                 perror("madvise free failed");
>> > >>                 free(buf);
>> > >>                 exit(EXIT_FAILURE);
>> > >>         }
>> > >>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>> > >>                 perror("madvise free failed");
>> > >>                 free(buf);
>> > >>                 exit(EXIT_FAILURE);
>> > >>         }
>> > >> }
>> > >> void set_one_young(char *buf)
>> > >> {
>> > >>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>> > >>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>> > >>         }
>> > >> }
>> > >>
>> > >> void test_contpte_perf() {
>> > >>         char *buf;
>> > >>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
>> > >>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>> > >>                 perror("posix_memalign failed");
>> > >>                 exit(EXIT_FAILURE);
>> > >>         }
>> > >>
>> > >>         rwdata(buf);
>> > >> #if TEST_CASE2 || TEST_CASE3
>> > >>         clear_young_dirty(buf);
>> > >> #endif
>> > >> #if TEST_CASE2
>> > >>         set_one_young(buf);
>> > >> #endif
>> > >>
>> > >>         for (int j = 0; j < 500; j++) {
>> > >>                 mlock(buf, TEST_SIZE);
>> > >>
>> > >>                 munlock(buf, TEST_SIZE);
>> > >>         }
>> > >>         free(buf);
>> > >> }
>> > >> ---
>> > >>
>> > >> Descriptions of three test scenarios
>> > >>
>> > >> Scenario 1
>> > >> The data of all 16 PTEs are both dirty and young.
>> > >> #define TEST_CASE2 0
>> > >> #define TEST_CASE3 0
>> > >>
>> > >> Scenario 2
>> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>> > >> #define TEST_CASE2 1
>> > >> #define TEST_CASE3 0
>> > >>
>> > >> Scenario 3
>> > >> Among the 16 PTEs, there are neither young nor dirty ones.
>> > >> #define TEST_CASE2 0
>> > >> #define TEST_CASE3 1
>> > >>
>> > >>
>> > >> Test results
>> > >>
>> > >> |Scenario 1         |       Original|  Modification 1|  Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions       |    37912436160|     18303833386|     18731580031|
>> > >> |test time          |         4.2797|          2.2687|          2.2949|
>> > >> |overhead of        |               |                |                |
>> > >> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
>> > >>
>> > >> |Scenario 2         |       Original|  Modification 1|  Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions       |    36701270862|     38729716276|     36115790086|
>> > >> |test time          |         3.2335|          3.5732|          3.0874|
>> > >> |Overhead of        |               |                |                |
>> > >> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
>> > >>
>> > >> |Scenario 3         |       Original|  Modification 1|  Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions       |    36706279735|     38305241759|     36750881878|
>> > >> |test time          |         3.2008|          3.5389|          3.1249|
>> > >> |Overhead of        |               |                |                |
>> > >> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
>> > >>
>> > >>
>> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
>> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
>> > >> benefit of 50.59% and a time benefit of 46.38%.
>> > >>
>> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
>> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
>> > >> the instructions and time due to additional conditional checks.
>> > >>
>> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
>> > >> the branches taken by Modification 1 and Modification 2 should be the same as
>> > >> those of the original code. In fact, the test results of Modification 2 seem
>> > >> to be closer to those of the original code. I don't know why there is a
>> > >> performance regression in Modification 1.
>> > >>
>> > >> Therefore, I believe modifying the code according to Modification 2 can bring
>> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
>> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
>> > >>
>> > >
>> > >modification 2 is not correct. if pte0~pte14 are all young and no one
>> > >is dirty, we are
>> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
>> > >
>> > >                 if (pte_young(pte)) {
>> > >                         orig_pte = pte_mkyoung(orig_pte);
>> > >                         i++;
>> > >                         ptep++;
>> > >                         for (; i < CONT_PTES; i++, ptep++) {
>> > >                                 pte = __ptep_get(ptep);
>> > >                                 if (pte_dirty(pte)) {
>> > >                                         orig_pte = pte_mkdirty(orig_pte);
>> > >                                         break;
>> > >                                 }
>> > >                         }
>> > >                         break;
>> > >                 }
>> > >
>> >
>> > I didn't understand which part you referred to when you said there were a lot of
>> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
>> > young and no one is dirty", Modification 2 will enter the following branch when
>> > judging pte0:
>> >
>> > if (pte_young(pte)) {
>> >         orig_pte = pte_mkyoung(orig_pte);
>> >         // The dirty status of pte0 has already been checked, skip it.
>> >         i++;
>> >         ptep++;
>> >         // Then we only need to check whether pte1~pte15 are dirty.
>> >         for (; i < CONT_PTES; i++, ptep++) {
>> >                 pte = __ptep_get(ptep);
>> >                 if (pte_dirty(pte)) {
>> >                         // Exit as soon as a dirty entry is found.
>> >                         orig_pte = pte_mkdirty(orig_pte);
>> >                         break;
>> >                 }
>> >         }
>> >         // Exit directly here without going through the outer loop again.
>> >         break;
>> > }
>> >
>> > In this scenario, the total number of judgments in Modification 2 is nearly half less
>> > than that of the original code. I should have understood it correctly, right?
>>
>> You're right—I missed the part where you're also taking a break, even though
>> no one is dirty. Based on your data, modification 2 seems to be good.
>>
>> But I don't quite understand why you are doing
>>          i++;
>>          ptep++;
>> before for (; i < CONT_PTES; i++, ptep++).
>>
>> it seems to be wrong. if i==15, you will get i=16. you are skipping
>> the pte_dirty
>> check for i==15. it is also true for any value between 0 and 15. My
>> point is that
>> you should drop it and re-test.
>
>Sorry, I missed that you already checked pte_dirty(pte) before calling
>pte_young(). So please ignore my comment. The code's a bit hard to
>follow now. :-)
>

The modification 2 is indeed a bit difficult to understand, but this is also to improve
the execution efficiency. Thanks for your careful review. :-)

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-05-04  2:39                             ` Xavier
@ 2025-05-08  1:29                               ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2025-05-08  1:29 UTC (permalink / raw)
  To: Xavier
  Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Sun, May 4, 2025 at 2:56 PM Xavier <xavier_qy@163.com> wrote:

> At 2025-05-02 05:32:50, "Barry Song" <21cnbao@gmail.com> wrote:
> >On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
> >> >
> >> >
> >> >
> >> > Hi Barry,
> >> >
> >> >
> >> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> >> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> >> > >>
> >> > >>
> >> > >> Hi all,
> >> > >>
> >> > >>
> >> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >> > >> >On 15/04/2025 09:22, Xavier wrote:
> >> > >> >> This commit optimizes the contpte_ptep_get function by adding early
> >> > >> >>  termination logic. It checks if the dirty and young bits of orig_pte
> >> > >> >>  are already set and skips redundant bit-setting operations during
> >> > >> >>  the loop. This reduces unnecessary iterations and improves performance.
> >> > >> >>
> >> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> > >> >> ---
> >> > >> >>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >> > >> >>  1 file changed, 18 insertions(+), 2 deletions(-)
> >> > >> >>
> >> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> > >> >> index bcac4f55f9c1..0acfee604947 100644
> >> > >> >> --- a/arch/arm64/mm/contpte.c
> >> > >> >> +++ b/arch/arm64/mm/contpte.c
> >> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >> > >> >>  }
> >> > >> >>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >> > >> >>
> >> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> > >> >> + * passed-in parameters.*/
> >> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> > >> >> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> > >> >> +            if (pte_##flag(__ptep_get(ptep))) { \
> >> > >> >> +                            orig_pte = pte_mk##flag(orig_pte); \
> >> > >> >> +                            break; \
> >> > >> >> +            } \
> >> > >> >> +    }
> >> > >> >
> >> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >> > >> >personally prefer to see the 2 extra loops open coded below.
> >> > >> >
> >> > >> >Or even better, could you provide results comparing this 3 loop version to the
> >> > >> >simpler approach I suggested previously? If the performance is similar (which I
> >> > >> >expect it will be, especially given Barry's point that your test always ensures
> >> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >> > >> >
> >> > >>
> >> > >> Based on the discussions in the previous email, two modifications were adopted
> >> > >> and tested, and the results are as follows:
> >> > >>
> >> > >> Modification 1
> >> > >>
> >> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> > >> {
> >> > >>         pte_t pte;
> >> > >>         int i;
> >> > >>
> >> > >>         ptep = contpte_align_down(ptep);
> >> > >>
> >> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> > >>                 pte = __ptep_get(ptep);
> >> > >>
> >> > >>                 if (pte_dirty(pte)) {
> >> > >>                         orig_pte = pte_mkdirty(orig_pte);
> >> > >>                         if (pte_young(orig_pte))
> >> > >>                                 break;
> >> > >>                 }
> >> > >>
> >> > >>                 if (pte_young(pte)) {
> >> > >>                         orig_pte = pte_mkyoung(orig_pte);
> >> > >>                         if (pte_dirty(orig_pte))
> >> > >>                                 break;
> >> > >>                 }
> >> > >>         }
> >> > >>
> >> > >>         return orig_pte;
> >> > >> }
> >> > >>
> >> > >> Modification 2
> >> > >>
> >> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> > >> {
> >> > >>         pte_t pte;
> >> > >>         int i;
> >> > >>
> >> > >>         ptep = contpte_align_down(ptep);
> >> > >>
> >> > >>         for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> > >>                 pte = __ptep_get(ptep);
> >> > >>
> >> > >>                 if (pte_dirty(pte)) {
> >> > >>                         orig_pte = pte_mkdirty(orig_pte);
> >> > >>                         for (; i < CONT_PTES; i++, ptep++) {
> >> > >>                                 pte = __ptep_get(ptep);
> >> > >>                                 if (pte_young(pte)) {
> >> > >>                                         orig_pte = pte_mkyoung(orig_pte);
> >> > >>                                         break;
> >> > >>                                 }
> >> > >>                         }
> >> > >>                         break;
> >> > >>                 }
> >> > >>
> >> > >>                 if (pte_young(pte)) {
> >> > >>                         orig_pte = pte_mkyoung(orig_pte);
> >> > >>                         i++;
> >> > >>                         ptep++;
> >> > >>                         for (; i < CONT_PTES; i++, ptep++) {
> >> > >>                                 pte = __ptep_get(ptep);
> >> > >>                                 if (pte_dirty(pte)) {
> >> > >>                                         orig_pte = pte_mkdirty(orig_pte);
> >> > >>                                         break;
> >> > >>                                 }
> >> > >>                         }
> >> > >>                         break;
> >> > >>                 }
> >> > >>         }
> >> > >>
> >> > >>         return orig_pte;
> >> > >> }
> >> > >>
> >> > >> Test Code:
> >> > >>
> >> > >> #define PAGE_SIZE 4096
> >> > >> #define CONT_PTES 16
> >> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >> > >> #define YOUNG_BIT 8
> >> > >> void rwdata(char *buf)
> >> > >> {
> >> > >>         for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >> > >>                 buf[i] = 'a';
> >> > >>                 volatile char c = buf[i];
> >> > >>         }
> >> > >> }
> >> > >> void clear_young_dirty(char *buf)
> >> > >> {
> >> > >>         if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >> > >>                 perror("madvise free failed");
> >> > >>                 free(buf);
> >> > >>                 exit(EXIT_FAILURE);
> >> > >>         }
> >> > >>         if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >> > >>                 perror("madvise free failed");
> >> > >>                 free(buf);
> >> > >>                 exit(EXIT_FAILURE);
> >> > >>         }
> >> > >> }
> >> > >> void set_one_young(char *buf)
> >> > >> {
> >> > >>         for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >> > >>                 volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >> > >>         }
> >> > >> }
> >> > >>
> >> > >> void test_contpte_perf() {
> >> > >>         char *buf;
> >> > >>         int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> >> > >>         if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >> > >>                 perror("posix_memalign failed");
> >> > >>                 exit(EXIT_FAILURE);
> >> > >>         }
> >> > >>
> >> > >>         rwdata(buf);
> >> > >> #if TEST_CASE2 || TEST_CASE3
> >> > >>         clear_young_dirty(buf);
> >> > >> #endif
> >> > >> #if TEST_CASE2
> >> > >>         set_one_young(buf);
> >> > >> #endif
> >> > >>
> >> > >>         for (int j = 0; j < 500; j++) {
> >> > >>                 mlock(buf, TEST_SIZE);
> >> > >>
> >> > >>                 munlock(buf, TEST_SIZE);
> >> > >>         }
> >> > >>         free(buf);
> >> > >> }
> >> > >> ---
> >> > >>
> >> > >> Descriptions of three test scenarios
> >> > >>
> >> > >> Scenario 1
> >> > >> The data of all 16 PTEs are both dirty and young.
> >> > >> #define TEST_CASE2 0
> >> > >> #define TEST_CASE3 0
> >> > >>
> >> > >> Scenario 2
> >> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >> > >> #define TEST_CASE2 1
> >> > >> #define TEST_CASE3 0
> >> > >>
> >> > >> Scenario 3
> >> > >> Among the 16 PTEs, there are neither young nor dirty ones.
> >> > >> #define TEST_CASE2 0
> >> > >> #define TEST_CASE3 1
> >> > >>
> >> > >>
> >> > >> Test results
> >> > >>
> >> > >> |Scenario 1         |       Original|  Modification 1|  Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions       |    37912436160|     18303833386|     18731580031|
> >> > >> |test time          |         4.2797|          2.2687|          2.2949|
> >> > >> |overhead of        |               |                |                |
> >> > >> |contpte_ptep_get() |         21.31%|           4.72%|           4.80%|
> >> > >>
> >> > >> |Scenario 2         |       Original|  Modification 1|  Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions       |    36701270862|     38729716276|     36115790086|
> >> > >> |test time          |         3.2335|          3.5732|          3.0874|
> >> > >> |Overhead of        |               |                |                |
> >> > >> |contpte_ptep_get() |         32.26%|          41.35%|          33.57%|
> >> > >>
> >> > >> |Scenario 3         |       Original|  Modification 1|  Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions       |    36706279735|     38305241759|     36750881878|
> >> > >> |test time          |         3.2008|          3.5389|          3.1249|
> >> > >> |Overhead of        |               |                |                |
> >> > >> |contpte_ptep_get() |         31.94%|          41.30%|          34.59%|
> >> > >>
> >> > >>
> >> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> >> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> >> > >> benefit of 50.59% and a time benefit of 46.38%.
> >> > >>
> >> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> >> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> >> > >> the instructions and time due to additional conditional checks.
> >> > >>
> >> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> >> > >> the branches taken by Modification 1 and Modification 2 should be the same as
> >> > >> those of the original code. In fact, the test results of Modification 2 seem
> >> > >> to be closer to those of the original code. I don't know why there is a
> >> > >> performance regression in Modification 1.
> >> > >>
> >> > >> Therefore, I believe modifying the code according to Modification 2 can bring
> >> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> >> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
> >> > >>
> >> > >
> >> > >modification 2 is not correct. if pte0~pte14 are all young and no one
> >> > >is dirty, we are
> >> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> >> > >
> >> > >                 if (pte_young(pte)) {
> >> > >                         orig_pte = pte_mkyoung(orig_pte);
> >> > >                         i++;
> >> > >                         ptep++;
> >> > >                         for (; i < CONT_PTES; i++, ptep++) {
> >> > >                                 pte = __ptep_get(ptep);
> >> > >                                 if (pte_dirty(pte)) {
> >> > >                                         orig_pte = pte_mkdirty(orig_pte);
> >> > >                                         break;
> >> > >                                 }
> >> > >                         }
> >> > >                         break;
> >> > >                 }
> >> > >
> >> >
> >> > I didn't understand which part you referred to when you said there were a lot of
> >> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> >> > young and no one is dirty", Modification 2 will enter the following branch when
> >> > judging pte0:
> >> >
> >> > if (pte_young(pte)) {
> >> >         orig_pte = pte_mkyoung(orig_pte);
> >> >         // The dirty status of pte0 has already been checked, skip it.
> >> >         i++;
> >> >         ptep++;
> >> >         // Then we only need to check whether pte1~pte15 are dirty.
> >> >         for (; i < CONT_PTES; i++, ptep++) {
> >> >                 pte = __ptep_get(ptep);
> >> >                 if (pte_dirty(pte)) {
> >> >                         // Exit as soon as a dirty entry is found.
> >> >                         orig_pte = pte_mkdirty(orig_pte);
> >> >                         break;
> >> >                 }
> >> >         }
> >> >         // Exit directly here without going through the outer loop again.
> >> >         break;
> >> > }
> >> >
> >> > In this scenario, the total number of judgments in Modification 2 is nearly half less
> >> > than that of the original code. I should have understood it correctly, right?
> >>
> >> You're right—I missed the part where you're also taking a break, even though
> >> no one is dirty. Based on your data, modification 2 seems to be good.
> >>
> >> But I don't quite understand why you are doing
> >>          i++;
> >>          ptep++;
> >> before for (; i < CONT_PTES; i++, ptep++).
> >>
> >> it seems to be wrong. if i==15, you will get i=16. you are skipping
> >> the pte_dirty
> >> check for i==15. it is also true for any value between 0 and 15. My
> >> point is that
> >> you should drop it and re-test.
> >
> >Sorry, I missed that you already checked pte_dirty(pte) before calling
> >pte_young(). So please ignore my comment. The code's a bit hard to
> >follow now. :-)
> >
>
> The modification 2 is indeed a bit difficult to understand, but this is also to improve
> the execution efficiency. Thanks for your careful review. :-)

Agreed. Personally, I have no objections to Modification 2, as it demonstrates
performance improvements in nearly all cases according to your data.

>
> --
> Thanks,
> Xavier

Thanks
Barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-01 21:32                           ` Barry Song
  2025-05-04  2:39                             ` Xavier
@ 2025-05-08  7:03                             ` Xavier Xia
  2025-05-08  8:30                               ` David Hildenbrand
  2025-05-09  2:09                               ` Barry Song
  1 sibling, 2 replies; 49+ messages in thread
From: Xavier Xia @ 2025-05-08  7:03 UTC (permalink / raw)
  To: 21cnbao, ryan.roberts, dev.jain, ioworker0
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, xavier_qy, ziy

This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.

In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:

Test Code:

	#define PAGE_SIZE 4096
	#define CONT_PTES 16
	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
	#define YOUNG_BIT 8
	void rwdata(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
			buf[i] = 'a';
			volatile char c = buf[i];
		}
	}
	void clear_young_dirty(char *buf)
	{
		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
	}
	void set_one_young(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
		}
	}

	void test_contpte_perf() {
		char *buf;
		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
				TEST_SIZE);
		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
			perror("posix_memalign failed");
			exit(EXIT_FAILURE);
		}

		rwdata(buf);
	#if TEST_CASE2 || TEST_CASE3
		clear_young_dirty(buf);
	#endif
	#if TEST_CASE2
		set_one_young(buf);
	#endif

		for (int j = 0; j < 500; j++) {
			mlock(buf, TEST_SIZE);

			munlock(buf, TEST_SIZE);
		}
		free(buf);
	}

	Descriptions of three test scenarios

Scenario 1
	The data of all 16 PTEs are both dirty and young.
	#define TEST_CASE2 0
	#define TEST_CASE3 0

Scenario 2
	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
	#define TEST_CASE2 1
	#define TEST_CASE3 0

Scenario 3
	Among the 16 PTEs, there are neither young nor dirty ones.
	#define TEST_CASE2 0
	#define TEST_CASE3 1

Test results

|Scenario 1         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    37912436160|     18731580031|
|test time          |         4.2797|          2.2949|
|overhead of        |               |                |
|contpte_ptep_get() |         21.31%|           4.80%|

|Scenario 2         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36701270862|     36115790086|
|test time          |         3.2335|          3.0874|
|Overhead of        |               |                |
|contpte_ptep_get() |         32.26%|          33.57%|

|Scenario 3         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36706279735|     36750881878|
|test time          |         3.2008|          3.1249|
|Overhead of        |               |                |
|contpte_ptep_get() |         31.94%|          34.59%|

For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.

It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.

Signed-off-by: Xavier Xia <xavier_qy@163.com>
---
 arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..e9882ec782fc 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
 }
 EXPORT_SYMBOL_GPL(contpte_ptep_get);
 
+#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
+	(!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
+		pgprot_val(prot) != pgprot_val(orig_prot))
+
 pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 {
 	/*
@@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 		pte = __ptep_get(ptep);
 		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
 
-		if (!pte_valid_cont(pte) ||
-		   pte_pfn(pte) != pfn ||
-		   pgprot_val(prot) != pgprot_val(orig_prot))
+		if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
 			goto retry;
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+				prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+				if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+					goto retry;
+
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			pfn++;
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+				prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+				if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+					goto retry;
+
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-08  7:03                             ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
@ 2025-05-08  8:30                               ` David Hildenbrand
  2025-05-09  9:17                                 ` Xavier
  2025-05-09  2:09                               ` Barry Song
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2025-05-08  8:30 UTC (permalink / raw)
  To: Xavier Xia, 21cnbao, ryan.roberts, dev.jain, ioworker0
  Cc: akpm, catalin.marinas, gshan, linux-arm-kernel, linux-kernel,
	will, willy, ziy

On 08.05.25 09:03, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:

For the future, please don't post vN+1 as reply to vN.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re:Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-08  8:30                               ` David Hildenbrand
@ 2025-05-09  9:17                                 ` Xavier
  2025-05-09  9:25                                   ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-09  9:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: 21cnbao, ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy




At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>On 08.05.25 09:03, Xavier Xia wrote:
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>> 
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>
>For the future, please don't post vN+1 as reply to vN.

I will pay attention to it when I submit it later. Thank you for the reminder.

>
>-- 
>Cheers,
>
>David / dhildenb

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-09  9:17                                 ` Xavier
@ 2025-05-09  9:25                                   ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2025-05-09  9:25 UTC (permalink / raw)
  To: Xavier
  Cc: 21cnbao, ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On 09.05.25 11:17, Xavier wrote:
> 
> 
> 
> At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>> On 08.05.25 09:03, Xavier Xia wrote:
>>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>>> function by adding early termination logic. It checks if the dirty and
>>> young bits of orig_pte are already set and skips redundant bit-setting
>>> operations during the loop. This reduces unnecessary iterations and
>>> improves performance.
>>>
>>> In order to verify the optimization performance, a test function has been
>>> designed. The function's execution time and instruction statistics have
>>> been traced using perf, and the following are the operation results on a
>>> certain Qualcomm mobile phone chip:
>>
>> For the future, please don't post vN+1 as reply to vN.
> 
> I will pay attention to it when I submit it later. Thank you for the reminder.

The rationale is that many people will just treat it as some discussion 
noise as part of vN and not really have a closer look, waiting for vN+1.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-08  7:03                             ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  2025-05-08  8:30                               ` David Hildenbrand
@ 2025-05-09  2:09                               ` Barry Song
  2025-05-09  9:20                                 ` Xavier
  1 sibling, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-05-09  2:09 UTC (permalink / raw)
  To: Xavier Xia
  Cc: ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy

On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
>
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
>
> Test Code:
>
>         #define PAGE_SIZE 4096
>         #define CONT_PTES 16
>         #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>         #define YOUNG_BIT 8
>         void rwdata(char *buf)
>         {
>                 for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>                         buf[i] = 'a';
>                         volatile char c = buf[i];
>                 }
>         }
>         void clear_young_dirty(char *buf)
>         {
>                 if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>                         perror("madvise free failed");
>                         free(buf);
>                         exit(EXIT_FAILURE);
>                 }
>                 if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>                         perror("madvise free failed");
>                         free(buf);
>                         exit(EXIT_FAILURE);
>                 }
>         }
>         void set_one_young(char *buf)
>         {
>                 for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>                         volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>                 }
>         }
>
>         void test_contpte_perf() {
>                 char *buf;
>                 int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>                                 TEST_SIZE);
>                 if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>                         perror("posix_memalign failed");
>                         exit(EXIT_FAILURE);
>                 }
>
>                 rwdata(buf);
>         #if TEST_CASE2 || TEST_CASE3
>                 clear_young_dirty(buf);
>         #endif
>         #if TEST_CASE2
>                 set_one_young(buf);
>         #endif
>
>                 for (int j = 0; j < 500; j++) {
>                         mlock(buf, TEST_SIZE);
>
>                         munlock(buf, TEST_SIZE);
>                 }
>                 free(buf);
>         }
>
>         Descriptions of three test scenarios
>
> Scenario 1
>         The data of all 16 PTEs are both dirty and young.
>         #define TEST_CASE2 0
>         #define TEST_CASE3 0
>
> Scenario 2
>         Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>         #define TEST_CASE2 1
>         #define TEST_CASE3 0
>
> Scenario 3
>         Among the 16 PTEs, there are neither young nor dirty ones.
>         #define TEST_CASE2 0
>         #define TEST_CASE3 1
>
> Test results
>
> |Scenario 1         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    37912436160|     18731580031|
> |test time          |         4.2797|          2.2949|
> |overhead of        |               |                |
> |contpte_ptep_get() |         21.31%|           4.80%|
>
> |Scenario 2         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36701270862|     36115790086|
> |test time          |         3.2335|          3.0874|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         32.26%|          33.57%|
>
> |Scenario 3         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36706279735|     36750881878|
> |test time          |         3.2008|          3.1249|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         31.94%|          34.59%|
>
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.
>
> It can be proven through test function that the optimization for
> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> is similar to that of contpte_ptep_get, the same optimization scheme is
> also adopted for it.
>
> Signed-off-by: Xavier Xia <xavier_qy@163.com>
> ---
>  arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
>  1 file changed, 62 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..e9882ec782fc 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
> -
> -               if (pte_young(pte))
> +                       for (; i < CONT_PTES; i++, ptep++) {
> +                               pte = __ptep_get(ptep);
> +                               if (pte_young(pte)) {
> +                                       orig_pte = pte_mkyoung(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
> +
> +               if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
> +                       i++;
> +                       ptep++;
> +                       for (; i < CONT_PTES; i++, ptep++) {
> +                               pte = __ptep_get(ptep);
> +                               if (pte_dirty(pte)) {
> +                                       orig_pte = pte_mkdirty(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
>         }
>
>         return orig_pte;
>  }
>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>
> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
> +       (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
> +               pgprot_val(prot) != pgprot_val(orig_prot))

maybe make it a static inline function to improve readability. Also,
the name appears to
be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
not consistency.

it might be:

static inline bool contpte_is_consistent(...)
{
        return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
               pgprot_val(prot) == pgprot_val(orig_prot);
}

or another better name.

> +
>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  {
>         /*
> @@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>                 pte = __ptep_get(ptep);
>                 prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>
> -               if (!pte_valid_cont(pte) ||
> -                  pte_pfn(pte) != pfn ||
> -                  pgprot_val(prot) != pgprot_val(orig_prot))
> +               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
>                         goto retry;
>
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
> -
> -               if (pte_young(pte))
> +                       for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +                               pte = __ptep_get(ptep);
> +                               prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +                               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> +                                       goto retry;
> +
> +                               if (pte_young(pte)) {
> +                                       orig_pte = pte_mkyoung(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
> +
> +               if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
> +                       i++;
> +                       ptep++;
> +                       pfn++;
> +                       for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +                               pte = __ptep_get(ptep);
> +                               prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +                               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> +                                       goto retry;
> +
> +                               if (pte_dirty(pte)) {
> +                                       orig_pte = pte_mkdirty(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
>         }
>
>         return orig_pte;
> --
> 2.34.1
>

Thanks
barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-09  2:09                               ` Barry Song
@ 2025-05-09  9:20                                 ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-05-09  9:20 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
	gshan, linux-arm-kernel, linux-kernel, will, willy, ziy



At 2025-05-09 10:09:21, "Barry Song" <21cnbao@gmail.com> wrote:
>On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>>
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>>
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>>
>> Test Code:
>>
>>         #define PAGE_SIZE 4096
>>         #define CONT_PTES 16
>>         #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>>         #define YOUNG_BIT 8
>>         void rwdata(char *buf)
>>         {
>>                 for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>>                         buf[i] = 'a';
>>                         volatile char c = buf[i];
>>                 }
>>         }
>>         void clear_young_dirty(char *buf)
>>         {
>>                 if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>>                         perror("madvise free failed");
>>                         free(buf);
>>                         exit(EXIT_FAILURE);
>>                 }
>>                 if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>>                         perror("madvise free failed");
>>                         free(buf);
>>                         exit(EXIT_FAILURE);
>>                 }
>>         }
>>         void set_one_young(char *buf)
>>         {
>>                 for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>>                         volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>>                 }
>>         }
>>
>>         void test_contpte_perf() {
>>                 char *buf;
>>                 int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>>                                 TEST_SIZE);
>>                 if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>>                         perror("posix_memalign failed");
>>                         exit(EXIT_FAILURE);
>>                 }
>>
>>                 rwdata(buf);
>>         #if TEST_CASE2 || TEST_CASE3
>>                 clear_young_dirty(buf);
>>         #endif
>>         #if TEST_CASE2
>>                 set_one_young(buf);
>>         #endif
>>
>>                 for (int j = 0; j < 500; j++) {
>>                         mlock(buf, TEST_SIZE);
>>
>>                         munlock(buf, TEST_SIZE);
>>                 }
>>                 free(buf);
>>         }
>>
>>         Descriptions of three test scenarios
>>
>> Scenario 1
>>         The data of all 16 PTEs are both dirty and young.
>>         #define TEST_CASE2 0
>>         #define TEST_CASE3 0
>>
>> Scenario 2
>>         Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>>         #define TEST_CASE2 1
>>         #define TEST_CASE3 0
>>
>> Scenario 3
>>         Among the 16 PTEs, there are neither young nor dirty ones.
>>         #define TEST_CASE2 0
>>         #define TEST_CASE3 1
>>
>> Test results
>>
>> |Scenario 1         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    37912436160|     18731580031|
>> |test time          |         4.2797|          2.2949|
>> |overhead of        |               |                |
>> |contpte_ptep_get() |         21.31%|           4.80%|
>>
>> |Scenario 2         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    36701270862|     36115790086|
>> |test time          |         3.2335|          3.0874|
>> |Overhead of        |               |                |
>> |contpte_ptep_get() |         32.26%|          33.57%|
>>
>> |Scenario 3         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    36706279735|     36750881878|
>> |test time          |         3.2008|          3.1249|
>> |Overhead of        |               |                |
>> |contpte_ptep_get() |         31.94%|          34.59%|
>>
>> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
>> and a time benefit of 46.38%.
>> For Scenario 2, optimized code can achieve an instruction count benefit of
>> 1.6% and a time benefit of 4.5%.
>> For Scenario 3, since all the PTEs have neither the young nor the dirty
>> flag, the branches taken by optimized code should be the same as those of
>> the original code. In fact, the test results of optimized code seem to be
>> closer to those of the original code.
>>
>> It can be proven through test function that the optimization for
>> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
>> is similar to that of contpte_ptep_get, the same optimization scheme is
>> also adopted for it.
>>
>> Signed-off-by: Xavier Xia <xavier_qy@163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
>>  1 file changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..e9882ec782fc 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                 pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (pte_dirty(pte)) {
>>                         orig_pte = pte_mkdirty(orig_pte);
>> -
>> -               if (pte_young(pte))
>> +                       for (; i < CONT_PTES; i++, ptep++) {
>> +                               pte = __ptep_get(ptep);
>> +                               if (pte_young(pte)) {
>> +                                       orig_pte = pte_mkyoung(orig_pte);
>> +                                       break;
>> +                               }
>> +                       }
>> +                       break;
>> +               }
>> +
>> +               if (pte_young(pte)) {
>>                         orig_pte = pte_mkyoung(orig_pte);
>> +                       i++;
>> +                       ptep++;
>> +                       for (; i < CONT_PTES; i++, ptep++) {
>> +                               pte = __ptep_get(ptep);
>> +                               if (pte_dirty(pte)) {
>> +                                       orig_pte = pte_mkdirty(orig_pte);
>> +                                       break;
>> +                               }
>> +                       }
>> +                       break;
>> +               }
>>         }
>>
>>         return orig_pte;
>>  }
>>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>
>> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
>> +       (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
>> +               pgprot_val(prot) != pgprot_val(orig_prot))
>
>maybe make it a static inline function to improve readability. Also,
>the name appears to
>be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
>not consistency.
>
>it might be:
>
>static inline bool contpte_is_consistent(...)
>{
>        return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
>               pgprot_val(prot) == pgprot_val(orig_prot);
>}
>
>or another better name.
>

You're right. What's being checked here is the inconsistency. I will make the modification
in the next version. Thank you for your suggestion.

--

Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
@ 2025-04-16  2:10               ` Andrew Morton
  2025-04-16  3:25                 ` Xavier
  2025-04-16 12:47               ` Catalin Marinas
  2025-04-16 12:48               ` Ryan Roberts
  3 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2025-04-16  2:10 UTC (permalink / raw)
  To: Xavier
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy


Please try to avoid presentation of a [0/N] cover letter when N==1!  A
simple singleton patch is better.

On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy@163.com> wrote:

> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there

which function?

> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:

All the words thus far appear to be discussing changes since v2.  For
the permanent kernel record, this isn't interesting or useful material.
So please present a standalone description which doesn't refer to
previous iterations.

It's great to present this what-i-changed-since-last-time material, but
that is better placed after the "^---$" separator, after the
Signed-off-by:, Reviewed-by: etc tags.

>
> ...
>


Below is what I came up with for a changelog.  Please check?

Optimize contpte_ptep_get() by adding early termination logic.  Check if
the dirty and young bits of orig_pte are already set and skip redundant
bit-setting operations during the loop.  This reduces unnecessary
iterations and improves performance.

The function's execution time and instruction statistics have been traced
using perf, and the following are the operation results on a certain
Qualcomm mobile phone chip:

Instruction Statistics - Before Optimization
#          count  event_name              # count / runtime
      20,814,352  branch-load-misses      # 662.244 K/sec
  41,894,986,323  branch-loads            # 1.333 G/sec
       1,957,415  iTLB-load-misses        # 62.278 K/sec
  49,872,282,100  iTLB-loads              # 1.587 G/sec
     302,808,096  L1-icache-load-misses   # 9.634 M/sec
  49,872,282,100  L1-icache-loads         # 1.587 G/sec

Total test time: 31.485237 seconds.

Instruction Statistics - After Optimization
#          count  event_name              # count / runtime
      19,340,524  branch-load-misses      # 688.753 K/sec
  38,510,185,183  branch-loads            # 1.371 G/sec
       1,812,716  iTLB-load-misses        # 64.554 K/sec
  47,673,923,151  iTLB-loads              # 1.698 G/sec
     675,853,661  L1-icache-load-misses   # 24.068 M/sec
  47,673,923,151  L1-icache-loads         # 1.698 G/sec

Total test time: 28.108048 seconds.

Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900

Overhead   Symbol
21.42%     lock_release
21.26%     lock_acquire
20.88%     arch_counter_get_cntvct
14.32%     _raw_spin_unlock_irq
6.79%      contpte_ptep_get
2.20%      test_contpte_perf
1.82%      follow_page_pte
0.97%      lock_acquired
0.97%      rcu_is_watching
0.89%      mlock_pte_range
0.84%      sched_clock_noinstr
0.70%      handle_softirqs.llvm.8218488130471452153
0.58%      test_preempt_disable_long
0.57%      _raw_spin_unlock_irqrestore
0.54%      arch_stack_walk
0.51%      vm_normal_folio
0.48%      check_preemption_disabled
0.47%      stackinfo_get_task
0.36%      try_grab_folio
0.34%      preempt_count
0.32%      trace_preempt_on
0.29%      trace_preempt_off
0.24%      debug_smp_processor_id

Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042

Overhead   Symbol
22.59%     lock_release
22.13%     arch_counter_get_cntvct
22.08%     lock_acquire
15.32%     _raw_spin_unlock_irq
2.26%      test_contpte_perf
1.50%      follow_page_pte
1.49%      arch_stack_walk
1.30%      rcu_is_watching
1.09%      lock_acquired
1.07%      sched_clock_noinstr
0.88%      handle_softirqs.llvm.12507768597002095717
0.88%      trace_preempt_off
0.76%      _raw_spin_unlock_irqrestore
0.61%      check_preemption_disabled
0.52%      trace_preempt_on
0.50%      mlock_pte_range
0.43%      try_grab_folio
0.41%      folio_mark_accessed
0.40%      vm_normal_folio
0.38%      test_preempt_disable_long
0.28%      contpte_ptep_get
0.27%      __traceiter_android_rvh_preempt_disable
0.26%      debug_smp_processor_id
0.24%      return_address
0.20%      __pte_offset_map_lock
0.19%      unwind_next_frame_record

If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.

If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.

Test function:

#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)

void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void test_contpte_perf()
{
	char *buf;
	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
	if (ret != 0) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		rwdata(buf);

		munlock(buf, TEST_SIZE);
	}
	
	free(buf);
}



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16  2:10               ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
@ 2025-04-16  3:25                 ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy


Hi Andrew,


At 2025-04-16 10:10:27, "Andrew Morton" <akpm@linux-foundation.org> wrote:
>
>Please try to avoid presentation of a [0/N] cover letter when N==1!  A
>simple singleton patch is better.

Got it, I'll keep this in mind for future submissions. Thanks for the reminder!

>
>On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy@163.com> wrote:
>
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>
>which function?

It's contpte_ptep_get().

>
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>All the words thus far appear to be discussing changes since v2.  For
>the permanent kernel record, this isn't interesting or useful material.
>So please present a standalone description which doesn't refer to
>previous iterations.
>
>It's great to present this what-i-changed-since-last-time material, but
>that is better placed after the "^---$" separator, after the
>Signed-off-by:, Reviewed-by: etc tags.
>

OK, I will follow this requirement for future submissions.

>>
>> ...
>>
>
>
>Below is what I came up with for a changelog.  Please check?

I've reviewed it, and it looks good. Thank you for your revisions!

>
>Optimize contpte_ptep_get() by adding early termination logic.  Check if
>the dirty and young bits of orig_pte are already set and skip redundant
>bit-setting operations during the loop.  This reduces unnecessary
>iterations and improves performance.
>
>The function's execution time and instruction statistics have been traced
>using perf, and the following are the operation results on a certain
>Qualcomm mobile phone chip:
>
>Instruction Statistics - Before Optimization
>#          count  event_name              # count / runtime
>      20,814,352  branch-load-misses      # 662.244 K/sec
>  41,894,986,323  branch-loads            # 1.333 G/sec
>       1,957,415  iTLB-load-misses        # 62.278 K/sec
>  49,872,282,100  iTLB-loads              # 1.587 G/sec
>     302,808,096  L1-icache-load-misses   # 9.634 M/sec
>  49,872,282,100  L1-icache-loads         # 1.587 G/sec
>
>Total test time: 31.485237 seconds.
>
>Instruction Statistics - After Optimization
>#          count  event_name              # count / runtime
>      19,340,524  branch-load-misses      # 688.753 K/sec
>  38,510,185,183  branch-loads            # 1.371 G/sec
>       1,812,716  iTLB-load-misses        # 64.554 K/sec
>  47,673,923,151  iTLB-loads              # 1.698 G/sec
>     675,853,661  L1-icache-load-misses   # 24.068 M/sec
>  47,673,923,151  L1-icache-loads         # 1.698 G/sec
>
>Total test time: 28.108048 seconds.
>
>Function Statistics - Before Optimization
>Arch: arm64
>Event: cpu-cycles (type 0, config 0)
>Samples: 1419716
>Event count: 99618088900
>
>Overhead   Symbol
>21.42%     lock_release
>21.26%     lock_acquire
>20.88%     arch_counter_get_cntvct
>14.32%     _raw_spin_unlock_irq
>6.79%      contpte_ptep_get
>2.20%      test_contpte_perf
>1.82%      follow_page_pte
>0.97%      lock_acquired
>0.97%      rcu_is_watching
>0.89%      mlock_pte_range
>0.84%      sched_clock_noinstr
>0.70%      handle_softirqs.llvm.8218488130471452153
>0.58%      test_preempt_disable_long
>0.57%      _raw_spin_unlock_irqrestore
>0.54%      arch_stack_walk
>0.51%      vm_normal_folio
>0.48%      check_preemption_disabled
>0.47%      stackinfo_get_task
>0.36%      try_grab_folio
>0.34%      preempt_count
>0.32%      trace_preempt_on
>0.29%      trace_preempt_off
>0.24%      debug_smp_processor_id
>
>Function Statistics - After Optimization
>Arch: arm64
>Event: cpu-cycles (type 0, config 0)
>Samples: 1431006
>Event count: 118856425042
>
>Overhead   Symbol
>22.59%     lock_release
>22.13%     arch_counter_get_cntvct
>22.08%     lock_acquire
>15.32%     _raw_spin_unlock_irq
>2.26%      test_contpte_perf
>1.50%      follow_page_pte
>1.49%      arch_stack_walk
>1.30%      rcu_is_watching
>1.09%      lock_acquired
>1.07%      sched_clock_noinstr
>0.88%      handle_softirqs.llvm.12507768597002095717
>0.88%      trace_preempt_off
>0.76%      _raw_spin_unlock_irqrestore
>0.61%      check_preemption_disabled
>0.52%      trace_preempt_on
>0.50%      mlock_pte_range
>0.43%      try_grab_folio
>0.41%      folio_mark_accessed
>0.40%      vm_normal_folio
>0.38%      test_preempt_disable_long
>0.28%      contpte_ptep_get
>0.27%      __traceiter_android_rvh_preempt_disable
>0.26%      debug_smp_processor_id
>0.24%      return_address
>0.20%      __pte_offset_map_lock
>0.19%      unwind_next_frame_record
>
>If there is no problem with my test program, it can be seen that there is a
>significant performance improvement both in the overall number of instructions
>and the execution time of contpte_ptep_get.
>
>If any reviewers have time, you can also test it on your machines for comparison.
>I have enabled THP and hugepages-64kB.
>
>Test function:
>
>#define PAGE_SIZE 4096
>#define CONT_PTES 16
>#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>
>void rwdata(char *buf)
>{
>	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>		buf[i] = 'a';
>		volatile char c = buf[i];
>	}
>}
>void test_contpte_perf()
>{
>	char *buf;
>	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
>	if (ret != 0) {
>		perror("posix_memalign failed");
>		exit(EXIT_FAILURE);
>	}
>
>	rwdata(buf);
>
>	for (int j = 0; j < 500; j++) {
>		mlock(buf, TEST_SIZE);
>
>		rwdata(buf);
>
>		munlock(buf, TEST_SIZE);
>	}
>	
>	free(buf);
>}


--

Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
  2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
  2025-04-16  2:10               ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
@ 2025-04-16 12:47               ` Catalin Marinas
  2025-04-16 15:08                 ` Xavier
  2025-04-16 12:48               ` Ryan Roberts
  3 siblings, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2025-04-16 12:47 UTC (permalink / raw)
  To: Xavier
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, david, gshan,
	linux-arm-kernel, linux-kernel, will, willy, ziy

On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev.

For some reason, my email (office365) rejected all these patches (not
even quarantined), I only got the replies. Anyway, I can get them from
the lore archive.

> Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
> 
> Instruction Statistics - Before Optimization
> #          count  event_name              # count / runtime
>       20,814,352  branch-load-misses      # 662.244 K/sec
>   41,894,986,323  branch-loads            # 1.333 G/sec
>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
> 
> Total test time: 31.485237 seconds.
> 
> Instruction Statistics - After Optimization
> #          count  event_name              # count / runtime
>       19,340,524  branch-load-misses      # 688.753 K/sec
>   38,510,185,183  branch-loads            # 1.371 G/sec
>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
> 
> Total test time: 28.108048 seconds.

We'd need to reproduce these numbers on other platforms as well and with
different page sizes. I hope Ryan can do some tests next week.

Purely looking at the patch, I don't like the complexity. I'd rather go
with your v1 if the numbers are fairly similar (even if slightly slower).

However, I don't trust microbenchmarks like calling mlock() in a loop.
It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
before mlock() to make the best out of the rewritten contpte_ptep_get().
Are there any real world workloads that would benefit from such change?

As it stands, I think this patch needs better justification.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re:Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16 12:47               ` Catalin Marinas
@ 2025-04-16 15:08                 ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 15:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, david, gshan,
	linux-arm-kernel, linux-kernel, will, willy, ziy



Hi Catalin,


At 2025-04-16 20:47:46, "Catalin Marinas" <catalin.marinas@arm.com> wrote:
>On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev.
>
>For some reason, my email (office365) rejected all these patches (not
>even quarantined), I only got the replies. Anyway, I can get them from
>the lore archive.
>
>> Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>> 
>> Instruction Statistics - Before Optimization
>> #          count  event_name              # count / runtime
>>       20,814,352  branch-load-misses      # 662.244 K/sec
>>   41,894,986,323  branch-loads            # 1.333 G/sec
>>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
>> 
>> Total test time: 31.485237 seconds.
>> 
>> Instruction Statistics - After Optimization
>> #          count  event_name              # count / runtime
>>       19,340,524  branch-load-misses      # 688.753 K/sec
>>   38,510,185,183  branch-loads            # 1.371 G/sec
>>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
>> 
>> Total test time: 28.108048 seconds.
>
>We'd need to reproduce these numbers on other platforms as well and with
>different page sizes. I hope Ryan can do some tests next week.

Of course, it would be even better if any reviewers could verify it with some
real-world scenarios.

>
>Purely looking at the patch, I don't like the complexity. I'd rather go
>with your v1 if the numbers are fairly similar (even if slightly slower).

The implementation of this macro is indeed a bit complicated. If it
doesn't affect the performance, I will try to make it simpler.

>
>However, I don't trust microbenchmarks like calling mlock() in a loop.
>It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
>before mlock() to make the best out of the rewritten contpte_ptep_get().
>Are there any real world workloads that would benefit from such change?
>
>As it stands, I think this patch needs better justification.
>

Indeed, I used mlock() because it is an example that can simply achieve
a large number of calls to contpte_ptep_get() in the user space. Of course,
there are many other scenarios where this function will be called, such as
madvise and mprotect, and so on. But essentially, there is no difference,
and all of them are aimed at verifying the performance improvement
brought by the contpte_ptep_get() function. It's true that the previous
test cases only tested the best-case scenario by dirtying the data.
Next, I will continue to test the impact of the new patch on performance
in scenarios where there are no "young" and "dirty" flags, or where there
is only a "young" flag.

--
Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
                                 ` (2 preceding siblings ...)
  2025-04-16 12:47               ` Catalin Marinas
@ 2025-04-16 12:48               ` Ryan Roberts
  2025-04-16 15:22                 ` Xavier
  3 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-04-16 12:48 UTC (permalink / raw)
  To: Xavier, dev.jain, ioworker0, 21cnbao
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, ziy

On 15/04/2025 09:22, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:

Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
the replies from others. I'll monitor lore but appologies if I'm slow to respond
- that's the reason.

Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".

Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
go via the arm64 tree, if we decide we want it.


> 
> Instruction Statistics - Before Optimization
> #          count  event_name              # count / runtime
>       20,814,352  branch-load-misses      # 662.244 K/sec
>   41,894,986,323  branch-loads            # 1.333 G/sec
>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
> 
> Total test time: 31.485237 seconds.
> 
> Instruction Statistics - After Optimization
> #          count  event_name              # count / runtime
>       19,340,524  branch-load-misses      # 688.753 K/sec
>   38,510,185,183  branch-loads            # 1.371 G/sec
>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
> 
> Total test time: 28.108048 seconds.
> 
> Function Statistics - Before Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1419716
> Event count: 99618088900
> 
> Overhead   Symbol
> 21.42%     lock_release
> 21.26%     lock_acquire
> 20.88%     arch_counter_get_cntvct
> 14.32%     _raw_spin_unlock_irq
> 6.79%      contpte_ptep_get
> 2.20%      test_contpte_perf
> 1.82%      follow_page_pte
> 0.97%      lock_acquired
> 0.97%      rcu_is_watching
> 0.89%      mlock_pte_range
> 0.84%      sched_clock_noinstr
> 0.70%      handle_softirqs.llvm.8218488130471452153
> 0.58%      test_preempt_disable_long
> 0.57%      _raw_spin_unlock_irqrestore
> 0.54%      arch_stack_walk
> 0.51%      vm_normal_folio
> 0.48%      check_preemption_disabled
> 0.47%      stackinfo_get_task
> 0.36%      try_grab_folio
> 0.34%      preempt_count
> 0.32%      trace_preempt_on
> 0.29%      trace_preempt_off
> 0.24%      debug_smp_processor_id
> 
> Function Statistics - After Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1431006
> Event count: 118856425042
> 
> Overhead   Symbol
> 22.59%     lock_release
> 22.13%     arch_counter_get_cntvct
> 22.08%     lock_acquire
> 15.32%     _raw_spin_unlock_irq
> 2.26%      test_contpte_perf
> 1.50%      follow_page_pte
> 1.49%      arch_stack_walk
> 1.30%      rcu_is_watching
> 1.09%      lock_acquired
> 1.07%      sched_clock_noinstr
> 0.88%      handle_softirqs.llvm.12507768597002095717
> 0.88%      trace_preempt_off
> 0.76%      _raw_spin_unlock_irqrestore
> 0.61%      check_preemption_disabled
> 0.52%      trace_preempt_on
> 0.50%      mlock_pte_range
> 0.43%      try_grab_folio
> 0.41%      folio_mark_accessed
> 0.40%      vm_normal_folio
> 0.38%      test_preempt_disable_long
> 0.28%      contpte_ptep_get
> 0.27%      __traceiter_android_rvh_preempt_disable
> 0.26%      debug_smp_processor_id
> 0.24%      return_address
> 0.20%      __pte_offset_map_lock
> 0.19%      unwind_next_frame_record
> 
> If there is no problem with my test program, it can be seen that there is a
> significant performance improvement both in the overall number of instructions
> and the execution time of contpte_ptep_get.
> 
> If any reviewers have time, you can also test it on your machines for comparison.
> I have enabled THP and hugepages-64kB.
> 
> Test Function:
> ---
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> 
> void rwdata(char *buf)
> {
> 	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> 		buf[i] = 'a';
> 		volatile char c = buf[i];
> 	}
> }
> void test_contpte_perf()
> {
> 	char *buf;
> 	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
> 	if (ret != 0) {
> 		perror("posix_memalign failed");
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	rwdata(buf);
> 
> 	for (int j = 0; j < 500; j++) {
> 		mlock(buf, TEST_SIZE);
> 
> 		rwdata(buf);
> 
> 		munlock(buf, TEST_SIZE);

This is a microbenchmark in a pathological case and it's showing ~11%
improvement. But in principle I'm ok with it. I have some comments on the actual
change though, which I'll send through against email.

Thanks,
Ryan

> 	}
> 	
> 	free(buf);
> }
> ---
> 
> Xavier (1):
>   mm/contpte: Optimize loop to reduce redundant operations
> 
>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
  2025-04-16 12:48               ` Ryan Roberts
@ 2025-04-16 15:22                 ` Xavier
  0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 15:22 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas, david, gshan,
	linux-arm-kernel, linux-kernel, will, willy, ziy


Hi Ryan,


At 2025-04-16 20:48:57, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
>the replies from others. I'll monitor lore but appologies if I'm slow to respond
>- that's the reason.
>
>Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
>
>Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
>go via the arm64 tree, if we decide we want it.

OK, I will change it to "arm64/mm" in the subsequent version.

>> 
>> Instruction Statistics - Before Optimization
>> #          count  event_name              # count / runtime
>>       20,814,352  branch-load-misses      # 662.244 K/sec
>>   41,894,986,323  branch-loads            # 1.333 G/sec
>>        1,957,415  iTLB-load-misses        # 62.278 K/sec
>>   49,872,282,100  iTLB-loads              # 1.587 G/sec
>>      302,808,096  L1-icache-load-misses   # 9.634 M/sec
>>   49,872,282,100  L1-icache-loads         # 1.587 G/sec
>> 
>> Total test time: 31.485237 seconds.
>> 
>> Instruction Statistics - After Optimization
>> #          count  event_name              # count / runtime
>>       19,340,524  branch-load-misses      # 688.753 K/sec
>>   38,510,185,183  branch-loads            # 1.371 G/sec
>>        1,812,716  iTLB-load-misses        # 64.554 K/sec
>>   47,673,923,151  iTLB-loads              # 1.698 G/sec
>>      675,853,661  L1-icache-load-misses   # 24.068 M/sec
>>   47,673,923,151  L1-icache-loads         # 1.698 G/sec
>> 
>> Total test time: 28.108048 seconds.
>> 
>> Function Statistics - Before Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1419716
>> Event count: 99618088900
>> 
>> Overhead   Symbol
>> 21.42%     lock_release
>> 21.26%     lock_acquire
>> 20.88%     arch_counter_get_cntvct
>> 14.32%     _raw_spin_unlock_irq
>> 6.79%      contpte_ptep_get
>> 2.20%      test_contpte_perf
>> 1.82%      follow_page_pte
>> 0.97%      lock_acquired
>> 0.97%      rcu_is_watching
>> 0.89%      mlock_pte_range
>> 0.84%      sched_clock_noinstr
>> 0.70%      handle_softirqs.llvm.8218488130471452153
>> 0.58%      test_preempt_disable_long
>> 0.57%      _raw_spin_unlock_irqrestore
>> 0.54%      arch_stack_walk
>> 0.51%      vm_normal_folio
>> 0.48%      check_preemption_disabled
>> 0.47%      stackinfo_get_task
>> 0.36%      try_grab_folio
>> 0.34%      preempt_count
>> 0.32%      trace_preempt_on
>> 0.29%      trace_preempt_off
>> 0.24%      debug_smp_processor_id
>> 
>> Function Statistics - After Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1431006
>> Event count: 118856425042
>> 
>> Overhead   Symbol
>> 22.59%     lock_release
>> 22.13%     arch_counter_get_cntvct
>> 22.08%     lock_acquire
>> 15.32%     _raw_spin_unlock_irq
>> 2.26%      test_contpte_perf
>> 1.50%      follow_page_pte
>> 1.49%      arch_stack_walk
>> 1.30%      rcu_is_watching
>> 1.09%      lock_acquired
>> 1.07%      sched_clock_noinstr
>> 0.88%      handle_softirqs.llvm.12507768597002095717
>> 0.88%      trace_preempt_off
>> 0.76%      _raw_spin_unlock_irqrestore
>> 0.61%      check_preemption_disabled
>> 0.52%      trace_preempt_on
>> 0.50%      mlock_pte_range
>> 0.43%      try_grab_folio
>> 0.41%      folio_mark_accessed
>> 0.40%      vm_normal_folio
>> 0.38%      test_preempt_disable_long
>> 0.28%      contpte_ptep_get
>> 0.27%      __traceiter_android_rvh_preempt_disable
>> 0.26%      debug_smp_processor_id
>> 0.24%      return_address
>> 0.20%      __pte_offset_map_lock
>> 0.19%      unwind_next_frame_record
>> 
>> If there is no problem with my test program, it can be seen that there is a
>> significant performance improvement both in the overall number of instructions
>> and the execution time of contpte_ptep_get.
>> 
>> If any reviewers have time, you can also test it on your machines for comparison.
>> I have enabled THP and hugepages-64kB.
>> 
>> Test Function:
>> ---
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> 
>> void rwdata(char *buf)
>> {
>> 	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> 		buf[i] = 'a';
>> 		volatile char c = buf[i];
>> 	}
>> }
>> void test_contpte_perf()
>> {
>> 	char *buf;
>> 	int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
>> 	if (ret != 0) {
>> 		perror("posix_memalign failed");
>> 		exit(EXIT_FAILURE);
>> 	}
>> 
>> 	rwdata(buf);
>> 
>> 	for (int j = 0; j < 500; j++) {
>> 		mlock(buf, TEST_SIZE);
>> 
>> 		rwdata(buf);
>> 
>> 		munlock(buf, TEST_SIZE);
>
>This is a microbenchmark in a pathological case and it's showing ~11%
>improvement. But in principle I'm ok with it. I have some comments on the actual
>change though, which I'll send through against email.

OK. Please refer to my reply to that email.

Thanks,
Xavier

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-05-09 10:24 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07  9:22 [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations Xavier
2025-04-07 11:29 ` Lance Yang
2025-04-07 12:56   ` Xavier
2025-04-07 13:31     ` Lance Yang
2025-04-07 16:19       ` Dev Jain
2025-04-08  8:58         ` [PATCH v2 0/1] " Xavier
2025-04-08  8:58           ` [PATCH v2 1/1] " Xavier
2025-04-09  4:09             ` Gavin Shan
2025-04-09 15:10               ` Xavier
2025-04-10  0:58                 ` Gavin Shan
2025-04-10  2:53                   ` Xavier
2025-04-10  3:06                     ` Gavin Shan
2025-04-08  9:17         ` [PATCH v1] " Lance Yang
2025-04-09 15:15           ` Xavier
2025-04-10 21:25 ` Barry Song
2025-04-11 12:03   ` David Laight
2025-04-12  7:18     ` Barry Song
2025-04-11 17:30   ` Dev Jain
2025-04-12  5:05     ` Lance Yang
2025-04-12  5:27       ` Xavier
2025-04-14  8:06       ` Ryan Roberts
2025-04-14  8:51         ` Dev Jain
2025-04-14 12:11           ` Ryan Roberts
2025-04-15  8:22             ` [mm/contpte v3 0/1] " Xavier
2025-04-15  8:22               ` [mm/contpte v3 1/1] " Xavier
2025-04-15  9:01                 ` [mm/contpte v3] " Markus Elfring
2025-04-16  8:57                 ` [mm/contpte v3 1/1] " David Laight
2025-04-16 16:15                   ` Xavier
2025-04-16 12:54                 ` Ryan Roberts
2025-04-16 16:09                   ` Xavier
2025-04-22  9:33                   ` Xavier
2025-04-30 23:17                     ` Barry Song
2025-05-01 12:39                       ` Xavier
2025-05-01 21:19                         ` Barry Song
2025-05-01 21:32                           ` Barry Song
2025-05-04  2:39                             ` Xavier
2025-05-08  1:29                               ` Barry Song
2025-05-08  7:03                             ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
2025-05-08  8:30                               ` David Hildenbrand
2025-05-09  9:17                                 ` Xavier
2025-05-09  9:25                                   ` David Hildenbrand
2025-05-09  2:09                               ` Barry Song
2025-05-09  9:20                                 ` Xavier
2025-04-16  2:10               ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
2025-04-16  3:25                 ` Xavier
2025-04-16 12:47               ` Catalin Marinas
2025-04-16 15:08                 ` Xavier
2025-04-16 12:48               ` Ryan Roberts
2025-04-16 15:22                 ` Xavier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).