* [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 0/1] " Xavier
@ 2025-04-15 8:22 ` Xavier
2025-04-15 9:01 ` [mm/contpte v3] " Markus Elfring
` (2 more replies)
2025-04-16 2:10 ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
` (2 subsequent siblings)
3 siblings, 3 replies; 49+ messages in thread
From: Xavier @ 2025-04-15 8:22 UTC (permalink / raw)
To: ryan.roberts, dev.jain, ioworker0, 21cnbao
Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
linux-kernel, will, willy, xavier_qy, ziy
This commit optimizes the contpte_ptep_get function by adding early
termination logic. It checks if the dirty and young bits of orig_pte
are already set and skips redundant bit-setting operations during
the loop. This reduces unnecessary iterations and improves performance.
Signed-off-by: Xavier <xavier_qy@163.com>
---
arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..0acfee604947 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(__contpte_try_unfold);
+/* Note: in order to improve efficiency, using this macro will modify the
+ * passed-in parameters.*/
+#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
+ for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
+ if (pte_##flag(__ptep_get(ptep))) { \
+ orig_pte = pte_mk##flag(orig_pte); \
+ break; \
+ } \
+ }
+
pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
/*
@@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
- if (pte_dirty(pte))
+ if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
+ CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
+ break;
+ }
- if (pte_young(pte))
+ if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
+ CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
+ break;
+ }
}
return orig_pte;
--
2.34.1
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 1/1] " Xavier
@ 2025-04-15 9:01 ` Markus Elfring
2025-04-16 8:57 ` [mm/contpte v3 1/1] " David Laight
2025-04-16 12:54 ` Ryan Roberts
2 siblings, 0 replies; 49+ messages in thread
From: Markus Elfring @ 2025-04-15 9:01 UTC (permalink / raw)
To: xavier_qy, linux-arm-kernel
Cc: LKML, Andrew Morton, Barry Song, Catalin Marinas,
David Hildenbrand, Dev Jain, Gavin Shan, Lance Yang,
Matthew Wilcox, Ryan Roberts, Will Deacon, Zi Yan
> This commit optimizes the contpte_ptep_get function by …
> Signed-off-by: Xavier <xavier_qy@163.com>
I guess that a complete personal name would be more desirable according to
the Developer's Certificate of Origin.
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.15-rc2#n436
Should an other key word be specified for the prefix in the subject?
See also:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.15-rc2#n94
Regards,
Markus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 1/1] " Xavier
2025-04-15 9:01 ` [mm/contpte v3] " Markus Elfring
@ 2025-04-16 8:57 ` David Laight
2025-04-16 16:15 ` Xavier
2025-04-16 12:54 ` Ryan Roberts
2 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2025-04-16 8:57 UTC (permalink / raw)
To: Xavier
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas,
david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Tue, 15 Apr 2025 16:22:05 +0800
Xavier <xavier_qy@163.com> wrote:
> This commit optimizes the contpte_ptep_get function by adding early
> termination logic. It checks if the dirty and young bits of orig_pte
> are already set and skips redundant bit-setting operations during
> the loop. This reduces unnecessary iterations and improves performance.
Benchmarks?
As has been pointed out before CONT_PTES is small and IIRC dirty+young
is unusual.
>
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..0acfee604947 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>
> +/* Note: in order to improve efficiency, using this macro will modify the
> + * passed-in parameters.*/
... this macro modifies ...
But you can make it obvious my passing by reference.
The compiler will generate the same code.
David
> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> + if (pte_##flag(__ptep_get(ptep))) { \
> + orig_pte = pte_mk##flag(orig_pte); \
> + break; \
> + } \
> + }
> +
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
> /*
> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> - if (pte_dirty(pte))
> + if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> + break;
> + }
>
> - if (pte_young(pte))
> + if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> + break;
> + }
> }
>
> return orig_pte;
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 8:57 ` [mm/contpte v3 1/1] " David Laight
@ 2025-04-16 16:15 ` Xavier
0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 16:15 UTC (permalink / raw)
To: David Laight
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas,
david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
At 2025-04-16 16:57:06, "David Laight" <david.laight.linux@gmail.com> wrote:
>On Tue, 15 Apr 2025 16:22:05 +0800
>Xavier <xavier_qy@163.com> wrote:
>
>> This commit optimizes the contpte_ptep_get function by adding early
>> termination logic. It checks if the dirty and young bits of orig_pte
>> are already set and skips redundant bit-setting operations during
>> the loop. This reduces unnecessary iterations and improves performance.
>
>Benchmarks?
>
>As has been pointed out before CONT_PTES is small and IIRC dirty+young
>is unusual.
I haven't found some suitable benchmark tests yet. I will write some more
general test scenarios. Please pay attention to the subsequent emails.
>
>>
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> 1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>
>... this macro modifies ...
>
>But you can make it obvious my passing by reference.
>The compiler will generate the same code.
>
This part may also be further refined.
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 1/1] " Xavier
2025-04-15 9:01 ` [mm/contpte v3] " Markus Elfring
2025-04-16 8:57 ` [mm/contpte v3 1/1] " David Laight
@ 2025-04-16 12:54 ` Ryan Roberts
2025-04-16 16:09 ` Xavier
2025-04-22 9:33 ` Xavier
2 siblings, 2 replies; 49+ messages in thread
From: Ryan Roberts @ 2025-04-16 12:54 UTC (permalink / raw)
To: Xavier, dev.jain, ioworker0, 21cnbao
Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
linux-kernel, will, willy, ziy
On 15/04/2025 09:22, Xavier wrote:
> This commit optimizes the contpte_ptep_get function by adding early
> termination logic. It checks if the dirty and young bits of orig_pte
> are already set and skips redundant bit-setting operations during
> the loop. This reduces unnecessary iterations and improves performance.
>
> Signed-off-by: Xavier <xavier_qy@163.com>
> ---
> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..0acfee604947 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>
> +/* Note: in order to improve efficiency, using this macro will modify the
> + * passed-in parameters.*/
> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> + if (pte_##flag(__ptep_get(ptep))) { \
> + orig_pte = pte_mk##flag(orig_pte); \
> + break; \
> + } \
> + }
I'm really not a fan of this macro, it just obfuscates what is going on. I'd
personally prefer to see the 2 extra loops open coded below.
Or even better, could you provide results comparing this 3 loop version to the
simpler approach I suggested previously? If the performance is similar (which I
expect it will be, especially given Barry's point that your test always ensures
the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> +
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
> /*
> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> - if (pte_dirty(pte))
> + if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> + break;
> + }
>
> - if (pte_young(pte))
> + if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> + break;
> + }
> }
>
> return orig_pte;
If we decide this is all worth the trouble, then I think we can (and *should*,
in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().
Thanks,
Ryan
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 12:54 ` Ryan Roberts
@ 2025-04-16 16:09 ` Xavier
2025-04-22 9:33 ` Xavier
1 sibling, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 16:09 UTC (permalink / raw)
To: Ryan Roberts
Cc: dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas, david, gshan,
linux-arm-kernel, linux-kernel, will, willy, ziy
At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>> termination logic. It checks if the dirty and young bits of orig_pte
>> are already set and skips redundant bit-setting operations during
>> the loop. This reduces unnecessary iterations and improves performance.
>>
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> 1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> + if (pte_##flag(__ptep_get(ptep))) { \
>> + orig_pte = pte_mk##flag(orig_pte); \
>> + break; \
>> + } \
>> + }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
I will test the performance differences among these several implementations.
If the differences are not significant, I will try to implement it with the simpler method.
At the same time, according to Barry's suggestion, I will add some more general
test cases to verify whether the optimization is effective.
Since I'm a bit busy recently, it may take me a few days to complete this task.
>> +
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>> /*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> - if (pte_dirty(pte))
>> + if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> + break;
>> + }
>>
>> - if (pte_young(pte))
>> + if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> + break;
>> + }
>> }
>>
>> return orig_pte;
>
>If we decide this is all worth the trouble, then I think we can (and *should*,
>in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().
OK. This function seems to be a bit more complicated. I'll give it a try to
figure out how to optimize it.
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 12:54 ` Ryan Roberts
2025-04-16 16:09 ` Xavier
@ 2025-04-22 9:33 ` Xavier
2025-04-30 23:17 ` Barry Song
1 sibling, 1 reply; 49+ messages in thread
From: Xavier @ 2025-04-22 9:33 UTC (permalink / raw)
To: Ryan Roberts, dev.jain, ioworker0, 21cnbao
Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
linux-kernel, will, willy, ziy
Hi all,
At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>> termination logic. It checks if the dirty and young bits of orig_pte
>> are already set and skips redundant bit-setting operations during
>> the loop. This reduces unnecessary iterations and improves performance.
>>
>> Signed-off-by: Xavier <xavier_qy@163.com>
>> ---
>> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> 1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> + if (pte_##flag(__ptep_get(ptep))) { \
>> + orig_pte = pte_mk##flag(orig_pte); \
>> + break; \
>> + } \
>> + }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>
Based on the discussions in the previous email, two modifications were adopted
and tested, and the results are as follows:
Modification 1
pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
pte_t pte;
int i;
ptep = contpte_align_down(ptep);
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(orig_pte))
break;
}
if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
if (pte_dirty(orig_pte))
break;
}
}
return orig_pte;
}
Modification 2
pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
pte_t pte;
int i;
ptep = contpte_align_down(ptep);
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
for (; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
break;
}
}
break;
}
if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
i++;
ptep++;
for (; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
break;
}
}
break;
}
}
return orig_pte;
}
Test Code:
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
#define YOUNG_BIT 8
void rwdata(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
buf[i] = 'a';
volatile char c = buf[i];
}
}
void clear_young_dirty(char *buf)
{
if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
}
void set_one_young(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
}
}
void test_contpte_perf() {
char *buf;
int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
perror("posix_memalign failed");
exit(EXIT_FAILURE);
}
rwdata(buf);
#if TEST_CASE2 || TEST_CASE3
clear_young_dirty(buf);
#endif
#if TEST_CASE2
set_one_young(buf);
#endif
for (int j = 0; j < 500; j++) {
mlock(buf, TEST_SIZE);
munlock(buf, TEST_SIZE);
}
free(buf);
}
---
Descriptions of three test scenarios
Scenario 1
The data of all 16 PTEs are both dirty and young.
#define TEST_CASE2 0
#define TEST_CASE3 0
Scenario 2
Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
#define TEST_CASE2 1
#define TEST_CASE3 0
Scenario 3
Among the 16 PTEs, there are neither young nor dirty ones.
#define TEST_CASE2 0
#define TEST_CASE3 1
Test results
|Scenario 1 | Original| Modification 1| Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions | 37912436160| 18303833386| 18731580031|
|test time | 4.2797| 2.2687| 2.2949|
|overhead of | | | |
|contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
|Scenario 2 | Original| Modification 1| Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions | 36701270862| 38729716276| 36115790086|
|test time | 3.2335| 3.5732| 3.0874|
|Overhead of | | | |
|contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
|Scenario 3 | Original| Modification 1| Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions | 36706279735| 38305241759| 36750881878|
|test time | 3.2008| 3.5389| 3.1249|
|Overhead of | | | |
|contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
For Scenario 1, Modification 1 can achieve an instruction count benefit of
51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
benefit of 50.59% and a time benefit of 46.38%.
For Scenarios 2, Modification 2 can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
the instructions and time due to additional conditional checks.
For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
the branches taken by Modification 1 and Modification 2 should be the same as
those of the original code. In fact, the test results of Modification 2 seem
to be closer to those of the original code. I don't know why there is a
performance regression in Modification 1.
Therefore, I believe modifying the code according to Modification 2 can bring
maximum benefits. Everyone can discuss whether this approach is acceptable,
and if so, I will send Patch V4 to proceed with submitting this modification.
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-22 9:33 ` Xavier
@ 2025-04-30 23:17 ` Barry Song
2025-05-01 12:39 ` Xavier
0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-04-30 23:17 UTC (permalink / raw)
To: Xavier
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>
>
> Hi all,
>
>
> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >On 15/04/2025 09:22, Xavier wrote:
> >> This commit optimizes the contpte_ptep_get function by adding early
> >> termination logic. It checks if the dirty and young bits of orig_pte
> >> are already set and skips redundant bit-setting operations during
> >> the loop. This reduces unnecessary iterations and improves performance.
> >>
> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> ---
> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >> 1 file changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> index bcac4f55f9c1..0acfee604947 100644
> >> --- a/arch/arm64/mm/contpte.c
> >> +++ b/arch/arm64/mm/contpte.c
> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >> }
> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >>
> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> + * passed-in parameters.*/
> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> + if (pte_##flag(__ptep_get(ptep))) { \
> >> + orig_pte = pte_mk##flag(orig_pte); \
> >> + break; \
> >> + } \
> >> + }
> >
> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >personally prefer to see the 2 extra loops open coded below.
> >
> >Or even better, could you provide results comparing this 3 loop version to the
> >simpler approach I suggested previously? If the performance is similar (which I
> >expect it will be, especially given Barry's point that your test always ensures
> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >
>
> Based on the discussions in the previous email, two modifications were adopted
> and tested, and the results are as follows:
>
> Modification 1
>
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
> pte_t pte;
> int i;
>
> ptep = contpte_align_down(ptep);
>
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> if (pte_young(orig_pte))
> break;
> }
>
> if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> if (pte_dirty(orig_pte))
> break;
> }
> }
>
> return orig_pte;
> }
>
> Modification 2
>
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
> pte_t pte;
> int i;
>
> ptep = contpte_align_down(ptep);
>
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> for (; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
> if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> break;
> }
> }
> break;
> }
>
> if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> i++;
> ptep++;
> for (; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
> if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> break;
> }
> }
> break;
> }
> }
>
> return orig_pte;
> }
>
> Test Code:
>
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> #define YOUNG_BIT 8
> void rwdata(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> buf[i] = 'a';
> volatile char c = buf[i];
> }
> }
> void clear_young_dirty(char *buf)
> {
> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> perror("madvise free failed");
> free(buf);
> exit(EXIT_FAILURE);
> }
> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> perror("madvise free failed");
> free(buf);
> exit(EXIT_FAILURE);
> }
> }
> void set_one_young(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> }
> }
>
> void test_contpte_perf() {
> char *buf;
> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> perror("posix_memalign failed");
> exit(EXIT_FAILURE);
> }
>
> rwdata(buf);
> #if TEST_CASE2 || TEST_CASE3
> clear_young_dirty(buf);
> #endif
> #if TEST_CASE2
> set_one_young(buf);
> #endif
>
> for (int j = 0; j < 500; j++) {
> mlock(buf, TEST_SIZE);
>
> munlock(buf, TEST_SIZE);
> }
> free(buf);
> }
> ---
>
> Descriptions of three test scenarios
>
> Scenario 1
> The data of all 16 PTEs are both dirty and young.
> #define TEST_CASE2 0
> #define TEST_CASE3 0
>
> Scenario 2
> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> #define TEST_CASE2 1
> #define TEST_CASE3 0
>
> Scenario 3
> Among the 16 PTEs, there are neither young nor dirty ones.
> #define TEST_CASE2 0
> #define TEST_CASE3 1
>
>
> Test results
>
> |Scenario 1 | Original| Modification 1| Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions | 37912436160| 18303833386| 18731580031|
> |test time | 4.2797| 2.2687| 2.2949|
> |overhead of | | | |
> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
>
> |Scenario 2 | Original| Modification 1| Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions | 36701270862| 38729716276| 36115790086|
> |test time | 3.2335| 3.5732| 3.0874|
> |Overhead of | | | |
> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
>
> |Scenario 3 | Original| Modification 1| Modification 2|
> |-------------------|---------------|----------------|----------------|
> |instructions | 36706279735| 38305241759| 36750881878|
> |test time | 3.2008| 3.5389| 3.1249|
> |Overhead of | | | |
> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
>
>
> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> benefit of 50.59% and a time benefit of 46.38%.
>
> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> the instructions and time due to additional conditional checks.
>
> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> the branches taken by Modification 1 and Modification 2 should be the same as
> those of the original code. In fact, the test results of Modification 2 seem
> to be closer to those of the original code. I don't know why there is a
> performance regression in Modification 1.
>
> Therefore, I believe modifying the code according to Modification 2 can bring
> maximum benefits. Everyone can discuss whether this approach is acceptable,
> and if so, I will send Patch V4 to proceed with submitting this modification.
>
modification 2 is not correct. if pte0~pte14 are all young and no one
is dirty, we are
having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
i++;
ptep++;
for (; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
break;
}
}
break;
}
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-30 23:17 ` Barry Song
@ 2025-05-01 12:39 ` Xavier
2025-05-01 21:19 ` Barry Song
0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-01 12:39 UTC (permalink / raw)
To: Barry Song
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
Hi Barry,
At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
>On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>>
>>
>> Hi all,
>>
>>
>> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>> >On 15/04/2025 09:22, Xavier wrote:
>> >> This commit optimizes the contpte_ptep_get function by adding early
>> >> termination logic. It checks if the dirty and young bits of orig_pte
>> >> are already set and skips redundant bit-setting operations during
>> >> the loop. This reduces unnecessary iterations and improves performance.
>> >>
>> >> Signed-off-by: Xavier <xavier_qy@163.com>
>> >> ---
>> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> >> 1 file changed, 18 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> >> index bcac4f55f9c1..0acfee604947 100644
>> >> --- a/arch/arm64/mm/contpte.c
>> >> +++ b/arch/arm64/mm/contpte.c
>> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> >> }
>> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>> >>
>> >> +/* Note: in order to improve efficiency, using this macro will modify the
>> >> + * passed-in parameters.*/
>> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> >> + if (pte_##flag(__ptep_get(ptep))) { \
>> >> + orig_pte = pte_mk##flag(orig_pte); \
>> >> + break; \
>> >> + } \
>> >> + }
>> >
>> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>> >personally prefer to see the 2 extra loops open coded below.
>> >
>> >Or even better, could you provide results comparing this 3 loop version to the
>> >simpler approach I suggested previously? If the performance is similar (which I
>> >expect it will be, especially given Barry's point that your test always ensures
>> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>> >
>>
>> Based on the discussions in the previous email, two modifications were adopted
>> and tested, and the results are as follows:
>>
>> Modification 1
>>
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>> pte_t pte;
>> int i;
>>
>> ptep = contpte_align_down(ptep);
>>
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> if (pte_young(orig_pte))
>> break;
>> }
>>
>> if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> if (pte_dirty(orig_pte))
>> break;
>> }
>> }
>>
>> return orig_pte;
>> }
>>
>> Modification 2
>>
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>> pte_t pte;
>> int i;
>>
>> ptep = contpte_align_down(ptep);
>>
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> for (; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>> if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> break;
>> }
>> }
>> break;
>> }
>>
>> if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> i++;
>> ptep++;
>> for (; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>> if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> break;
>> }
>> }
>> break;
>> }
>> }
>>
>> return orig_pte;
>> }
>>
>> Test Code:
>>
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> #define YOUNG_BIT 8
>> void rwdata(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> buf[i] = 'a';
>> volatile char c = buf[i];
>> }
>> }
>> void clear_young_dirty(char *buf)
>> {
>> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>> perror("madvise free failed");
>> free(buf);
>> exit(EXIT_FAILURE);
>> }
>> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>> perror("madvise free failed");
>> free(buf);
>> exit(EXIT_FAILURE);
>> }
>> }
>> void set_one_young(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>> }
>> }
>>
>> void test_contpte_perf() {
>> char *buf;
>> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
>> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>> perror("posix_memalign failed");
>> exit(EXIT_FAILURE);
>> }
>>
>> rwdata(buf);
>> #if TEST_CASE2 || TEST_CASE3
>> clear_young_dirty(buf);
>> #endif
>> #if TEST_CASE2
>> set_one_young(buf);
>> #endif
>>
>> for (int j = 0; j < 500; j++) {
>> mlock(buf, TEST_SIZE);
>>
>> munlock(buf, TEST_SIZE);
>> }
>> free(buf);
>> }
>> ---
>>
>> Descriptions of three test scenarios
>>
>> Scenario 1
>> The data of all 16 PTEs are both dirty and young.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 0
>>
>> Scenario 2
>> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>> #define TEST_CASE2 1
>> #define TEST_CASE3 0
>>
>> Scenario 3
>> Among the 16 PTEs, there are neither young nor dirty ones.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 1
>>
>>
>> Test results
>>
>> |Scenario 1 | Original| Modification 1| Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions | 37912436160| 18303833386| 18731580031|
>> |test time | 4.2797| 2.2687| 2.2949|
>> |overhead of | | | |
>> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
>>
>> |Scenario 2 | Original| Modification 1| Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions | 36701270862| 38729716276| 36115790086|
>> |test time | 3.2335| 3.5732| 3.0874|
>> |Overhead of | | | |
>> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
>>
>> |Scenario 3 | Original| Modification 1| Modification 2|
>> |-------------------|---------------|----------------|----------------|
>> |instructions | 36706279735| 38305241759| 36750881878|
>> |test time | 3.2008| 3.5389| 3.1249|
>> |Overhead of | | | |
>> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
>>
>>
>> For Scenario 1, Modification 1 can achieve an instruction count benefit of
>> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
>> benefit of 50.59% and a time benefit of 46.38%.
>>
>> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
>> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
>> the instructions and time due to additional conditional checks.
>>
>> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
>> the branches taken by Modification 1 and Modification 2 should be the same as
>> those of the original code. In fact, the test results of Modification 2 seem
>> to be closer to those of the original code. I don't know why there is a
>> performance regression in Modification 1.
>>
>> Therefore, I believe modifying the code according to Modification 2 can bring
>> maximum benefits. Everyone can discuss whether this approach is acceptable,
>> and if so, I will send Patch V4 to proceed with submitting this modification.
>>
>
>modification 2 is not correct. if pte0~pte14 are all young and no one
>is dirty, we are
>having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
>
> if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> i++;
> ptep++;
> for (; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
> if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> break;
> }
> }
> break;
> }
>
I didn't understand which part you referred to when you said there were a lot of
useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
young and no one is dirty", Modification 2 will enter the following branch when
judging pte0:
if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
// The dirty status of pte0 has already been checked, skip it.
i++;
ptep++;
// Then we only need to check whether pte1~pte15 are dirty.
for (; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
if (pte_dirty(pte)) {
// Exit as soon as a dirty entry is found.
orig_pte = pte_mkdirty(orig_pte);
break;
}
}
// Exit directly here without going through the outer loop again.
break;
}
In this scenario, the total number of judgments in Modification 2 is nearly half less
than that of the original code. I should have understood it correctly, right?
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-05-01 12:39 ` Xavier
@ 2025-05-01 21:19 ` Barry Song
2025-05-01 21:32 ` Barry Song
0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-05-01 21:19 UTC (permalink / raw)
To: Xavier
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
>
>
>
> Hi Barry,
>
>
> At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>
> >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >> >On 15/04/2025 09:22, Xavier wrote:
> >> >> This commit optimizes the contpte_ptep_get function by adding early
> >> >> termination logic. It checks if the dirty and young bits of orig_pte
> >> >> are already set and skips redundant bit-setting operations during
> >> >> the loop. This reduces unnecessary iterations and improves performance.
> >> >>
> >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> >> ---
> >> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >> >> 1 file changed, 18 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> >> index bcac4f55f9c1..0acfee604947 100644
> >> >> --- a/arch/arm64/mm/contpte.c
> >> >> +++ b/arch/arm64/mm/contpte.c
> >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >> >> }
> >> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >> >>
> >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> >> + * passed-in parameters.*/
> >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> >> + if (pte_##flag(__ptep_get(ptep))) { \
> >> >> + orig_pte = pte_mk##flag(orig_pte); \
> >> >> + break; \
> >> >> + } \
> >> >> + }
> >> >
> >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >> >personally prefer to see the 2 extra loops open coded below.
> >> >
> >> >Or even better, could you provide results comparing this 3 loop version to the
> >> >simpler approach I suggested previously? If the performance is similar (which I
> >> >expect it will be, especially given Barry's point that your test always ensures
> >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >> >
> >>
> >> Based on the discussions in the previous email, two modifications were adopted
> >> and tested, and the results are as follows:
> >>
> >> Modification 1
> >>
> >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> {
> >> pte_t pte;
> >> int i;
> >>
> >> ptep = contpte_align_down(ptep);
> >>
> >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> pte = __ptep_get(ptep);
> >>
> >> if (pte_dirty(pte)) {
> >> orig_pte = pte_mkdirty(orig_pte);
> >> if (pte_young(orig_pte))
> >> break;
> >> }
> >>
> >> if (pte_young(pte)) {
> >> orig_pte = pte_mkyoung(orig_pte);
> >> if (pte_dirty(orig_pte))
> >> break;
> >> }
> >> }
> >>
> >> return orig_pte;
> >> }
> >>
> >> Modification 2
> >>
> >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> {
> >> pte_t pte;
> >> int i;
> >>
> >> ptep = contpte_align_down(ptep);
> >>
> >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> pte = __ptep_get(ptep);
> >>
> >> if (pte_dirty(pte)) {
> >> orig_pte = pte_mkdirty(orig_pte);
> >> for (; i < CONT_PTES; i++, ptep++) {
> >> pte = __ptep_get(ptep);
> >> if (pte_young(pte)) {
> >> orig_pte = pte_mkyoung(orig_pte);
> >> break;
> >> }
> >> }
> >> break;
> >> }
> >>
> >> if (pte_young(pte)) {
> >> orig_pte = pte_mkyoung(orig_pte);
> >> i++;
> >> ptep++;
> >> for (; i < CONT_PTES; i++, ptep++) {
> >> pte = __ptep_get(ptep);
> >> if (pte_dirty(pte)) {
> >> orig_pte = pte_mkdirty(orig_pte);
> >> break;
> >> }
> >> }
> >> break;
> >> }
> >> }
> >>
> >> return orig_pte;
> >> }
> >>
> >> Test Code:
> >>
> >> #define PAGE_SIZE 4096
> >> #define CONT_PTES 16
> >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >> #define YOUNG_BIT 8
> >> void rwdata(char *buf)
> >> {
> >> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >> buf[i] = 'a';
> >> volatile char c = buf[i];
> >> }
> >> }
> >> void clear_young_dirty(char *buf)
> >> {
> >> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >> perror("madvise free failed");
> >> free(buf);
> >> exit(EXIT_FAILURE);
> >> }
> >> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >> perror("madvise free failed");
> >> free(buf);
> >> exit(EXIT_FAILURE);
> >> }
> >> }
> >> void set_one_young(char *buf)
> >> {
> >> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >> }
> >> }
> >>
> >> void test_contpte_perf() {
> >> char *buf;
> >> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> >> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >> perror("posix_memalign failed");
> >> exit(EXIT_FAILURE);
> >> }
> >>
> >> rwdata(buf);
> >> #if TEST_CASE2 || TEST_CASE3
> >> clear_young_dirty(buf);
> >> #endif
> >> #if TEST_CASE2
> >> set_one_young(buf);
> >> #endif
> >>
> >> for (int j = 0; j < 500; j++) {
> >> mlock(buf, TEST_SIZE);
> >>
> >> munlock(buf, TEST_SIZE);
> >> }
> >> free(buf);
> >> }
> >> ---
> >>
> >> Descriptions of three test scenarios
> >>
> >> Scenario 1
> >> The data of all 16 PTEs are both dirty and young.
> >> #define TEST_CASE2 0
> >> #define TEST_CASE3 0
> >>
> >> Scenario 2
> >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >> #define TEST_CASE2 1
> >> #define TEST_CASE3 0
> >>
> >> Scenario 3
> >> Among the 16 PTEs, there are neither young nor dirty ones.
> >> #define TEST_CASE2 0
> >> #define TEST_CASE3 1
> >>
> >>
> >> Test results
> >>
> >> |Scenario 1 | Original| Modification 1| Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions | 37912436160| 18303833386| 18731580031|
> >> |test time | 4.2797| 2.2687| 2.2949|
> >> |overhead of | | | |
> >> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
> >>
> >> |Scenario 2 | Original| Modification 1| Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions | 36701270862| 38729716276| 36115790086|
> >> |test time | 3.2335| 3.5732| 3.0874|
> >> |Overhead of | | | |
> >> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
> >>
> >> |Scenario 3 | Original| Modification 1| Modification 2|
> >> |-------------------|---------------|----------------|----------------|
> >> |instructions | 36706279735| 38305241759| 36750881878|
> >> |test time | 3.2008| 3.5389| 3.1249|
> >> |Overhead of | | | |
> >> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
> >>
> >>
> >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> >> benefit of 50.59% and a time benefit of 46.38%.
> >>
> >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> >> the instructions and time due to additional conditional checks.
> >>
> >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> >> the branches taken by Modification 1 and Modification 2 should be the same as
> >> those of the original code. In fact, the test results of Modification 2 seem
> >> to be closer to those of the original code. I don't know why there is a
> >> performance regression in Modification 1.
> >>
> >> Therefore, I believe modifying the code according to Modification 2 can bring
> >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> >> and if so, I will send Patch V4 to proceed with submitting this modification.
> >>
> >
> >modification 2 is not correct. if pte0~pte14 are all young and no one
> >is dirty, we are
> >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> >
> > if (pte_young(pte)) {
> > orig_pte = pte_mkyoung(orig_pte);
> > i++;
> > ptep++;
> > for (; i < CONT_PTES; i++, ptep++) {
> > pte = __ptep_get(ptep);
> > if (pte_dirty(pte)) {
> > orig_pte = pte_mkdirty(orig_pte);
> > break;
> > }
> > }
> > break;
> > }
> >
>
> I didn't understand which part you referred to when you said there were a lot of
> useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> young and no one is dirty", Modification 2 will enter the following branch when
> judging pte0:
>
> if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> // The dirty status of pte0 has already been checked, skip it.
> i++;
> ptep++;
> // Then we only need to check whether pte1~pte15 are dirty.
> for (; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
> if (pte_dirty(pte)) {
> // Exit as soon as a dirty entry is found.
> orig_pte = pte_mkdirty(orig_pte);
> break;
> }
> }
> // Exit directly here without going through the outer loop again.
> break;
> }
>
> In this scenario, the total number of judgments in Modification 2 is nearly half less
> than that of the original code. I should have understood it correctly, right?
You're right—I missed the part where you're also taking a break, even though
no one is dirty. Based on your data, modification 2 seems to be good.
But I don't quite understand why you are doing
i++;
ptep++;
before for (; i < CONT_PTES; i++, ptep++).
it seems to be wrong. if i==15, you will get i=16. you are skipping
the pte_dirty
check for i==15. it is also true for any value between 0 and 15. My
point is that
you should drop it and re-test.
>
>
> --
>
> Thanks,
> Xavier
Thanks
barry
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-05-01 21:19 ` Barry Song
@ 2025-05-01 21:32 ` Barry Song
2025-05-04 2:39 ` Xavier
2025-05-08 7:03 ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
0 siblings, 2 replies; 49+ messages in thread
From: Barry Song @ 2025-05-01 21:32 UTC (permalink / raw)
To: Xavier
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
> >
> >
> >
> > Hi Barry,
> >
> >
> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> > >>
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> > >> >On 15/04/2025 09:22, Xavier wrote:
> > >> >> This commit optimizes the contpte_ptep_get function by adding early
> > >> >> termination logic. It checks if the dirty and young bits of orig_pte
> > >> >> are already set and skips redundant bit-setting operations during
> > >> >> the loop. This reduces unnecessary iterations and improves performance.
> > >> >>
> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> > >> >> ---
> > >> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> > >> >> 1 file changed, 18 insertions(+), 2 deletions(-)
> > >> >>
> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > >> >> index bcac4f55f9c1..0acfee604947 100644
> > >> >> --- a/arch/arm64/mm/contpte.c
> > >> >> +++ b/arch/arm64/mm/contpte.c
> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> > >> >> }
> > >> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> > >> >>
> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> > >> >> + * passed-in parameters.*/
> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> > >> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> > >> >> + if (pte_##flag(__ptep_get(ptep))) { \
> > >> >> + orig_pte = pte_mk##flag(orig_pte); \
> > >> >> + break; \
> > >> >> + } \
> > >> >> + }
> > >> >
> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> > >> >personally prefer to see the 2 extra loops open coded below.
> > >> >
> > >> >Or even better, could you provide results comparing this 3 loop version to the
> > >> >simpler approach I suggested previously? If the performance is similar (which I
> > >> >expect it will be, especially given Barry's point that your test always ensures
> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> > >> >
> > >>
> > >> Based on the discussions in the previous email, two modifications were adopted
> > >> and tested, and the results are as follows:
> > >>
> > >> Modification 1
> > >>
> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > >> {
> > >> pte_t pte;
> > >> int i;
> > >>
> > >> ptep = contpte_align_down(ptep);
> > >>
> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> > >> pte = __ptep_get(ptep);
> > >>
> > >> if (pte_dirty(pte)) {
> > >> orig_pte = pte_mkdirty(orig_pte);
> > >> if (pte_young(orig_pte))
> > >> break;
> > >> }
> > >>
> > >> if (pte_young(pte)) {
> > >> orig_pte = pte_mkyoung(orig_pte);
> > >> if (pte_dirty(orig_pte))
> > >> break;
> > >> }
> > >> }
> > >>
> > >> return orig_pte;
> > >> }
> > >>
> > >> Modification 2
> > >>
> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> > >> {
> > >> pte_t pte;
> > >> int i;
> > >>
> > >> ptep = contpte_align_down(ptep);
> > >>
> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> > >> pte = __ptep_get(ptep);
> > >>
> > >> if (pte_dirty(pte)) {
> > >> orig_pte = pte_mkdirty(orig_pte);
> > >> for (; i < CONT_PTES; i++, ptep++) {
> > >> pte = __ptep_get(ptep);
> > >> if (pte_young(pte)) {
> > >> orig_pte = pte_mkyoung(orig_pte);
> > >> break;
> > >> }
> > >> }
> > >> break;
> > >> }
> > >>
> > >> if (pte_young(pte)) {
> > >> orig_pte = pte_mkyoung(orig_pte);
> > >> i++;
> > >> ptep++;
> > >> for (; i < CONT_PTES; i++, ptep++) {
> > >> pte = __ptep_get(ptep);
> > >> if (pte_dirty(pte)) {
> > >> orig_pte = pte_mkdirty(orig_pte);
> > >> break;
> > >> }
> > >> }
> > >> break;
> > >> }
> > >> }
> > >>
> > >> return orig_pte;
> > >> }
> > >>
> > >> Test Code:
> > >>
> > >> #define PAGE_SIZE 4096
> > >> #define CONT_PTES 16
> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> > >> #define YOUNG_BIT 8
> > >> void rwdata(char *buf)
> > >> {
> > >> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> > >> buf[i] = 'a';
> > >> volatile char c = buf[i];
> > >> }
> > >> }
> > >> void clear_young_dirty(char *buf)
> > >> {
> > >> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> > >> perror("madvise free failed");
> > >> free(buf);
> > >> exit(EXIT_FAILURE);
> > >> }
> > >> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> > >> perror("madvise free failed");
> > >> free(buf);
> > >> exit(EXIT_FAILURE);
> > >> }
> > >> }
> > >> void set_one_young(char *buf)
> > >> {
> > >> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> > >> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> > >> }
> > >> }
> > >>
> > >> void test_contpte_perf() {
> > >> char *buf;
> > >> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> > >> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> > >> perror("posix_memalign failed");
> > >> exit(EXIT_FAILURE);
> > >> }
> > >>
> > >> rwdata(buf);
> > >> #if TEST_CASE2 || TEST_CASE3
> > >> clear_young_dirty(buf);
> > >> #endif
> > >> #if TEST_CASE2
> > >> set_one_young(buf);
> > >> #endif
> > >>
> > >> for (int j = 0; j < 500; j++) {
> > >> mlock(buf, TEST_SIZE);
> > >>
> > >> munlock(buf, TEST_SIZE);
> > >> }
> > >> free(buf);
> > >> }
> > >> ---
> > >>
> > >> Descriptions of three test scenarios
> > >>
> > >> Scenario 1
> > >> The data of all 16 PTEs are both dirty and young.
> > >> #define TEST_CASE2 0
> > >> #define TEST_CASE3 0
> > >>
> > >> Scenario 2
> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> > >> #define TEST_CASE2 1
> > >> #define TEST_CASE3 0
> > >>
> > >> Scenario 3
> > >> Among the 16 PTEs, there are neither young nor dirty ones.
> > >> #define TEST_CASE2 0
> > >> #define TEST_CASE3 1
> > >>
> > >>
> > >> Test results
> > >>
> > >> |Scenario 1 | Original| Modification 1| Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions | 37912436160| 18303833386| 18731580031|
> > >> |test time | 4.2797| 2.2687| 2.2949|
> > >> |overhead of | | | |
> > >> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
> > >>
> > >> |Scenario 2 | Original| Modification 1| Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions | 36701270862| 38729716276| 36115790086|
> > >> |test time | 3.2335| 3.5732| 3.0874|
> > >> |Overhead of | | | |
> > >> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
> > >>
> > >> |Scenario 3 | Original| Modification 1| Modification 2|
> > >> |-------------------|---------------|----------------|----------------|
> > >> |instructions | 36706279735| 38305241759| 36750881878|
> > >> |test time | 3.2008| 3.5389| 3.1249|
> > >> |Overhead of | | | |
> > >> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
> > >>
> > >>
> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> > >> benefit of 50.59% and a time benefit of 46.38%.
> > >>
> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> > >> the instructions and time due to additional conditional checks.
> > >>
> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> > >> the branches taken by Modification 1 and Modification 2 should be the same as
> > >> those of the original code. In fact, the test results of Modification 2 seem
> > >> to be closer to those of the original code. I don't know why there is a
> > >> performance regression in Modification 1.
> > >>
> > >> Therefore, I believe modifying the code according to Modification 2 can bring
> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
> > >>
> > >
> > >modification 2 is not correct. if pte0~pte14 are all young and no one
> > >is dirty, we are
> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> > >
> > > if (pte_young(pte)) {
> > > orig_pte = pte_mkyoung(orig_pte);
> > > i++;
> > > ptep++;
> > > for (; i < CONT_PTES; i++, ptep++) {
> > > pte = __ptep_get(ptep);
> > > if (pte_dirty(pte)) {
> > > orig_pte = pte_mkdirty(orig_pte);
> > > break;
> > > }
> > > }
> > > break;
> > > }
> > >
> >
> > I didn't understand which part you referred to when you said there were a lot of
> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> > young and no one is dirty", Modification 2 will enter the following branch when
> > judging pte0:
> >
> > if (pte_young(pte)) {
> > orig_pte = pte_mkyoung(orig_pte);
> > // The dirty status of pte0 has already been checked, skip it.
> > i++;
> > ptep++;
> > // Then we only need to check whether pte1~pte15 are dirty.
> > for (; i < CONT_PTES; i++, ptep++) {
> > pte = __ptep_get(ptep);
> > if (pte_dirty(pte)) {
> > // Exit as soon as a dirty entry is found.
> > orig_pte = pte_mkdirty(orig_pte);
> > break;
> > }
> > }
> > // Exit directly here without going through the outer loop again.
> > break;
> > }
> >
> > In this scenario, the total number of judgments in Modification 2 is nearly half less
> > than that of the original code. I should have understood it correctly, right?
>
> You're right—I missed the part where you're also taking a break, even though
> no one is dirty. Based on your data, modification 2 seems to be good.
>
> But I don't quite understand why you are doing
> i++;
> ptep++;
> before for (; i < CONT_PTES; i++, ptep++).
>
> it seems to be wrong. if i==15, you will get i=16. you are skipping
> the pte_dirty
> check for i==15. it is also true for any value between 0 and 15. My
> point is that
> you should drop it and re-test.
Sorry, I missed that you already checked pte_dirty(pte) before calling
pte_young(). So please ignore my comment. The code's a bit hard to
follow now. :-)
>
> >
> >
> > --
> >
> > Thanks,
> > Xavier
>
Thanks
barry
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-05-01 21:32 ` Barry Song
@ 2025-05-04 2:39 ` Xavier
2025-05-08 1:29 ` Barry Song
2025-05-08 7:03 ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
1 sibling, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-04 2:39 UTC (permalink / raw)
To: Barry Song
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
At 2025-05-02 05:32:50, "Barry Song" <21cnbao@gmail.com> wrote:
>On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
>> >
>> >
>> >
>> > Hi Barry,
>> >
>> >
>> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
>> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
>> > >>
>> > >>
>> > >> Hi all,
>> > >>
>> > >>
>> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>> > >> >On 15/04/2025 09:22, Xavier wrote:
>> > >> >> This commit optimizes the contpte_ptep_get function by adding early
>> > >> >> termination logic. It checks if the dirty and young bits of orig_pte
>> > >> >> are already set and skips redundant bit-setting operations during
>> > >> >> the loop. This reduces unnecessary iterations and improves performance.
>> > >> >>
>> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
>> > >> >> ---
>> > >> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> > >> >> 1 file changed, 18 insertions(+), 2 deletions(-)
>> > >> >>
>> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> > >> >> index bcac4f55f9c1..0acfee604947 100644
>> > >> >> --- a/arch/arm64/mm/contpte.c
>> > >> >> +++ b/arch/arm64/mm/contpte.c
>> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> > >> >> }
>> > >> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>> > >> >>
>> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
>> > >> >> + * passed-in parameters.*/
>> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> > >> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> > >> >> + if (pte_##flag(__ptep_get(ptep))) { \
>> > >> >> + orig_pte = pte_mk##flag(orig_pte); \
>> > >> >> + break; \
>> > >> >> + } \
>> > >> >> + }
>> > >> >
>> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>> > >> >personally prefer to see the 2 extra loops open coded below.
>> > >> >
>> > >> >Or even better, could you provide results comparing this 3 loop version to the
>> > >> >simpler approach I suggested previously? If the performance is similar (which I
>> > >> >expect it will be, especially given Barry's point that your test always ensures
>> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>> > >> >
>> > >>
>> > >> Based on the discussions in the previous email, two modifications were adopted
>> > >> and tested, and the results are as follows:
>> > >>
>> > >> Modification 1
>> > >>
>> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> > >> {
>> > >> pte_t pte;
>> > >> int i;
>> > >>
>> > >> ptep = contpte_align_down(ptep);
>> > >>
>> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> > >> pte = __ptep_get(ptep);
>> > >>
>> > >> if (pte_dirty(pte)) {
>> > >> orig_pte = pte_mkdirty(orig_pte);
>> > >> if (pte_young(orig_pte))
>> > >> break;
>> > >> }
>> > >>
>> > >> if (pte_young(pte)) {
>> > >> orig_pte = pte_mkyoung(orig_pte);
>> > >> if (pte_dirty(orig_pte))
>> > >> break;
>> > >> }
>> > >> }
>> > >>
>> > >> return orig_pte;
>> > >> }
>> > >>
>> > >> Modification 2
>> > >>
>> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> > >> {
>> > >> pte_t pte;
>> > >> int i;
>> > >>
>> > >> ptep = contpte_align_down(ptep);
>> > >>
>> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> > >> pte = __ptep_get(ptep);
>> > >>
>> > >> if (pte_dirty(pte)) {
>> > >> orig_pte = pte_mkdirty(orig_pte);
>> > >> for (; i < CONT_PTES; i++, ptep++) {
>> > >> pte = __ptep_get(ptep);
>> > >> if (pte_young(pte)) {
>> > >> orig_pte = pte_mkyoung(orig_pte);
>> > >> break;
>> > >> }
>> > >> }
>> > >> break;
>> > >> }
>> > >>
>> > >> if (pte_young(pte)) {
>> > >> orig_pte = pte_mkyoung(orig_pte);
>> > >> i++;
>> > >> ptep++;
>> > >> for (; i < CONT_PTES; i++, ptep++) {
>> > >> pte = __ptep_get(ptep);
>> > >> if (pte_dirty(pte)) {
>> > >> orig_pte = pte_mkdirty(orig_pte);
>> > >> break;
>> > >> }
>> > >> }
>> > >> break;
>> > >> }
>> > >> }
>> > >>
>> > >> return orig_pte;
>> > >> }
>> > >>
>> > >> Test Code:
>> > >>
>> > >> #define PAGE_SIZE 4096
>> > >> #define CONT_PTES 16
>> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> > >> #define YOUNG_BIT 8
>> > >> void rwdata(char *buf)
>> > >> {
>> > >> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> > >> buf[i] = 'a';
>> > >> volatile char c = buf[i];
>> > >> }
>> > >> }
>> > >> void clear_young_dirty(char *buf)
>> > >> {
>> > >> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>> > >> perror("madvise free failed");
>> > >> free(buf);
>> > >> exit(EXIT_FAILURE);
>> > >> }
>> > >> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>> > >> perror("madvise free failed");
>> > >> free(buf);
>> > >> exit(EXIT_FAILURE);
>> > >> }
>> > >> }
>> > >> void set_one_young(char *buf)
>> > >> {
>> > >> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>> > >> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>> > >> }
>> > >> }
>> > >>
>> > >> void test_contpte_perf() {
>> > >> char *buf;
>> > >> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
>> > >> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>> > >> perror("posix_memalign failed");
>> > >> exit(EXIT_FAILURE);
>> > >> }
>> > >>
>> > >> rwdata(buf);
>> > >> #if TEST_CASE2 || TEST_CASE3
>> > >> clear_young_dirty(buf);
>> > >> #endif
>> > >> #if TEST_CASE2
>> > >> set_one_young(buf);
>> > >> #endif
>> > >>
>> > >> for (int j = 0; j < 500; j++) {
>> > >> mlock(buf, TEST_SIZE);
>> > >>
>> > >> munlock(buf, TEST_SIZE);
>> > >> }
>> > >> free(buf);
>> > >> }
>> > >> ---
>> > >>
>> > >> Descriptions of three test scenarios
>> > >>
>> > >> Scenario 1
>> > >> The data of all 16 PTEs are both dirty and young.
>> > >> #define TEST_CASE2 0
>> > >> #define TEST_CASE3 0
>> > >>
>> > >> Scenario 2
>> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>> > >> #define TEST_CASE2 1
>> > >> #define TEST_CASE3 0
>> > >>
>> > >> Scenario 3
>> > >> Among the 16 PTEs, there are neither young nor dirty ones.
>> > >> #define TEST_CASE2 0
>> > >> #define TEST_CASE3 1
>> > >>
>> > >>
>> > >> Test results
>> > >>
>> > >> |Scenario 1 | Original| Modification 1| Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions | 37912436160| 18303833386| 18731580031|
>> > >> |test time | 4.2797| 2.2687| 2.2949|
>> > >> |overhead of | | | |
>> > >> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
>> > >>
>> > >> |Scenario 2 | Original| Modification 1| Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions | 36701270862| 38729716276| 36115790086|
>> > >> |test time | 3.2335| 3.5732| 3.0874|
>> > >> |Overhead of | | | |
>> > >> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
>> > >>
>> > >> |Scenario 3 | Original| Modification 1| Modification 2|
>> > >> |-------------------|---------------|----------------|----------------|
>> > >> |instructions | 36706279735| 38305241759| 36750881878|
>> > >> |test time | 3.2008| 3.5389| 3.1249|
>> > >> |Overhead of | | | |
>> > >> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
>> > >>
>> > >>
>> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
>> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
>> > >> benefit of 50.59% and a time benefit of 46.38%.
>> > >>
>> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
>> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
>> > >> the instructions and time due to additional conditional checks.
>> > >>
>> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
>> > >> the branches taken by Modification 1 and Modification 2 should be the same as
>> > >> those of the original code. In fact, the test results of Modification 2 seem
>> > >> to be closer to those of the original code. I don't know why there is a
>> > >> performance regression in Modification 1.
>> > >>
>> > >> Therefore, I believe modifying the code according to Modification 2 can bring
>> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
>> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
>> > >>
>> > >
>> > >modification 2 is not correct. if pte0~pte14 are all young and no one
>> > >is dirty, we are
>> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
>> > >
>> > > if (pte_young(pte)) {
>> > > orig_pte = pte_mkyoung(orig_pte);
>> > > i++;
>> > > ptep++;
>> > > for (; i < CONT_PTES; i++, ptep++) {
>> > > pte = __ptep_get(ptep);
>> > > if (pte_dirty(pte)) {
>> > > orig_pte = pte_mkdirty(orig_pte);
>> > > break;
>> > > }
>> > > }
>> > > break;
>> > > }
>> > >
>> >
>> > I didn't understand which part you referred to when you said there were a lot of
>> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
>> > young and no one is dirty", Modification 2 will enter the following branch when
>> > judging pte0:
>> >
>> > if (pte_young(pte)) {
>> > orig_pte = pte_mkyoung(orig_pte);
>> > // The dirty status of pte0 has already been checked, skip it.
>> > i++;
>> > ptep++;
>> > // Then we only need to check whether pte1~pte15 are dirty.
>> > for (; i < CONT_PTES; i++, ptep++) {
>> > pte = __ptep_get(ptep);
>> > if (pte_dirty(pte)) {
>> > // Exit as soon as a dirty entry is found.
>> > orig_pte = pte_mkdirty(orig_pte);
>> > break;
>> > }
>> > }
>> > // Exit directly here without going through the outer loop again.
>> > break;
>> > }
>> >
>> > In this scenario, the total number of judgments in Modification 2 is nearly half less
>> > than that of the original code. I should have understood it correctly, right?
>>
>> You're right—I missed the part where you're also taking a break, even though
>> no one is dirty. Based on your data, modification 2 seems to be good.
>>
>> But I don't quite understand why you are doing
>> i++;
>> ptep++;
>> before for (; i < CONT_PTES; i++, ptep++).
>>
>> it seems to be wrong. if i==15, you will get i=16. you are skipping
>> the pte_dirty
>> check for i==15. it is also true for any value between 0 and 15. My
>> point is that
>> you should drop it and re-test.
>
>Sorry, I missed that you already checked pte_dirty(pte) before calling
>pte_young(). So please ignore my comment. The code's a bit hard to
>follow now. :-)
>
The modification 2 is indeed a bit difficult to understand, but this is also to improve
the execution efficiency. Thanks for your careful review. :-)
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations
2025-05-04 2:39 ` Xavier
@ 2025-05-08 1:29 ` Barry Song
0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2025-05-08 1:29 UTC (permalink / raw)
To: Xavier
Cc: Ryan Roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Sun, May 4, 2025 at 2:56 PM Xavier <xavier_qy@163.com> wrote:
> At 2025-05-02 05:32:50, "Barry Song" <21cnbao@gmail.com> wrote:
> >On Fri, May 2, 2025 at 9:19 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Fri, May 2, 2025 at 12:41 AM Xavier <xavier_qy@163.com> wrote:
> >> >
> >> >
> >> >
> >> > Hi Barry,
> >> >
> >> >
> >> > At 2025-05-01 07:17:36, "Barry Song" <21cnbao@gmail.com> wrote:
> >> > >On Tue, Apr 22, 2025 at 9:34 PM Xavier <xavier_qy@163.com> wrote:
> >> > >>
> >> > >>
> >> > >> Hi all,
> >> > >>
> >> > >>
> >> > >> At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
> >> > >> >On 15/04/2025 09:22, Xavier wrote:
> >> > >> >> This commit optimizes the contpte_ptep_get function by adding early
> >> > >> >> termination logic. It checks if the dirty and young bits of orig_pte
> >> > >> >> are already set and skips redundant bit-setting operations during
> >> > >> >> the loop. This reduces unnecessary iterations and improves performance.
> >> > >> >>
> >> > >> >> Signed-off-by: Xavier <xavier_qy@163.com>
> >> > >> >> ---
> >> > >> >> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> >> > >> >> 1 file changed, 18 insertions(+), 2 deletions(-)
> >> > >> >>
> >> > >> >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >> > >> >> index bcac4f55f9c1..0acfee604947 100644
> >> > >> >> --- a/arch/arm64/mm/contpte.c
> >> > >> >> +++ b/arch/arm64/mm/contpte.c
> >> > >> >> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> >> > >> >> }
> >> > >> >> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
> >> > >> >>
> >> > >> >> +/* Note: in order to improve efficiency, using this macro will modify the
> >> > >> >> + * passed-in parameters.*/
> >> > >> >> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> >> > >> >> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
> >> > >> >> + if (pte_##flag(__ptep_get(ptep))) { \
> >> > >> >> + orig_pte = pte_mk##flag(orig_pte); \
> >> > >> >> + break; \
> >> > >> >> + } \
> >> > >> >> + }
> >> > >> >
> >> > >> >I'm really not a fan of this macro, it just obfuscates what is going on. I'd
> >> > >> >personally prefer to see the 2 extra loops open coded below.
> >> > >> >
> >> > >> >Or even better, could you provide results comparing this 3 loop version to the
> >> > >> >simpler approach I suggested previously? If the performance is similar (which I
> >> > >> >expect it will be, especially given Barry's point that your test always ensures
> >> > >> >the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
> >> > >> >
> >> > >>
> >> > >> Based on the discussions in the previous email, two modifications were adopted
> >> > >> and tested, and the results are as follows:
> >> > >>
> >> > >> Modification 1
> >> > >>
> >> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> > >> {
> >> > >> pte_t pte;
> >> > >> int i;
> >> > >>
> >> > >> ptep = contpte_align_down(ptep);
> >> > >>
> >> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> > >> pte = __ptep_get(ptep);
> >> > >>
> >> > >> if (pte_dirty(pte)) {
> >> > >> orig_pte = pte_mkdirty(orig_pte);
> >> > >> if (pte_young(orig_pte))
> >> > >> break;
> >> > >> }
> >> > >>
> >> > >> if (pte_young(pte)) {
> >> > >> orig_pte = pte_mkyoung(orig_pte);
> >> > >> if (pte_dirty(orig_pte))
> >> > >> break;
> >> > >> }
> >> > >> }
> >> > >>
> >> > >> return orig_pte;
> >> > >> }
> >> > >>
> >> > >> Modification 2
> >> > >>
> >> > >> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >> > >> {
> >> > >> pte_t pte;
> >> > >> int i;
> >> > >>
> >> > >> ptep = contpte_align_down(ptep);
> >> > >>
> >> > >> for (i = 0; i < CONT_PTES; i++, ptep++) {
> >> > >> pte = __ptep_get(ptep);
> >> > >>
> >> > >> if (pte_dirty(pte)) {
> >> > >> orig_pte = pte_mkdirty(orig_pte);
> >> > >> for (; i < CONT_PTES; i++, ptep++) {
> >> > >> pte = __ptep_get(ptep);
> >> > >> if (pte_young(pte)) {
> >> > >> orig_pte = pte_mkyoung(orig_pte);
> >> > >> break;
> >> > >> }
> >> > >> }
> >> > >> break;
> >> > >> }
> >> > >>
> >> > >> if (pte_young(pte)) {
> >> > >> orig_pte = pte_mkyoung(orig_pte);
> >> > >> i++;
> >> > >> ptep++;
> >> > >> for (; i < CONT_PTES; i++, ptep++) {
> >> > >> pte = __ptep_get(ptep);
> >> > >> if (pte_dirty(pte)) {
> >> > >> orig_pte = pte_mkdirty(orig_pte);
> >> > >> break;
> >> > >> }
> >> > >> }
> >> > >> break;
> >> > >> }
> >> > >> }
> >> > >>
> >> > >> return orig_pte;
> >> > >> }
> >> > >>
> >> > >> Test Code:
> >> > >>
> >> > >> #define PAGE_SIZE 4096
> >> > >> #define CONT_PTES 16
> >> > >> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >> > >> #define YOUNG_BIT 8
> >> > >> void rwdata(char *buf)
> >> > >> {
> >> > >> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >> > >> buf[i] = 'a';
> >> > >> volatile char c = buf[i];
> >> > >> }
> >> > >> }
> >> > >> void clear_young_dirty(char *buf)
> >> > >> {
> >> > >> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >> > >> perror("madvise free failed");
> >> > >> free(buf);
> >> > >> exit(EXIT_FAILURE);
> >> > >> }
> >> > >> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >> > >> perror("madvise free failed");
> >> > >> free(buf);
> >> > >> exit(EXIT_FAILURE);
> >> > >> }
> >> > >> }
> >> > >> void set_one_young(char *buf)
> >> > >> {
> >> > >> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >> > >> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >> > >> }
> >> > >> }
> >> > >>
> >> > >> void test_contpte_perf() {
> >> > >> char *buf;
> >> > >> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
> >> > >> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >> > >> perror("posix_memalign failed");
> >> > >> exit(EXIT_FAILURE);
> >> > >> }
> >> > >>
> >> > >> rwdata(buf);
> >> > >> #if TEST_CASE2 || TEST_CASE3
> >> > >> clear_young_dirty(buf);
> >> > >> #endif
> >> > >> #if TEST_CASE2
> >> > >> set_one_young(buf);
> >> > >> #endif
> >> > >>
> >> > >> for (int j = 0; j < 500; j++) {
> >> > >> mlock(buf, TEST_SIZE);
> >> > >>
> >> > >> munlock(buf, TEST_SIZE);
> >> > >> }
> >> > >> free(buf);
> >> > >> }
> >> > >> ---
> >> > >>
> >> > >> Descriptions of three test scenarios
> >> > >>
> >> > >> Scenario 1
> >> > >> The data of all 16 PTEs are both dirty and young.
> >> > >> #define TEST_CASE2 0
> >> > >> #define TEST_CASE3 0
> >> > >>
> >> > >> Scenario 2
> >> > >> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >> > >> #define TEST_CASE2 1
> >> > >> #define TEST_CASE3 0
> >> > >>
> >> > >> Scenario 3
> >> > >> Among the 16 PTEs, there are neither young nor dirty ones.
> >> > >> #define TEST_CASE2 0
> >> > >> #define TEST_CASE3 1
> >> > >>
> >> > >>
> >> > >> Test results
> >> > >>
> >> > >> |Scenario 1 | Original| Modification 1| Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions | 37912436160| 18303833386| 18731580031|
> >> > >> |test time | 4.2797| 2.2687| 2.2949|
> >> > >> |overhead of | | | |
> >> > >> |contpte_ptep_get() | 21.31%| 4.72%| 4.80%|
> >> > >>
> >> > >> |Scenario 2 | Original| Modification 1| Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions | 36701270862| 38729716276| 36115790086|
> >> > >> |test time | 3.2335| 3.5732| 3.0874|
> >> > >> |Overhead of | | | |
> >> > >> |contpte_ptep_get() | 32.26%| 41.35%| 33.57%|
> >> > >>
> >> > >> |Scenario 3 | Original| Modification 1| Modification 2|
> >> > >> |-------------------|---------------|----------------|----------------|
> >> > >> |instructions | 36706279735| 38305241759| 36750881878|
> >> > >> |test time | 3.2008| 3.5389| 3.1249|
> >> > >> |Overhead of | | | |
> >> > >> |contpte_ptep_get() | 31.94%| 41.30%| 34.59%|
> >> > >>
> >> > >>
> >> > >> For Scenario 1, Modification 1 can achieve an instruction count benefit of
> >> > >> 51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
> >> > >> benefit of 50.59% and a time benefit of 46.38%.
> >> > >>
> >> > >> For Scenarios 2, Modification 2 can achieve an instruction count benefit of
> >> > >> 1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
> >> > >> the instructions and time due to additional conditional checks.
> >> > >>
> >> > >> For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
> >> > >> the branches taken by Modification 1 and Modification 2 should be the same as
> >> > >> those of the original code. In fact, the test results of Modification 2 seem
> >> > >> to be closer to those of the original code. I don't know why there is a
> >> > >> performance regression in Modification 1.
> >> > >>
> >> > >> Therefore, I believe modifying the code according to Modification 2 can bring
> >> > >> maximum benefits. Everyone can discuss whether this approach is acceptable,
> >> > >> and if so, I will send Patch V4 to proceed with submitting this modification.
> >> > >>
> >> > >
> >> > >modification 2 is not correct. if pte0~pte14 are all young and no one
> >> > >is dirty, we are
> >> > >having lots of useless "for (; i < CONT_PTES; i++, ptep++)"
> >> > >
> >> > > if (pte_young(pte)) {
> >> > > orig_pte = pte_mkyoung(orig_pte);
> >> > > i++;
> >> > > ptep++;
> >> > > for (; i < CONT_PTES; i++, ptep++) {
> >> > > pte = __ptep_get(ptep);
> >> > > if (pte_dirty(pte)) {
> >> > > orig_pte = pte_mkdirty(orig_pte);
> >> > > break;
> >> > > }
> >> > > }
> >> > > break;
> >> > > }
> >> > >
> >> >
> >> > I didn't understand which part you referred to when you said there were a lot of
> >> > useless loops. According to the scenario you mentioned, "if pte0~pte14 are all
> >> > young and no one is dirty", Modification 2 will enter the following branch when
> >> > judging pte0:
> >> >
> >> > if (pte_young(pte)) {
> >> > orig_pte = pte_mkyoung(orig_pte);
> >> > // The dirty status of pte0 has already been checked, skip it.
> >> > i++;
> >> > ptep++;
> >> > // Then we only need to check whether pte1~pte15 are dirty.
> >> > for (; i < CONT_PTES; i++, ptep++) {
> >> > pte = __ptep_get(ptep);
> >> > if (pte_dirty(pte)) {
> >> > // Exit as soon as a dirty entry is found.
> >> > orig_pte = pte_mkdirty(orig_pte);
> >> > break;
> >> > }
> >> > }
> >> > // Exit directly here without going through the outer loop again.
> >> > break;
> >> > }
> >> >
> >> > In this scenario, the total number of judgments in Modification 2 is nearly half less
> >> > than that of the original code. I should have understood it correctly, right?
> >>
> >> You're right—I missed the part where you're also taking a break, even though
> >> no one is dirty. Based on your data, modification 2 seems to be good.
> >>
> >> But I don't quite understand why you are doing
> >> i++;
> >> ptep++;
> >> before for (; i < CONT_PTES; i++, ptep++).
> >>
> >> it seems to be wrong. if i==15, you will get i=16. you are skipping
> >> the pte_dirty
> >> check for i==15. it is also true for any value between 0 and 15. My
> >> point is that
> >> you should drop it and re-test.
> >
> >Sorry, I missed that you already checked pte_dirty(pte) before calling
> >pte_young(). So please ignore my comment. The code's a bit hard to
> >follow now. :-)
> >
>
> The modification 2 is indeed a bit difficult to understand, but this is also to improve
> the execution efficiency. Thanks for your careful review. :-)
Agreed. Personally, I have no objections to Modification 2, as it demonstrates
performance improvements in nearly all cases according to your data.
>
> --
> Thanks,
> Xavier
Thanks
Barry
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-01 21:32 ` Barry Song
2025-05-04 2:39 ` Xavier
@ 2025-05-08 7:03 ` Xavier Xia
2025-05-08 8:30 ` David Hildenbrand
2025-05-09 2:09 ` Barry Song
1 sibling, 2 replies; 49+ messages in thread
From: Xavier Xia @ 2025-05-08 7:03 UTC (permalink / raw)
To: 21cnbao, ryan.roberts, dev.jain, ioworker0
Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
linux-kernel, will, willy, xavier_qy, ziy
This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.
In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:
Test Code:
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
#define YOUNG_BIT 8
void rwdata(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
buf[i] = 'a';
volatile char c = buf[i];
}
}
void clear_young_dirty(char *buf)
{
if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
perror("madvise free failed");
free(buf);
exit(EXIT_FAILURE);
}
}
void set_one_young(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
}
}
void test_contpte_perf() {
char *buf;
int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
TEST_SIZE);
if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
perror("posix_memalign failed");
exit(EXIT_FAILURE);
}
rwdata(buf);
#if TEST_CASE2 || TEST_CASE3
clear_young_dirty(buf);
#endif
#if TEST_CASE2
set_one_young(buf);
#endif
for (int j = 0; j < 500; j++) {
mlock(buf, TEST_SIZE);
munlock(buf, TEST_SIZE);
}
free(buf);
}
Descriptions of three test scenarios
Scenario 1
The data of all 16 PTEs are both dirty and young.
#define TEST_CASE2 0
#define TEST_CASE3 0
Scenario 2
Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
#define TEST_CASE2 1
#define TEST_CASE3 0
Scenario 3
Among the 16 PTEs, there are neither young nor dirty ones.
#define TEST_CASE2 0
#define TEST_CASE3 1
Test results
|Scenario 1 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 37912436160| 18731580031|
|test time | 4.2797| 2.2949|
|overhead of | | |
|contpte_ptep_get() | 21.31%| 4.80%|
|Scenario 2 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 36701270862| 36115790086|
|test time | 3.2335| 3.0874|
|Overhead of | | |
|contpte_ptep_get() | 32.26%| 33.57%|
|Scenario 3 | Original| Optimized|
|-------------------|---------------|----------------|
|instructions | 36706279735| 36750881878|
|test time | 3.2008| 3.1249|
|Overhead of | | |
|contpte_ptep_get() | 31.94%| 34.59%|
For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.
It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.
Signed-off-by: Xavier Xia <xavier_qy@163.com>
---
arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
1 file changed, 62 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..e9882ec782fc 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
- if (pte_dirty(pte))
+ if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
-
- if (pte_young(pte))
+ for (; i < CONT_PTES; i++, ptep++) {
+ pte = __ptep_get(ptep);
+ if (pte_young(pte)) {
+ orig_pte = pte_mkyoung(orig_pte);
+ break;
+ }
+ }
+ break;
+ }
+
+ if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
+ i++;
+ ptep++;
+ for (; i < CONT_PTES; i++, ptep++) {
+ pte = __ptep_get(ptep);
+ if (pte_dirty(pte)) {
+ orig_pte = pte_mkdirty(orig_pte);
+ break;
+ }
+ }
+ break;
+ }
}
return orig_pte;
}
EXPORT_SYMBOL_GPL(contpte_ptep_get);
+#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
+ (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
+ pgprot_val(prot) != pgprot_val(orig_prot))
+
pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
{
/*
@@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
pte = __ptep_get(ptep);
prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
- if (!pte_valid_cont(pte) ||
- pte_pfn(pte) != pfn ||
- pgprot_val(prot) != pgprot_val(orig_prot))
+ if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
goto retry;
- if (pte_dirty(pte))
+ if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
-
- if (pte_young(pte))
+ for (; i < CONT_PTES; i++, ptep++, pfn++) {
+ pte = __ptep_get(ptep);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+ if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+ goto retry;
+
+ if (pte_young(pte)) {
+ orig_pte = pte_mkyoung(orig_pte);
+ break;
+ }
+ }
+ break;
+ }
+
+ if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
+ i++;
+ ptep++;
+ pfn++;
+ for (; i < CONT_PTES; i++, ptep++, pfn++) {
+ pte = __ptep_get(ptep);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+ if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+ goto retry;
+
+ if (pte_dirty(pte)) {
+ orig_pte = pte_mkdirty(orig_pte);
+ break;
+ }
+ }
+ break;
+ }
}
return orig_pte;
--
2.34.1
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-08 7:03 ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
@ 2025-05-08 8:30 ` David Hildenbrand
2025-05-09 9:17 ` Xavier
2025-05-09 2:09 ` Barry Song
1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2025-05-08 8:30 UTC (permalink / raw)
To: Xavier Xia, 21cnbao, ryan.roberts, dev.jain, ioworker0
Cc: akpm, catalin.marinas, gshan, linux-arm-kernel, linux-kernel,
will, willy, ziy
On 08.05.25 09:03, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
>
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
For the future, please don't post vN+1 as reply to vN.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re:Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-08 8:30 ` David Hildenbrand
@ 2025-05-09 9:17 ` Xavier
2025-05-09 9:25 ` David Hildenbrand
0 siblings, 1 reply; 49+ messages in thread
From: Xavier @ 2025-05-09 9:17 UTC (permalink / raw)
To: David Hildenbrand
Cc: 21cnbao, ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>On 08.05.25 09:03, Xavier Xia wrote:
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>>
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>
>For the future, please don't post vN+1 as reply to vN.
I will pay attention to it when I submit it later. Thank you for the reminder.
>
>--
>Cheers,
>
>David / dhildenb
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-09 9:17 ` Xavier
@ 2025-05-09 9:25 ` David Hildenbrand
0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2025-05-09 9:25 UTC (permalink / raw)
To: Xavier
Cc: 21cnbao, ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On 09.05.25 11:17, Xavier wrote:
>
>
>
> At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>> On 08.05.25 09:03, Xavier Xia wrote:
>>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>>> function by adding early termination logic. It checks if the dirty and
>>> young bits of orig_pte are already set and skips redundant bit-setting
>>> operations during the loop. This reduces unnecessary iterations and
>>> improves performance.
>>>
>>> In order to verify the optimization performance, a test function has been
>>> designed. The function's execution time and instruction statistics have
>>> been traced using perf, and the following are the operation results on a
>>> certain Qualcomm mobile phone chip:
>>
>> For the future, please don't post vN+1 as reply to vN.
>
> I will pay attention to it when I submit it later. Thank you for the reminder.
The rationale is that many people will just treat it as some discussion
noise as part of vN and not really have a closer look, waiting for vN+1.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-08 7:03 ` [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
2025-05-08 8:30 ` David Hildenbrand
@ 2025-05-09 2:09 ` Barry Song
2025-05-09 9:20 ` Xavier
1 sibling, 1 reply; 49+ messages in thread
From: Barry Song @ 2025-05-09 2:09 UTC (permalink / raw)
To: Xavier Xia
Cc: ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
>
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
>
> Test Code:
>
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> #define YOUNG_BIT 8
> void rwdata(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> buf[i] = 'a';
> volatile char c = buf[i];
> }
> }
> void clear_young_dirty(char *buf)
> {
> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> perror("madvise free failed");
> free(buf);
> exit(EXIT_FAILURE);
> }
> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> perror("madvise free failed");
> free(buf);
> exit(EXIT_FAILURE);
> }
> }
> void set_one_young(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> }
> }
>
> void test_contpte_perf() {
> char *buf;
> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> TEST_SIZE);
> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> perror("posix_memalign failed");
> exit(EXIT_FAILURE);
> }
>
> rwdata(buf);
> #if TEST_CASE2 || TEST_CASE3
> clear_young_dirty(buf);
> #endif
> #if TEST_CASE2
> set_one_young(buf);
> #endif
>
> for (int j = 0; j < 500; j++) {
> mlock(buf, TEST_SIZE);
>
> munlock(buf, TEST_SIZE);
> }
> free(buf);
> }
>
> Descriptions of three test scenarios
>
> Scenario 1
> The data of all 16 PTEs are both dirty and young.
> #define TEST_CASE2 0
> #define TEST_CASE3 0
>
> Scenario 2
> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> #define TEST_CASE2 1
> #define TEST_CASE3 0
>
> Scenario 3
> Among the 16 PTEs, there are neither young nor dirty ones.
> #define TEST_CASE2 0
> #define TEST_CASE3 1
>
> Test results
>
> |Scenario 1 | Original| Optimized|
> |-------------------|---------------|----------------|
> |instructions | 37912436160| 18731580031|
> |test time | 4.2797| 2.2949|
> |overhead of | | |
> |contpte_ptep_get() | 21.31%| 4.80%|
>
> |Scenario 2 | Original| Optimized|
> |-------------------|---------------|----------------|
> |instructions | 36701270862| 36115790086|
> |test time | 3.2335| 3.0874|
> |Overhead of | | |
> |contpte_ptep_get() | 32.26%| 33.57%|
>
> |Scenario 3 | Original| Optimized|
> |-------------------|---------------|----------------|
> |instructions | 36706279735| 36750881878|
> |test time | 3.2008| 3.1249|
> |Overhead of | | |
> |contpte_ptep_get() | 31.94%| 34.59%|
>
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.
>
> It can be proven through test function that the optimization for
> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> is similar to that of contpte_ptep_get, the same optimization scheme is
> also adopted for it.
>
> Signed-off-by: Xavier Xia <xavier_qy@163.com>
> ---
> arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 62 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..e9882ec782fc 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> - if (pte_dirty(pte))
> + if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> -
> - if (pte_young(pte))
> + for (; i < CONT_PTES; i++, ptep++) {
> + pte = __ptep_get(ptep);
> + if (pte_young(pte)) {
> + orig_pte = pte_mkyoung(orig_pte);
> + break;
> + }
> + }
> + break;
> + }
> +
> + if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> + i++;
> + ptep++;
> + for (; i < CONT_PTES; i++, ptep++) {
> + pte = __ptep_get(ptep);
> + if (pte_dirty(pte)) {
> + orig_pte = pte_mkdirty(orig_pte);
> + break;
> + }
> + }
> + break;
> + }
> }
>
> return orig_pte;
> }
> EXPORT_SYMBOL_GPL(contpte_ptep_get);
>
> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
> + (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
> + pgprot_val(prot) != pgprot_val(orig_prot))
maybe make it a static inline function to improve readability. Also,
the name appears to
be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
not consistency.
it might be:
static inline bool contpte_is_consistent(...)
{
return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
pgprot_val(prot) == pgprot_val(orig_prot);
}
or another better name.
> +
> pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> {
> /*
> @@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> pte = __ptep_get(ptep);
> prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>
> - if (!pte_valid_cont(pte) ||
> - pte_pfn(pte) != pfn ||
> - pgprot_val(prot) != pgprot_val(orig_prot))
> + if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> goto retry;
>
> - if (pte_dirty(pte))
> + if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> -
> - if (pte_young(pte))
> + for (; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> + if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> + goto retry;
> +
> + if (pte_young(pte)) {
> + orig_pte = pte_mkyoung(orig_pte);
> + break;
> + }
> + }
> + break;
> + }
> +
> + if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> + i++;
> + ptep++;
> + pfn++;
> + for (; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> + if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> + goto retry;
> +
> + if (pte_dirty(pte)) {
> + orig_pte = pte_mkdirty(orig_pte);
> + break;
> + }
> + }
> + break;
> + }
> }
>
> return orig_pte;
> --
> 2.34.1
>
Thanks
barry
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
2025-05-09 2:09 ` Barry Song
@ 2025-05-09 9:20 ` Xavier
0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-05-09 9:20 UTC (permalink / raw)
To: Barry Song
Cc: ryan.roberts, dev.jain, ioworker0, akpm, catalin.marinas, david,
gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
At 2025-05-09 10:09:21, "Barry Song" <21cnbao@gmail.com> wrote:
>On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>>
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>>
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>>
>> Test Code:
>>
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>> #define YOUNG_BIT 8
>> void rwdata(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> buf[i] = 'a';
>> volatile char c = buf[i];
>> }
>> }
>> void clear_young_dirty(char *buf)
>> {
>> if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>> perror("madvise free failed");
>> free(buf);
>> exit(EXIT_FAILURE);
>> }
>> if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>> perror("madvise free failed");
>> free(buf);
>> exit(EXIT_FAILURE);
>> }
>> }
>> void set_one_young(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>> volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>> }
>> }
>>
>> void test_contpte_perf() {
>> char *buf;
>> int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>> TEST_SIZE);
>> if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>> perror("posix_memalign failed");
>> exit(EXIT_FAILURE);
>> }
>>
>> rwdata(buf);
>> #if TEST_CASE2 || TEST_CASE3
>> clear_young_dirty(buf);
>> #endif
>> #if TEST_CASE2
>> set_one_young(buf);
>> #endif
>>
>> for (int j = 0; j < 500; j++) {
>> mlock(buf, TEST_SIZE);
>>
>> munlock(buf, TEST_SIZE);
>> }
>> free(buf);
>> }
>>
>> Descriptions of three test scenarios
>>
>> Scenario 1
>> The data of all 16 PTEs are both dirty and young.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 0
>>
>> Scenario 2
>> Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>> #define TEST_CASE2 1
>> #define TEST_CASE3 0
>>
>> Scenario 3
>> Among the 16 PTEs, there are neither young nor dirty ones.
>> #define TEST_CASE2 0
>> #define TEST_CASE3 1
>>
>> Test results
>>
>> |Scenario 1 | Original| Optimized|
>> |-------------------|---------------|----------------|
>> |instructions | 37912436160| 18731580031|
>> |test time | 4.2797| 2.2949|
>> |overhead of | | |
>> |contpte_ptep_get() | 21.31%| 4.80%|
>>
>> |Scenario 2 | Original| Optimized|
>> |-------------------|---------------|----------------|
>> |instructions | 36701270862| 36115790086|
>> |test time | 3.2335| 3.0874|
>> |Overhead of | | |
>> |contpte_ptep_get() | 32.26%| 33.57%|
>>
>> |Scenario 3 | Original| Optimized|
>> |-------------------|---------------|----------------|
>> |instructions | 36706279735| 36750881878|
>> |test time | 3.2008| 3.1249|
>> |Overhead of | | |
>> |contpte_ptep_get() | 31.94%| 34.59%|
>>
>> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
>> and a time benefit of 46.38%.
>> For Scenario 2, optimized code can achieve an instruction count benefit of
>> 1.6% and a time benefit of 4.5%.
>> For Scenario 3, since all the PTEs have neither the young nor the dirty
>> flag, the branches taken by optimized code should be the same as those of
>> the original code. In fact, the test results of optimized code seem to be
>> closer to those of the original code.
>>
>> It can be proven through test function that the optimization for
>> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
>> is similar to that of contpte_ptep_get, the same optimization scheme is
>> also adopted for it.
>>
>> Signed-off-by: Xavier Xia <xavier_qy@163.com>
>> ---
>> arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
>> 1 file changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..e9882ec782fc 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> - if (pte_dirty(pte))
>> + if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> -
>> - if (pte_young(pte))
>> + for (; i < CONT_PTES; i++, ptep++) {
>> + pte = __ptep_get(ptep);
>> + if (pte_young(pte)) {
>> + orig_pte = pte_mkyoung(orig_pte);
>> + break;
>> + }
>> + }
>> + break;
>> + }
>> +
>> + if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> + i++;
>> + ptep++;
>> + for (; i < CONT_PTES; i++, ptep++) {
>> + pte = __ptep_get(ptep);
>> + if (pte_dirty(pte)) {
>> + orig_pte = pte_mkdirty(orig_pte);
>> + break;
>> + }
>> + }
>> + break;
>> + }
>> }
>>
>> return orig_pte;
>> }
>> EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>
>> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
>> + (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
>> + pgprot_val(prot) != pgprot_val(orig_prot))
>
>maybe make it a static inline function to improve readability. Also,
>the name appears to
>be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
>not consistency.
>
>it might be:
>
>static inline bool contpte_is_consistent(...)
>{
> return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
> pgprot_val(prot) == pgprot_val(orig_prot);
>}
>
>or another better name.
>
You're right. What's being checked here is the inconsistency. I will make the modification
in the next version. Thank you for your suggestion.
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 0/1] " Xavier
2025-04-15 8:22 ` [mm/contpte v3 1/1] " Xavier
@ 2025-04-16 2:10 ` Andrew Morton
2025-04-16 3:25 ` Xavier
2025-04-16 12:47 ` Catalin Marinas
2025-04-16 12:48 ` Ryan Roberts
3 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2025-04-16 2:10 UTC (permalink / raw)
To: Xavier
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, catalin.marinas,
david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
Please try to avoid presentation of a [0/N] cover letter when N==1! A
simple singleton patch is better.
On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy@163.com> wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
which function?
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
All the words thus far appear to be discussing changes since v2. For
the permanent kernel record, this isn't interesting or useful material.
So please present a standalone description which doesn't refer to
previous iterations.
It's great to present this what-i-changed-since-last-time material, but
that is better placed after the "^---$" separator, after the
Signed-off-by:, Reviewed-by: etc tags.
>
> ...
>
Below is what I came up with for a changelog. Please check?
Optimize contpte_ptep_get() by adding early termination logic. Check if
the dirty and young bits of orig_pte are already set and skip redundant
bit-setting operations during the loop. This reduces unnecessary
iterations and improves performance.
The function's execution time and instruction statistics have been traced
using perf, and the following are the operation results on a certain
Qualcomm mobile phone chip:
Instruction Statistics - Before Optimization
# count event_name # count / runtime
20,814,352 branch-load-misses # 662.244 K/sec
41,894,986,323 branch-loads # 1.333 G/sec
1,957,415 iTLB-load-misses # 62.278 K/sec
49,872,282,100 iTLB-loads # 1.587 G/sec
302,808,096 L1-icache-load-misses # 9.634 M/sec
49,872,282,100 L1-icache-loads # 1.587 G/sec
Total test time: 31.485237 seconds.
Instruction Statistics - After Optimization
# count event_name # count / runtime
19,340,524 branch-load-misses # 688.753 K/sec
38,510,185,183 branch-loads # 1.371 G/sec
1,812,716 iTLB-load-misses # 64.554 K/sec
47,673,923,151 iTLB-loads # 1.698 G/sec
675,853,661 L1-icache-load-misses # 24.068 M/sec
47,673,923,151 L1-icache-loads # 1.698 G/sec
Total test time: 28.108048 seconds.
Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900
Overhead Symbol
21.42% lock_release
21.26% lock_acquire
20.88% arch_counter_get_cntvct
14.32% _raw_spin_unlock_irq
6.79% contpte_ptep_get
2.20% test_contpte_perf
1.82% follow_page_pte
0.97% lock_acquired
0.97% rcu_is_watching
0.89% mlock_pte_range
0.84% sched_clock_noinstr
0.70% handle_softirqs.llvm.8218488130471452153
0.58% test_preempt_disable_long
0.57% _raw_spin_unlock_irqrestore
0.54% arch_stack_walk
0.51% vm_normal_folio
0.48% check_preemption_disabled
0.47% stackinfo_get_task
0.36% try_grab_folio
0.34% preempt_count
0.32% trace_preempt_on
0.29% trace_preempt_off
0.24% debug_smp_processor_id
Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042
Overhead Symbol
22.59% lock_release
22.13% arch_counter_get_cntvct
22.08% lock_acquire
15.32% _raw_spin_unlock_irq
2.26% test_contpte_perf
1.50% follow_page_pte
1.49% arch_stack_walk
1.30% rcu_is_watching
1.09% lock_acquired
1.07% sched_clock_noinstr
0.88% handle_softirqs.llvm.12507768597002095717
0.88% trace_preempt_off
0.76% _raw_spin_unlock_irqrestore
0.61% check_preemption_disabled
0.52% trace_preempt_on
0.50% mlock_pte_range
0.43% try_grab_folio
0.41% folio_mark_accessed
0.40% vm_normal_folio
0.38% test_preempt_disable_long
0.28% contpte_ptep_get
0.27% __traceiter_android_rvh_preempt_disable
0.26% debug_smp_processor_id
0.24% return_address
0.20% __pte_offset_map_lock
0.19% unwind_next_frame_record
If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.
If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.
Test function:
#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
void rwdata(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
buf[i] = 'a';
volatile char c = buf[i];
}
}
void test_contpte_perf()
{
char *buf;
int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
if (ret != 0) {
perror("posix_memalign failed");
exit(EXIT_FAILURE);
}
rwdata(buf);
for (int j = 0; j < 500; j++) {
mlock(buf, TEST_SIZE);
rwdata(buf);
munlock(buf, TEST_SIZE);
}
free(buf);
}
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 2:10 ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
@ 2025-04-16 3:25 ` Xavier
0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 3:25 UTC (permalink / raw)
To: Andrew Morton
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, catalin.marinas,
david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy
Hi Andrew,
At 2025-04-16 10:10:27, "Andrew Morton" <akpm@linux-foundation.org> wrote:
>
>Please try to avoid presentation of a [0/N] cover letter when N==1! A
>simple singleton patch is better.
Got it, I'll keep this in mind for future submissions. Thanks for the reminder!
>
>On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy@163.com> wrote:
>
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>
>which function?
It's contpte_ptep_get().
>
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>All the words thus far appear to be discussing changes since v2. For
>the permanent kernel record, this isn't interesting or useful material.
>So please present a standalone description which doesn't refer to
>previous iterations.
>
>It's great to present this what-i-changed-since-last-time material, but
>that is better placed after the "^---$" separator, after the
>Signed-off-by:, Reviewed-by: etc tags.
>
OK, I will follow this requirement for future submissions.
>>
>> ...
>>
>
>
>Below is what I came up with for a changelog. Please check?
I've reviewed it, and it looks good. Thank you for your revisions!
>
>Optimize contpte_ptep_get() by adding early termination logic. Check if
>the dirty and young bits of orig_pte are already set and skip redundant
>bit-setting operations during the loop. This reduces unnecessary
>iterations and improves performance.
>
>The function's execution time and instruction statistics have been traced
>using perf, and the following are the operation results on a certain
>Qualcomm mobile phone chip:
>
>Instruction Statistics - Before Optimization
># count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
>Total test time: 31.485237 seconds.
>
>Instruction Statistics - After Optimization
># count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
>Total test time: 28.108048 seconds.
>
>Function Statistics - Before Optimization
>Arch: arm64
>Event: cpu-cycles (type 0, config 0)
>Samples: 1419716
>Event count: 99618088900
>
>Overhead Symbol
>21.42% lock_release
>21.26% lock_acquire
>20.88% arch_counter_get_cntvct
>14.32% _raw_spin_unlock_irq
>6.79% contpte_ptep_get
>2.20% test_contpte_perf
>1.82% follow_page_pte
>0.97% lock_acquired
>0.97% rcu_is_watching
>0.89% mlock_pte_range
>0.84% sched_clock_noinstr
>0.70% handle_softirqs.llvm.8218488130471452153
>0.58% test_preempt_disable_long
>0.57% _raw_spin_unlock_irqrestore
>0.54% arch_stack_walk
>0.51% vm_normal_folio
>0.48% check_preemption_disabled
>0.47% stackinfo_get_task
>0.36% try_grab_folio
>0.34% preempt_count
>0.32% trace_preempt_on
>0.29% trace_preempt_off
>0.24% debug_smp_processor_id
>
>Function Statistics - After Optimization
>Arch: arm64
>Event: cpu-cycles (type 0, config 0)
>Samples: 1431006
>Event count: 118856425042
>
>Overhead Symbol
>22.59% lock_release
>22.13% arch_counter_get_cntvct
>22.08% lock_acquire
>15.32% _raw_spin_unlock_irq
>2.26% test_contpte_perf
>1.50% follow_page_pte
>1.49% arch_stack_walk
>1.30% rcu_is_watching
>1.09% lock_acquired
>1.07% sched_clock_noinstr
>0.88% handle_softirqs.llvm.12507768597002095717
>0.88% trace_preempt_off
>0.76% _raw_spin_unlock_irqrestore
>0.61% check_preemption_disabled
>0.52% trace_preempt_on
>0.50% mlock_pte_range
>0.43% try_grab_folio
>0.41% folio_mark_accessed
>0.40% vm_normal_folio
>0.38% test_preempt_disable_long
>0.28% contpte_ptep_get
>0.27% __traceiter_android_rvh_preempt_disable
>0.26% debug_smp_processor_id
>0.24% return_address
>0.20% __pte_offset_map_lock
>0.19% unwind_next_frame_record
>
>If there is no problem with my test program, it can be seen that there is a
>significant performance improvement both in the overall number of instructions
>and the execution time of contpte_ptep_get.
>
>If any reviewers have time, you can also test it on your machines for comparison.
>I have enabled THP and hugepages-64kB.
>
>Test function:
>
>#define PAGE_SIZE 4096
>#define CONT_PTES 16
>#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>
>void rwdata(char *buf)
>{
> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> buf[i] = 'a';
> volatile char c = buf[i];
> }
>}
>void test_contpte_perf()
>{
> char *buf;
> int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
> if (ret != 0) {
> perror("posix_memalign failed");
> exit(EXIT_FAILURE);
> }
>
> rwdata(buf);
>
> for (int j = 0; j < 500; j++) {
> mlock(buf, TEST_SIZE);
>
> rwdata(buf);
>
> munlock(buf, TEST_SIZE);
> }
>
> free(buf);
>}
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 0/1] " Xavier
2025-04-15 8:22 ` [mm/contpte v3 1/1] " Xavier
2025-04-16 2:10 ` [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations Andrew Morton
@ 2025-04-16 12:47 ` Catalin Marinas
2025-04-16 15:08 ` Xavier
2025-04-16 12:48 ` Ryan Roberts
3 siblings, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2025-04-16 12:47 UTC (permalink / raw)
To: Xavier
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, david, gshan,
linux-arm-kernel, linux-kernel, will, willy, ziy
On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev.
For some reason, my email (office365) rejected all these patches (not
even quarantined), I only got the replies. Anyway, I can get them from
the lore archive.
> Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
>
> Instruction Statistics - Before Optimization
> # count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
> Total test time: 31.485237 seconds.
>
> Instruction Statistics - After Optimization
> # count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
> Total test time: 28.108048 seconds.
We'd need to reproduce these numbers on other platforms as well and with
different page sizes. I hope Ryan can do some tests next week.
Purely looking at the patch, I don't like the complexity. I'd rather go
with your v1 if the numbers are fairly similar (even if slightly slower).
However, I don't trust microbenchmarks like calling mlock() in a loop.
It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
before mlock() to make the best out of the rewritten contpte_ptep_get().
Are there any real world workloads that would benefit from such change?
As it stands, I think this patch needs better justification.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re:Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 12:47 ` Catalin Marinas
@ 2025-04-16 15:08 ` Xavier
0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 15:08 UTC (permalink / raw)
To: Catalin Marinas
Cc: ryan.roberts, dev.jain, ioworker0, 21cnbao, akpm, david, gshan,
linux-arm-kernel, linux-kernel, will, willy, ziy
Hi Catalin,
At 2025-04-16 20:47:46, "Catalin Marinas" <catalin.marinas@arm.com> wrote:
>On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev.
>
>For some reason, my email (office365) rejected all these patches (not
>even quarantined), I only got the replies. Anyway, I can get them from
>the lore archive.
>
>> Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>>
>> Instruction Statistics - Before Optimization
>> # count event_name # count / runtime
>> 20,814,352 branch-load-misses # 662.244 K/sec
>> 41,894,986,323 branch-loads # 1.333 G/sec
>> 1,957,415 iTLB-load-misses # 62.278 K/sec
>> 49,872,282,100 iTLB-loads # 1.587 G/sec
>> 302,808,096 L1-icache-load-misses # 9.634 M/sec
>> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>>
>> Total test time: 31.485237 seconds.
>>
>> Instruction Statistics - After Optimization
>> # count event_name # count / runtime
>> 19,340,524 branch-load-misses # 688.753 K/sec
>> 38,510,185,183 branch-loads # 1.371 G/sec
>> 1,812,716 iTLB-load-misses # 64.554 K/sec
>> 47,673,923,151 iTLB-loads # 1.698 G/sec
>> 675,853,661 L1-icache-load-misses # 24.068 M/sec
>> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>>
>> Total test time: 28.108048 seconds.
>
>We'd need to reproduce these numbers on other platforms as well and with
>different page sizes. I hope Ryan can do some tests next week.
Of course, it would be even better if any reviewers could verify it with some
real-world scenarios.
>
>Purely looking at the patch, I don't like the complexity. I'd rather go
>with your v1 if the numbers are fairly similar (even if slightly slower).
The implementation of this macro is indeed a bit complicated. If it
doesn't affect the performance, I will try to make it simpler.
>
>However, I don't trust microbenchmarks like calling mlock() in a loop.
>It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
>before mlock() to make the best out of the rewritten contpte_ptep_get().
>Are there any real world workloads that would benefit from such change?
>
>As it stands, I think this patch needs better justification.
>
Indeed, I used mlock() because it is an example that can simply achieve
a large number of calls to contpte_ptep_get() in the user space. Of course,
there are many other scenarios where this function will be called, such as
madvise and mprotect, and so on. But essentially, there is no difference,
and all of them are aimed at verifying the performance improvement
brought by the contpte_ptep_get() function. It's true that the previous
test cases only tested the best-case scenario by dirtying the data.
Next, I will continue to test the impact of the new patch on performance
in scenarios where there are no "young" and "dirty" flags, or where there
is only a "young" flag.
--
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-15 8:22 ` [mm/contpte v3 0/1] " Xavier
` (2 preceding siblings ...)
2025-04-16 12:47 ` Catalin Marinas
@ 2025-04-16 12:48 ` Ryan Roberts
2025-04-16 15:22 ` Xavier
3 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-04-16 12:48 UTC (permalink / raw)
To: Xavier, dev.jain, ioworker0, 21cnbao
Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
linux-kernel, will, willy, ziy
On 15/04/2025 09:22, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
the replies from others. I'll monitor lore but appologies if I'm slow to respond
- that's the reason.
Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
go via the arm64 tree, if we decide we want it.
>
> Instruction Statistics - Before Optimization
> # count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
> Total test time: 31.485237 seconds.
>
> Instruction Statistics - After Optimization
> # count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
> Total test time: 28.108048 seconds.
>
> Function Statistics - Before Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1419716
> Event count: 99618088900
>
> Overhead Symbol
> 21.42% lock_release
> 21.26% lock_acquire
> 20.88% arch_counter_get_cntvct
> 14.32% _raw_spin_unlock_irq
> 6.79% contpte_ptep_get
> 2.20% test_contpte_perf
> 1.82% follow_page_pte
> 0.97% lock_acquired
> 0.97% rcu_is_watching
> 0.89% mlock_pte_range
> 0.84% sched_clock_noinstr
> 0.70% handle_softirqs.llvm.8218488130471452153
> 0.58% test_preempt_disable_long
> 0.57% _raw_spin_unlock_irqrestore
> 0.54% arch_stack_walk
> 0.51% vm_normal_folio
> 0.48% check_preemption_disabled
> 0.47% stackinfo_get_task
> 0.36% try_grab_folio
> 0.34% preempt_count
> 0.32% trace_preempt_on
> 0.29% trace_preempt_off
> 0.24% debug_smp_processor_id
>
> Function Statistics - After Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1431006
> Event count: 118856425042
>
> Overhead Symbol
> 22.59% lock_release
> 22.13% arch_counter_get_cntvct
> 22.08% lock_acquire
> 15.32% _raw_spin_unlock_irq
> 2.26% test_contpte_perf
> 1.50% follow_page_pte
> 1.49% arch_stack_walk
> 1.30% rcu_is_watching
> 1.09% lock_acquired
> 1.07% sched_clock_noinstr
> 0.88% handle_softirqs.llvm.12507768597002095717
> 0.88% trace_preempt_off
> 0.76% _raw_spin_unlock_irqrestore
> 0.61% check_preemption_disabled
> 0.52% trace_preempt_on
> 0.50% mlock_pte_range
> 0.43% try_grab_folio
> 0.41% folio_mark_accessed
> 0.40% vm_normal_folio
> 0.38% test_preempt_disable_long
> 0.28% contpte_ptep_get
> 0.27% __traceiter_android_rvh_preempt_disable
> 0.26% debug_smp_processor_id
> 0.24% return_address
> 0.20% __pte_offset_map_lock
> 0.19% unwind_next_frame_record
>
> If there is no problem with my test program, it can be seen that there is a
> significant performance improvement both in the overall number of instructions
> and the execution time of contpte_ptep_get.
>
> If any reviewers have time, you can also test it on your machines for comparison.
> I have enabled THP and hugepages-64kB.
>
> Test Function:
> ---
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>
> void rwdata(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> buf[i] = 'a';
> volatile char c = buf[i];
> }
> }
> void test_contpte_perf()
> {
> char *buf;
> int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
> if (ret != 0) {
> perror("posix_memalign failed");
> exit(EXIT_FAILURE);
> }
>
> rwdata(buf);
>
> for (int j = 0; j < 500; j++) {
> mlock(buf, TEST_SIZE);
>
> rwdata(buf);
>
> munlock(buf, TEST_SIZE);
This is a microbenchmark in a pathological case and it's showing ~11%
improvement. But in principle I'm ok with it. I have some comments on the actual
change though, which I'll send through against email.
Thanks,
Ryan
> }
>
> free(buf);
> }
> ---
>
> Xavier (1):
> mm/contpte: Optimize loop to reduce redundant operations
>
> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations
2025-04-16 12:48 ` Ryan Roberts
@ 2025-04-16 15:22 ` Xavier
0 siblings, 0 replies; 49+ messages in thread
From: Xavier @ 2025-04-16 15:22 UTC (permalink / raw)
To: Ryan Roberts
Cc: dev.jain, ioworker0, 21cnbao, akpm, catalin.marinas, david, gshan,
linux-arm-kernel, linux-kernel, will, willy, ziy
Hi Ryan,
At 2025-04-16 20:48:57, "Ryan Roberts" <ryan.roberts@arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>
>Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
>the replies from others. I'll monitor lore but appologies if I'm slow to respond
>- that's the reason.
>
>Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
>
>Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
>go via the arm64 tree, if we decide we want it.
OK, I will change it to "arm64/mm" in the subsequent version.
>>
>> Instruction Statistics - Before Optimization
>> # count event_name # count / runtime
>> 20,814,352 branch-load-misses # 662.244 K/sec
>> 41,894,986,323 branch-loads # 1.333 G/sec
>> 1,957,415 iTLB-load-misses # 62.278 K/sec
>> 49,872,282,100 iTLB-loads # 1.587 G/sec
>> 302,808,096 L1-icache-load-misses # 9.634 M/sec
>> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>>
>> Total test time: 31.485237 seconds.
>>
>> Instruction Statistics - After Optimization
>> # count event_name # count / runtime
>> 19,340,524 branch-load-misses # 688.753 K/sec
>> 38,510,185,183 branch-loads # 1.371 G/sec
>> 1,812,716 iTLB-load-misses # 64.554 K/sec
>> 47,673,923,151 iTLB-loads # 1.698 G/sec
>> 675,853,661 L1-icache-load-misses # 24.068 M/sec
>> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>>
>> Total test time: 28.108048 seconds.
>>
>> Function Statistics - Before Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1419716
>> Event count: 99618088900
>>
>> Overhead Symbol
>> 21.42% lock_release
>> 21.26% lock_acquire
>> 20.88% arch_counter_get_cntvct
>> 14.32% _raw_spin_unlock_irq
>> 6.79% contpte_ptep_get
>> 2.20% test_contpte_perf
>> 1.82% follow_page_pte
>> 0.97% lock_acquired
>> 0.97% rcu_is_watching
>> 0.89% mlock_pte_range
>> 0.84% sched_clock_noinstr
>> 0.70% handle_softirqs.llvm.8218488130471452153
>> 0.58% test_preempt_disable_long
>> 0.57% _raw_spin_unlock_irqrestore
>> 0.54% arch_stack_walk
>> 0.51% vm_normal_folio
>> 0.48% check_preemption_disabled
>> 0.47% stackinfo_get_task
>> 0.36% try_grab_folio
>> 0.34% preempt_count
>> 0.32% trace_preempt_on
>> 0.29% trace_preempt_off
>> 0.24% debug_smp_processor_id
>>
>> Function Statistics - After Optimization
>> Arch: arm64
>> Event: cpu-cycles (type 0, config 0)
>> Samples: 1431006
>> Event count: 118856425042
>>
>> Overhead Symbol
>> 22.59% lock_release
>> 22.13% arch_counter_get_cntvct
>> 22.08% lock_acquire
>> 15.32% _raw_spin_unlock_irq
>> 2.26% test_contpte_perf
>> 1.50% follow_page_pte
>> 1.49% arch_stack_walk
>> 1.30% rcu_is_watching
>> 1.09% lock_acquired
>> 1.07% sched_clock_noinstr
>> 0.88% handle_softirqs.llvm.12507768597002095717
>> 0.88% trace_preempt_off
>> 0.76% _raw_spin_unlock_irqrestore
>> 0.61% check_preemption_disabled
>> 0.52% trace_preempt_on
>> 0.50% mlock_pte_range
>> 0.43% try_grab_folio
>> 0.41% folio_mark_accessed
>> 0.40% vm_normal_folio
>> 0.38% test_preempt_disable_long
>> 0.28% contpte_ptep_get
>> 0.27% __traceiter_android_rvh_preempt_disable
>> 0.26% debug_smp_processor_id
>> 0.24% return_address
>> 0.20% __pte_offset_map_lock
>> 0.19% unwind_next_frame_record
>>
>> If there is no problem with my test program, it can be seen that there is a
>> significant performance improvement both in the overall number of instructions
>> and the execution time of contpte_ptep_get.
>>
>> If any reviewers have time, you can also test it on your machines for comparison.
>> I have enabled THP and hugepages-64kB.
>>
>> Test Function:
>> ---
>> #define PAGE_SIZE 4096
>> #define CONT_PTES 16
>> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>>
>> void rwdata(char *buf)
>> {
>> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>> buf[i] = 'a';
>> volatile char c = buf[i];
>> }
>> }
>> void test_contpte_perf()
>> {
>> char *buf;
>> int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
>> if (ret != 0) {
>> perror("posix_memalign failed");
>> exit(EXIT_FAILURE);
>> }
>>
>> rwdata(buf);
>>
>> for (int j = 0; j < 500; j++) {
>> mlock(buf, TEST_SIZE);
>>
>> rwdata(buf);
>>
>> munlock(buf, TEST_SIZE);
>
>This is a microbenchmark in a pathological case and it's showing ~11%
>improvement. But in principle I'm ok with it. I have some comments on the actual
>change though, which I'll send through against email.
OK. Please refer to my reply to that email.
Thanks,
Xavier
^ permalink raw reply [flat|nested] 49+ messages in thread