[PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
@ 2025-05-10 12:59 Xavier Xia
  2025-06-03  8:00 ` Xavier
  2025-06-04 16:02 ` Ryan Roberts
  0 siblings, 2 replies; 10+ messages in thread
From: Xavier Xia @ 2025-05-10 12:59 UTC (permalink / raw)
  To: 21cnbao, ryan.roberts, dev.jain, ioworker0
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, xavier_qy, ziy, Barry Song

This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.

In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:

Test Code:

	#define PAGE_SIZE 4096
	#define CONT_PTES 16
	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
	#define YOUNG_BIT 8
	void rwdata(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
			buf[i] = 'a';
			volatile char c = buf[i];
		}
	}
	void clear_young_dirty(char *buf)
	{
		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
	}
	void set_one_young(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
		}
	}

	void test_contpte_perf() {
		char *buf;
		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
				TEST_SIZE);
		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
			perror("posix_memalign failed");
			exit(EXIT_FAILURE);
		}

		rwdata(buf);
	#if TEST_CASE2 || TEST_CASE3
		clear_young_dirty(buf);
	#endif
	#if TEST_CASE2
		set_one_young(buf);
	#endif

		for (int j = 0; j < 500; j++) {
			mlock(buf, TEST_SIZE);

			munlock(buf, TEST_SIZE);
		}
		free(buf);
	}

	Descriptions of three test scenarios

Scenario 1
	The data of all 16 PTEs are both dirty and young.
	#define TEST_CASE2 0
	#define TEST_CASE3 0

Scenario 2
	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
	#define TEST_CASE2 1
	#define TEST_CASE3 0

Scenario 3
	Among the 16 PTEs, there are neither young nor dirty ones.
	#define TEST_CASE2 0
	#define TEST_CASE3 1

Test results

|Scenario 1         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    37912436160|     18731580031|
|test time          |         4.2797|          2.2949|
|overhead of        |               |                |
|contpte_ptep_get() |         21.31%|           4.80%|

|Scenario 2         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36701270862|     36115790086|
|test time          |         3.2335|          3.0874|
|Overhead of        |               |                |
|contpte_ptep_get() |         32.26%|          33.57%|

|Scenario 3         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36706279735|     36750881878|
|test time          |         3.2008|          3.1249|
|Overhead of        |               |                |
|contpte_ptep_get() |         31.94%|          34.59%|

For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.

It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.

Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Xavier Xia <xavier_qy@163.com>
---
Changes in v6:
- Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
  as suggested by Barry.
- Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/

Changes in v5:
- Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
  for improved readability and clarity, as suggested by Barry.
- Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/

Changes in v4:
- Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
- Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
- Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
---
 arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..71efe7dff0ad 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
 }
 EXPORT_SYMBOL_GPL(contpte_ptep_get);
 
+static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
+					pgprot_t orig_prot)
+{
+	pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+	return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
+			pgprot_val(prot) == pgprot_val(orig_prot);
+}
+
 pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 {
 	/*
@@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 	pgprot_t orig_prot;
 	unsigned long pfn;
 	pte_t orig_pte;
-	pgprot_t prot;
 	pte_t *ptep;
 	pte_t pte;
 	int i;
@@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 
 	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
 		pte = __ptep_get(ptep);
-		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
 
-		if (!pte_valid_cont(pte) ||
-		   pte_pfn(pte) != pfn ||
-		   pgprot_val(prot) != pgprot_val(orig_prot))
+		if (!contpte_is_consistent(pte, pfn, orig_prot))
 			goto retry;
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+
+				if (!contpte_is_consistent(pte, pfn, orig_prot))
+					goto retry;
+
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 
-		if (pte_young(pte))
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			pfn++;
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+
+				if (!contpte_is_consistent(pte, pfn, orig_prot))
+					goto retry;
+
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re:[PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-10 12:59 [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
@ 2025-06-03  8:00 ` Xavier
  2025-06-04  2:31   ` [PATCH " Andrew Morton
  2025-06-04 16:02 ` Ryan Roberts
  1 sibling, 1 reply; 10+ messages in thread
From: Xavier @ 2025-06-03  8:00 UTC (permalink / raw)
  To: will, ryan.roberts, 21cnbao, dev.jain, ioworker0
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, willy, ziy, Barry Song


Hi All,

It has been three weeks since the submission of Patch v6. I take the liberty of
asking whether you still have any comments or suggestions on this version,
and whether the test data indicates that it is ready for merging. I sincerely
look forward to your valuable feedback and guidance.


--

Thanks,
Xavier

At 2025-05-10 20:59:48, "Xavier Xia" <xavier_qy@163.com> wrote:
>This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>function by adding early termination logic. It checks if the dirty and
>young bits of orig_pte are already set and skips redundant bit-setting
>operations during the loop. This reduces unnecessary iterations and
>improves performance.
>
>In order to verify the optimization performance, a test function has been
>designed. The function's execution time and instruction statistics have
>been traced using perf, and the following are the operation results on a
>certain Qualcomm mobile phone chip:
>
>Test Code:
>
>	#define PAGE_SIZE 4096
>	#define CONT_PTES 16
>	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>	#define YOUNG_BIT 8
>	void rwdata(char *buf)
>	{
>		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>			buf[i] = 'a';
>			volatile char c = buf[i];
>		}
>	}
>	void clear_young_dirty(char *buf)
>	{
>		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>			perror("madvise free failed");
>			free(buf);
>			exit(EXIT_FAILURE);
>		}
>		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>			perror("madvise free failed");
>			free(buf);
>			exit(EXIT_FAILURE);
>		}
>	}
>	void set_one_young(char *buf)
>	{
>		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>		}
>	}
>
>	void test_contpte_perf() {
>		char *buf;
>		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>				TEST_SIZE);
>		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>			perror("posix_memalign failed");
>			exit(EXIT_FAILURE);
>		}
>
>		rwdata(buf);
>	#if TEST_CASE2 || TEST_CASE3
>		clear_young_dirty(buf);
>	#endif
>	#if TEST_CASE2
>		set_one_young(buf);
>	#endif
>
>		for (int j = 0; j < 500; j++) {
>			mlock(buf, TEST_SIZE);
>
>			munlock(buf, TEST_SIZE);
>		}
>		free(buf);
>	}
>
>	Descriptions of three test scenarios
>
>Scenario 1
>	The data of all 16 PTEs are both dirty and young.
>	#define TEST_CASE2 0
>	#define TEST_CASE3 0
>
>Scenario 2
>	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>	#define TEST_CASE2 1
>	#define TEST_CASE3 0
>
>Scenario 3
>	Among the 16 PTEs, there are neither young nor dirty ones.
>	#define TEST_CASE2 0
>	#define TEST_CASE3 1
>
>Test results
>
>|Scenario 1         |       Original|       Optimized|
>|-------------------|---------------|----------------|
>|instructions       |    37912436160|     18731580031|
>|test time          |         4.2797|          2.2949|
>|overhead of        |               |                |
>|contpte_ptep_get() |         21.31%|           4.80%|
>
>|Scenario 2         |       Original|       Optimized|
>|-------------------|---------------|----------------|
>|instructions       |    36701270862|     36115790086|
>|test time          |         3.2335|          3.0874|
>|Overhead of        |               |                |
>|contpte_ptep_get() |         32.26%|          33.57%|
>
>|Scenario 3         |       Original|       Optimized|
>|-------------------|---------------|----------------|
>|instructions       |    36706279735|     36750881878|
>|test time          |         3.2008|          3.1249|
>|Overhead of        |               |                |
>|contpte_ptep_get() |         31.94%|          34.59%|
>
>For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
>and a time benefit of 46.38%.
>For Scenario 2, optimized code can achieve an instruction count benefit of
>1.6% and a time benefit of 4.5%.
>For Scenario 3, since all the PTEs have neither the young nor the dirty
>flag, the branches taken by optimized code should be the same as those of
>the original code. In fact, the test results of optimized code seem to be
>closer to those of the original code.
>
>It can be proven through test function that the optimization for
>contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
>is similar to that of contpte_ptep_get, the same optimization scheme is
>also adopted for it.
>
>Reviewed-by: Barry Song <baohua@kernel.org>
>Signed-off-by: Xavier Xia <xavier_qy@163.com>
>---
>Changes in v6:
>- Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
>  as suggested by Barry.
>- Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/
>
>Changes in v5:
>- Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
>  for improved readability and clarity, as suggested by Barry.
>- Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/
>
>Changes in v4:
>- Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
>- Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
>- Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
>---
> arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 64 insertions(+), 10 deletions(-)
>
>diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>index bcac4f55f9c1..71efe7dff0ad 100644
>--- a/arch/arm64/mm/contpte.c
>+++ b/arch/arm64/mm/contpte.c
>@@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> 	for (i = 0; i < CONT_PTES; i++, ptep++) {
> 		pte = __ptep_get(ptep);
> 
>-		if (pte_dirty(pte))
>+		if (pte_dirty(pte)) {
> 			orig_pte = pte_mkdirty(orig_pte);
>-
>-		if (pte_young(pte))
>+			for (; i < CONT_PTES; i++, ptep++) {
>+				pte = __ptep_get(ptep);
>+				if (pte_young(pte)) {
>+					orig_pte = pte_mkyoung(orig_pte);
>+					break;
>+				}
>+			}
>+			break;
>+		}
>+
>+		if (pte_young(pte)) {
> 			orig_pte = pte_mkyoung(orig_pte);
>+			i++;
>+			ptep++;
>+			for (; i < CONT_PTES; i++, ptep++) {
>+				pte = __ptep_get(ptep);
>+				if (pte_dirty(pte)) {
>+					orig_pte = pte_mkdirty(orig_pte);
>+					break;
>+				}
>+			}
>+			break;
>+		}
> 	}
> 
> 	return orig_pte;
> }
> EXPORT_SYMBOL_GPL(contpte_ptep_get);
> 
>+static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
>+					pgprot_t orig_prot)
>+{
>+	pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>+
>+	return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
>+			pgprot_val(prot) == pgprot_val(orig_prot);
>+}
>+
> pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> {
> 	/*
>@@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> 	pgprot_t orig_prot;
> 	unsigned long pfn;
> 	pte_t orig_pte;
>-	pgprot_t prot;
> 	pte_t *ptep;
> 	pte_t pte;
> 	int i;
>@@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> 
> 	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> 		pte = __ptep_get(ptep);
>-		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> 
>-		if (!pte_valid_cont(pte) ||
>-		   pte_pfn(pte) != pfn ||
>-		   pgprot_val(prot) != pgprot_val(orig_prot))
>+		if (!contpte_is_consistent(pte, pfn, orig_prot))
> 			goto retry;
> 
>-		if (pte_dirty(pte))
>+		if (pte_dirty(pte)) {
> 			orig_pte = pte_mkdirty(orig_pte);
>+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
>+				pte = __ptep_get(ptep);
>+
>+				if (!contpte_is_consistent(pte, pfn, orig_prot))
>+					goto retry;
>+
>+				if (pte_young(pte)) {
>+					orig_pte = pte_mkyoung(orig_pte);
>+					break;
>+				}
>+			}
>+			break;
>+		}
> 
>-		if (pte_young(pte))
>+		if (pte_young(pte)) {
> 			orig_pte = pte_mkyoung(orig_pte);
>+			i++;
>+			ptep++;
>+			pfn++;
>+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
>+				pte = __ptep_get(ptep);
>+
>+				if (!contpte_is_consistent(pte, pfn, orig_prot))
>+					goto retry;
>+
>+				if (pte_dirty(pte)) {
>+					orig_pte = pte_mkdirty(orig_pte);
>+					break;
>+				}
>+			}
>+			break;
>+		}
> 	}
> 
> 	return orig_pte;
>-- 
>2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-03  8:00 ` Xavier
@ 2025-06-04  2:31   ` Andrew Morton
  2025-06-04 12:54     ` Ryan Roberts
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2025-06-04  2:31 UTC (permalink / raw)
  To: Xavier
  Cc: will, ryan.roberts, 21cnbao, dev.jain, ioworker0, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, willy, ziy,
	Barry Song

On Tue, 3 Jun 2025 16:00:11 +0800 (CST) Xavier  <xavier_qy@163.com> wrote:

> It has been three weeks since the submission of Patch v6. I take the liberty of
> asking whether you still have any comments or suggestions on this version,
> and whether the test data indicates that it is ready for merging. I sincerely
> look forward to your valuable feedback and guidance.

It was late in the 6.15-rcX cycle and I was hoping that arm people
would handle this.

It would be better if the patches had cc:linux-mm@kvack.org.

I'll grab it and shall include it in -next after -rc1.  If ARM people
want to grab it, please do so and I'll autodrop the mm.git copy.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-04  2:31   ` [PATCH " Andrew Morton
@ 2025-06-04 12:54     ` Ryan Roberts
  0 siblings, 0 replies; 10+ messages in thread
From: Ryan Roberts @ 2025-06-04 12:54 UTC (permalink / raw)
  To: Andrew Morton, Xavier
  Cc: will, 21cnbao, dev.jain, ioworker0, catalin.marinas, david, gshan,
	linux-arm-kernel, linux-kernel, willy, ziy, Barry Song

On 04/06/2025 03:31, Andrew Morton wrote:
> On Tue, 3 Jun 2025 16:00:11 +0800 (CST) Xavier  <xavier_qy@163.com> wrote:
> 
>> It has been three weeks since the submission of Patch v6. I take the liberty of
>> asking whether you still have any comments or suggestions on this version,
>> and whether the test data indicates that it is ready for merging. I sincerely
>> look forward to your valuable feedback and guidance.
> 
> It was late in the 6.15-rcX cycle and I was hoping that arm people
> would handle this.
> 
> It would be better if the patches had cc:linux-mm@kvack.org.
> 
> I'll grab it and shall include it in -next after -rc1.  If ARM people
> want to grab it, please do so and I'll autodrop the mm.git copy.
> 

Appologies for radio silence; As Catalin and I both said before, the arm email
servers drop all mail from Xavier <xavier_qy@163.com> on the floor. I'm not sure
why, but any chance you could use a different email account to make this smoother?

I'm doing some testing of this now and will respond properly in the next few
days. But I don't personally like the complexity of the patch and am not totally
clear on if this is likely to show a real benefit in real-world use cases (as I
commented previously).

I'll respond properly once I have some results.

Given this is only touching arm64, I'd really prefer that this go via arm64 if
at all.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-05-10 12:59 [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  2025-06-03  8:00 ` Xavier
@ 2025-06-04 16:02 ` Ryan Roberts
  2025-06-05  5:54   ` Xavier Xia
  1 sibling, 1 reply; 10+ messages in thread
From: Ryan Roberts @ 2025-06-04 16:02 UTC (permalink / raw)
  To: Xavier Xia, 21cnbao, dev.jain, ioworker0
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, will, willy, ziy, Barry Song

On 10/05/2025 13:59, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
> 
> Test Code:

nit: It would have been good to include the source for the whole program,
including #includes and the main() function to make it quicker for others to get
up and running.

> 
> 	#define PAGE_SIZE 4096
> 	#define CONT_PTES 16
> 	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> 	#define YOUNG_BIT 8
> 	void rwdata(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> 			buf[i] = 'a';
> 			volatile char c = buf[i];
> 		}
> 	}
> 	void clear_young_dirty(char *buf)
> 	{
> 		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}
> 		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}

nit: MADV_FREE clears both young and dirty so I don't think MADV_COLD is
required? (MADV_COLD only clears young I think?)

> 	}
> 	void set_one_young(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> 			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> 		}
> 	}
> 
> 	void test_contpte_perf() {
> 		char *buf;
> 		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> 				TEST_SIZE);
> 		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> 			perror("posix_memalign failed");
> 			exit(EXIT_FAILURE);
> 		}
> 
> 		rwdata(buf);
> 	#if TEST_CASE2 || TEST_CASE3
> 		clear_young_dirty(buf);
> 	#endif
> 	#if TEST_CASE2
> 		set_one_young(buf);
> 	#endif
> 
> 		for (int j = 0; j < 500; j++) {
> 			mlock(buf, TEST_SIZE);
> 
> 			munlock(buf, TEST_SIZE);
> 		}
> 		free(buf);
> 	}
> 
> 	Descriptions of three test scenarios
> 
> Scenario 1
> 	The data of all 16 PTEs are both dirty and young.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 0
> 
> Scenario 2
> 	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> 	#define TEST_CASE2 1
> 	#define TEST_CASE3 0
> 
> Scenario 3
> 	Among the 16 PTEs, there are neither young nor dirty ones.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 1
> 
> Test results
> 
> |Scenario 1         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    37912436160|     18731580031|
> |test time          |         4.2797|          2.2949|
> |overhead of        |               |                |
> |contpte_ptep_get() |         21.31%|           4.80%|
> 
> |Scenario 2         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36701270862|     36115790086|
> |test time          |         3.2335|          3.0874|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         32.26%|          33.57%|
> 
> |Scenario 3         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36706279735|     36750881878|
> |test time          |         3.2008|          3.1249|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         31.94%|          34.59%|
> 
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.

I re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.

Scenario 1: reduced to 56% of baseline execution time
Scenario 2: reduced to 89% of baseline execution time
Scenario 3: reduced to 91% of baseline execution time

I'm pretty amazed that scenario 3 got faster given it is doing the same number
of loops.

> 
> It can be proven through test function that the optimization for
> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> is similar to that of contpte_ptep_get, the same optimization scheme is
> also adopted for it.
> 
> Reviewed-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Xavier Xia <xavier_qy@163.com>

I don't love the extra complexity, but this version is much tidier. While the
micro-benchmark is clearly contrived, it shows that there will be cases where it
will be faster and there are no cases where it is slower. This will probably be
more valuable for 16K kernels because the number of PTEs in a contpte block is
128 there:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
> Changes in v6:
> - Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
>   as suggested by Barry.
> - Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/
> 
> Changes in v5:
> - Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
>   for improved readability and clarity, as suggested by Barry.
> - Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/
> 
> Changes in v4:
> - Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
> - Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
> - Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
> ---
>  arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
>  1 file changed, 64 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..71efe7dff0ad 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>  		pte = __ptep_get(ptep);
>  
> -		if (pte_dirty(pte))
> +		if (pte_dirty(pte)) {
>  			orig_pte = pte_mkdirty(orig_pte);
> -
> -		if (pte_young(pte))
> +			for (; i < CONT_PTES; i++, ptep++) {
> +				pte = __ptep_get(ptep);
> +				if (pte_young(pte)) {
> +					orig_pte = pte_mkyoung(orig_pte);
> +					break;
> +				}
> +			}
> +			break;
> +		}
> +
> +		if (pte_young(pte)) {
>  			orig_pte = pte_mkyoung(orig_pte);
> +			i++;
> +			ptep++;
> +			for (; i < CONT_PTES; i++, ptep++) {
> +				pte = __ptep_get(ptep);
> +				if (pte_dirty(pte)) {
> +					orig_pte = pte_mkdirty(orig_pte);
> +					break;
> +				}
> +			}
> +			break;
> +		}
>  	}
>  
>  	return orig_pte;
>  }
>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>  
> +static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
> +					pgprot_t orig_prot)
> +{
> +	pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +	return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
> +			pgprot_val(prot) == pgprot_val(orig_prot);
> +}
> +
>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  {
>  	/*
> @@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  	pgprot_t orig_prot;
>  	unsigned long pfn;
>  	pte_t orig_pte;
> -	pgprot_t prot;
>  	pte_t *ptep;
>  	pte_t pte;
>  	int i;
> @@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  
>  	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>  		pte = __ptep_get(ptep);
> -		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>  
> -		if (!pte_valid_cont(pte) ||
> -		   pte_pfn(pte) != pfn ||
> -		   pgprot_val(prot) != pgprot_val(orig_prot))
> +		if (!contpte_is_consistent(pte, pfn, orig_prot))
>  			goto retry;
>  
> -		if (pte_dirty(pte))
> +		if (pte_dirty(pte)) {
>  			orig_pte = pte_mkdirty(orig_pte);
> +			for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +				pte = __ptep_get(ptep);
> +
> +				if (!contpte_is_consistent(pte, pfn, orig_prot))
> +					goto retry;
> +
> +				if (pte_young(pte)) {
> +					orig_pte = pte_mkyoung(orig_pte);
> +					break;
> +				}
> +			}
> +			break;

I considered for a while whether it is safe for contpte_ptep_get_lockless() to
exit early having not seen every PTE in the contpte block and confirmed that
they are all consistent. I eventually concluded that it is, as long as all the
PTEs that it does check are consistent I believe this is fine.

> +		}
>  
> -		if (pte_young(pte))
> +		if (pte_young(pte)) {
>  			orig_pte = pte_mkyoung(orig_pte);
> +			i++;
> +			ptep++;
> +			pfn++;
> +			for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +				pte = __ptep_get(ptep);
> +
> +				if (!contpte_is_consistent(pte, pfn, orig_prot))
> +					goto retry;
> +
> +				if (pte_dirty(pte)) {
> +					orig_pte = pte_mkdirty(orig_pte);
> +					break;
> +				}
> +			}
> +			break;
> +		}
>  	}
>  
>  	return orig_pte;



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-04 16:02 ` Ryan Roberts
@ 2025-06-05  5:54   ` Xavier Xia
  2025-06-05  7:15     ` Ryan Roberts
  0 siblings, 1 reply; 10+ messages in thread
From: Xavier Xia @ 2025-06-05  5:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Xavier Xia, 21cnbao, dev.jain, ioworker0, akpm, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy,
	Barry Song, linux-mm

Hi Ryan,

Thank you for your review, and for reproducing and verifying the test cases.
I am using a Gmail email to reply to your message, hoping you can receive it.
Please check the details below.



On Thu, Jun 5, 2025 at 11:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/05/2025 13:59, Xavier Xia wrote:
> > This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> > function by adding early termination logic. It checks if the dirty and
> > young bits of orig_pte are already set and skips redundant bit-setting
> > operations during the loop. This reduces unnecessary iterations and
> > improves performance.
> >
> > In order to verify the optimization performance, a test function has been
> > designed. The function's execution time and instruction statistics have
> > been traced using perf, and the following are the operation results on a
> > certain Qualcomm mobile phone chip:
> >
> > Test Code:
>
> nit: It would have been good to include the source for the whole program,
> including #includes and the main() function to make it quicker for others to get
> up and running.

OK, I will pay attention to it in the future. This test case is quite
simple, so I didn't add it.

>
> >
> >       #define PAGE_SIZE 4096
> >       #define CONT_PTES 16
> >       #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >       #define YOUNG_BIT 8
> >       void rwdata(char *buf)
> >       {
> >               for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >                       buf[i] = 'a';
> >                       volatile char c = buf[i];
> >               }
> >       }
> >       void clear_young_dirty(char *buf)
> >       {
> >               if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >                       perror("madvise free failed");
> >                       free(buf);
> >                       exit(EXIT_FAILURE);
> >               }
> >               if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >                       perror("madvise free failed");
> >                       free(buf);
> >                       exit(EXIT_FAILURE);
> >               }
>
> nit: MADV_FREE clears both young and dirty so I don't think MADV_COLD is
> required? (MADV_COLD only clears young I think?)

You're right, MADV_COLD here can probably be removed.

>
> >       }
> >       void set_one_young(char *buf)
> >       {
> >               for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >                       volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >               }
> >       }
> >
> >       void test_contpte_perf() {
> >               char *buf;
> >               int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> >                               TEST_SIZE);
> >               if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >                       perror("posix_memalign failed");
> >                       exit(EXIT_FAILURE);
> >               }
> >
> >               rwdata(buf);
> >       #if TEST_CASE2 || TEST_CASE3
> >               clear_young_dirty(buf);
> >       #endif
> >       #if TEST_CASE2
> >               set_one_young(buf);
> >       #endif
> >
> >               for (int j = 0; j < 500; j++) {
> >                       mlock(buf, TEST_SIZE);
> >
> >                       munlock(buf, TEST_SIZE);
> >               }
> >               free(buf);
> >       }
> >
> >       Descriptions of three test scenarios
> >
> > Scenario 1
> >       The data of all 16 PTEs are both dirty and young.
> >       #define TEST_CASE2 0
> >       #define TEST_CASE3 0
> >
> > Scenario 2
> >       Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >       #define TEST_CASE2 1
> >       #define TEST_CASE3 0
> >
> > Scenario 3
> >       Among the 16 PTEs, there are neither young nor dirty ones.
> >       #define TEST_CASE2 0
> >       #define TEST_CASE3 1
> >
> > Test results
> >
> > |Scenario 1         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    37912436160|     18731580031|
> > |test time          |         4.2797|          2.2949|
> > |overhead of        |               |                |
> > |contpte_ptep_get() |         21.31%|           4.80%|
> >
> > |Scenario 2         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    36701270862|     36115790086|
> > |test time          |         3.2335|          3.0874|
> > |Overhead of        |               |                |
> > |contpte_ptep_get() |         32.26%|          33.57%|
> >
> > |Scenario 3         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    36706279735|     36750881878|
> > |test time          |         3.2008|          3.1249|
> > |Overhead of        |               |                |
> > |contpte_ptep_get() |         31.94%|          34.59%|
> >
> > For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> > and a time benefit of 46.38%.
> > For Scenario 2, optimized code can achieve an instruction count benefit of
> > 1.6% and a time benefit of 4.5%.
> > For Scenario 3, since all the PTEs have neither the young nor the dirty
> > flag, the branches taken by optimized code should be the same as those of
> > the original code. In fact, the test results of optimized code seem to be
> > closer to those of the original code.
>
> I re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
>
> Scenario 1: reduced to 56% of baseline execution time
> Scenario 2: reduced to 89% of baseline execution time
> Scenario 3: reduced to 91% of baseline execution time
>
> I'm pretty amazed that scenario 3 got faster given it is doing the same number
> of loops.

It seems that the data you obtained is similar to my test data. For
scenario 3, it's
faster even when running the same code, which I can't quite figure out either.

> >
> > It can be proven through test function that the optimization for
> > contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> > is similar to that of contpte_ptep_get, the same optimization scheme is
> > also adopted for it.
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
> > Signed-off-by: Xavier Xia <xavier_qy@163.com>
>
> I don't love the extra complexity, but this version is much tidier. While the
> micro-benchmark is clearly contrived, it shows that there will be cases where it
> will be faster and there are no cases where it is slower. This will probably be
> more valuable for 16K kernels because the number of PTEs in a contpte block is
> 128 there:

Okay, this version has been revised multiple times based on your
previous feedback
and Barry's comments, and it seems much less complicated to understand now. :)

>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Tested-by: Ryan Roberts <ryan.roberts@arm.com>
>
> > ---
> > Changes in v6:
> > - Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
> >   as suggested by Barry.
> > - Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/
> >
> > Changes in v5:
> > - Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
> >   for improved readability and clarity, as suggested by Barry.
> > - Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/
> >
> > Changes in v4:
> > - Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
> > - Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
> > - Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
> > ---
> >  arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
> >  1 file changed, 64 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index bcac4f55f9c1..71efe7dff0ad 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >       for (i = 0; i < CONT_PTES; i++, ptep++) {
> >               pte = __ptep_get(ptep);
> >
> > -             if (pte_dirty(pte))
> > +             if (pte_dirty(pte)) {
> >                       orig_pte = pte_mkdirty(orig_pte);
> > -
> > -             if (pte_young(pte))
> > +                     for (; i < CONT_PTES; i++, ptep++) {
> > +                             pte = __ptep_get(ptep);
> > +                             if (pte_young(pte)) {
> > +                                     orig_pte = pte_mkyoung(orig_pte);
> > +                                     break;
> > +                             }
> > +                     }
> > +                     break;
> > +             }
> > +
> > +             if (pte_young(pte)) {
> >                       orig_pte = pte_mkyoung(orig_pte);
> > +                     i++;
> > +                     ptep++;
> > +                     for (; i < CONT_PTES; i++, ptep++) {
> > +                             pte = __ptep_get(ptep);
> > +                             if (pte_dirty(pte)) {
> > +                                     orig_pte = pte_mkdirty(orig_pte);
> > +                                     break;
> > +                             }
> > +                     }
> > +                     break;
> > +             }
> >       }
> >
> >       return orig_pte;
> >  }
> >  EXPORT_SYMBOL_GPL(contpte_ptep_get);
> >
> > +static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
> > +                                     pgprot_t orig_prot)
> > +{
> > +     pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> > +
> > +     return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
> > +                     pgprot_val(prot) == pgprot_val(orig_prot);
> > +}
> > +
> >  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >  {
> >       /*
> > @@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >       pgprot_t orig_prot;
> >       unsigned long pfn;
> >       pte_t orig_pte;
> > -     pgprot_t prot;
> >       pte_t *ptep;
> >       pte_t pte;
> >       int i;
> > @@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >
> >       for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >               pte = __ptep_get(ptep);
> > -             prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >
> > -             if (!pte_valid_cont(pte) ||
> > -                pte_pfn(pte) != pfn ||
> > -                pgprot_val(prot) != pgprot_val(orig_prot))
> > +             if (!contpte_is_consistent(pte, pfn, orig_prot))
> >                       goto retry;
> >
> > -             if (pte_dirty(pte))
> > +             if (pte_dirty(pte)) {
> >                       orig_pte = pte_mkdirty(orig_pte);
> > +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
> > +                             pte = __ptep_get(ptep);
> > +
> > +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
> > +                                     goto retry;
> > +
> > +                             if (pte_young(pte)) {
> > +                                     orig_pte = pte_mkyoung(orig_pte);
> > +                                     break;
> > +                             }
> > +                     }
> > +                     break;
>
> I considered for a while whether it is safe for contpte_ptep_get_lockless() to
> exit early having not seen every PTE in the contpte block and confirmed that
> they are all consistent. I eventually concluded that it is, as long as all the
> PTEs that it does check are consistent I believe this is fine.

So, it looks like my changes here will be okay.

>
> > +             }
> >
> > -             if (pte_young(pte))
> > +             if (pte_young(pte)) {
> >                       orig_pte = pte_mkyoung(orig_pte);
> > +                     i++;
> > +                     ptep++;
> > +                     pfn++;
> > +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
> > +                             pte = __ptep_get(ptep);
> > +
> > +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
> > +                                     goto retry;
> > +
> > +                             if (pte_dirty(pte)) {
> > +                                     orig_pte = pte_mkdirty(orig_pte);
> > +                                     break;
> > +                             }
> > +                     }
> > +                     break;
> > +             }
> >       }
> >
> >       return orig_pte;
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-05  5:54   ` Xavier Xia
@ 2025-06-05  7:15     ` Ryan Roberts
  2025-06-20  7:56       ` Xavier Xia
  0 siblings, 1 reply; 10+ messages in thread
From: Ryan Roberts @ 2025-06-05  7:15 UTC (permalink / raw)
  To: Xavier Xia
  Cc: Xavier Xia, 21cnbao, dev.jain, ioworker0, akpm, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, will, willy, ziy,
	Barry Song, linux-mm

On 05/06/2025 06:54, Xavier Xia wrote:
> Hi Ryan,
> 
> Thank you for your review, and for reproducing and verifying the test cases.
> I am using a Gmail email to reply to your message, hoping you can receive it.
> Please check the details below.

Ahh yes, this arrived in my inbox without issue!

Thanks,
Ryan


> 
> 
> 
> On Thu, Jun 5, 2025 at 11:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/05/2025 13:59, Xavier Xia wrote:
>>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>>> function by adding early termination logic. It checks if the dirty and
>>> young bits of orig_pte are already set and skips redundant bit-setting
>>> operations during the loop. This reduces unnecessary iterations and
>>> improves performance.
>>>
>>> In order to verify the optimization performance, a test function has been
>>> designed. The function's execution time and instruction statistics have
>>> been traced using perf, and the following are the operation results on a
>>> certain Qualcomm mobile phone chip:
>>>
>>> Test Code:
>>
>> nit: It would have been good to include the source for the whole program,
>> including #includes and the main() function to make it quicker for others to get
>> up and running.
> 
> OK, I will pay attention to it in the future. This test case is quite
> simple, so I didn't add it.
> 
>>
>>>
>>>       #define PAGE_SIZE 4096
>>>       #define CONT_PTES 16
>>>       #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>>>       #define YOUNG_BIT 8
>>>       void rwdata(char *buf)
>>>       {
>>>               for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>>>                       buf[i] = 'a';
>>>                       volatile char c = buf[i];
>>>               }
>>>       }
>>>       void clear_young_dirty(char *buf)
>>>       {
>>>               if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>>>                       perror("madvise free failed");
>>>                       free(buf);
>>>                       exit(EXIT_FAILURE);
>>>               }
>>>               if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>>>                       perror("madvise free failed");
>>>                       free(buf);
>>>                       exit(EXIT_FAILURE);
>>>               }
>>
>> nit: MADV_FREE clears both young and dirty so I don't think MADV_COLD is
>> required? (MADV_COLD only clears young I think?)
> 
> You're right, MADV_COLD here can probably be removed.
> 
>>
>>>       }
>>>       void set_one_young(char *buf)
>>>       {
>>>               for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>>>                       volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>>>               }
>>>       }
>>>
>>>       void test_contpte_perf() {
>>>               char *buf;
>>>               int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>>>                               TEST_SIZE);
>>>               if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>>>                       perror("posix_memalign failed");
>>>                       exit(EXIT_FAILURE);
>>>               }
>>>
>>>               rwdata(buf);
>>>       #if TEST_CASE2 || TEST_CASE3
>>>               clear_young_dirty(buf);
>>>       #endif
>>>       #if TEST_CASE2
>>>               set_one_young(buf);
>>>       #endif
>>>
>>>               for (int j = 0; j < 500; j++) {
>>>                       mlock(buf, TEST_SIZE);
>>>
>>>                       munlock(buf, TEST_SIZE);
>>>               }
>>>               free(buf);
>>>       }
>>>
>>>       Descriptions of three test scenarios
>>>
>>> Scenario 1
>>>       The data of all 16 PTEs are both dirty and young.
>>>       #define TEST_CASE2 0
>>>       #define TEST_CASE3 0
>>>
>>> Scenario 2
>>>       Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>>>       #define TEST_CASE2 1
>>>       #define TEST_CASE3 0
>>>
>>> Scenario 3
>>>       Among the 16 PTEs, there are neither young nor dirty ones.
>>>       #define TEST_CASE2 0
>>>       #define TEST_CASE3 1
>>>
>>> Test results
>>>
>>> |Scenario 1         |       Original|       Optimized|
>>> |-------------------|---------------|----------------|
>>> |instructions       |    37912436160|     18731580031|
>>> |test time          |         4.2797|          2.2949|
>>> |overhead of        |               |                |
>>> |contpte_ptep_get() |         21.31%|           4.80%|
>>>
>>> |Scenario 2         |       Original|       Optimized|
>>> |-------------------|---------------|----------------|
>>> |instructions       |    36701270862|     36115790086|
>>> |test time          |         3.2335|          3.0874|
>>> |Overhead of        |               |                |
>>> |contpte_ptep_get() |         32.26%|          33.57%|
>>>
>>> |Scenario 3         |       Original|       Optimized|
>>> |-------------------|---------------|----------------|
>>> |instructions       |    36706279735|     36750881878|
>>> |test time          |         3.2008|          3.1249|
>>> |Overhead of        |               |                |
>>> |contpte_ptep_get() |         31.94%|          34.59%|
>>>
>>> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
>>> and a time benefit of 46.38%.
>>> For Scenario 2, optimized code can achieve an instruction count benefit of
>>> 1.6% and a time benefit of 4.5%.
>>> For Scenario 3, since all the PTEs have neither the young nor the dirty
>>> flag, the branches taken by optimized code should be the same as those of
>>> the original code. In fact, the test results of optimized code seem to be
>>> closer to those of the original code.
>>
>> I re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
>>
>> Scenario 1: reduced to 56% of baseline execution time
>> Scenario 2: reduced to 89% of baseline execution time
>> Scenario 3: reduced to 91% of baseline execution time
>>
>> I'm pretty amazed that scenario 3 got faster given it is doing the same number
>> of loops.
> 
> It seems that the data you obtained is similar to my test data. For
> scenario 3, it's
> faster even when running the same code, which I can't quite figure out either.
> 
>>>
>>> It can be proven through test function that the optimization for
>>> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
>>> is similar to that of contpte_ptep_get, the same optimization scheme is
>>> also adopted for it.
>>>
>>> Reviewed-by: Barry Song <baohua@kernel.org>
>>> Signed-off-by: Xavier Xia <xavier_qy@163.com>
>>
>> I don't love the extra complexity, but this version is much tidier. While the
>> micro-benchmark is clearly contrived, it shows that there will be cases where it
>> will be faster and there are no cases where it is slower. This will probably be
>> more valuable for 16K kernels because the number of PTEs in a contpte block is
>> 128 there:
> 
> Okay, this version has been revised multiple times based on your
> previous feedback
> and Barry's comments, and it seems much less complicated to understand now. :)
> 
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Tested-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> ---
>>> Changes in v6:
>>> - Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
>>>   as suggested by Barry.
>>> - Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/
>>>
>>> Changes in v5:
>>> - Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
>>>   for improved readability and clarity, as suggested by Barry.
>>> - Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/
>>>
>>> Changes in v4:
>>> - Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
>>> - Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
>>> - Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
>>> ---
>>>  arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
>>>  1 file changed, 64 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index bcac4f55f9c1..71efe7dff0ad 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>       for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>               pte = __ptep_get(ptep);
>>>
>>> -             if (pte_dirty(pte))
>>> +             if (pte_dirty(pte)) {
>>>                       orig_pte = pte_mkdirty(orig_pte);
>>> -
>>> -             if (pte_young(pte))
>>> +                     for (; i < CONT_PTES; i++, ptep++) {
>>> +                             pte = __ptep_get(ptep);
>>> +                             if (pte_young(pte)) {
>>> +                                     orig_pte = pte_mkyoung(orig_pte);
>>> +                                     break;
>>> +                             }
>>> +                     }
>>> +                     break;
>>> +             }
>>> +
>>> +             if (pte_young(pte)) {
>>>                       orig_pte = pte_mkyoung(orig_pte);
>>> +                     i++;
>>> +                     ptep++;
>>> +                     for (; i < CONT_PTES; i++, ptep++) {
>>> +                             pte = __ptep_get(ptep);
>>> +                             if (pte_dirty(pte)) {
>>> +                                     orig_pte = pte_mkdirty(orig_pte);
>>> +                                     break;
>>> +                             }
>>> +                     }
>>> +                     break;
>>> +             }
>>>       }
>>>
>>>       return orig_pte;
>>>  }
>>>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>>
>>> +static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
>>> +                                     pgprot_t orig_prot)
>>> +{
>>> +     pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +     return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
>>> +                     pgprot_val(prot) == pgprot_val(orig_prot);
>>> +}
>>> +
>>>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>  {
>>>       /*
>>> @@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>       pgprot_t orig_prot;
>>>       unsigned long pfn;
>>>       pte_t orig_pte;
>>> -     pgprot_t prot;
>>>       pte_t *ptep;
>>>       pte_t pte;
>>>       int i;
>>> @@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>
>>>       for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>               pte = __ptep_get(ptep);
>>> -             prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>
>>> -             if (!pte_valid_cont(pte) ||
>>> -                pte_pfn(pte) != pfn ||
>>> -                pgprot_val(prot) != pgprot_val(orig_prot))
>>> +             if (!contpte_is_consistent(pte, pfn, orig_prot))
>>>                       goto retry;
>>>
>>> -             if (pte_dirty(pte))
>>> +             if (pte_dirty(pte)) {
>>>                       orig_pte = pte_mkdirty(orig_pte);
>>> +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +                             pte = __ptep_get(ptep);
>>> +
>>> +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
>>> +                                     goto retry;
>>> +
>>> +                             if (pte_young(pte)) {
>>> +                                     orig_pte = pte_mkyoung(orig_pte);
>>> +                                     break;
>>> +                             }
>>> +                     }
>>> +                     break;
>>
>> I considered for a while whether it is safe for contpte_ptep_get_lockless() to
>> exit early having not seen every PTE in the contpte block and confirmed that
>> they are all consistent. I eventually concluded that it is, as long as all the
>> PTEs that it does check are consistent I believe this is fine.
> 
> So, it looks like my changes here will be okay.
> 
>>
>>> +             }
>>>
>>> -             if (pte_young(pte))
>>> +             if (pte_young(pte)) {
>>>                       orig_pte = pte_mkyoung(orig_pte);
>>> +                     i++;
>>> +                     ptep++;
>>> +                     pfn++;
>>> +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +                             pte = __ptep_get(ptep);
>>> +
>>> +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
>>> +                                     goto retry;
>>> +
>>> +                             if (pte_dirty(pte)) {
>>> +                                     orig_pte = pte_mkdirty(orig_pte);
>>> +                                     break;
>>> +                             }
>>> +                     }
>>> +                     break;
>>> +             }
>>>       }
>>>
>>>       return orig_pte;
>>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-05  7:15     ` Ryan Roberts
@ 2025-06-20  7:56       ` Xavier Xia
  2025-06-22 22:34         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Xavier Xia @ 2025-06-20  7:56 UTC (permalink / raw)
  To: Ryan Roberts, will
  Cc: Xavier Xia, 21cnbao, dev.jain, ioworker0, akpm, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, willy, ziy,
	Barry Song, linux-mm

Hi all,

May I follow up: Does this patch require any further changes? Is it
now meeting the merging criteria?

--

Thanks,
Xavier

On Thu, Jun 5, 2025 at 3:16 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/06/2025 06:54, Xavier Xia wrote:
> > Hi Ryan,
> >
> > Thank you for your review, and for reproducing and verifying the test cases.
> > I am using a Gmail email to reply to your message, hoping you can receive it.
> > Please check the details below.
>
> Ahh yes, this arrived in my inbox without issue!
>
> Thanks,
> Ryan
>
>
> >
> >
> >
> > On Thu, Jun 5, 2025 at 11:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/05/2025 13:59, Xavier Xia wrote:
> >>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> >>> function by adding early termination logic. It checks if the dirty and
> >>> young bits of orig_pte are already set and skips redundant bit-setting
> >>> operations during the loop. This reduces unnecessary iterations and
> >>> improves performance.
> >>>
> >>> In order to verify the optimization performance, a test function has been
> >>> designed. The function's execution time and instruction statistics have
> >>> been traced using perf, and the following are the operation results on a
> >>> certain Qualcomm mobile phone chip:
> >>>
> >>> Test Code:
> >>
> >> nit: It would have been good to include the source for the whole program,
> >> including #includes and the main() function to make it quicker for others to get
> >> up and running.
> >
> > OK, I will pay attention to it in the future. This test case is quite
> > simple, so I didn't add it.
> >
> >>
> >>>
> >>>       #define PAGE_SIZE 4096
> >>>       #define CONT_PTES 16
> >>>       #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >>>       #define YOUNG_BIT 8
> >>>       void rwdata(char *buf)
> >>>       {
> >>>               for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >>>                       buf[i] = 'a';
> >>>                       volatile char c = buf[i];
> >>>               }
> >>>       }
> >>>       void clear_young_dirty(char *buf)
> >>>       {
> >>>               if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >>>                       perror("madvise free failed");
> >>>                       free(buf);
> >>>                       exit(EXIT_FAILURE);
> >>>               }
> >>>               if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >>>                       perror("madvise free failed");
> >>>                       free(buf);
> >>>                       exit(EXIT_FAILURE);
> >>>               }
> >>
> >> nit: MADV_FREE clears both young and dirty so I don't think MADV_COLD is
> >> required? (MADV_COLD only clears young I think?)
> >
> > You're right, MADV_COLD here can probably be removed.
> >
> >>
> >>>       }
> >>>       void set_one_young(char *buf)
> >>>       {
> >>>               for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >>>                       volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >>>               }
> >>>       }
> >>>
> >>>       void test_contpte_perf() {
> >>>               char *buf;
> >>>               int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> >>>                               TEST_SIZE);
> >>>               if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >>>                       perror("posix_memalign failed");
> >>>                       exit(EXIT_FAILURE);
> >>>               }
> >>>
> >>>               rwdata(buf);
> >>>       #if TEST_CASE2 || TEST_CASE3
> >>>               clear_young_dirty(buf);
> >>>       #endif
> >>>       #if TEST_CASE2
> >>>               set_one_young(buf);
> >>>       #endif
> >>>
> >>>               for (int j = 0; j < 500; j++) {
> >>>                       mlock(buf, TEST_SIZE);
> >>>
> >>>                       munlock(buf, TEST_SIZE);
> >>>               }
> >>>               free(buf);
> >>>       }
> >>>
> >>>       Descriptions of three test scenarios
> >>>
> >>> Scenario 1
> >>>       The data of all 16 PTEs are both dirty and young.
> >>>       #define TEST_CASE2 0
> >>>       #define TEST_CASE3 0
> >>>
> >>> Scenario 2
> >>>       Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >>>       #define TEST_CASE2 1
> >>>       #define TEST_CASE3 0
> >>>
> >>> Scenario 3
> >>>       Among the 16 PTEs, there are neither young nor dirty ones.
> >>>       #define TEST_CASE2 0
> >>>       #define TEST_CASE3 1
> >>>
> >>> Test results
> >>>
> >>> |Scenario 1         |       Original|       Optimized|
> >>> |-------------------|---------------|----------------|
> >>> |instructions       |    37912436160|     18731580031|
> >>> |test time          |         4.2797|          2.2949|
> >>> |overhead of        |               |                |
> >>> |contpte_ptep_get() |         21.31%|           4.80%|
> >>>
> >>> |Scenario 2         |       Original|       Optimized|
> >>> |-------------------|---------------|----------------|
> >>> |instructions       |    36701270862|     36115790086|
> >>> |test time          |         3.2335|          3.0874|
> >>> |Overhead of        |               |                |
> >>> |contpte_ptep_get() |         32.26%|          33.57%|
> >>>
> >>> |Scenario 3         |       Original|       Optimized|
> >>> |-------------------|---------------|----------------|
> >>> |instructions       |    36706279735|     36750881878|
> >>> |test time          |         3.2008|          3.1249|
> >>> |Overhead of        |               |                |
> >>> |contpte_ptep_get() |         31.94%|          34.59%|
> >>>
> >>> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> >>> and a time benefit of 46.38%.
> >>> For Scenario 2, optimized code can achieve an instruction count benefit of
> >>> 1.6% and a time benefit of 4.5%.
> >>> For Scenario 3, since all the PTEs have neither the young nor the dirty
> >>> flag, the branches taken by optimized code should be the same as those of
> >>> the original code. In fact, the test results of optimized code seem to be
> >>> closer to those of the original code.
> >>
> >> I re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
> >>
> >> Scenario 1: reduced to 56% of baseline execution time
> >> Scenario 2: reduced to 89% of baseline execution time
> >> Scenario 3: reduced to 91% of baseline execution time
> >>
> >> I'm pretty amazed that scenario 3 got faster given it is doing the same number
> >> of loops.
> >
> > It seems that the data you obtained is similar to my test data. For
> > scenario 3, it's
> > faster even when running the same code, which I can't quite figure out either.
> >
> >>>
> >>> It can be proven through test function that the optimization for
> >>> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> >>> is similar to that of contpte_ptep_get, the same optimization scheme is
> >>> also adopted for it.
> >>>
> >>> Reviewed-by: Barry Song <baohua@kernel.org>
> >>> Signed-off-by: Xavier Xia <xavier_qy@163.com>
> >>
> >> I don't love the extra complexity, but this version is much tidier. While the
> >> micro-benchmark is clearly contrived, it shows that there will be cases where it
> >> will be faster and there are no cases where it is slower. This will probably be
> >> more valuable for 16K kernels because the number of PTEs in a contpte block is
> >> 128 there:
> >
> > Okay, this version has been revised multiple times based on your
> > previous feedback
> > and Barry's comments, and it seems much less complicated to understand now. :)
> >
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Tested-by: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >>> ---
> >>> Changes in v6:
> >>> - Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
> >>>   as suggested by Barry.
> >>> - Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/
> >>>
> >>> Changes in v5:
> >>> - Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
> >>>   for improved readability and clarity, as suggested by Barry.
> >>> - Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/
> >>>
> >>> Changes in v4:
> >>> - Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
> >>> - Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
> >>> - Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
> >>> ---
> >>>  arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
> >>>  1 file changed, 64 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >>> index bcac4f55f9c1..71efe7dff0ad 100644
> >>> --- a/arch/arm64/mm/contpte.c
> >>> +++ b/arch/arm64/mm/contpte.c
> >>> @@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >>>       for (i = 0; i < CONT_PTES; i++, ptep++) {
> >>>               pte = __ptep_get(ptep);
> >>>
> >>> -             if (pte_dirty(pte))
> >>> +             if (pte_dirty(pte)) {
> >>>                       orig_pte = pte_mkdirty(orig_pte);
> >>> -
> >>> -             if (pte_young(pte))
> >>> +                     for (; i < CONT_PTES; i++, ptep++) {
> >>> +                             pte = __ptep_get(ptep);
> >>> +                             if (pte_young(pte)) {
> >>> +                                     orig_pte = pte_mkyoung(orig_pte);
> >>> +                                     break;
> >>> +                             }
> >>> +                     }
> >>> +                     break;
> >>> +             }
> >>> +
> >>> +             if (pte_young(pte)) {
> >>>                       orig_pte = pte_mkyoung(orig_pte);
> >>> +                     i++;
> >>> +                     ptep++;
> >>> +                     for (; i < CONT_PTES; i++, ptep++) {
> >>> +                             pte = __ptep_get(ptep);
> >>> +                             if (pte_dirty(pte)) {
> >>> +                                     orig_pte = pte_mkdirty(orig_pte);
> >>> +                                     break;
> >>> +                             }
> >>> +                     }
> >>> +                     break;
> >>> +             }
> >>>       }
> >>>
> >>>       return orig_pte;
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
> >>>
> >>> +static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
> >>> +                                     pgprot_t orig_prot)
> >>> +{
> >>> +     pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >>> +
> >>> +     return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
> >>> +                     pgprot_val(prot) == pgprot_val(orig_prot);
> >>> +}
> >>> +
> >>>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >>>  {
> >>>       /*
> >>> @@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >>>       pgprot_t orig_prot;
> >>>       unsigned long pfn;
> >>>       pte_t orig_pte;
> >>> -     pgprot_t prot;
> >>>       pte_t *ptep;
> >>>       pte_t pte;
> >>>       int i;
> >>> @@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >>>
> >>>       for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >>>               pte = __ptep_get(ptep);
> >>> -             prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >>>
> >>> -             if (!pte_valid_cont(pte) ||
> >>> -                pte_pfn(pte) != pfn ||
> >>> -                pgprot_val(prot) != pgprot_val(orig_prot))
> >>> +             if (!contpte_is_consistent(pte, pfn, orig_prot))
> >>>                       goto retry;
> >>>
> >>> -             if (pte_dirty(pte))
> >>> +             if (pte_dirty(pte)) {
> >>>                       orig_pte = pte_mkdirty(orig_pte);
> >>> +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
> >>> +                             pte = __ptep_get(ptep);
> >>> +
> >>> +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
> >>> +                                     goto retry;
> >>> +
> >>> +                             if (pte_young(pte)) {
> >>> +                                     orig_pte = pte_mkyoung(orig_pte);
> >>> +                                     break;
> >>> +                             }
> >>> +                     }
> >>> +                     break;
> >>
> >> I considered for a while whether it is safe for contpte_ptep_get_lockless() to
> >> exit early having not seen every PTE in the contpte block and confirmed that
> >> they are all consistent. I eventually concluded that it is, as long as all the
> >> PTEs that it does check are consistent I believe this is fine.
> >
> > So, it looks like my changes here will be okay.
> >
> >>
> >>> +             }
> >>>
> >>> -             if (pte_young(pte))
> >>> +             if (pte_young(pte)) {
> >>>                       orig_pte = pte_mkyoung(orig_pte);
> >>> +                     i++;
> >>> +                     ptep++;
> >>> +                     pfn++;
> >>> +                     for (; i < CONT_PTES; i++, ptep++, pfn++) {
> >>> +                             pte = __ptep_get(ptep);
> >>> +
> >>> +                             if (!contpte_is_consistent(pte, pfn, orig_prot))
> >>> +                                     goto retry;
> >>> +
> >>> +                             if (pte_dirty(pte)) {
> >>> +                                     orig_pte = pte_mkdirty(orig_pte);
> >>> +                                     break;
> >>> +                             }
> >>> +                     }
> >>> +                     break;
> >>> +             }
> >>>       }
> >>>
> >>>       return orig_pte;
> >>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-20  7:56       ` Xavier Xia
@ 2025-06-22 22:34         ` Andrew Morton
  2025-06-24 14:02           ` Xavier Xia
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2025-06-22 22:34 UTC (permalink / raw)
  To: Xavier Xia
  Cc: Ryan Roberts, will, Xavier Xia, 21cnbao, dev.jain, ioworker0,
	catalin.marinas, david, gshan, linux-arm-kernel, linux-kernel,
	willy, ziy, Barry Song, linux-mm

On Fri, 20 Jun 2025 15:56:55 +0800 Xavier Xia <xavier.qyxia@gmail.com> wrote:

> May I follow up: Does this patch require any further changes? Is it
> now meeting the merging criteria?

I think a refresh/retest/resend would be best at this time please, help
people take a fresh look at it.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-22 22:34         ` Andrew Morton
@ 2025-06-24 14:02           ` Xavier Xia
  0 siblings, 0 replies; 10+ messages in thread
From: Xavier Xia @ 2025-06-24 14:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ryan Roberts, will, Xavier Xia, 21cnbao, dev.jain, ioworker0,
	catalin.marinas, david, gshan, linux-arm-kernel, linux-kernel,
	willy, ziy, Barry Song, linux-mm

On Mon, Jun 23, 2025 at 6:34 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 20 Jun 2025 15:56:55 +0800 Xavier Xia <xavier.qyxia@gmail.com> wrote:
>
> > May I follow up: Does this patch require any further changes? Is it
> > now meeting the merging criteria?
>
> I think a refresh/retest/resend would be best at this time please, help
> people take a fresh look at it.

Okay, I'll try sending a new version. Thank you for the reminder.

--

Thanks,
Xavier


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-06-24 20:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-10 12:59 [PATCH v6] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
2025-06-03  8:00 ` Xavier
2025-06-04  2:31   ` [PATCH " Andrew Morton
2025-06-04 12:54     ` Ryan Roberts
2025-06-04 16:02 ` Ryan Roberts
2025-06-05  5:54   ` Xavier Xia
2025-06-05  7:15     ` Ryan Roberts
2025-06-20  7:56       ` Xavier Xia
2025-06-22 22:34         ` Andrew Morton
2025-06-24 14:02           ` Xavier Xia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).