[PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of  contpte_ptep_get
@ 2025-06-24 15:25 Xavier Xia
  2025-06-24 21:15 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Xavier Xia @ 2025-06-24 15:25 UTC (permalink / raw)
  To: ryan.roberts, will, 21cnbao, ioworker0, dev.jain
  Cc: akpm, catalin.marinas, david, gshan, linux-arm-kernel,
	linux-kernel, linux-mm, willy, xavier_qy, ziy, Xavier Xia,
	Barry Song

This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.

In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:

Test Code:
	#include <stdlib.h>
	#include <sys/mman.h>
	#include <stdio.h>

	#define PAGE_SIZE 4096
	#define CONT_PTES 16
	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
	#define YOUNG_BIT 8
	void rwdata(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
			buf[i] = 'a';
			volatile char c = buf[i];
		}
	}
	void clear_young_dirty(char *buf)
	{
		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
	}
	void set_one_young(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
		}
	}

	void test_contpte_perf() {
		char *buf;
		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
				TEST_SIZE);
		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
			perror("posix_memalign failed");
			exit(EXIT_FAILURE);
		}

		rwdata(buf);
	#if TEST_CASE2 || TEST_CASE3
		clear_young_dirty(buf);
	#endif
	#if TEST_CASE2
		set_one_young(buf);
	#endif

		for (int j = 0; j < 500; j++) {
			mlock(buf, TEST_SIZE);

			munlock(buf, TEST_SIZE);
		}
		free(buf);
	}

	int main(void) 
	{
		test_contpte_perf();
		return 0;
	}

	Descriptions of three test scenarios

Scenario 1
	The data of all 16 PTEs are both dirty and young.
	#define TEST_CASE2 0
	#define TEST_CASE3 0

Scenario 2
	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
	#define TEST_CASE2 1
	#define TEST_CASE3 0

Scenario 3
	Among the 16 PTEs, there are neither young nor dirty ones.
	#define TEST_CASE2 0
	#define TEST_CASE3 1

Test results

|Scenario 1         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    37912436160|     18731580031|
|test time          |         4.2797|          2.2949|
|overhead of        |               |                |
|contpte_ptep_get() |         21.31%|           4.80%|

|Scenario 2         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36701270862|     36115790086|
|test time          |         3.2335|          3.0874|
|Overhead of        |               |                |
|contpte_ptep_get() |         32.26%|          33.57%|

|Scenario 3         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36706279735|     36750881878|
|test time          |         3.2008|          3.1249|
|Overhead of        |               |                |
|contpte_ptep_get() |         31.94%|          34.59%|

For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.

Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.

Scenario 1: reduced to 56% of baseline execution time
Scenario 2: reduced to 89% of baseline execution time
Scenario 3: reduced to 91% of baseline execution time

It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Xavier Xia <xavier.qyxia@gmail.com>
---
Changes in v7:
- Update the header files and main function of the test program, as well as Ryan's validation data.
- Link to v6: https://lore.kernel.org/all/20250510125948.2383778-1-xavier_qy@163.com/

Changes in v6:
- Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(),
  as suggested by Barry.
- Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/

Changes in v5:
- Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent
  for improved readability and clarity, as suggested by Barry.
- Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/

Changes in v4:
- Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability.
- Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan.
- Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/
---
 arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..71efe7dff0ad 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
 }
 EXPORT_SYMBOL_GPL(contpte_ptep_get);
 
+static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn,
+					pgprot_t orig_prot)
+{
+	pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+	return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
+			pgprot_val(prot) == pgprot_val(orig_prot);
+}
+
 pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 {
 	/*
@@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 	pgprot_t orig_prot;
 	unsigned long pfn;
 	pte_t orig_pte;
-	pgprot_t prot;
 	pte_t *ptep;
 	pte_t pte;
 	int i;
@@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 
 	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
 		pte = __ptep_get(ptep);
-		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
 
-		if (!pte_valid_cont(pte) ||
-		   pte_pfn(pte) != pfn ||
-		   pgprot_val(prot) != pgprot_val(orig_prot))
+		if (!contpte_is_consistent(pte, pfn, orig_prot))
 			goto retry;
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+
+				if (!contpte_is_consistent(pte, pfn, orig_prot))
+					goto retry;
+
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 
-		if (pte_young(pte))
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			pfn++;
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+
+				if (!contpte_is_consistent(pte, pfn, orig_prot))
+					goto retry;
+
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of  contpte_ptep_get
  2025-06-24 15:25 [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
@ 2025-06-24 21:15 ` Andrew Morton
  2025-07-01 13:58 ` Catalin Marinas
  2025-07-03 19:04 ` Catalin Marinas
  2 siblings, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2025-06-24 21:15 UTC (permalink / raw)
  To: Xavier Xia
  Cc: ryan.roberts, will, 21cnbao, ioworker0, dev.jain, catalin.marinas,
	david, gshan, linux-arm-kernel, linux-kernel, linux-mm, willy,
	xavier_qy, ziy, Barry Song

On Tue, 24 Jun 2025 23:25:49 +0800 Xavier Xia <xavier.qyxia@gmail.com> wrote:

> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.

Thanks, I added this to mm.git for some testing.  But perhaps the ARM
tree would be a better merge path - more testing, at least.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of  contpte_ptep_get
  2025-06-24 15:25 [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  2025-06-24 21:15 ` Andrew Morton
@ 2025-07-01 13:58 ` Catalin Marinas
  2025-07-02  9:00   ` Xavier Xia
  2025-07-03 19:04 ` Catalin Marinas
  2 siblings, 1 reply; 5+ messages in thread
From: Catalin Marinas @ 2025-07-01 13:58 UTC (permalink / raw)
  To: Xavier Xia
  Cc: ryan.roberts, will, 21cnbao, ioworker0, dev.jain, akpm, david,
	gshan, linux-arm-kernel, linux-kernel, linux-mm, willy, xavier_qy,
	ziy, Barry Song

On Tue, Jun 24, 2025 at 11:25:49PM +0800, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
> 
> Test Code:
> 	#include <stdlib.h>
> 	#include <sys/mman.h>
> 	#include <stdio.h>
> 
> 	#define PAGE_SIZE 4096
> 	#define CONT_PTES 16
> 	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> 	#define YOUNG_BIT 8
> 	void rwdata(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> 			buf[i] = 'a';
> 			volatile char c = buf[i];
> 		}
> 	}
> 	void clear_young_dirty(char *buf)
> 	{
> 		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}
> 		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}
> 	}
> 	void set_one_young(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> 			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> 		}
> 	}
> 
> 	void test_contpte_perf() {
> 		char *buf;
> 		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> 				TEST_SIZE);
> 		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> 			perror("posix_memalign failed");
> 			exit(EXIT_FAILURE);
> 		}
> 
> 		rwdata(buf);
> 	#if TEST_CASE2 || TEST_CASE3
> 		clear_young_dirty(buf);
> 	#endif
> 	#if TEST_CASE2
> 		set_one_young(buf);
> 	#endif
> 
> 		for (int j = 0; j < 500; j++) {
> 			mlock(buf, TEST_SIZE);
> 
> 			munlock(buf, TEST_SIZE);
> 		}
> 		free(buf);
> 	}
> 
> 	int main(void) 
> 	{
> 		test_contpte_perf();
> 		return 0;
> 	}
> 
> 	Descriptions of three test scenarios
> 
> Scenario 1
> 	The data of all 16 PTEs are both dirty and young.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 0
> 
> Scenario 2
> 	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> 	#define TEST_CASE2 1
> 	#define TEST_CASE3 0
> 
> Scenario 3
> 	Among the 16 PTEs, there are neither young nor dirty ones.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 1
> 
> Test results
> 
> |Scenario 1         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    37912436160|     18731580031|
> |test time          |         4.2797|          2.2949|
> |overhead of        |               |                |
> |contpte_ptep_get() |         21.31%|           4.80%|
> 
> |Scenario 2         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36701270862|     36115790086|
> |test time          |         3.2335|          3.0874|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         32.26%|          33.57%|
> 
> |Scenario 3         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36706279735|     36750881878|
> |test time          |         3.2008|          3.1249|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         31.94%|          34.59%|
> 
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.
> 
> Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
> 
> Scenario 1: reduced to 56% of baseline execution time
> Scenario 2: reduced to 89% of baseline execution time
> Scenario 3: reduced to 91% of baseline execution time

Still not keen on microbenchmarks to justify such change but at least
the code is more readable than the macro approach in some earlier
version.

Do you have any numbers to see how it compares with your v1:

https://lore.kernel.org/all/20250407092243.2207837-1-xavier_qy@163.com/

That patch was a lot simpler.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-07-01 13:58 ` Catalin Marinas
@ 2025-07-02  9:00   ` Xavier Xia
  0 siblings, 0 replies; 5+ messages in thread
From: Xavier Xia @ 2025-07-02  9:00 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ryan.roberts, will, 21cnbao, ioworker0, dev.jain, akpm, david,
	gshan, linux-arm-kernel, linux-kernel, linux-mm, willy, xavier_qy,
	ziy, Barry Song

Hi Catalin,


On Tue, Jul 1, 2025 at 9:59 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, Jun 24, 2025 at 11:25:49PM +0800, Xavier Xia wrote:
> > This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> > function by adding early termination logic. It checks if the dirty and
> > young bits of orig_pte are already set and skips redundant bit-setting
> > operations during the loop. This reduces unnecessary iterations and
> > improves performance.
> >
> > In order to verify the optimization performance, a test function has been
> > designed. The function's execution time and instruction statistics have
> > been traced using perf, and the following are the operation results on a
> > certain Qualcomm mobile phone chip:
> >
> > Test Code:
> >       #include <stdlib.h>
> >       #include <sys/mman.h>
> >       #include <stdio.h>
> >
> >       #define PAGE_SIZE 4096
> >       #define CONT_PTES 16
> >       #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> >       #define YOUNG_BIT 8
> >       void rwdata(char *buf)
> >       {
> >               for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> >                       buf[i] = 'a';
> >                       volatile char c = buf[i];
> >               }
> >       }
> >       void clear_young_dirty(char *buf)
> >       {
> >               if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> >                       perror("madvise free failed");
> >                       free(buf);
> >                       exit(EXIT_FAILURE);
> >               }
> >               if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> >                       perror("madvise free failed");
> >                       free(buf);
> >                       exit(EXIT_FAILURE);
> >               }
> >       }
> >       void set_one_young(char *buf)
> >       {
> >               for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> >                       volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> >               }
> >       }
> >
> >       void test_contpte_perf() {
> >               char *buf;
> >               int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> >                               TEST_SIZE);
> >               if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> >                       perror("posix_memalign failed");
> >                       exit(EXIT_FAILURE);
> >               }
> >
> >               rwdata(buf);
> >       #if TEST_CASE2 || TEST_CASE3
> >               clear_young_dirty(buf);
> >       #endif
> >       #if TEST_CASE2
> >               set_one_young(buf);
> >       #endif
> >
> >               for (int j = 0; j < 500; j++) {
> >                       mlock(buf, TEST_SIZE);
> >
> >                       munlock(buf, TEST_SIZE);
> >               }
> >               free(buf);
> >       }
> >
> >       int main(void)
> >       {
> >               test_contpte_perf();
> >               return 0;
> >       }
> >
> >       Descriptions of three test scenarios
> >
> > Scenario 1
> >       The data of all 16 PTEs are both dirty and young.
> >       #define TEST_CASE2 0
> >       #define TEST_CASE3 0
> >
> > Scenario 2
> >       Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> >       #define TEST_CASE2 1
> >       #define TEST_CASE3 0
> >
> > Scenario 3
> >       Among the 16 PTEs, there are neither young nor dirty ones.
> >       #define TEST_CASE2 0
> >       #define TEST_CASE3 1
> >
> > Test results
> >
> > |Scenario 1         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    37912436160|     18731580031|
> > |test time          |         4.2797|          2.2949|
> > |overhead of        |               |                |
> > |contpte_ptep_get() |         21.31%|           4.80%|
> >
> > |Scenario 2         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    36701270862|     36115790086|
> > |test time          |         3.2335|          3.0874|
> > |Overhead of        |               |                |
> > |contpte_ptep_get() |         32.26%|          33.57%|
> >
> > |Scenario 3         |       Original|       Optimized|
> > |-------------------|---------------|----------------|
> > |instructions       |    36706279735|     36750881878|
> > |test time          |         3.2008|          3.1249|
> > |Overhead of        |               |                |
> > |contpte_ptep_get() |         31.94%|          34.59%|
> >
> > For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> > and a time benefit of 46.38%.
> > For Scenario 2, optimized code can achieve an instruction count benefit of
> > 1.6% and a time benefit of 4.5%.
> > For Scenario 3, since all the PTEs have neither the young nor the dirty
> > flag, the branches taken by optimized code should be the same as those of
> > the original code. In fact, the test results of optimized code seem to be
> > closer to those of the original code.
> >
> > Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
> >
> > Scenario 1: reduced to 56% of baseline execution time
> > Scenario 2: reduced to 89% of baseline execution time
> > Scenario 3: reduced to 91% of baseline execution time
>
> Still not keen on microbenchmarks to justify such change but at least
> the code is more readable than the macro approach in some earlier
> version.
>
> Do you have any numbers to see how it compares with your v1:
>
> https://lore.kernel.org/all/20250407092243.2207837-1-xavier_qy@163.com/
>
> That patch was a lot simpler.
>

You can check the comparison data via:

https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/

The v1 only optimizes Scenario 1 case (where all PTEs are both young and dirty),
but it degrades performance in other scenarios. Although the current
version increases
code complexity, its optimization results are notably significant.

--

Thanks,
Xavier


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  2025-06-24 15:25 [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
  2025-06-24 21:15 ` Andrew Morton
  2025-07-01 13:58 ` Catalin Marinas
@ 2025-07-03 19:04 ` Catalin Marinas
  2 siblings, 0 replies; 5+ messages in thread
From: Catalin Marinas @ 2025-07-03 19:04 UTC (permalink / raw)
  To: ryan.roberts, will, dev.jain, Barry Song, Lance Yang, Xavier Xia
  Cc: akpm, david, gshan, linux-arm-kernel, linux-kernel, linux-mm,
	willy, xavier_qy, ziy

On Tue, 24 Jun 2025 23:25:49 +0800, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
      https://git.kernel.org/arm64/c/093ae7a033cf

-- 
Catalin



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-03 19:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 15:25 [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Xavier Xia
2025-06-24 21:15 ` Andrew Morton
2025-07-01 13:58 ` Catalin Marinas
2025-07-02  9:00   ` Xavier Xia
2025-07-03 19:04 ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).