linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix mincore() tmpfs test failure
@ 2025-03-26  3:38 Baolin Wang
  2025-03-26  3:38 ` [PATCH 1/2] selftests: mincore: fix tmpfs mincore " Baolin Wang
  2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Baolin Wang @ 2025-03-26  3:38 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, 21cnbao, ryan.roberts, ziy, baolin.wang, linux-mm,
	linux-kernel

Patch 1 fixes the failure of mincore() tmpfs selftest. While testing the
mincore() syscall, patch 2 provides some improvements for mincore() when
mTHP is enabled.

Baolin Wang (2):
  selftests: mincore: fix tmpfs mincore test failure
  mm: mincore: use folio_pte_batch() to batch process large folios

 mm/mincore.c                                  | 27 +++++++++++++++----
 .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++--
 2 files changed, 45 insertions(+), 7 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-03-26  3:38 [PATCH 0/2] Fix mincore() tmpfs test failure Baolin Wang
@ 2025-03-26  3:38 ` Baolin Wang
  2025-03-27 14:36   ` Zi Yan
  2025-04-01 12:54   ` David Hildenbrand
  2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
  1 sibling, 2 replies; 23+ messages in thread
From: Baolin Wang @ 2025-03-26  3:38 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, 21cnbao, ryan.roberts, ziy, baolin.wang, linux-mm,
	linux-kernel

When running mincore test cases, I encountered the following failures:

"
mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
check_tmpfs_mmap: Test terminated by assertion
          FAIL  global.check_tmpfs_mmap
not ok 5 global.check_tmpfs_mmap
FAILED: 4 / 5 tests passed
"

The reason for the test case failure is that my system automatically enabled
tmpfs large folio allocation by adding the 'transparent_hugepage_tmpfs=always'
cmdline. However, the test case still expects the tmpfs mounted on /dev/shm to
allocate small folios, which leads to assertion failures when verifying readahead
pages.

To fix this issue, remount tmpfs to a new test directory and set the 'huge=never'
parameter to avoid allocating large folios, which can pass the test.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a6145..e8d7a3a4739f 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -12,6 +12,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <sys/mman.h>
+#include <sys/mount.h>
 #include <string.h>
 #include <fcntl.h>
 
@@ -283,7 +284,7 @@ TEST(check_file_mmap)
 	free(vec);
 }
 
-
+#define INPUT_MAX 80
 /*
  * Test mincore() behavior on a page backed by a tmpfs file.  This test
  * performs the same steps as the previous one. However, we don't expect
@@ -291,6 +292,9 @@ TEST(check_file_mmap)
  */
 TEST(check_tmpfs_mmap)
 {
+	char tmpfs_template[] = "/tmp/check_tmpfs_XXXXXX";
+	const char *tmpfs_loc = mkdtemp(tmpfs_template);
+	char testfile[INPUT_MAX];
 	unsigned char *vec;
 	int vec_size;
 	char *addr;
@@ -300,6 +304,10 @@ TEST(check_tmpfs_mmap)
 	int i;
 	int ra_pages = 0;
 
+	ASSERT_NE(NULL, tmpfs_loc) {
+		TH_LOG("Can't mkdir tmpfs dentry\n");
+	}
+
 	page_size = sysconf(_SC_PAGESIZE);
 	vec_size = FILE_SIZE / page_size;
 	if (FILE_SIZE % page_size)
@@ -311,7 +319,18 @@ TEST(check_tmpfs_mmap)
 	}
 
 	errno = 0;
-	fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
+	/* Do not use large folios for tmpfs mincore testing */
+	retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0, "huge=never,size=4M");
+	ASSERT_EQ(0, retval) {
+		TH_LOG("Unable to mount tmpfs for testing\n");
+	}
+
+	retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
+	ASSERT_GE(INPUT_MAX, retval) {
+		TH_LOG("Unable to create a tmpfs for testing\n");
+	}
+
+	fd = open(testfile, O_CREAT|O_RDWR, 0664);
 	ASSERT_NE(-1, fd) {
 		TH_LOG("Can't create temporary file: %s",
 			strerror(errno));
@@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
 	munmap(addr, FILE_SIZE);
 	close(fd);
 	free(vec);
+	umount(tmpfs_loc);
+	rmdir(tmpfs_loc);
 }
 
 TEST_HARNESS_MAIN
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-26  3:38 [PATCH 0/2] Fix mincore() tmpfs test failure Baolin Wang
  2025-03-26  3:38 ` [PATCH 1/2] selftests: mincore: fix tmpfs mincore " Baolin Wang
@ 2025-03-26  3:38 ` Baolin Wang
  2025-03-27 10:49   ` Oscar Salvador
                     ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Baolin Wang @ 2025-03-26  3:38 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, 21cnbao, ryan.roberts, ziy, baolin.wang, linux-mm,
	linux-kernel

When I tested the mincore() syscall, I observed that it takes longer with
64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
still checks each PTE individually, even when the PTEs are contiguous,
which is not efficient.

Thus we can use folio_pte_batch() to get the batch number of the present
contiguous PTEs, which can improve the performance. I tested the mincore()
syscall with 1G anonymous memory populated with 64K mTHP, and observed an
obvious performance improvement:

w/o patch		w/ patch		changes
6022us			1115us			+81%

Moreover, I also tested mincore() with disabling mTHP/THP, and did not
see any obvious regression.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/mincore.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 832f29f46767..88be180b5550 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,6 +21,7 @@
 
 #include <linux/uaccess.h>
 #include "swap.h"
+#include "internal.h"
 
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
@@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *ptep;
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	int step, i;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
-	for (; addr != end; ptep++, addr += PAGE_SIZE) {
+	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
 		pte_t pte = ptep_get(ptep);
 
+		step = 1;
 		/* We need to do cache lookup too for pte markers */
 		if (pte_none_mostly(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
-		else if (pte_present(pte))
-			*vec = 1;
-		else { /* pte is a swap entry */
+		else if (pte_present(pte)) {
+			if (pte_batch_hint(ptep, pte) > 1) {
+				struct folio *folio = vm_normal_folio(vma, addr, pte);
+
+				if (folio && folio_test_large(folio)) {
+					const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
+								FPB_IGNORE_SOFT_DIRTY;
+					int max_nr = (end - addr) / PAGE_SIZE;
+
+					step = folio_pte_batch(folio, addr, ptep, pte,
+							max_nr, fpb_flags, NULL, NULL, NULL);
+				}
+			}
+
+			for (i = 0; i < step; i++)
+				vec[i] = 1;
+		} else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
 			if (non_swap_entry(entry)) {
@@ -146,7 +163,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 #endif
 			}
 		}
-		vec++;
+		vec += step;
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
@ 2025-03-27 10:49   ` Oscar Salvador
  2025-03-27 11:54     ` Baolin Wang
  2025-03-27 14:08   ` Ryan Roberts
  2025-05-07  5:12   ` Dev Jain
  2 siblings, 1 reply; 23+ messages in thread
From: Oscar Salvador @ 2025-03-27 10:49 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, 21cnbao, ryan.roberts, ziy, linux-mm,
	linux-kernel

On Wed, Mar 26, 2025 at 11:38:11AM +0800, Baolin Wang wrote:
> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		walk->action = ACTION_AGAIN;
>  		return 0;
>  	}
> -	for (; addr != end; ptep++, addr += PAGE_SIZE) {
> +	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>  		pte_t pte = ptep_get(ptep);
>  
> +		step = 1;
>  		/* We need to do cache lookup too for pte markers */
>  		if (pte_none_mostly(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
> -		else if (pte_present(pte))
> -			*vec = 1;
> -		else { /* pte is a swap entry */
> +		else if (pte_present(pte)) {
> +			if (pte_batch_hint(ptep, pte) > 1) {

AFAIU, you will only batch if the CONT_PTE is set, but that is only true for arm64,
and so we lose the ability to batch in e.g: x86 when we have contiguous
entries, right?

So why not have folio_pte_batch take care of it directly without involving
pte_batch_hint here?

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-27 10:49   ` Oscar Salvador
@ 2025-03-27 11:54     ` Baolin Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-03-27 11:54 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, hughd, willy, david, 21cnbao, ryan.roberts, ziy, linux-mm,
	linux-kernel



On 2025/3/27 18:49, Oscar Salvador wrote:
> On Wed, Mar 26, 2025 at 11:38:11AM +0800, Baolin Wang wrote:
>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>   		walk->action = ACTION_AGAIN;
>>   		return 0;
>>   	}
>> -	for (; addr != end; ptep++, addr += PAGE_SIZE) {
>> +	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>   		pte_t pte = ptep_get(ptep);
>>   
>> +		step = 1;
>>   		/* We need to do cache lookup too for pte markers */
>>   		if (pte_none_mostly(pte))
>>   			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>   						 vma, vec);
>> -		else if (pte_present(pte))
>> -			*vec = 1;
>> -		else { /* pte is a swap entry */
>> +		else if (pte_present(pte)) {
>> +			if (pte_batch_hint(ptep, pte) > 1) {
> 
> AFAIU, you will only batch if the CONT_PTE is set, but that is only true for arm64,
> and so we lose the ability to batch in e.g: x86 when we have contiguous
> entries, right?
> 
> So why not have folio_pte_batch take care of it directly without involving
> pte_batch_hint here?

Good question, this was the first approach I tried.

However, I found there was a obvious performance regression with small 
folios (where CONT_PTE is not set). I think the overhead introduced by 
vm_normal_folio() and folio_pte_batch() is greater than the optimization 
gained from batch processing small folios.

For large folios where CONT_PTE is set, ptep_get()--->contpte_ptep_get() 
wastes a significant amount of CPU time, so using folio_pte_batch() can 
improve the performance obviously.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
  2025-03-27 10:49   ` Oscar Salvador
@ 2025-03-27 14:08   ` Ryan Roberts
  2025-03-28 13:10     ` Oscar Salvador
  2025-03-30 19:57     ` Baolin Wang
  2025-05-07  5:12   ` Dev Jain
  2 siblings, 2 replies; 23+ messages in thread
From: Ryan Roberts @ 2025-03-27 14:08 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, 21cnbao, ziy, linux-mm, linux-kernel

On 25/03/2025 23:38, Baolin Wang wrote:
> When I tested the mincore() syscall, I observed that it takes longer with
> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
> still checks each PTE individually, even when the PTEs are contiguous,
> which is not efficient.
> 
> Thus we can use folio_pte_batch() to get the batch number of the present
> contiguous PTEs, which can improve the performance. I tested the mincore()
> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
> obvious performance improvement:
> 
> w/o patch		w/ patch		changes
> 6022us			1115us			+81%
> 
> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
> see any obvious regression.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/mincore.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 832f29f46767..88be180b5550 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -21,6 +21,7 @@
>  
>  #include <linux/uaccess.h>
>  #include "swap.h"
> +#include "internal.h"
>  
>  static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>  			unsigned long end, struct mm_walk *walk)
> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *ptep;
>  	unsigned char *vec = walk->private;
>  	int nr = (end - addr) >> PAGE_SHIFT;
> +	int step, i;
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		walk->action = ACTION_AGAIN;
>  		return 0;
>  	}
> -	for (; addr != end; ptep++, addr += PAGE_SIZE) {
> +	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>  		pte_t pte = ptep_get(ptep);
>  
> +		step = 1;
>  		/* We need to do cache lookup too for pte markers */
>  		if (pte_none_mostly(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
> -		else if (pte_present(pte))
> -			*vec = 1;
> -		else { /* pte is a swap entry */
> +		else if (pte_present(pte)) {
> +			if (pte_batch_hint(ptep, pte) > 1) {
> +				struct folio *folio = vm_normal_folio(vma, addr, pte);
> +
> +				if (folio && folio_test_large(folio)) {
> +					const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> +								FPB_IGNORE_SOFT_DIRTY;
> +					int max_nr = (end - addr) / PAGE_SIZE;
> +
> +					step = folio_pte_batch(folio, addr, ptep, pte,
> +							max_nr, fpb_flags, NULL, NULL, NULL);
> +				}
> +			}

You could simplify to the following, I think, to avoid needing to grab the folio
and call folio_pte_batch():

			else if (pte_present(pte)) {
				int max_nr = (end - addr) / PAGE_SIZE;
				step = min(pte_batch_hint(ptep, pte), max_nr);
			} ...

I expect the regression you are seeing here is all due to calling ptep_get() for
every pte in the contpte batch, which will cause 16 memory reads per pte (to
gather the access/dirty bits). For small folios its just 1 read per pte.
pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
same number as for the small folio case. You don't need all the fancy extras
that folio_pte_batch() gives you here.

Thanks,
Ryan


> +
> +			for (i = 0; i < step; i++)
> +				vec[i] = 1;
> +		} else { /* pte is a swap entry */
>  			swp_entry_t entry = pte_to_swp_entry(pte);
>  
>  			if (non_swap_entry(entry)) {
> @@ -146,7 +163,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  #endif
>  			}
>  		}
> -		vec++;
> +		vec += step;
>  	}
>  	pte_unmap_unlock(ptep - 1, ptl);
>  out:



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-03-26  3:38 ` [PATCH 1/2] selftests: mincore: fix tmpfs mincore " Baolin Wang
@ 2025-03-27 14:36   ` Zi Yan
  2025-03-30 19:47     ` Baolin Wang
  2025-04-01 12:54   ` David Hildenbrand
  1 sibling, 1 reply; 23+ messages in thread
From: Zi Yan @ 2025-03-27 14:36 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, 21cnbao, ryan.roberts, linux-mm,
	linux-kernel

On 25 Mar 2025, at 23:38, Baolin Wang wrote:

> When running mincore test cases, I encountered the following failures:
>
> "
> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
> check_tmpfs_mmap: Test terminated by assertion
>           FAIL  global.check_tmpfs_mmap
> not ok 5 global.check_tmpfs_mmap
> FAILED: 4 / 5 tests passed
> "
>
> The reason for the test case failure is that my system automatically enabled
> tmpfs large folio allocation by adding the 'transparent_hugepage_tmpfs=always'
> cmdline. However, the test case still expects the tmpfs mounted on /dev/shm to
> allocate small folios, which leads to assertion failures when verifying readahead
> pages.
>
> To fix this issue, remount tmpfs to a new test directory and set the 'huge=never'
> parameter to avoid allocating large folios, which can pass the test.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
>

<snip>

>
>  	errno = 0;
> -	fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
> +	/* Do not use large folios for tmpfs mincore testing */
> +	retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0, "huge=never,size=4M");
> +	ASSERT_EQ(0, retval) {
> +		TH_LOG("Unable to mount tmpfs for testing\n");
> +	}
> +
> +	retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
> +	ASSERT_GE(INPUT_MAX, retval) {
> +		TH_LOG("Unable to create a tmpfs for testing\n");
> +	}
> +
> +	fd = open(testfile, O_CREAT|O_RDWR, 0664);

The fd permission is changed from 0600 to 0664, but it probably does not
matter.

>  	ASSERT_NE(-1, fd) {
>  		TH_LOG("Can't create temporary file: %s",
>  			strerror(errno));
> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>  	munmap(addr, FILE_SIZE);
>  	close(fd);
>  	free(vec);
> +	umount(tmpfs_loc);
> +	rmdir(tmpfs_loc);
>  }
>
>  TEST_HARNESS_MAIN
> -- 
> 2.43.5

Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-27 14:08   ` Ryan Roberts
@ 2025-03-28 13:10     ` Oscar Salvador
  2025-03-30 19:57     ` Baolin Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Oscar Salvador @ 2025-03-28 13:10 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Baolin Wang, akpm, hughd, willy, david, 21cnbao, ziy, linux-mm,
	linux-kernel

On Thu, Mar 27, 2025 at 10:08:56AM -0400, Ryan Roberts wrote:
> You could simplify to the following, I think, to avoid needing to grab the folio
> and call folio_pte_batch():
> 
> 			else if (pte_present(pte)) {
> 				int max_nr = (end - addr) / PAGE_SIZE;
> 				step = min(pte_batch_hint(ptep, pte), max_nr);
> 			} ...

Yes, I think this makes much more sense, in the end, as you said, we do
not really need what folio_pte_batch gives us here.

With the API I am working on, batching will be done in there internally,
so we will not have to expose this here.

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-03-27 14:36   ` Zi Yan
@ 2025-03-30 19:47     ` Baolin Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-03-30 19:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, hughd, willy, david, 21cnbao, ryan.roberts, linux-mm,
	linux-kernel



On 2025/3/27 22:36, Zi Yan wrote:
> On 25 Mar 2025, at 23:38, Baolin Wang wrote:
> 
>> When running mincore test cases, I encountered the following failures:
>>
>> "
>> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
>> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
>> check_tmpfs_mmap: Test terminated by assertion
>>            FAIL  global.check_tmpfs_mmap
>> not ok 5 global.check_tmpfs_mmap
>> FAILED: 4 / 5 tests passed
>> "
>>
>> The reason for the test case failure is that my system automatically enabled
>> tmpfs large folio allocation by adding the 'transparent_hugepage_tmpfs=always'
>> cmdline. However, the test case still expects the tmpfs mounted on /dev/shm to
>> allocate small folios, which leads to assertion failures when verifying readahead
>> pages.
>>
>> To fix this issue, remount tmpfs to a new test directory and set the 'huge=never'
>> parameter to avoid allocating large folios, which can pass the test.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
>>   1 file changed, 23 insertions(+), 2 deletions(-)
>>
> 
> <snip>
> 
>>
>>   	errno = 0;
>> -	fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
>> +	/* Do not use large folios for tmpfs mincore testing */
>> +	retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0, "huge=never,size=4M");
>> +	ASSERT_EQ(0, retval) {
>> +		TH_LOG("Unable to mount tmpfs for testing\n");
>> +	}
>> +
>> +	retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
>> +	ASSERT_GE(INPUT_MAX, retval) {
>> +		TH_LOG("Unable to create a tmpfs for testing\n");
>> +	}
>> +
>> +	fd = open(testfile, O_CREAT|O_RDWR, 0664);
> 
> The fd permission is changed from 0600 to 0664, but it probably does not
> matter.

It is just a temp file, so it doesn't matter.

> 
>>   	ASSERT_NE(-1, fd) {
>>   		TH_LOG("Can't create temporary file: %s",
>>   			strerror(errno));
>> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>>   	munmap(addr, FILE_SIZE);
>>   	close(fd);
>>   	free(vec);
>> +	umount(tmpfs_loc);
>> +	rmdir(tmpfs_loc);
>>   }
>>
>>   TEST_HARNESS_MAIN
>> -- 
>> 2.43.5
> 
> Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Thanks for reviewing.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-27 14:08   ` Ryan Roberts
  2025-03-28 13:10     ` Oscar Salvador
@ 2025-03-30 19:57     ` Baolin Wang
  2025-04-01 10:45       ` Ryan Roberts
  1 sibling, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-03-30 19:57 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, 21cnbao, ziy, linux-mm, linux-kernel



On 2025/3/27 22:08, Ryan Roberts wrote:
> On 25/03/2025 23:38, Baolin Wang wrote:
>> When I tested the mincore() syscall, I observed that it takes longer with
>> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
>> still checks each PTE individually, even when the PTEs are contiguous,
>> which is not efficient.
>>
>> Thus we can use folio_pte_batch() to get the batch number of the present
>> contiguous PTEs, which can improve the performance. I tested the mincore()
>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>> obvious performance improvement:
>>
>> w/o patch		w/ patch		changes
>> 6022us			1115us			+81%
>>
>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>> see any obvious regression.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/mincore.c | 27 ++++++++++++++++++++++-----
>>   1 file changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/mincore.c b/mm/mincore.c
>> index 832f29f46767..88be180b5550 100644
>> --- a/mm/mincore.c
>> +++ b/mm/mincore.c
>> @@ -21,6 +21,7 @@
>>   
>>   #include <linux/uaccess.h>
>>   #include "swap.h"
>> +#include "internal.h"
>>   
>>   static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>>   			unsigned long end, struct mm_walk *walk)
>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>   	pte_t *ptep;
>>   	unsigned char *vec = walk->private;
>>   	int nr = (end - addr) >> PAGE_SHIFT;
>> +	int step, i;
>>   
>>   	ptl = pmd_trans_huge_lock(pmd, vma);
>>   	if (ptl) {
>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>   		walk->action = ACTION_AGAIN;
>>   		return 0;
>>   	}
>> -	for (; addr != end; ptep++, addr += PAGE_SIZE) {
>> +	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>   		pte_t pte = ptep_get(ptep);
>>   
>> +		step = 1;
>>   		/* We need to do cache lookup too for pte markers */
>>   		if (pte_none_mostly(pte))
>>   			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>   						 vma, vec);
>> -		else if (pte_present(pte))
>> -			*vec = 1;
>> -		else { /* pte is a swap entry */
>> +		else if (pte_present(pte)) {
>> +			if (pte_batch_hint(ptep, pte) > 1) {
>> +				struct folio *folio = vm_normal_folio(vma, addr, pte);
>> +
>> +				if (folio && folio_test_large(folio)) {
>> +					const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> +								FPB_IGNORE_SOFT_DIRTY;
>> +					int max_nr = (end - addr) / PAGE_SIZE;
>> +
>> +					step = folio_pte_batch(folio, addr, ptep, pte,
>> +							max_nr, fpb_flags, NULL, NULL, NULL);
>> +				}
>> +			}
> 
> You could simplify to the following, I think, to avoid needing to grab the folio
> and call folio_pte_batch():
> 
> 			else if (pte_present(pte)) {
> 				int max_nr = (end - addr) / PAGE_SIZE;
> 				step = min(pte_batch_hint(ptep, pte), max_nr);
> 			} ...
> 
> I expect the regression you are seeing here is all due to calling ptep_get() for
> every pte in the contpte batch, which will cause 16 memory reads per pte (to
> gather the access/dirty bits). For small folios its just 1 read per pte.

Right.

> pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
> same number as for the small folio case. You don't need all the fancy extras
> that folio_pte_batch() gives you here.

Sounds reasonable. Your suggestion looks simple, but my method can batch 
the whole large folio (such as large folios containing more than 16 
contiguous PTEs) at once. Anyway, let me do some performance 
measurements for your suggestion. Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-30 19:57     ` Baolin Wang
@ 2025-04-01 10:45       ` Ryan Roberts
  2025-04-01 13:04         ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Ryan Roberts @ 2025-04-01 10:45 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, 21cnbao, ziy, linux-mm, linux-kernel

On 30/03/2025 15:57, Baolin Wang wrote:
> 
> 
> On 2025/3/27 22:08, Ryan Roberts wrote:
>> On 25/03/2025 23:38, Baolin Wang wrote:
>>> When I tested the mincore() syscall, I observed that it takes longer with
>>> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
>>> still checks each PTE individually, even when the PTEs are contiguous,
>>> which is not efficient.
>>>
>>> Thus we can use folio_pte_batch() to get the batch number of the present
>>> contiguous PTEs, which can improve the performance. I tested the mincore()
>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>>> obvious performance improvement:
>>>
>>> w/o patch        w/ patch        changes
>>> 6022us            1115us            +81%
>>>
>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>> see any obvious regression.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>   1 file changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>> index 832f29f46767..88be180b5550 100644
>>> --- a/mm/mincore.c
>>> +++ b/mm/mincore.c
>>> @@ -21,6 +21,7 @@
>>>     #include <linux/uaccess.h>
>>>   #include "swap.h"
>>> +#include "internal.h"
>>>     static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long
>>> addr,
>>>               unsigned long end, struct mm_walk *walk)
>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>> addr, unsigned long end,
>>>       pte_t *ptep;
>>>       unsigned char *vec = walk->private;
>>>       int nr = (end - addr) >> PAGE_SHIFT;
>>> +    int step, i;
>>>         ptl = pmd_trans_huge_lock(pmd, vma);
>>>       if (ptl) {
>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>> addr, unsigned long end,
>>>           walk->action = ACTION_AGAIN;
>>>           return 0;
>>>       }
>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>           pte_t pte = ptep_get(ptep);
>>>   +        step = 1;
>>>           /* We need to do cache lookup too for pte markers */
>>>           if (pte_none_mostly(pte))
>>>               __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>                            vma, vec);
>>> -        else if (pte_present(pte))
>>> -            *vec = 1;
>>> -        else { /* pte is a swap entry */
>>> +        else if (pte_present(pte)) {
>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>> +
>>> +                if (folio && folio_test_large(folio)) {
>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>> +
>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>> +                }
>>> +            }
>>
>> You could simplify to the following, I think, to avoid needing to grab the folio
>> and call folio_pte_batch():
>>
>>             else if (pte_present(pte)) {
>>                 int max_nr = (end - addr) / PAGE_SIZE;
>>                 step = min(pte_batch_hint(ptep, pte), max_nr);
>>             } ...
>>
>> I expect the regression you are seeing here is all due to calling ptep_get() for
>> every pte in the contpte batch, which will cause 16 memory reads per pte (to
>> gather the access/dirty bits). For small folios its just 1 read per pte.
> 
> Right.
> 
>> pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
>> same number as for the small folio case. You don't need all the fancy extras
>> that folio_pte_batch() gives you here.
> 
> Sounds reasonable. Your suggestion looks simple, but my method can batch the
> whole large folio (such as large folios containing more than 16 contiguous PTEs)
> at once. 

Sure but folio_pte_batch() just implements that with another loop that calls
pte_batch_hint(), so it all amounts to the same thing. In fact there are some
extra checks in folio_pte_batch() that you don't need so it might be a bit slower.

Thanks,
Ryan

> Anyway, let me do some performance measurements for your suggestion.
> Thanks.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-03-26  3:38 ` [PATCH 1/2] selftests: mincore: fix tmpfs mincore " Baolin Wang
  2025-03-27 14:36   ` Zi Yan
@ 2025-04-01 12:54   ` David Hildenbrand
  2025-04-07  3:49     ` Baolin Wang
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-04-01 12:54 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel

On 26.03.25 04:38, Baolin Wang wrote:
> When running mincore test cases, I encountered the following failures:
> 
> "
> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
> check_tmpfs_mmap: Test terminated by assertion
>            FAIL  global.check_tmpfs_mmap
> not ok 5 global.check_tmpfs_mmap
> FAILED: 4 / 5 tests passed
> "
> 
> The reason for the test case failure is that my system automatically enabled
> tmpfs large folio allocation by adding the 'transparent_hugepage_tmpfs=always'
> cmdline. However, the test case still expects the tmpfs mounted on /dev/shm to
> allocate small folios, which leads to assertion failures when verifying readahead
> pages.
> 
> To fix this issue, remount tmpfs to a new test directory and set the 'huge=never'
> parameter to avoid allocating large folios, which can pass the test.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
>   1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
> index e949a43a6145..e8d7a3a4739f 100644
> --- a/tools/testing/selftests/mincore/mincore_selftest.c
> +++ b/tools/testing/selftests/mincore/mincore_selftest.c
> @@ -12,6 +12,7 @@
>   #include <unistd.h>
>   #include <stdlib.h>
>   #include <sys/mman.h>
> +#include <sys/mount.h>
>   #include <string.h>
>   #include <fcntl.h>
>   
> @@ -283,7 +284,7 @@ TEST(check_file_mmap)
>   	free(vec);
>   }
>   
> -
> +#define INPUT_MAX 80
>   /*
>    * Test mincore() behavior on a page backed by a tmpfs file.  This test
>    * performs the same steps as the previous one. However, we don't expect
> @@ -291,6 +292,9 @@ TEST(check_file_mmap)
>    */
>   TEST(check_tmpfs_mmap)
>   {
> +	char tmpfs_template[] = "/tmp/check_tmpfs_XXXXXX";
> +	const char *tmpfs_loc = mkdtemp(tmpfs_template);
> +	char testfile[INPUT_MAX];
>   	unsigned char *vec;
>   	int vec_size;
>   	char *addr;
> @@ -300,6 +304,10 @@ TEST(check_tmpfs_mmap)
>   	int i;
>   	int ra_pages = 0;
>   
> +	ASSERT_NE(NULL, tmpfs_loc) {
> +		TH_LOG("Can't mkdir tmpfs dentry\n");
> +	}
> +
>   	page_size = sysconf(_SC_PAGESIZE);
>   	vec_size = FILE_SIZE / page_size;
>   	if (FILE_SIZE % page_size)
> @@ -311,7 +319,18 @@ TEST(check_tmpfs_mmap)
>   	}
>   
>   	errno = 0;
> -	fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
> +	/* Do not use large folios for tmpfs mincore testing */
> +	retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0, "huge=never,size=4M");
> +	ASSERT_EQ(0, retval) {
> +		TH_LOG("Unable to mount tmpfs for testing\n");
> +	}
> +
> +	retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
> +	ASSERT_GE(INPUT_MAX, retval) {
> +		TH_LOG("Unable to create a tmpfs for testing\n");
> +	}
> +
> +	fd = open(testfile, O_CREAT|O_RDWR, 0664);
>   	ASSERT_NE(-1, fd) {
>   		TH_LOG("Can't create temporary file: %s",
>   			strerror(errno));
> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>   	munmap(addr, FILE_SIZE);
>   	close(fd);
>   	free(vec);
> +	umount(tmpfs_loc);
> +	rmdir(tmpfs_loc);
>   }
>   
>   TEST_HARNESS_MAIN

Is there anything that cleans up the mount in case something goes wrong 
(we run into an assertion), or will the directory+mount stick around 
forever?

But I also wonder whether check_tmpfs_mmap() should be changed to live 
with the fact that readahead ("faultaround") could now happen. What's 
the reason for not doing that?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-04-01 10:45       ` Ryan Roberts
@ 2025-04-01 13:04         ` David Hildenbrand
  2025-04-07  6:33           ` Baolin Wang
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-04-01 13:04 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, 21cnbao, ziy, linux-mm, linux-kernel

On 01.04.25 12:45, Ryan Roberts wrote:
> On 30/03/2025 15:57, Baolin Wang wrote:
>>
>>
>> On 2025/3/27 22:08, Ryan Roberts wrote:
>>> On 25/03/2025 23:38, Baolin Wang wrote:
>>>> When I tested the mincore() syscall, I observed that it takes longer with
>>>> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
>>>> still checks each PTE individually, even when the PTEs are contiguous,
>>>> which is not efficient.
>>>>
>>>> Thus we can use folio_pte_batch() to get the batch number of the present
>>>> contiguous PTEs, which can improve the performance. I tested the mincore()
>>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>>>> obvious performance improvement:
>>>>
>>>> w/o patch        w/ patch        changes
>>>> 6022us            1115us            +81%
>>>>
>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>>> see any obvious regression.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>>> index 832f29f46767..88be180b5550 100644
>>>> --- a/mm/mincore.c
>>>> +++ b/mm/mincore.c
>>>> @@ -21,6 +21,7 @@
>>>>      #include <linux/uaccess.h>
>>>>    #include "swap.h"
>>>> +#include "internal.h"
>>>>      static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long
>>>> addr,
>>>>                unsigned long end, struct mm_walk *walk)
>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>>> addr, unsigned long end,
>>>>        pte_t *ptep;
>>>>        unsigned char *vec = walk->private;
>>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>>> +    int step, i;
>>>>          ptl = pmd_trans_huge_lock(pmd, vma);
>>>>        if (ptl) {
>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>>> addr, unsigned long end,
>>>>            walk->action = ACTION_AGAIN;
>>>>            return 0;
>>>>        }
>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>>            pte_t pte = ptep_get(ptep);
>>>>    +        step = 1;
>>>>            /* We need to do cache lookup too for pte markers */
>>>>            if (pte_none_mostly(pte))
>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>>                             vma, vec);
>>>> -        else if (pte_present(pte))
>>>> -            *vec = 1;
>>>> -        else { /* pte is a swap entry */
>>>> +        else if (pte_present(pte)) {
>>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>>> +
>>>> +                if (folio && folio_test_large(folio)) {
>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>>> +                }
>>>> +            }
>>>
>>> You could simplify to the following, I think, to avoid needing to grab the folio
>>> and call folio_pte_batch():
>>>
>>>              else if (pte_present(pte)) {
>>>                  int max_nr = (end - addr) / PAGE_SIZE;
>>>                  step = min(pte_batch_hint(ptep, pte), max_nr);
>>>              } ...
>>>
>>> I expect the regression you are seeing here is all due to calling ptep_get() for
>>> every pte in the contpte batch, which will cause 16 memory reads per pte (to
>>> gather the access/dirty bits). For small folios its just 1 read per pte.
>>
>> Right.
>>
>>> pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
>>> same number as for the small folio case. You don't need all the fancy extras
>>> that folio_pte_batch() gives you here.
>>
>> Sounds reasonable. Your suggestion looks simple, but my method can batch the
>> whole large folio (such as large folios containing more than 16 contiguous PTEs)
>> at once.
> 
> Sure but folio_pte_batch() just implements that with another loop that calls
> pte_batch_hint(), so it all amounts to the same thing. In fact there are some
> extra checks in folio_pte_batch() that you don't need so it might be a bit slower.

I don't enjoy open-coding the batching, especially if only cont-pte 
users will benefit from it. But I also don't enjoy the open-coded 
pte_batch_hint() :)

But we really don't need the folio here, so I assume the short variant 
you (Ryan) suggest is alright to just avoid the ptep_get().

As Oscar says, these details might soon be hidden inside a new page 
table walker API (even though it will likely end up using 
folio_pte_batch() internally, TBD).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-04-01 12:54   ` David Hildenbrand
@ 2025-04-07  3:49     ` Baolin Wang
  2025-04-07  7:49       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-04-07  3:49 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel



On 2025/4/1 20:54, David Hildenbrand wrote:
> On 26.03.25 04:38, Baolin Wang wrote:
>> When running mincore test cases, I encountered the following failures:
>>
>> "
>> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
>> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
>> check_tmpfs_mmap: Test terminated by assertion
>>            FAIL  global.check_tmpfs_mmap
>> not ok 5 global.check_tmpfs_mmap
>> FAILED: 4 / 5 tests passed
>> "
>>
>> The reason for the test case failure is that my system automatically 
>> enabled
>> tmpfs large folio allocation by adding the 
>> 'transparent_hugepage_tmpfs=always'
>> cmdline. However, the test case still expects the tmpfs mounted on 
>> /dev/shm to
>> allocate small folios, which leads to assertion failures when 
>> verifying readahead
>> pages.
>>
>> To fix this issue, remount tmpfs to a new test directory and set the 
>> 'huge=never'
>> parameter to avoid allocating large folios, which can pass the test.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
>>   1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mincore/mincore_selftest.c 
>> b/tools/testing/selftests/mincore/mincore_selftest.c
>> index e949a43a6145..e8d7a3a4739f 100644
>> --- a/tools/testing/selftests/mincore/mincore_selftest.c
>> +++ b/tools/testing/selftests/mincore/mincore_selftest.c
>> @@ -12,6 +12,7 @@
>>   #include <unistd.h>
>>   #include <stdlib.h>
>>   #include <sys/mman.h>
>> +#include <sys/mount.h>
>>   #include <string.h>
>>   #include <fcntl.h>
>> @@ -283,7 +284,7 @@ TEST(check_file_mmap)
>>       free(vec);
>>   }
>> -
>> +#define INPUT_MAX 80
>>   /*
>>    * Test mincore() behavior on a page backed by a tmpfs file.  This test
>>    * performs the same steps as the previous one. However, we don't 
>> expect
>> @@ -291,6 +292,9 @@ TEST(check_file_mmap)
>>    */
>>   TEST(check_tmpfs_mmap)
>>   {
>> +    char tmpfs_template[] = "/tmp/check_tmpfs_XXXXXX";
>> +    const char *tmpfs_loc = mkdtemp(tmpfs_template);
>> +    char testfile[INPUT_MAX];
>>       unsigned char *vec;
>>       int vec_size;
>>       char *addr;
>> @@ -300,6 +304,10 @@ TEST(check_tmpfs_mmap)
>>       int i;
>>       int ra_pages = 0;
>> +    ASSERT_NE(NULL, tmpfs_loc) {
>> +        TH_LOG("Can't mkdir tmpfs dentry\n");
>> +    }
>> +
>>       page_size = sysconf(_SC_PAGESIZE);
>>       vec_size = FILE_SIZE / page_size;
>>       if (FILE_SIZE % page_size)
>> @@ -311,7 +319,18 @@ TEST(check_tmpfs_mmap)
>>       }
>>       errno = 0;
>> -    fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
>> +    /* Do not use large folios for tmpfs mincore testing */
>> +    retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0, 
>> "huge=never,size=4M");
>> +    ASSERT_EQ(0, retval) {
>> +        TH_LOG("Unable to mount tmpfs for testing\n");
>> +    }
>> +
>> +    retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
>> +    ASSERT_GE(INPUT_MAX, retval) {
>> +        TH_LOG("Unable to create a tmpfs for testing\n");
>> +    }
>> +
>> +    fd = open(testfile, O_CREAT|O_RDWR, 0664);
>>       ASSERT_NE(-1, fd) {
>>           TH_LOG("Can't create temporary file: %s",
>>               strerror(errno));
>> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>>       munmap(addr, FILE_SIZE);
>>       close(fd);
>>       free(vec);
>> +    umount(tmpfs_loc);
>> +    rmdir(tmpfs_loc);
>>   }
>>   TEST_HARNESS_MAIN
> 
> Is there anything that cleans up the mount in case something goes wrong 
> (we run into an assertion), or will the directory+mount stick around 
> forever?

Good point, will cleanup the mount in the next version.

> 
> But I also wonder whether check_tmpfs_mmap() should be changed to live 
> with the fact that readahead ("faultaround") could now happen. What's 
> the reason for not doing that?

 From this test case's description, it doesn't expect any readahead.
"
/*
  * Test mincore() behavior on a page backed by a tmpfs file.  This test
  * performs the same steps as the previous one. However, we don't expect
  * any readahead in this case.
  */
"

Maybe adding a new test case to expect the readahead for tmpfs file.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-04-01 13:04         ` David Hildenbrand
@ 2025-04-07  6:33           ` Baolin Wang
  2025-04-14 13:46             ` Ryan Roberts
  0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-04-07  6:33 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts, akpm, hughd
  Cc: willy, 21cnbao, ziy, linux-mm, linux-kernel



On 2025/4/1 21:04, David Hildenbrand wrote:
> On 01.04.25 12:45, Ryan Roberts wrote:
>> On 30/03/2025 15:57, Baolin Wang wrote:
>>>
>>>
>>> On 2025/3/27 22:08, Ryan Roberts wrote:
>>>> On 25/03/2025 23:38, Baolin Wang wrote:
>>>>> When I tested the mincore() syscall, I observed that it takes 
>>>>> longer with
>>>>> 64K mTHP enabled on my Arm64 server. The reason is the 
>>>>> mincore_pte_range()
>>>>> still checks each PTE individually, even when the PTEs are contiguous,
>>>>> which is not efficient.
>>>>>
>>>>> Thus we can use folio_pte_batch() to get the batch number of the 
>>>>> present
>>>>> contiguous PTEs, which can improve the performance. I tested the 
>>>>> mincore()
>>>>> syscall with 1G anonymous memory populated with 64K mTHP, and 
>>>>> observed an
>>>>> obvious performance improvement:
>>>>>
>>>>> w/o patch        w/ patch        changes
>>>>> 6022us            1115us            +81%
>>>>>
>>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>>>> see any obvious regression.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>>>> index 832f29f46767..88be180b5550 100644
>>>>> --- a/mm/mincore.c
>>>>> +++ b/mm/mincore.c
>>>>> @@ -21,6 +21,7 @@
>>>>>      #include <linux/uaccess.h>
>>>>>    #include "swap.h"
>>>>> +#include "internal.h"
>>>>>      static int mincore_hugetlb(pte_t *pte, unsigned long hmask, 
>>>>> unsigned long
>>>>> addr,
>>>>>                unsigned long end, struct mm_walk *walk)
>>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, 
>>>>> unsigned long
>>>>> addr, unsigned long end,
>>>>>        pte_t *ptep;
>>>>>        unsigned char *vec = walk->private;
>>>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>>>> +    int step, i;
>>>>>          ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>        if (ptl) {
>>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, 
>>>>> unsigned long
>>>>> addr, unsigned long end,
>>>>>            walk->action = ACTION_AGAIN;
>>>>>            return 0;
>>>>>        }
>>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>>>            pte_t pte = ptep_get(ptep);
>>>>>    +        step = 1;
>>>>>            /* We need to do cache lookup too for pte markers */
>>>>>            if (pte_none_mostly(pte))
>>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>>>                             vma, vec);
>>>>> -        else if (pte_present(pte))
>>>>> -            *vec = 1;
>>>>> -        else { /* pte is a swap entry */
>>>>> +        else if (pte_present(pte)) {
>>>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>>>> +                struct folio *folio = vm_normal_folio(vma, addr, 
>>>>> pte);
>>>>> +
>>>>> +                if (folio && folio_test_large(folio)) {
>>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>>>> +
>>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>>>> +                }
>>>>> +            }
>>>>
>>>> You could simplify to the following, I think, to avoid needing to 
>>>> grab the folio
>>>> and call folio_pte_batch():
>>>>
>>>>              else if (pte_present(pte)) {
>>>>                  int max_nr = (end - addr) / PAGE_SIZE;
>>>>                  step = min(pte_batch_hint(ptep, pte), max_nr);
>>>>              } ...
>>>>
>>>> I expect the regression you are seeing here is all due to calling 
>>>> ptep_get() for
>>>> every pte in the contpte batch, which will cause 16 memory reads per 
>>>> pte (to
>>>> gather the access/dirty bits). For small folios its just 1 read per 
>>>> pte.
>>>
>>> Right.
>>>
>>>> pte_batch_hint() will skip forward in blocks of 16 so you now end up 
>>>> with the
>>>> same number as for the small folio case. You don't need all the 
>>>> fancy extras
>>>> that folio_pte_batch() gives you here.
>>>
>>> Sounds reasonable. Your suggestion looks simple, but my method can 
>>> batch the
>>> whole large folio (such as large folios containing more than 16 
>>> contiguous PTEs)
>>> at once.
>>
>> Sure but folio_pte_batch() just implements that with another loop that 
>> calls
>> pte_batch_hint(), so it all amounts to the same thing. In fact there 
>> are some
>> extra checks in folio_pte_batch() that you don't need so it might be a 
>> bit slower.

Right. I tested your suggestion, yes, much better.

> I don't enjoy open-coding the batching, especially if only cont-pte 
> users will benefit from it. But I also don't enjoy the open-coded 
> pte_batch_hint() :)
> 
> But we really don't need the folio here, so I assume the short variant 
> you (Ryan) suggest is alright to just avoid the ptep_get().
> 
> As Oscar says, these details might soon be hidden inside a new page 
> table walker API (even though it will likely end up using 
> folio_pte_batch() internally, TBD).

OK. I can drop this patch if it will be addressed in the following patches.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-04-07  3:49     ` Baolin Wang
@ 2025-04-07  7:49       ` David Hildenbrand
  2025-04-07  8:35         ` Baolin Wang
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-04-07  7:49 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel

On 07.04.25 05:49, Baolin Wang wrote:
> 
> 
> On 2025/4/1 20:54, David Hildenbrand wrote:
>> On 26.03.25 04:38, Baolin Wang wrote:
>>> When running mincore test cases, I encountered the following failures:
>>>
>>> "
>>> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0)
>>> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory
>>> check_tmpfs_mmap: Test terminated by assertion
>>>             FAIL  global.check_tmpfs_mmap
>>> not ok 5 global.check_tmpfs_mmap
>>> FAILED: 4 / 5 tests passed
>>> "
>>>
>>> The reason for the test case failure is that my system automatically
>>> enabled
>>> tmpfs large folio allocation by adding the
>>> 'transparent_hugepage_tmpfs=always'
>>> cmdline. However, the test case still expects the tmpfs mounted on
>>> /dev/shm to
>>> allocate small folios, which leads to assertion failures when
>>> verifying readahead
>>> pages.
>>>
>>> To fix this issue, remount tmpfs to a new test directory and set the
>>> 'huge=never'
>>> parameter to avoid allocating large folios, which can pass the test.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>    .../selftests/mincore/mincore_selftest.c      | 25 +++++++++++++++++--
>>>    1 file changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/mincore/mincore_selftest.c
>>> b/tools/testing/selftests/mincore/mincore_selftest.c
>>> index e949a43a6145..e8d7a3a4739f 100644
>>> --- a/tools/testing/selftests/mincore/mincore_selftest.c
>>> +++ b/tools/testing/selftests/mincore/mincore_selftest.c
>>> @@ -12,6 +12,7 @@
>>>    #include <unistd.h>
>>>    #include <stdlib.h>
>>>    #include <sys/mman.h>
>>> +#include <sys/mount.h>
>>>    #include <string.h>
>>>    #include <fcntl.h>
>>> @@ -283,7 +284,7 @@ TEST(check_file_mmap)
>>>        free(vec);
>>>    }
>>> -
>>> +#define INPUT_MAX 80
>>>    /*
>>>     * Test mincore() behavior on a page backed by a tmpfs file.  This test
>>>     * performs the same steps as the previous one. However, we don't
>>> expect
>>> @@ -291,6 +292,9 @@ TEST(check_file_mmap)
>>>     */
>>>    TEST(check_tmpfs_mmap)
>>>    {
>>> +    char tmpfs_template[] = "/tmp/check_tmpfs_XXXXXX";
>>> +    const char *tmpfs_loc = mkdtemp(tmpfs_template);
>>> +    char testfile[INPUT_MAX];
>>>        unsigned char *vec;
>>>        int vec_size;
>>>        char *addr;
>>> @@ -300,6 +304,10 @@ TEST(check_tmpfs_mmap)
>>>        int i;
>>>        int ra_pages = 0;
>>> +    ASSERT_NE(NULL, tmpfs_loc) {
>>> +        TH_LOG("Can't mkdir tmpfs dentry\n");
>>> +    }
>>> +
>>>        page_size = sysconf(_SC_PAGESIZE);
>>>        vec_size = FILE_SIZE / page_size;
>>>        if (FILE_SIZE % page_size)
>>> @@ -311,7 +319,18 @@ TEST(check_tmpfs_mmap)
>>>        }
>>>        errno = 0;
>>> -    fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
>>> +    /* Do not use large folios for tmpfs mincore testing */
>>> +    retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0,
>>> "huge=never,size=4M");
>>> +    ASSERT_EQ(0, retval) {
>>> +        TH_LOG("Unable to mount tmpfs for testing\n");
>>> +    }
>>> +
>>> +    retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
>>> +    ASSERT_GE(INPUT_MAX, retval) {
>>> +        TH_LOG("Unable to create a tmpfs for testing\n");
>>> +    }
>>> +
>>> +    fd = open(testfile, O_CREAT|O_RDWR, 0664);
>>>        ASSERT_NE(-1, fd) {
>>>            TH_LOG("Can't create temporary file: %s",
>>>                strerror(errno));
>>> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>>>        munmap(addr, FILE_SIZE);
>>>        close(fd);
>>>        free(vec);
>>> +    umount(tmpfs_loc);
>>> +    rmdir(tmpfs_loc);
>>>    }
>>>    TEST_HARNESS_MAIN
>>
>> Is there anything that cleans up the mount in case something goes wrong
>> (we run into an assertion), or will the directory+mount stick around
>> forever?
> 
> Good point, will cleanup the mount in the next version.
> 
>>
>> But I also wonder whether check_tmpfs_mmap() should be changed to live
>> with the fact that readahead ("faultaround") could now happen. What's
>> the reason for not doing that?
> 
>   From this test case's description, it doesn't expect any readahead.

Yes, but why are we testing for that *at all*? We don't make such 
assumptions/tests for anon memmory ("no faultaround happened").

Why not simply remove the "We expect only that page to be fetched into 
memory." documentation + checking?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] selftests: mincore: fix tmpfs mincore test failure
  2025-04-07  7:49       ` David Hildenbrand
@ 2025-04-07  8:35         ` Baolin Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-04-07  8:35 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel



On 2025/4/7 15:49, David Hildenbrand wrote:
> On 07.04.25 05:49, Baolin Wang wrote:
>>
>>
>> On 2025/4/1 20:54, David Hildenbrand wrote:
>>> On 26.03.25 04:38, Baolin Wang wrote:
>>>> When running mincore test cases, I encountered the following failures:
>>>>
>>>> "
>>>> mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 
>>>> (0)
>>>> mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in 
>>>> memory
>>>> check_tmpfs_mmap: Test terminated by assertion
>>>>             FAIL  global.check_tmpfs_mmap
>>>> not ok 5 global.check_tmpfs_mmap
>>>> FAILED: 4 / 5 tests passed
>>>> "
>>>>
>>>> The reason for the test case failure is that my system automatically
>>>> enabled
>>>> tmpfs large folio allocation by adding the
>>>> 'transparent_hugepage_tmpfs=always'
>>>> cmdline. However, the test case still expects the tmpfs mounted on
>>>> /dev/shm to
>>>> allocate small folios, which leads to assertion failures when
>>>> verifying readahead
>>>> pages.
>>>>
>>>> To fix this issue, remount tmpfs to a new test directory and set the
>>>> 'huge=never'
>>>> parameter to avoid allocating large folios, which can pass the test.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    .../selftests/mincore/mincore_selftest.c      | 25 
>>>> +++++++++++++++++--
>>>>    1 file changed, 23 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/tools/testing/selftests/mincore/mincore_selftest.c
>>>> b/tools/testing/selftests/mincore/mincore_selftest.c
>>>> index e949a43a6145..e8d7a3a4739f 100644
>>>> --- a/tools/testing/selftests/mincore/mincore_selftest.c
>>>> +++ b/tools/testing/selftests/mincore/mincore_selftest.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <unistd.h>
>>>>    #include <stdlib.h>
>>>>    #include <sys/mman.h>
>>>> +#include <sys/mount.h>
>>>>    #include <string.h>
>>>>    #include <fcntl.h>
>>>> @@ -283,7 +284,7 @@ TEST(check_file_mmap)
>>>>        free(vec);
>>>>    }
>>>> -
>>>> +#define INPUT_MAX 80
>>>>    /*
>>>>     * Test mincore() behavior on a page backed by a tmpfs file.  
>>>> This test
>>>>     * performs the same steps as the previous one. However, we don't
>>>> expect
>>>> @@ -291,6 +292,9 @@ TEST(check_file_mmap)
>>>>     */
>>>>    TEST(check_tmpfs_mmap)
>>>>    {
>>>> +    char tmpfs_template[] = "/tmp/check_tmpfs_XXXXXX";
>>>> +    const char *tmpfs_loc = mkdtemp(tmpfs_template);
>>>> +    char testfile[INPUT_MAX];
>>>>        unsigned char *vec;
>>>>        int vec_size;
>>>>        char *addr;
>>>> @@ -300,6 +304,10 @@ TEST(check_tmpfs_mmap)
>>>>        int i;
>>>>        int ra_pages = 0;
>>>> +    ASSERT_NE(NULL, tmpfs_loc) {
>>>> +        TH_LOG("Can't mkdir tmpfs dentry\n");
>>>> +    }
>>>> +
>>>>        page_size = sysconf(_SC_PAGESIZE);
>>>>        vec_size = FILE_SIZE / page_size;
>>>>        if (FILE_SIZE % page_size)
>>>> @@ -311,7 +319,18 @@ TEST(check_tmpfs_mmap)
>>>>        }
>>>>        errno = 0;
>>>> -    fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600);
>>>> +    /* Do not use large folios for tmpfs mincore testing */
>>>> +    retval = mount("tmpfs", tmpfs_loc, "tmpfs", 0,
>>>> "huge=never,size=4M");
>>>> +    ASSERT_EQ(0, retval) {
>>>> +        TH_LOG("Unable to mount tmpfs for testing\n");
>>>> +    }
>>>> +
>>>> +    retval = snprintf(testfile, INPUT_MAX, "%s/test_file", tmpfs_loc);
>>>> +    ASSERT_GE(INPUT_MAX, retval) {
>>>> +        TH_LOG("Unable to create a tmpfs for testing\n");
>>>> +    }
>>>> +
>>>> +    fd = open(testfile, O_CREAT|O_RDWR, 0664);
>>>>        ASSERT_NE(-1, fd) {
>>>>            TH_LOG("Can't create temporary file: %s",
>>>>                strerror(errno));
>>>> @@ -363,6 +382,8 @@ TEST(check_tmpfs_mmap)
>>>>        munmap(addr, FILE_SIZE);
>>>>        close(fd);
>>>>        free(vec);
>>>> +    umount(tmpfs_loc);
>>>> +    rmdir(tmpfs_loc);
>>>>    }
>>>>    TEST_HARNESS_MAIN
>>>
>>> Is there anything that cleans up the mount in case something goes wrong
>>> (we run into an assertion), or will the directory+mount stick around
>>> forever?
>>
>> Good point, will cleanup the mount in the next version.
>>
>>>
>>> But I also wonder whether check_tmpfs_mmap() should be changed to live
>>> with the fact that readahead ("faultaround") could now happen. What's
>>> the reason for not doing that?
>>
>>   From this test case's description, it doesn't expect any readahead.
> 
> Yes, but why are we testing for that *at all*? We don't make such 
> assumptions/tests for anon memmory ("no faultaround happened").
> 
> Why not simply remove the "We expect only that page to be fetched into 
> memory." documentation + checking?

OK. I'm fine with dropping the readahead check. Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-04-07  6:33           ` Baolin Wang
@ 2025-04-14 13:46             ` Ryan Roberts
  0 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2025-04-14 13:46 UTC (permalink / raw)
  To: Baolin Wang, David Hildenbrand, akpm, hughd
  Cc: willy, 21cnbao, ziy, linux-mm, linux-kernel

On 07/04/2025 07:33, Baolin Wang wrote:
> 
> 
> On 2025/4/1 21:04, David Hildenbrand wrote:
>> On 01.04.25 12:45, Ryan Roberts wrote:
>>> On 30/03/2025 15:57, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/3/27 22:08, Ryan Roberts wrote:
>>>>> On 25/03/2025 23:38, Baolin Wang wrote:
>>>>>> When I tested the mincore() syscall, I observed that it takes longer with
>>>>>> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
>>>>>> still checks each PTE individually, even when the PTEs are contiguous,
>>>>>> which is not efficient.
>>>>>>
>>>>>> Thus we can use folio_pte_batch() to get the batch number of the present
>>>>>> contiguous PTEs, which can improve the performance. I tested the mincore()
>>>>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>>>>>> obvious performance improvement:
>>>>>>
>>>>>> w/o patch        w/ patch        changes
>>>>>> 6022us            1115us            +81%
>>>>>>
>>>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>>>>> see any obvious regression.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>>>>> index 832f29f46767..88be180b5550 100644
>>>>>> --- a/mm/mincore.c
>>>>>> +++ b/mm/mincore.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>>      #include <linux/uaccess.h>
>>>>>>    #include "swap.h"
>>>>>> +#include "internal.h"
>>>>>>      static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned
>>>>>> long
>>>>>> addr,
>>>>>>                unsigned long end, struct mm_walk *walk)
>>>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>>>>> addr, unsigned long end,
>>>>>>        pte_t *ptep;
>>>>>>        unsigned char *vec = walk->private;
>>>>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>>>>> +    int step, i;
>>>>>>          ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>>        if (ptl) {
>>>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
>>>>>> addr, unsigned long end,
>>>>>>            walk->action = ACTION_AGAIN;
>>>>>>            return 0;
>>>>>>        }
>>>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>>>>            pte_t pte = ptep_get(ptep);
>>>>>>    +        step = 1;
>>>>>>            /* We need to do cache lookup too for pte markers */
>>>>>>            if (pte_none_mostly(pte))
>>>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>>>>                             vma, vec);
>>>>>> -        else if (pte_present(pte))
>>>>>> -            *vec = 1;
>>>>>> -        else { /* pte is a swap entry */
>>>>>> +        else if (pte_present(pte)) {
>>>>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>>>>> +
>>>>>> +                if (folio && folio_test_large(folio)) {
>>>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>>>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>>>>> +
>>>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>>>>> +                }
>>>>>> +            }
>>>>>
>>>>> You could simplify to the following, I think, to avoid needing to grab the
>>>>> folio
>>>>> and call folio_pte_batch():
>>>>>
>>>>>              else if (pte_present(pte)) {
>>>>>                  int max_nr = (end - addr) / PAGE_SIZE;
>>>>>                  step = min(pte_batch_hint(ptep, pte), max_nr);
>>>>>              } ...
>>>>>
>>>>> I expect the regression you are seeing here is all due to calling
>>>>> ptep_get() for
>>>>> every pte in the contpte batch, which will cause 16 memory reads per pte (to
>>>>> gather the access/dirty bits). For small folios its just 1 read per pte.
>>>>
>>>> Right.
>>>>
>>>>> pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
>>>>> same number as for the small folio case. You don't need all the fancy extras
>>>>> that folio_pte_batch() gives you here.
>>>>
>>>> Sounds reasonable. Your suggestion looks simple, but my method can batch the
>>>> whole large folio (such as large folios containing more than 16 contiguous
>>>> PTEs)
>>>> at once.
>>>
>>> Sure but folio_pte_batch() just implements that with another loop that calls
>>> pte_batch_hint(), so it all amounts to the same thing. In fact there are some
>>> extra checks in folio_pte_batch() that you don't need so it might be a bit
>>> slower.
> 
> Right. I tested your suggestion, yes, much better.
> 
>> I don't enjoy open-coding the batching, especially if only cont-pte users will
>> benefit from it. But I also don't enjoy the open-coded pte_batch_hint() :)

I'm not quite sure what you are saying here? Is:

    else if (pte_present(pte)) {
        int max_nr = (end - addr) / PAGE_SIZE;
        step = min(pte_batch_hint(ptep, pte), max_nr);
    }

really to be considered open-coding? pte_batch_hint() is a generic API and it
feels pretty reasonable to use it in this situation?

>>
>> But we really don't need the folio here, so I assume the short variant you
>> (Ryan) suggest is alright to just avoid the ptep_get().

Yes, that would get my vote.

>>
>> As Oscar says, these details might soon be hidden inside a new page table
>> walker API (even though it will likely end up using folio_pte_batch()
>> internally, TBD).
> 
> OK. I can drop this patch if it will be addressed in the following patches.

I'm assuming a large chunk of the speedup is actually fixing a regression (it
would be good to see the numbers for non-mTHP mappings for comparison), so
personally I'd prefer we put this patch in now rather than waiting for the new
API to land then waiting for someone to get around to converting this user.

Thanks,
Ryan





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
  2025-03-27 10:49   ` Oscar Salvador
  2025-03-27 14:08   ` Ryan Roberts
@ 2025-05-07  5:12   ` Dev Jain
  2025-05-07  9:48     ` Baolin Wang
  2 siblings, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-05-07  5:12 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel



On 26/03/25 9:08 am, Baolin Wang wrote:
> When I tested the mincore() syscall, I observed that it takes longer with
> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
> still checks each PTE individually, even when the PTEs are contiguous,
> which is not efficient.
> 
> Thus we can use folio_pte_batch() to get the batch number of the present
> contiguous PTEs, which can improve the performance. I tested the mincore()
> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
> obvious performance improvement:
> 
> w/o patch		w/ patch		changes
> 6022us			1115us			+81%
> 
> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
> see any obvious regression.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/mincore.c | 27 ++++++++++++++++++++++-----
>   1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 832f29f46767..88be180b5550 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -21,6 +21,7 @@
>   
>   #include <linux/uaccess.h>
>   #include "swap.h"
> +#include "internal.h"
>   
>   static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>   			unsigned long end, struct mm_walk *walk)
> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *ptep;
>   	unsigned char *vec = walk->private;
>   	int nr = (end - addr) >> PAGE_SHIFT;
> +	int step, i;
>   
>   	ptl = pmd_trans_huge_lock(pmd, vma);
>   	if (ptl) {
> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   		walk->action = ACTION_AGAIN;
>   		return 0;
>   	}
> -	for (; addr != end; ptep++, addr += PAGE_SIZE) {
> +	for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>   		pte_t pte = ptep_get(ptep);
>   
> +		step = 1;
>   		/* We need to do cache lookup too for pte markers */
>   		if (pte_none_mostly(pte))
>   			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>   						 vma, vec);
> -		else if (pte_present(pte))
> -			*vec = 1;
> -		else { /* pte is a swap entry */
> +		else if (pte_present(pte)) {
> +			if (pte_batch_hint(ptep, pte) > 1) {
> +				struct folio *folio = vm_normal_folio(vma, addr, pte);
> +
> +				if (folio && folio_test_large(folio)) {
> +					const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> +								FPB_IGNORE_SOFT_DIRTY;
> +					int max_nr = (end - addr) / PAGE_SIZE;
> +
> +					step = folio_pte_batch(folio, addr, ptep, pte,
> +							max_nr, fpb_flags, NULL, NULL, NULL);
> +				}
> +			}

Can we go ahead with this along with [1], that will help us generalize 
for all arches.

[1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
(Please replace PAGE_SIZE with 1)

> +
> +			for (i = 0; i < step; i++)
> +				vec[i] = 1;
> +		} else { /* pte is a swap entry */
>   			swp_entry_t entry = pte_to_swp_entry(pte);
>   
>   			if (non_swap_entry(entry)) {
> @@ -146,7 +163,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   #endif
>   			}
>   		}
> -		vec++;
> +		vec += step;
>   	}
>   	pte_unmap_unlock(ptep - 1, ptl);
>   out:



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-05-07  5:12   ` Dev Jain
@ 2025-05-07  9:48     ` Baolin Wang
  2025-05-07  9:54       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-05-07  9:48 UTC (permalink / raw)
  To: Dev Jain, akpm, hughd
  Cc: willy, david, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel



On 2025/5/7 13:12, Dev Jain wrote:
> 
> 
> On 26/03/25 9:08 am, Baolin Wang wrote:
>> When I tested the mincore() syscall, I observed that it takes longer with
>> 64K mTHP enabled on my Arm64 server. The reason is the 
>> mincore_pte_range()
>> still checks each PTE individually, even when the PTEs are contiguous,
>> which is not efficient.
>>
>> Thus we can use folio_pte_batch() to get the batch number of the present
>> contiguous PTEs, which can improve the performance. I tested the 
>> mincore()
>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>> obvious performance improvement:
>>
>> w/o patch        w/ patch        changes
>> 6022us            1115us            +81%
>>
>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>> see any obvious regression.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/mincore.c | 27 ++++++++++++++++++++++-----
>>   1 file changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/mincore.c b/mm/mincore.c
>> index 832f29f46767..88be180b5550 100644
>> --- a/mm/mincore.c
>> +++ b/mm/mincore.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/uaccess.h>
>>   #include "swap.h"
>> +#include "internal.h"
>>   static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned 
>> long addr,
>>               unsigned long end, struct mm_walk *walk)
>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned 
>> long addr, unsigned long end,
>>       pte_t *ptep;
>>       unsigned char *vec = walk->private;
>>       int nr = (end - addr) >> PAGE_SHIFT;
>> +    int step, i;
>>       ptl = pmd_trans_huge_lock(pmd, vma);
>>       if (ptl) {
>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, 
>> unsigned long addr, unsigned long end,
>>           walk->action = ACTION_AGAIN;
>>           return 0;
>>       }
>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>           pte_t pte = ptep_get(ptep);
>> +        step = 1;
>>           /* We need to do cache lookup too for pte markers */
>>           if (pte_none_mostly(pte))
>>               __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>                            vma, vec);
>> -        else if (pte_present(pte))
>> -            *vec = 1;
>> -        else { /* pte is a swap entry */
>> +        else if (pte_present(pte)) {
>> +            if (pte_batch_hint(ptep, pte) > 1) {
>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>> +
>> +                if (folio && folio_test_large(folio)) {
>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> +                                FPB_IGNORE_SOFT_DIRTY;
>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>> +
>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>> +                }
>> +            }
> 
> Can we go ahead with this along with [1], that will help us generalize 
> for all arches.
> 
> [1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
> (Please replace PAGE_SIZE with 1)

As discussed with Ryan, we don’t need to call folio_pte_batch() 
(something like the code below), so your patch seems unnecessarily 
complicated. However, David is unhappy about the open-coded 
pte_batch_hint().

  static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned 
long addr,
                         unsigned long end, struct mm_walk *walk)
@@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned 
long addr, unsigned long end,
         pte_t *ptep;
         unsigned char *vec = walk->private;
         int nr = (end - addr) >> PAGE_SHIFT;
+       int step, i;

         ptl = pmd_trans_huge_lock(pmd, vma);
         if (ptl) {
@@ -118,16 +120,21 @@ static int mincore_pte_range(pmd_t *pmd, unsigned 
long addr, unsigned long end,
                 walk->action = ACTION_AGAIN;
                 return 0;
         }
-       for (; addr != end; ptep++, addr += PAGE_SIZE) {
+       for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
                 pte_t pte = ptep_get(ptep);

+               step = 1;
                 /* We need to do cache lookup too for pte markers */
                 if (pte_none_mostly(pte))
                         __mincore_unmapped_range(addr, addr + PAGE_SIZE,
                                                  vma, vec);
-               else if (pte_present(pte))
-                       *vec = 1;
-               else { /* pte is a swap entry */
+               else if (pte_present(pte)) {
+                       unsigned int max_nr = (end - addr) / PAGE_SIZE;
+
+                       step = min(pte_batch_hint(ptep, pte), max_nr);
+                       for (i = 0; i < step; i++)
+                               vec[i] = 1;
+               } else { /* pte is a swap entry */
                         swp_entry_t entry = pte_to_swp_entry(pte);

                         if (non_swap_entry(entry)) {
@@ -146,7 +153,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned 
long addr, unsigned long end,
  #endif
                         }
                 }
-               vec++;
+               vec += step;
         }
         pte_unmap_unlock(ptep - 1, ptl);
  out:


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-05-07  9:48     ` Baolin Wang
@ 2025-05-07  9:54       ` David Hildenbrand
  2025-05-07 10:03         ` Baolin Wang
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-07  9:54 UTC (permalink / raw)
  To: Baolin Wang, Dev Jain, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel

On 07.05.25 11:48, Baolin Wang wrote:
> 
> 
> On 2025/5/7 13:12, Dev Jain wrote:
>>
>>
>> On 26/03/25 9:08 am, Baolin Wang wrote:
>>> When I tested the mincore() syscall, I observed that it takes longer with
>>> 64K mTHP enabled on my Arm64 server. The reason is the
>>> mincore_pte_range()
>>> still checks each PTE individually, even when the PTEs are contiguous,
>>> which is not efficient.
>>>
>>> Thus we can use folio_pte_batch() to get the batch number of the present
>>> contiguous PTEs, which can improve the performance. I tested the
>>> mincore()
>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>>> obvious performance improvement:
>>>
>>> w/o patch        w/ patch        changes
>>> 6022us            1115us            +81%
>>>
>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>> see any obvious regression.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>> index 832f29f46767..88be180b5550 100644
>>> --- a/mm/mincore.c
>>> +++ b/mm/mincore.c
>>> @@ -21,6 +21,7 @@
>>>    #include <linux/uaccess.h>
>>>    #include "swap.h"
>>> +#include "internal.h"
>>>    static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned
>>> long addr,
>>>                unsigned long end, struct mm_walk *walk)
>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
>>> long addr, unsigned long end,
>>>        pte_t *ptep;
>>>        unsigned char *vec = walk->private;
>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>> +    int step, i;
>>>        ptl = pmd_trans_huge_lock(pmd, vma);
>>>        if (ptl) {
>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd,
>>> unsigned long addr, unsigned long end,
>>>            walk->action = ACTION_AGAIN;
>>>            return 0;
>>>        }
>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>            pte_t pte = ptep_get(ptep);
>>> +        step = 1;
>>>            /* We need to do cache lookup too for pte markers */
>>>            if (pte_none_mostly(pte))
>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>                             vma, vec);
>>> -        else if (pte_present(pte))
>>> -            *vec = 1;
>>> -        else { /* pte is a swap entry */
>>> +        else if (pte_present(pte)) {
>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>> +
>>> +                if (folio && folio_test_large(folio)) {
>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>> +
>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>> +                }
>>> +            }
>>
>> Can we go ahead with this along with [1], that will help us generalize
>> for all arches.
>>
>> [1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
>> (Please replace PAGE_SIZE with 1)
> 
> As discussed with Ryan, we don’t need to call folio_pte_batch()
> (something like the code below), so your patch seems unnecessarily
> complicated. However, David is unhappy about the open-coded
> pte_batch_hint().

I can live with the below :)

Having something more universal does maybe not make sense here. Any form 
of patching contiguous PTEs (contiguous PFNs) -- whether with folios or 
not -- is not required here as we really only want to

(a) Identify pte_present() PTEs
(b) Avoid the cost of repeated ptep_get() with cont-pte.

> 
>    static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned
> long addr,
>                           unsigned long end, struct mm_walk *walk)
> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
> long addr, unsigned long end,
>           pte_t *ptep;
>           unsigned char *vec = walk->private;
>           int nr = (end - addr) >> PAGE_SHIFT;
> +       int step, i;
> 
>           ptl = pmd_trans_huge_lock(pmd, vma);
>           if (ptl) {
> @@ -118,16 +120,21 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
> long addr, unsigned long end,
>                   walk->action = ACTION_AGAIN;
>                   return 0;
>           }
> -       for (; addr != end; ptep++, addr += PAGE_SIZE) {
> +       for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>                   pte_t pte = ptep_get(ptep);
> 
> +               step = 1;
>                   /* We need to do cache lookup too for pte markers */
>                   if (pte_none_mostly(pte))
>                           __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>                                                    vma, vec);
> -               else if (pte_present(pte))
> -                       *vec = 1;
> -               else { /* pte is a swap entry */
> +               else if (pte_present(pte)) {
> +                       unsigned int max_nr = (end - addr) / PAGE_SIZE;
> +
> +                       step = min(pte_batch_hint(ptep, pte), max_nr);
> +                       for (i = 0; i < step; i++)
> +                               vec[i] = 1;
> +               } else { /* pte is a swap entry */
>                           swp_entry_t entry = pte_to_swp_entry(pte);
> 
>                           if (non_swap_entry(entry)) {
> @@ -146,7 +153,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
> long addr, unsigned long end,
>    #endif
>                           }
>                   }
> -               vec++;
> +               vec += step;
>           }
>           pte_unmap_unlock(ptep - 1, ptl);
>    out:
> 


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-05-07  9:54       ` David Hildenbrand
@ 2025-05-07 10:03         ` Baolin Wang
  2025-05-07 11:14           ` Ryan Roberts
  0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-05-07 10:03 UTC (permalink / raw)
  To: David Hildenbrand, Dev Jain, akpm, hughd
  Cc: willy, 21cnbao, ryan.roberts, ziy, linux-mm, linux-kernel



On 2025/5/7 17:54, David Hildenbrand wrote:
> On 07.05.25 11:48, Baolin Wang wrote:
>>
>>
>> On 2025/5/7 13:12, Dev Jain wrote:
>>>
>>>
>>> On 26/03/25 9:08 am, Baolin Wang wrote:
>>>> When I tested the mincore() syscall, I observed that it takes longer 
>>>> with
>>>> 64K mTHP enabled on my Arm64 server. The reason is the
>>>> mincore_pte_range()
>>>> still checks each PTE individually, even when the PTEs are contiguous,
>>>> which is not efficient.
>>>>
>>>> Thus we can use folio_pte_batch() to get the batch number of the 
>>>> present
>>>> contiguous PTEs, which can improve the performance. I tested the
>>>> mincore()
>>>> syscall with 1G anonymous memory populated with 64K mTHP, and 
>>>> observed an
>>>> obvious performance improvement:
>>>>
>>>> w/o patch        w/ patch        changes
>>>> 6022us            1115us            +81%
>>>>
>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>>> see any obvious regression.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>>> index 832f29f46767..88be180b5550 100644
>>>> --- a/mm/mincore.c
>>>> +++ b/mm/mincore.c
>>>> @@ -21,6 +21,7 @@
>>>>    #include <linux/uaccess.h>
>>>>    #include "swap.h"
>>>> +#include "internal.h"
>>>>    static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned
>>>> long addr,
>>>>                unsigned long end, struct mm_walk *walk)
>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
>>>> long addr, unsigned long end,
>>>>        pte_t *ptep;
>>>>        unsigned char *vec = walk->private;
>>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>>> +    int step, i;
>>>>        ptl = pmd_trans_huge_lock(pmd, vma);
>>>>        if (ptl) {
>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd,
>>>> unsigned long addr, unsigned long end,
>>>>            walk->action = ACTION_AGAIN;
>>>>            return 0;
>>>>        }
>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>>            pte_t pte = ptep_get(ptep);
>>>> +        step = 1;
>>>>            /* We need to do cache lookup too for pte markers */
>>>>            if (pte_none_mostly(pte))
>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>>                             vma, vec);
>>>> -        else if (pte_present(pte))
>>>> -            *vec = 1;
>>>> -        else { /* pte is a swap entry */
>>>> +        else if (pte_present(pte)) {
>>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>>> +
>>>> +                if (folio && folio_test_large(folio)) {
>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>>> +                }
>>>> +            }
>>>
>>> Can we go ahead with this along with [1], that will help us generalize
>>> for all arches.
>>>
>>> [1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
>>> (Please replace PAGE_SIZE with 1)
>>
>> As discussed with Ryan, we don’t need to call folio_pte_batch()
>> (something like the code below), so your patch seems unnecessarily
>> complicated. However, David is unhappy about the open-coded
>> pte_batch_hint().
> 
> I can live with the below :)
> 
> Having something more universal does maybe not make sense here. Any form 
> of patching contiguous PTEs (contiguous PFNs) -- whether with folios or 
> not -- is not required here as we really only want to
> 
> (a) Identify pte_present() PTEs
> (b) Avoid the cost of repeated ptep_get() with cont-pte.

Good. I will change the patch and resend it. Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios
  2025-05-07 10:03         ` Baolin Wang
@ 2025-05-07 11:14           ` Ryan Roberts
  0 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2025-05-07 11:14 UTC (permalink / raw)
  To: Baolin Wang, David Hildenbrand, Dev Jain, akpm, hughd
  Cc: willy, 21cnbao, ziy, linux-mm, linux-kernel

On 07/05/2025 11:03, Baolin Wang wrote:
> 
> 
> On 2025/5/7 17:54, David Hildenbrand wrote:
>> On 07.05.25 11:48, Baolin Wang wrote:
>>>
>>>
>>> On 2025/5/7 13:12, Dev Jain wrote:
>>>>
>>>>
>>>> On 26/03/25 9:08 am, Baolin Wang wrote:
>>>>> When I tested the mincore() syscall, I observed that it takes longer with
>>>>> 64K mTHP enabled on my Arm64 server. The reason is the
>>>>> mincore_pte_range()
>>>>> still checks each PTE individually, even when the PTEs are contiguous,
>>>>> which is not efficient.
>>>>>
>>>>> Thus we can use folio_pte_batch() to get the batch number of the present
>>>>> contiguous PTEs, which can improve the performance. I tested the
>>>>> mincore()
>>>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an
>>>>> obvious performance improvement:
>>>>>
>>>>> w/o patch        w/ patch        changes
>>>>> 6022us            1115us            +81%
>>>>>
>>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not
>>>>> see any obvious regression.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>>    mm/mincore.c | 27 ++++++++++++++++++++++-----
>>>>>    1 file changed, 22 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>>>> index 832f29f46767..88be180b5550 100644
>>>>> --- a/mm/mincore.c
>>>>> +++ b/mm/mincore.c
>>>>> @@ -21,6 +21,7 @@
>>>>>    #include <linux/uaccess.h>
>>>>>    #include "swap.h"
>>>>> +#include "internal.h"
>>>>>    static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned
>>>>> long addr,
>>>>>                unsigned long end, struct mm_walk *walk)
>>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned
>>>>> long addr, unsigned long end,
>>>>>        pte_t *ptep;
>>>>>        unsigned char *vec = walk->private;
>>>>>        int nr = (end - addr) >> PAGE_SHIFT;
>>>>> +    int step, i;
>>>>>        ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>        if (ptl) {
>>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd,
>>>>> unsigned long addr, unsigned long end,
>>>>>            walk->action = ACTION_AGAIN;
>>>>>            return 0;
>>>>>        }
>>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) {
>>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
>>>>>            pte_t pte = ptep_get(ptep);
>>>>> +        step = 1;
>>>>>            /* We need to do cache lookup too for pte markers */
>>>>>            if (pte_none_mostly(pte))
>>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>>>>>                             vma, vec);
>>>>> -        else if (pte_present(pte))
>>>>> -            *vec = 1;
>>>>> -        else { /* pte is a swap entry */
>>>>> +        else if (pte_present(pte)) {
>>>>> +            if (pte_batch_hint(ptep, pte) > 1) {
>>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte);
>>>>> +
>>>>> +                if (folio && folio_test_large(folio)) {
>>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>> +                                FPB_IGNORE_SOFT_DIRTY;
>>>>> +                    int max_nr = (end - addr) / PAGE_SIZE;
>>>>> +
>>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte,
>>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL);
>>>>> +                }
>>>>> +            }
>>>>
>>>> Can we go ahead with this along with [1], that will help us generalize
>>>> for all arches.
>>>>
>>>> [1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
>>>> (Please replace PAGE_SIZE with 1)
>>>
>>> As discussed with Ryan, we don’t need to call folio_pte_batch()
>>> (something like the code below), so your patch seems unnecessarily
>>> complicated. However, David is unhappy about the open-coded
>>> pte_batch_hint().
>>
>> I can live with the below :)
>>
>> Having something more universal does maybe not make sense here. Any form of
>> patching contiguous PTEs (contiguous PFNs) -- whether with folios or not -- is
>> not required here as we really only want to
>>
>> (a) Identify pte_present() PTEs
>> (b) Avoid the cost of repeated ptep_get() with cont-pte.
> 
> Good. I will change the patch and resend it. Thanks.

Agreed. Thanks Baolin!



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-05-07 11:14 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-26  3:38 [PATCH 0/2] Fix mincore() tmpfs test failure Baolin Wang
2025-03-26  3:38 ` [PATCH 1/2] selftests: mincore: fix tmpfs mincore " Baolin Wang
2025-03-27 14:36   ` Zi Yan
2025-03-30 19:47     ` Baolin Wang
2025-04-01 12:54   ` David Hildenbrand
2025-04-07  3:49     ` Baolin Wang
2025-04-07  7:49       ` David Hildenbrand
2025-04-07  8:35         ` Baolin Wang
2025-03-26  3:38 ` [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Baolin Wang
2025-03-27 10:49   ` Oscar Salvador
2025-03-27 11:54     ` Baolin Wang
2025-03-27 14:08   ` Ryan Roberts
2025-03-28 13:10     ` Oscar Salvador
2025-03-30 19:57     ` Baolin Wang
2025-04-01 10:45       ` Ryan Roberts
2025-04-01 13:04         ` David Hildenbrand
2025-04-07  6:33           ` Baolin Wang
2025-04-14 13:46             ` Ryan Roberts
2025-05-07  5:12   ` Dev Jain
2025-05-07  9:48     ` Baolin Wang
2025-05-07  9:54       ` David Hildenbrand
2025-05-07 10:03         ` Baolin Wang
2025-05-07 11:14           ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).