linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
@ 2024-06-25  9:06 Gavin Shan
  2024-06-25  9:06 ` [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray Gavin Shan
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Gavin Shan @ 2024-06-25  9:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, linux-kernel, david, djwong, willy, akpm, hughd,
	torvalds, zhenyzha, shan.gavin

Currently, xarray can't support arbitrary page cache size. More details
can be found from the WARN_ON() statement in xas_split_alloc(). In our
test whose code is attached below, we hit the WARN_ON() on ARM64 system
where the base page size is 64KB and huge page size is 512MB. The issue
was reported long time ago and some discussions on it can be found here
[1].

[1] https://www.spinics.net/lists/linux-xfs/msg75404.html 

In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
supported by xarray and avoid PMD-sized page cache if needed. The code
changes are suggested by David Hildenbrand.

PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
PATCH[4] avoids PMD-sized page cache for shmem files if needed

Test program
============
# cat test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/mman.h>

#define TEST_XFS_FILENAME	"/tmp/data"
#define TEST_SHMEM_FILENAME	"/dev/shm/data"
#define TEST_MEM_SIZE		0x20000000

int main(int argc, char **argv)
{
	const char *filename;
	int fd = 0;
	void *buf = (void *)-1, *p;
	int pgsize = getpagesize();
	int ret;

	if (pgsize != 0x10000) {
		fprintf(stderr, "64KB base page size is required\n");
		return -EPERM;
	}

	system("echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled");
	system("rm -fr /tmp/data");
	system("rm -fr /dev/shm/data");
	system("echo 1 > /proc/sys/vm/drop_caches");

	/* Open xfs or shmem file */
	filename = TEST_XFS_FILENAME;
	if (argc > 1 && !strcmp(argv[1], "shmem"))
		filename = TEST_SHMEM_FILENAME;

	fd = open(filename, O_CREAT | O_RDWR | O_TRUNC);
	if (fd < 0) {
		fprintf(stderr, "Unable to open <%s>\n", filename);
		return -EIO;
	}

	/* Extend file size */
	ret = ftruncate(fd, TEST_MEM_SIZE);
	if (ret) {
		fprintf(stderr, "Error %d to ftruncate()\n", ret);
		goto cleanup;
	}

	/* Create VMA */
	buf = mmap(NULL, TEST_MEM_SIZE,
		   PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (buf == (void *)-1) {
		fprintf(stderr, "Unable to mmap <%s>\n", filename);
		goto cleanup;
	}

	fprintf(stdout, "mapped buffer at 0x%p\n", buf);
	ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
        if (ret) {
		fprintf(stderr, "Unable to madvise(MADV_HUGEPAGE)\n");
		goto cleanup;
	}

	/* Populate VMA */
	ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_WRITE);
	if (ret) {
		fprintf(stderr, "Error %d to madvise(MADV_POPULATE_WRITE)\n", ret);
		goto cleanup;
	}

	/* Punch the file to enforce xarray split */
	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
        		TEST_MEM_SIZE - pgsize, pgsize);
	if (ret)
		fprintf(stderr, "Error %d to fallocate()\n", ret);

cleanup:
	if (buf != (void *)-1)
		munmap(buf, TEST_MEM_SIZE);
	if (fd > 0)
		close(fd);

	return 0;
}

# gcc test.c -o test
# cat /proc/1/smaps | grep KernelPageSize | head -n 1
KernelPageSize:       64 kB
# ./test shmem
   :
------------[ cut here ]------------
WARNING: CPU: 17 PID: 5253 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
ip_set nf_tables rfkill nfnetlink vfat fat virtio_balloon          \
drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64  \
virtio_net sha1_ce net_failover failover virtio_console virtio_blk \
dimlib virtio_mmio
CPU: 17 PID: 5253 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #12
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : xas_split_alloc+0xf8/0x128
lr : split_huge_page_to_list_to_order+0x1c4/0x720
sp : ffff80008a92f5b0
x29: ffff80008a92f5b0 x28: ffff80008a92f610 x27: ffff80008a92f728
x26: 0000000000000cc0 x25: 000000000000000d x24: ffff0000cf00c858
x23: ffff80008a92f610 x22: ffffffdfc0600000 x21: 0000000000000000
x20: 0000000000000000 x19: ffffffdfc0600000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000018000000000 x15: 3374004000000000
x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
x11: 3374000000000000 x10: 3374e1c0ffff6000 x9 : ffffb463a84c681c
x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00011c976ce0
x5 : ffffb463aa47e378 x4 : 0000000000000000 x3 : 0000000000000cc0
x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
Call trace:
 xas_split_alloc+0xf8/0x128
 split_huge_page_to_list_to_order+0x1c4/0x720
 truncate_inode_partial_folio+0xdc/0x160
 shmem_undo_range+0x2bc/0x6a8
 shmem_fallocate+0x134/0x430
 vfs_fallocate+0x124/0x2e8
 ksys_fallocate+0x4c/0xa0
 __arm64_sys_fallocate+0x24/0x38
 invoke_syscall.constprop.0+0x7c/0xd8
 do_el0_svc+0xb4/0xd0
 el0_svc+0x44/0x1d8
 el0t_64_sync_handler+0x134/0x150
 el0t_64_sync+0x17c/0x180

Gavin Shan (4):
  mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray
  mm/filemap: Skip to allocate PMD-sized folios if needed
  mm/readahead: Limit page cache size in page_cache_ra_order()
  mm/shmem: Disable PMD-sized page cache if needed

 include/linux/pagemap.h | 11 +++++++++--
 mm/filemap.c            |  2 +-
 mm/readahead.c          |  8 ++++----
 mm/shmem.c              | 15 +++++++++++++--
 4 files changed, 27 insertions(+), 9 deletions(-)

-- 
2.45.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray
  2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
@ 2024-06-25  9:06 ` Gavin Shan
  2024-06-25 18:43   ` David Hildenbrand
  2024-06-25  9:06 ` [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed Gavin Shan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2024-06-25  9:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, linux-kernel, david, djwong, willy, akpm, hughd,
	torvalds, zhenyzha, shan.gavin

The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64
with 64KB base page size. The xarray entry with this order can't
be split as the following error messages indicate.

------------[ cut here ]------------
WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
dimlib virtio_mmio
CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : xas_split_alloc+0xf8/0x128
lr : split_huge_page_to_list_to_order+0x1c4/0x720
sp : ffff800087a4f6c0
x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
Call trace:
 xas_split_alloc+0xf8/0x128
 split_huge_page_to_list_to_order+0x1c4/0x720
 truncate_inode_partial_folio+0xdc/0x160
 truncate_inode_pages_range+0x1b4/0x4a8
 truncate_pagecache_range+0x84/0xa0
 xfs_flush_unmap_range+0x70/0x90 [xfs]
 xfs_file_fallocate+0xfc/0x4d8 [xfs]
 vfs_fallocate+0x124/0x2e8
 ksys_fallocate+0x4c/0xa0
 __arm64_sys_fallocate+0x24/0x38
 invoke_syscall.constprop.0+0x7c/0xd8
 do_el0_svc+0xb4/0xd0
 el0_svc+0x44/0x1d8
 el0t_64_sync_handler+0x134/0x150
 el0t_64_sync+0x17c/0x180

Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order
by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from
13 to 11 when CONFIG_BASE_SMALL is disabled.

Fixes: 4f6617011910 ("filemap: Allow __filemap_get_folio to allocate large folios")
Cc: stable@kernel.org # v6.6+
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/linux/pagemap.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 59f1df0cde5a..a0a026d2d244 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,11 +354,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  * a good order (that's 1MB if you're using 4kB pages)
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#define PREFERRED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
 #else
-#define MAX_PAGECACHE_ORDER	8
+#define PREFERRED_MAX_PAGECACHE_ORDER	8
 #endif
 
+/*
+ * xas_split_alloc() does not support arbitrary orders. This implies no
+ * 512MB THP on ARM64 with 64KB base page size.
+ */
+#define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
+#define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed
  2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
  2024-06-25  9:06 ` [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray Gavin Shan
@ 2024-06-25  9:06 ` Gavin Shan
  2024-06-25 18:44   ` David Hildenbrand
  2024-06-25  9:06 ` [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order() Gavin Shan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2024-06-25  9:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, linux-kernel, david, djwong, willy, akpm, hughd,
	torvalds, zhenyzha, shan.gavin

On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB. The
PMD-sized page cache can't be supported by xarray as the following
error messages indicate.

------------[ cut here ]------------
WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
dimlib virtio_mmio
CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : xas_split_alloc+0xf8/0x128
lr : split_huge_page_to_list_to_order+0x1c4/0x720
sp : ffff800087a4f6c0
x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
Call trace:
 xas_split_alloc+0xf8/0x128
 split_huge_page_to_list_to_order+0x1c4/0x720
 truncate_inode_partial_folio+0xdc/0x160
 truncate_inode_pages_range+0x1b4/0x4a8
 truncate_pagecache_range+0x84/0xa0
 xfs_flush_unmap_range+0x70/0x90 [xfs]
 xfs_file_fallocate+0xfc/0x4d8 [xfs]
 vfs_fallocate+0x124/0x2e8
 ksys_fallocate+0x4c/0xa0
 __arm64_sys_fallocate+0x24/0x38
 invoke_syscall.constprop.0+0x7c/0xd8
 do_el0_svc+0xb4/0xd0
 el0_svc+0x44/0x1d8
 el0t_64_sync_handler+0x134/0x150
 el0t_64_sync+0x17c/0x180

Fix it by skipping to allocate PMD-sized page cache when its size
is larger than MAX_PAGECACHE_ORDER. For this specific case, we will
fall to regular path where the readahead window is determined by BDI's
sysfs file (read_ahead_kb).

Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Cc: stable@kernel.org # v5.18+
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 876cc64aadd7..b306861d9d36 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3124,7 +3124,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
-	if (vm_flags & VM_HUGEPAGE) {
+	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
 		ra->size = HPAGE_PMD_NR;
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order()
  2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
  2024-06-25  9:06 ` [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray Gavin Shan
  2024-06-25  9:06 ` [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed Gavin Shan
@ 2024-06-25  9:06 ` Gavin Shan
  2024-06-25 18:45   ` David Hildenbrand
  2024-06-25  9:06 ` [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed Gavin Shan
  2024-06-25 18:37 ` [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Andrew Morton
  4 siblings, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2024-06-25  9:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, linux-kernel, david, djwong, willy, akpm, hughd,
	torvalds, zhenyzha, shan.gavin

In page_cache_ra_order(), the maximal order of the page cache to be
allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise,
it's possible the large page cache can't be supported by xarray when
the corresponding xarray entry is split.

For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size
is 64KB. The PMD-sized page cache can't be supported by xarray.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/readahead.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c1b23989d9ca..817b2a352d78 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -503,11 +503,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER) {
+	if (new_order < MAX_PAGECACHE_ORDER)
 		new_order += 2;
-		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
-		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
-	}
+
+	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed
  2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
                   ` (2 preceding siblings ...)
  2024-06-25  9:06 ` [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order() Gavin Shan
@ 2024-06-25  9:06 ` Gavin Shan
  2024-06-25 18:50   ` David Hildenbrand
  2024-06-25 18:37 ` [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Andrew Morton
  4 siblings, 1 reply; 20+ messages in thread
From: Gavin Shan @ 2024-06-25  9:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, linux-kernel, david, djwong, willy, akpm, hughd,
	torvalds, zhenyzha, shan.gavin

For shmem files, it's possible that PMD-sized page cache can't be
supported by xarray. For example, 512MB page cache on ARM64 when
the base page size is 64KB can't be supported by xarray. It leads
to errors as the following messages indicate when this sort of xarray
entry is split.

WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6   \
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject        \
nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4  \
ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs  \
libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \
net_failover virtio_console virtio_blk failover dimlib virtio_mmio
CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : xas_split_alloc+0xf8/0x128
lr : split_huge_page_to_list_to_order+0x1c4/0x720
sp : ffff8000882af5f0
x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768
x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858
x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000
x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000
x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c
x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0
x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0
x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
Call trace:
 xas_split_alloc+0xf8/0x128
 split_huge_page_to_list_to_order+0x1c4/0x720
 truncate_inode_partial_folio+0xdc/0x160
 shmem_undo_range+0x2bc/0x6a8
 shmem_fallocate+0x134/0x430
 vfs_fallocate+0x124/0x2e8
 ksys_fallocate+0x4c/0xa0
 __arm64_sys_fallocate+0x24/0x38
 invoke_syscall.constprop.0+0x7c/0xd8
 do_el0_svc+0xb4/0xd0
 el0_svc+0x44/0x1d8
 el0t_64_sync_handler+0x134/0x150
 el0t_64_sync+0x17c/0x180

Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is
larger than MAX_PAGECACHE_ORDER.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/shmem.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index a8b181a63402..5453875e3810 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -541,8 +541,9 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
-		   struct mm_struct *mm, unsigned long vm_flags)
+static bool __shmem_is_huge(struct inode *inode, pgoff_t index,
+			    bool shmem_huge_force, struct mm_struct *mm,
+			    unsigned long vm_flags)
 {
 	loff_t i_size;
 
@@ -573,6 +574,16 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
 	}
 }
 
+bool shmem_is_huge(struct inode *inode, pgoff_t index,
+		   bool shmem_huge_force, struct mm_struct *mm,
+		   unsigned long vm_flags)
+{
+	if (!__shmem_is_huge(inode, index, shmem_huge_force, mm, vm_flags))
+		return false;
+
+	return HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER;
+}
+
 #if defined(CONFIG_SYSFS)
 static int shmem_parse_huge(const char *str)
 {
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
                   ` (3 preceding siblings ...)
  2024-06-25  9:06 ` [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed Gavin Shan
@ 2024-06-25 18:37 ` Andrew Morton
  2024-06-25 18:51   ` David Hildenbrand
  4 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2024-06-25 18:37 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-mm, linux-fsdevel, linux-kernel, david, djwong, willy,
	hughd, torvalds, zhenyzha, shan.gavin

On Tue, 25 Jun 2024 19:06:42 +1000 Gavin Shan <gshan@redhat.com> wrote:

> Currently, xarray can't support arbitrary page cache size. More details
> can be found from the WARN_ON() statement in xas_split_alloc(). In our
> test whose code is attached below, we hit the WARN_ON() on ARM64 system
> where the base page size is 64KB and huge page size is 512MB. The issue
> was reported long time ago and some discussions on it can be found here
> [1].
> 
> [1] https://www.spinics.net/lists/linux-xfs/msg75404.html 
> 
> In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
> supported by xarray and avoid PMD-sized page cache if needed. The code
> changes are suggested by David Hildenbrand.
> 
> PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
> PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
> PATCH[4] avoids PMD-sized page cache for shmem files if needed

Questions on the timing of these.

1&2 are cc:stable whereas 3&4 are not.

I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
problem with this approach is that we're putting a basically untested
combination into -stable: 1&2 might have bugs which were accidentally
fixed in 3&4.  A way to avoid this is to add cc:stable to all four
patches.

What are your thoughts on this matter?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray
  2024-06-25  9:06 ` [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray Gavin Shan
@ 2024-06-25 18:43   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 18:43 UTC (permalink / raw)
  To: Gavin Shan, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin

On 25.06.24 11:06, Gavin Shan wrote:
> The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64
> with 64KB base page size. The xarray entry with this order can't
> be split as the following error messages indicate.
> 
> ------------[ cut here ]------------
> WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
> Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
> ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
> fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
> sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
> dimlib virtio_mmio
> CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
> Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
> pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : xas_split_alloc+0xf8/0x128
> lr : split_huge_page_to_list_to_order+0x1c4/0x720
> sp : ffff800087a4f6c0
> x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
> x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
> x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
> x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
> x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
> x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
> x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
> x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
> x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
> x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> Call trace:
>   xas_split_alloc+0xf8/0x128
>   split_huge_page_to_list_to_order+0x1c4/0x720
>   truncate_inode_partial_folio+0xdc/0x160
>   truncate_inode_pages_range+0x1b4/0x4a8
>   truncate_pagecache_range+0x84/0xa0
>   xfs_flush_unmap_range+0x70/0x90 [xfs]
>   xfs_file_fallocate+0xfc/0x4d8 [xfs]
>   vfs_fallocate+0x124/0x2e8
>   ksys_fallocate+0x4c/0xa0
>   __arm64_sys_fallocate+0x24/0x38
>   invoke_syscall.constprop.0+0x7c/0xd8
>   do_el0_svc+0xb4/0xd0
>   el0_svc+0x44/0x1d8
>   el0t_64_sync_handler+0x134/0x150
>   el0t_64_sync+0x17c/0x180
> 
> Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order
> by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from
> 13 to 11 when CONFIG_BASE_SMALL is disabled.
> 
> Fixes: 4f6617011910 ("filemap: Allow __filemap_get_folio to allocate large folios")
> Cc: stable@kernel.org # v6.6+
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   include/linux/pagemap.h | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 59f1df0cde5a..a0a026d2d244 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -354,11 +354,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>    * a good order (that's 1MB if you're using 4kB pages)
>    */
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#define PREFERRED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
>   #else
> -#define MAX_PAGECACHE_ORDER	8
> +#define PREFERRED_MAX_PAGECACHE_ORDER	8
>   #endif
>   
> +/*
> + * xas_split_alloc() does not support arbitrary orders. This implies no
> + * 512MB THP on ARM64 with 64KB base page size.
> + */
> +#define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
> +#define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> +
>   /**
>    * mapping_set_large_folios() - Indicate the file supports large folios.
>    * @mapping: The file.

Thanks!

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed
  2024-06-25  9:06 ` [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed Gavin Shan
@ 2024-06-25 18:44   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 18:44 UTC (permalink / raw)
  To: Gavin Shan, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin

On 25.06.24 11:06, Gavin Shan wrote:
> On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB. The
> PMD-sized page cache can't be supported by xarray as the following
> error messages indicate.
> 
> ------------[ cut here ]------------
> WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
> Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
> ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
> fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
> sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
> dimlib virtio_mmio
> CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
> Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
> pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : xas_split_alloc+0xf8/0x128
> lr : split_huge_page_to_list_to_order+0x1c4/0x720
> sp : ffff800087a4f6c0
> x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
> x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
> x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
> x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
> x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
> x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
> x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
> x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
> x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
> x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> Call trace:
>   xas_split_alloc+0xf8/0x128
>   split_huge_page_to_list_to_order+0x1c4/0x720
>   truncate_inode_partial_folio+0xdc/0x160
>   truncate_inode_pages_range+0x1b4/0x4a8
>   truncate_pagecache_range+0x84/0xa0
>   xfs_flush_unmap_range+0x70/0x90 [xfs]
>   xfs_file_fallocate+0xfc/0x4d8 [xfs]
>   vfs_fallocate+0x124/0x2e8
>   ksys_fallocate+0x4c/0xa0
>   __arm64_sys_fallocate+0x24/0x38
>   invoke_syscall.constprop.0+0x7c/0xd8
>   do_el0_svc+0xb4/0xd0
>   el0_svc+0x44/0x1d8
>   el0t_64_sync_handler+0x134/0x150
>   el0t_64_sync+0x17c/0x180
> 
> Fix it by skipping to allocate PMD-sized page cache when its size
> is larger than MAX_PAGECACHE_ORDER. For this specific case, we will
> fall to regular path where the readahead window is determined by BDI's
> sysfs file (read_ahead_kb).
> 
> Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> Cc: stable@kernel.org # v5.18+
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   mm/filemap.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 876cc64aadd7..b306861d9d36 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3124,7 +3124,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   	/* Use the readahead code, even if readahead is disabled */
> -	if (vm_flags & VM_HUGEPAGE) {
> +	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
>   		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>   		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>   		ra->size = HPAGE_PMD_NR;

As discussed, one option is for using min(HPAGE_PMD_ORDER, 
MAX_PAGECACHE_ORDER) here, but it also doesn't quite result in the 
expected performance results on arm64 with 64k.

This code dates back to PMD-THP times, so we'll leave it like that for now.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order()
  2024-06-25  9:06 ` [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order() Gavin Shan
@ 2024-06-25 18:45   ` David Hildenbrand
  2024-06-26  0:48     ` Gavin Shan
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 18:45 UTC (permalink / raw)
  To: Gavin Shan, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin

On 25.06.24 11:06, Gavin Shan wrote:
> In page_cache_ra_order(), the maximal order of the page cache to be
> allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise,
> it's possible the large page cache can't be supported by xarray when
> the corresponding xarray entry is split.
> 
> For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size
> is 64KB. The PMD-sized page cache can't be supported by xarray.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>

Heh, you came up with this yourself concurrently :) so feel free to drop 
that.

Acked-by: David Hildenbrand <david@redhat.com>

> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   mm/readahead.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c1b23989d9ca..817b2a352d78 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -503,11 +503,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
>   
>   	limit = min(limit, index + ra->size - 1);
>   
> -	if (new_order < MAX_PAGECACHE_ORDER) {
> +	if (new_order < MAX_PAGECACHE_ORDER)
>   		new_order += 2;
> -		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> -		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> -	}
> +
> +	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> +	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>   
>   	/* See comment in page_cache_ra_unbounded() */
>   	nofs = memalloc_nofs_save();

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed
  2024-06-25  9:06 ` [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed Gavin Shan
@ 2024-06-25 18:50   ` David Hildenbrand
  2024-06-26  8:24     ` Ryan Roberts
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 18:50 UTC (permalink / raw)
  To: Gavin Shan, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin, Ryan Roberts

On 25.06.24 11:06, Gavin Shan wrote:
> For shmem files, it's possible that PMD-sized page cache can't be
> supported by xarray. For example, 512MB page cache on ARM64 when
> the base page size is 64KB can't be supported by xarray. It leads
> to errors as the following messages indicate when this sort of xarray
> entry is split.
> 
> WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
> Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6   \
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject        \
> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4  \
> ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs  \
> libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \
> net_failover virtio_console virtio_blk failover dimlib virtio_mmio
> CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
> Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
> pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : xas_split_alloc+0xf8/0x128
> lr : split_huge_page_to_list_to_order+0x1c4/0x720
> sp : ffff8000882af5f0
> x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768
> x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858
> x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000
> x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000
> x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
> x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c
> x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0
> x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0
> x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> Call trace:
>   xas_split_alloc+0xf8/0x128
>   split_huge_page_to_list_to_order+0x1c4/0x720
>   truncate_inode_partial_folio+0xdc/0x160
>   shmem_undo_range+0x2bc/0x6a8
>   shmem_fallocate+0x134/0x430
>   vfs_fallocate+0x124/0x2e8
>   ksys_fallocate+0x4c/0xa0
>   __arm64_sys_fallocate+0x24/0x38
>   invoke_syscall.constprop.0+0x7c/0xd8
>   do_el0_svc+0xb4/0xd0
>   el0_svc+0x44/0x1d8
>   el0t_64_sync_handler+0x134/0x150
>   el0t_64_sync+0x17c/0x180
> 
> Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is
> larger than MAX_PAGECACHE_ORDER.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   mm/shmem.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a8b181a63402..5453875e3810 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -541,8 +541,9 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>   
>   static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>   
> -bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
> -		   struct mm_struct *mm, unsigned long vm_flags)
> +static bool __shmem_is_huge(struct inode *inode, pgoff_t index,
> +			    bool shmem_huge_force, struct mm_struct *mm,
> +			    unsigned long vm_flags)
>   {
>   	loff_t i_size;
>   
> @@ -573,6 +574,16 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
>   	}
>   }
>   
> +bool shmem_is_huge(struct inode *inode, pgoff_t index,
> +		   bool shmem_huge_force, struct mm_struct *mm,
> +		   unsigned long vm_flags)
> +{
> +	if (!__shmem_is_huge(inode, index, shmem_huge_force, mm, vm_flags))
> +		return false;
> +
> +	return HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER;

Why not check for that upfront?

> +}
> +
>   #if defined(CONFIG_SYSFS)
>   static int shmem_parse_huge(const char *str)
>   {

This should make __thp_vma_allowable_orders() happy for shmem, and consequently, also khugepaged IIRC.

Acked-by: David Hildenbrand <david@redhat.com>


@Ryan,

should we do something like the following on top? The use of PUD_ORDER for ordinary pagecache is
wrong. Really only DAX is special and can support that in its own weird ways.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2aa986a5cd1b..ac63233fed6c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -72,14 +72,25 @@ extern struct kobj_attribute shmem_enabled_attr;
  #define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
  
  /*
- * Mask of all large folio orders supported for file THP.
+ * Mask of all large folio orders supported for FSDAX THP.
   */
-#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))
+#define THP_ORDERS_ALL_DAX     (BIT(PMD_ORDER) | BIT(PUD_ORDER))
+
+
+/*
+ * Mask of all large folio orders supported for ordinary pagecache (file/shmem)
+ * THP.
+ */
+#if PMD_ORDER <= MAX_PAGECACHE_ORDER
+#define THP_ORDERS_ALL_FILE    0
+#else
+#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER))
+#endif
  
  /*
   * Mask of all large folio orders supported for THP.
   */
-#define THP_ORDERS_ALL         (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
+#define THP_ORDERS_ALL         (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE )
  
  #define TVA_SMAPS              (1 << 0)        /* Will be used for procfs */
  #define TVA_IN_PF              (1 << 1)        /* Page fault handler */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89932fd0f62e..95d4a2edae39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -88,9 +88,15 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
         bool smaps = tva_flags & TVA_SMAPS;
         bool in_pf = tva_flags & TVA_IN_PF;
         bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
+
         /* Check the intersection of requested and supported orders. */
-       orders &= vma_is_anonymous(vma) ?
-                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+       if (vma_is_anonymous(vma))
+               orders &= THP_ORDERS_ALL_ANON;
+       else if (vma_is_dax(vma))
+               orders &= THP_ORDERS_ALL_DAX;
+       else
+               orders &= THP_ORDERS_ALL_FILE;
+
         if (!orders)
                 return 0;
  



-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-25 18:37 ` [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Andrew Morton
@ 2024-06-25 18:51   ` David Hildenbrand
  2024-06-25 18:58     ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 18:51 UTC (permalink / raw)
  To: Andrew Morton, Gavin Shan
  Cc: linux-mm, linux-fsdevel, linux-kernel, djwong, willy, hughd,
	torvalds, zhenyzha, shan.gavin

On 25.06.24 20:37, Andrew Morton wrote:
> On Tue, 25 Jun 2024 19:06:42 +1000 Gavin Shan <gshan@redhat.com> wrote:
> 
>> Currently, xarray can't support arbitrary page cache size. More details
>> can be found from the WARN_ON() statement in xas_split_alloc(). In our
>> test whose code is attached below, we hit the WARN_ON() on ARM64 system
>> where the base page size is 64KB and huge page size is 512MB. The issue
>> was reported long time ago and some discussions on it can be found here
>> [1].
>>
>> [1] https://www.spinics.net/lists/linux-xfs/msg75404.html
>>
>> In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
>> supported by xarray and avoid PMD-sized page cache if needed. The code
>> changes are suggested by David Hildenbrand.
>>
>> PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
>> PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
>> PATCH[4] avoids PMD-sized page cache for shmem files if needed
> 
> Questions on the timing of these.
> 
> 1&2 are cc:stable whereas 3&4 are not.
> 
> I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
> problem with this approach is that we're putting a basically untested
> combination into -stable: 1&2 might have bugs which were accidentally
> fixed in 3&4.  A way to avoid this is to add cc:stable to all four
> patches.
> 
> What are your thoughts on this matter?

Especially 4 should also be CC stable, so likely we should just do it 
for all of them.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-25 18:51   ` David Hildenbrand
@ 2024-06-25 18:58     ` Andrew Morton
  2024-06-25 19:05       ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2024-06-25 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gavin Shan, linux-mm, linux-fsdevel, linux-kernel, djwong, willy,
	hughd, torvalds, zhenyzha, shan.gavin

On Tue, 25 Jun 2024 20:51:13 +0200 David Hildenbrand <david@redhat.com> wrote:

> > I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
> > problem with this approach is that we're putting a basically untested
> > combination into -stable: 1&2 might have bugs which were accidentally
> > fixed in 3&4.  A way to avoid this is to add cc:stable to all four
> > patches.
> > 
> > What are your thoughts on this matter?
> 
> Especially 4 should also be CC stable, so likely we should just do it 
> for all of them.

Fine.  A Fixes: for 3 & 4 would be good.  Otherwise we're potentially
asking for those to be backported further than 1 & 2, which seems
wrong.

Then again, by having different Fixes: in the various patches we're
suggesting that people split the patch series apart as they slot things
into the indicated places.  In other words, it's not a patch series at
all - it's a sprinkle of independent fixes.  Are we OK thinking of it
in that fashion?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-25 18:58     ` Andrew Morton
@ 2024-06-25 19:05       ` David Hildenbrand
  2024-06-26  0:37         ` Gavin Shan
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-06-25 19:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gavin Shan, linux-mm, linux-fsdevel, linux-kernel, djwong, willy,
	hughd, torvalds, zhenyzha, shan.gavin

On 25.06.24 20:58, Andrew Morton wrote:
> On Tue, 25 Jun 2024 20:51:13 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>>> I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
>>> problem with this approach is that we're putting a basically untested
>>> combination into -stable: 1&2 might have bugs which were accidentally
>>> fixed in 3&4.  A way to avoid this is to add cc:stable to all four
>>> patches.
>>>
>>> What are your thoughts on this matter?
>>
>> Especially 4 should also be CC stable, so likely we should just do it
>> for all of them.
> 
> Fine.  A Fixes: for 3 & 4 would be good.  Otherwise we're potentially
> asking for those to be backported further than 1 & 2, which seems
> wrong.

4 is shmem fix, which likely dates back a bit longer.

> 
> Then again, by having different Fixes: in the various patches we're
> suggesting that people split the patch series apart as they slot things
> into the indicated places.  In other words, it's not a patch series at
> all - it's a sprinkle of independent fixes.  Are we OK thinking of it
> in that fashion?

The common themes is "pagecache cannot handle > order-11", #1-3 tackle 
"ordinary" file THP, #4 tackles shmem THP.

So I'm not sure we should be splitting it apart. It's just that shmem 
THP arrived before file THP :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-25 19:05       ` David Hildenbrand
@ 2024-06-26  0:37         ` Gavin Shan
  2024-06-26 20:38           ` Andrew Morton
  2024-06-26 20:54           ` Matthew Wilcox
  0 siblings, 2 replies; 20+ messages in thread
From: Gavin Shan @ 2024-06-26  0:37 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: linux-mm, linux-fsdevel, linux-kernel, djwong, willy, hughd,
	torvalds, zhenyzha, shan.gavin

On 6/26/24 5:05 AM, David Hildenbrand wrote:
> On 25.06.24 20:58, Andrew Morton wrote:
>> On Tue, 25 Jun 2024 20:51:13 +0200 David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
>>>> problem with this approach is that we're putting a basically untested
>>>> combination into -stable: 1&2 might have bugs which were accidentally
>>>> fixed in 3&4.  A way to avoid this is to add cc:stable to all four
>>>> patches.
>>>>
>>>> What are your thoughts on this matter?
>>>
>>> Especially 4 should also be CC stable, so likely we should just do it
>>> for all of them.
>>
>> Fine.  A Fixes: for 3 & 4 would be good.  Otherwise we're potentially
>> asking for those to be backported further than 1 & 2, which seems
>> wrong.
> 
> 4 is shmem fix, which likely dates back a bit longer.
> 
>>
>> Then again, by having different Fixes: in the various patches we're
>> suggesting that people split the patch series apart as they slot things
>> into the indicated places.  In other words, it's not a patch series at
>> all - it's a sprinkle of independent fixes.  Are we OK thinking of it
>> in that fashion?
> 
> The common themes is "pagecache cannot handle > order-11", #1-3 tackle "ordinary" file THP, #4 tackles shmem THP.
> 
> So I'm not sure we should be splitting it apart. It's just that shmem THP arrived before file THP :)
> 

I rechecked the history, it's a bit hard to have precise fix tag for PATCH[4].
Please let me know if you have a better one for PATCH[4].

#4
   Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
   Cc: stable@kernel.org # v4.10+
   Fixes: 552446a41661 ("shmem: Convert shmem_add_to_page_cache to XArray")
   Cc: stable@kernel.org # v4.20+
#3
   Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
   Cc: stable@kernel.org # v5.18+
#2
   Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
   Cc: stable@kernel.org # v5.18+
#1
   Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
   Cc: stable@kernel.org # v5.18+

I probably need to move PATCH[3] before PATCH[2] since PATCH[1] and PATCH[2]
point to same commit.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order()
  2024-06-25 18:45   ` David Hildenbrand
@ 2024-06-26  0:48     ` Gavin Shan
  0 siblings, 0 replies; 20+ messages in thread
From: Gavin Shan @ 2024-06-26  0:48 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin

On 6/26/24 4:45 AM, David Hildenbrand wrote:
> On 25.06.24 11:06, Gavin Shan wrote:
>> In page_cache_ra_order(), the maximal order of the page cache to be
>> allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise,
>> it's possible the large page cache can't be supported by xarray when
>> the corresponding xarray entry is split.
>>
>> For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size
>> is 64KB. The PMD-sized page cache can't be supported by xarray.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
> 
> Heh, you came up with this yourself concurrently :) so feel free to drop that.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 

David, thanks for your follow-up and reviews. I will drop that tag in next respin :)

Thanks,
Gavin

>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   mm/readahead.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index c1b23989d9ca..817b2a352d78 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -503,11 +503,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
>>       limit = min(limit, index + ra->size - 1);
>> -    if (new_order < MAX_PAGECACHE_ORDER) {
>> +    if (new_order < MAX_PAGECACHE_ORDER)
>>           new_order += 2;
>> -        new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
>> -        new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>> -    }
>> +
>> +    new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
>> +    new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>>       /* See comment in page_cache_ra_unbounded() */
>>       nofs = memalloc_nofs_save();
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed
  2024-06-25 18:50   ` David Hildenbrand
@ 2024-06-26  8:24     ` Ryan Roberts
  0 siblings, 0 replies; 20+ messages in thread
From: Ryan Roberts @ 2024-06-26  8:24 UTC (permalink / raw)
  To: David Hildenbrand, Gavin Shan, linux-mm
  Cc: linux-fsdevel, linux-kernel, djwong, willy, akpm, hughd, torvalds,
	zhenyzha, shan.gavin

On 25/06/2024 19:50, David Hildenbrand wrote:
> On 25.06.24 11:06, Gavin Shan wrote:
>> For shmem files, it's possible that PMD-sized page cache can't be
>> supported by xarray. For example, 512MB page cache on ARM64 when
>> the base page size is 64KB can't be supported by xarray. It leads
>> to errors as the following messages indicate when this sort of xarray
>> entry is split.
>>
>> WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
>> Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6   \
>> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject        \
>> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4  \
>> ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs  \
>> libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \
>> net_failover virtio_console virtio_blk failover dimlib virtio_mmio
>> CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
>> Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
>> pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>> pc : xas_split_alloc+0xf8/0x128
>> lr : split_huge_page_to_list_to_order+0x1c4/0x720
>> sp : ffff8000882af5f0
>> x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768
>> x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858
>> x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000
>> x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000
>> x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000
>> x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
>> x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c
>> x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0
>> x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0
>> x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>> Call trace:
>>   xas_split_alloc+0xf8/0x128
>>   split_huge_page_to_list_to_order+0x1c4/0x720
>>   truncate_inode_partial_folio+0xdc/0x160
>>   shmem_undo_range+0x2bc/0x6a8
>>   shmem_fallocate+0x134/0x430
>>   vfs_fallocate+0x124/0x2e8
>>   ksys_fallocate+0x4c/0xa0
>>   __arm64_sys_fallocate+0x24/0x38
>>   invoke_syscall.constprop.0+0x7c/0xd8
>>   do_el0_svc+0xb4/0xd0
>>   el0_svc+0x44/0x1d8
>>   el0t_64_sync_handler+0x134/0x150
>>   el0t_64_sync+0x17c/0x180
>>
>> Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is
>> larger than MAX_PAGECACHE_ORDER.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   mm/shmem.c | 15 +++++++++++++--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index a8b181a63402..5453875e3810 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -541,8 +541,9 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>     static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>   -bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
>> -           struct mm_struct *mm, unsigned long vm_flags)
>> +static bool __shmem_is_huge(struct inode *inode, pgoff_t index,
>> +                bool shmem_huge_force, struct mm_struct *mm,
>> +                unsigned long vm_flags)
>>   {
>>       loff_t i_size;
>>   @@ -573,6 +574,16 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index,
>> bool shmem_huge_force,
>>       }
>>   }
>>   +bool shmem_is_huge(struct inode *inode, pgoff_t index,
>> +           bool shmem_huge_force, struct mm_struct *mm,
>> +           unsigned long vm_flags)
>> +{
>> +    if (!__shmem_is_huge(inode, index, shmem_huge_force, mm, vm_flags))
>> +        return false;
>> +
>> +    return HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER;

Sorry I don't have the context of the original post, but this seems odd to me,
given that MAX_PAGECACHE_ORDER is defined as HPAGE_PMD_ORDER (unless you changed
this in an earlier patch in the series?)

at least v6.10-rc4 has:

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
#else
#define MAX_PAGECACHE_ORDER	8
#endif

> 
> Why not check for that upfront?
> 
>> +}
>> +
>>   #if defined(CONFIG_SYSFS)
>>   static int shmem_parse_huge(const char *str)
>>   {
> 
> This should make __thp_vma_allowable_orders() happy for shmem, and consequently,
> also khugepaged IIRC.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> 
> @Ryan,
> 
> should we do something like the following on top? The use of PUD_ORDER for
> ordinary pagecache is
> wrong. Really only DAX is special and can support that in its own weird ways.

I'll take your word for that. If correct, then I agree we should change this.
Note that arm64 doesn't support PUD THP mappings at all.

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2aa986a5cd1b..ac63233fed6c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -72,14 +72,25 @@ extern struct kobj_attribute shmem_enabled_attr;
>  #define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>  
>  /*
> - * Mask of all large folio orders supported for file THP.
> + * Mask of all large folio orders supported for FSDAX THP.
>   */
> -#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))
> +#define THP_ORDERS_ALL_DAX     (BIT(PMD_ORDER) | BIT(PUD_ORDER))
> +
> +
> +/*
> + * Mask of all large folio orders supported for ordinary pagecache (file/shmem)
> + * THP.
> + */
> +#if PMD_ORDER <= MAX_PAGECACHE_ORDER

Shouldn't this be ">="? (assuming MAX_PAGECACHE_ORDER is now defined
independently of PMD_ORDER, as per above).

Although Gavin's commit log only mentions shmem as being a problem with really
big PMD. Is it also a problem for regular files?

> +#define THP_ORDERS_ALL_FILE    0
> +#else
> +#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER))
> +#endif
>  
>  /*
>   * Mask of all large folio orders supported for THP.
>   */
> -#define THP_ORDERS_ALL         (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
> +#define THP_ORDERS_ALL         (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE )

And I think this needs to include THP_ORDERS_ALL_DAX so that THPeligible in
show_smap() continues to work for DAX VMAs?

>  
>  #define TVA_SMAPS              (1 << 0)        /* Will be used for procfs */
>  #define TVA_IN_PF              (1 << 1)        /* Page fault handler */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 89932fd0f62e..95d4a2edae39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -88,9 +88,15 @@ unsigned long __thp_vma_allowable_orders(struct
> vm_area_struct *vma,
>         bool smaps = tva_flags & TVA_SMAPS;
>         bool in_pf = tva_flags & TVA_IN_PF;
>         bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
> +
>         /* Check the intersection of requested and supported orders. */
> -       orders &= vma_is_anonymous(vma) ?
> -                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
> +       if (vma_is_anonymous(vma))
> +               orders &= THP_ORDERS_ALL_ANON;
> +       else if (vma_is_dax(vma))
> +               orders &= THP_ORDERS_ALL_DAX;
> +       else
> +               orders &= THP_ORDERS_ALL_FILE;
> +
>         if (!orders)
>                 return 0;
>  
> 
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-26  0:37         ` Gavin Shan
@ 2024-06-26 20:38           ` Andrew Morton
  2024-06-26 23:05             ` Gavin Shan
  2024-06-26 20:54           ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2024-06-26 20:38 UTC (permalink / raw)
  To: Gavin Shan
  Cc: David Hildenbrand, linux-mm, linux-fsdevel, linux-kernel, djwong,
	willy, hughd, torvalds, zhenyzha, shan.gavin

On Wed, 26 Jun 2024 10:37:00 +1000 Gavin Shan <gshan@redhat.com> wrote:

> 
> I rechecked the history, it's a bit hard to have precise fix tag for PATCH[4].
> Please let me know if you have a better one for PATCH[4].
> 
> #4
>    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
>    Cc: stable@kernel.org # v4.10+
>    Fixes: 552446a41661 ("shmem: Convert shmem_add_to_page_cache to XArray")
>    Cc: stable@kernel.org # v4.20+
> #3
>    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>    Cc: stable@kernel.org # v5.18+
> #2
>    Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
>    Cc: stable@kernel.org # v5.18+
> #1
>    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>    Cc: stable@kernel.org # v5.18+
> 
> I probably need to move PATCH[3] before PATCH[2] since PATCH[1] and PATCH[2]
> point to same commit.

OK, thanks.

I assume you'll be sending a new revision of the series.  And Ryan had
comments.  Please incorporate the above into the updated changelogs as
best you can.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-26  0:37         ` Gavin Shan
  2024-06-26 20:38           ` Andrew Morton
@ 2024-06-26 20:54           ` Matthew Wilcox
  2024-06-26 23:48             ` Gavin Shan
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2024-06-26 20:54 UTC (permalink / raw)
  To: Gavin Shan
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-fsdevel,
	linux-kernel, djwong, hughd, torvalds, zhenyzha, shan.gavin

On Wed, Jun 26, 2024 at 10:37:00AM +1000, Gavin Shan wrote:
> On 6/26/24 5:05 AM, David Hildenbrand wrote:
> > On 25.06.24 20:58, Andrew Morton wrote:
> > > On Tue, 25 Jun 2024 20:51:13 +0200 David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > > > I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
> > > > > problem with this approach is that we're putting a basically untested
> > > > > combination into -stable: 1&2 might have bugs which were accidentally
> > > > > fixed in 3&4.  A way to avoid this is to add cc:stable to all four
> > > > > patches.
> > > > > 
> > > > > What are your thoughts on this matter?
> > > > 
> > > > Especially 4 should also be CC stable, so likely we should just do it
> > > > for all of them.
> > > 
> > > Fine.  A Fixes: for 3 & 4 would be good.  Otherwise we're potentially
> > > asking for those to be backported further than 1 & 2, which seems
> > > wrong.
> > 
> > 4 is shmem fix, which likely dates back a bit longer.
> > 
> > > 
> > > Then again, by having different Fixes: in the various patches we're
> > > suggesting that people split the patch series apart as they slot things
> > > into the indicated places.  In other words, it's not a patch series at
> > > all - it's a sprinkle of independent fixes.  Are we OK thinking of it
> > > in that fashion?
> > 
> > The common themes is "pagecache cannot handle > order-11", #1-3 tackle "ordinary" file THP, #4 tackles shmem THP.
> > 
> > So I'm not sure we should be splitting it apart. It's just that shmem THP arrived before file THP :)
> > 
> 
> I rechecked the history, it's a bit hard to have precise fix tag for PATCH[4].
> Please let me know if you have a better one for PATCH[4].
> 
> #4
>   Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
>   Cc: stable@kernel.org # v4.10+
>   Fixes: 552446a41661 ("shmem: Convert shmem_add_to_page_cache to XArray")
>   Cc: stable@kernel.org # v4.20+
> #3
>   Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>   Cc: stable@kernel.org # v5.18+
> #2
>   Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
>   Cc: stable@kernel.org # v5.18+
> #1
>   Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>   Cc: stable@kernel.org # v5.18+

I actually think it's this:

commit 6b24ca4a1a8d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jun 27 22:19:08 2020 -0400

    mm: Use multi-index entries in the page cache

    We currently store large folios as 2^N consecutive entries.  While this
    consumes rather more memory than necessary, it also turns out to be buggy.
    A writeback operation which starts within a tail page of a dirty folio will
    not write back the folio as the xarray's dirty bit is only set on the
    head index.  With multi-index entries, the dirty bit will be found no
    matter where in the folio the operation starts.

    This does end up simplifying the page cache slightly, although not as
    much as I had hoped.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Before this, we could split an arbitrary size folio to order 0.  After
it, we're limited to whatever the xarray allows us to split.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-26 20:38           ` Andrew Morton
@ 2024-06-26 23:05             ` Gavin Shan
  0 siblings, 0 replies; 20+ messages in thread
From: Gavin Shan @ 2024-06-26 23:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-fsdevel, linux-kernel, djwong,
	willy, hughd, torvalds, zhenyzha, shan.gavin

On 6/27/24 6:38 AM, Andrew Morton wrote:
> On Wed, 26 Jun 2024 10:37:00 +1000 Gavin Shan <gshan@redhat.com> wrote:
>>
>> I rechecked the history, it's a bit hard to have precise fix tag for PATCH[4].
>> Please let me know if you have a better one for PATCH[4].
>>
>> #4
>>     Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
>>     Cc: stable@kernel.org # v4.10+
>>     Fixes: 552446a41661 ("shmem: Convert shmem_add_to_page_cache to XArray")
>>     Cc: stable@kernel.org # v4.20+
>> #3
>>     Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>     Cc: stable@kernel.org # v5.18+
>> #2
>>     Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
>>     Cc: stable@kernel.org # v5.18+
>> #1
>>     Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>     Cc: stable@kernel.org # v5.18+
>>
>> I probably need to move PATCH[3] before PATCH[2] since PATCH[1] and PATCH[2]
>> point to same commit.
> 
> OK, thanks.
> 
> I assume you'll be sending a new revision of the series.  And Ryan had
> comments.  Please incorporate the above into the updated changelogs as
> best you can.
> 

Yes, I will post a new revision where all pending comments will be addressed.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray
  2024-06-26 20:54           ` Matthew Wilcox
@ 2024-06-26 23:48             ` Gavin Shan
  0 siblings, 0 replies; 20+ messages in thread
From: Gavin Shan @ 2024-06-26 23:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-fsdevel,
	linux-kernel, djwong, hughd, torvalds, zhenyzha, shan.gavin

On 6/27/24 6:54 AM, Matthew Wilcox wrote:
> On Wed, Jun 26, 2024 at 10:37:00AM +1000, Gavin Shan wrote:
>> On 6/26/24 5:05 AM, David Hildenbrand wrote:
>>> On 25.06.24 20:58, Andrew Morton wrote:
>>>> On Tue, 25 Jun 2024 20:51:13 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>>> I could split them and feed 1&2 into 6.10-rcX and 3&4 into 6.11-rc1.  A
>>>>>> problem with this approach is that we're putting a basically untested
>>>>>> combination into -stable: 1&2 might have bugs which were accidentally
>>>>>> fixed in 3&4.  A way to avoid this is to add cc:stable to all four
>>>>>> patches.
>>>>>>
>>>>>> What are your thoughts on this matter?
>>>>>
>>>>> Especially 4 should also be CC stable, so likely we should just do it
>>>>> for all of them.
>>>>
>>>> Fine.  A Fixes: for 3 & 4 would be good.  Otherwise we're potentially
>>>> asking for those to be backported further than 1 & 2, which seems
>>>> wrong.
>>>
>>> 4 is shmem fix, which likely dates back a bit longer.
>>>
>>>>
>>>> Then again, by having different Fixes: in the various patches we're
>>>> suggesting that people split the patch series apart as they slot things
>>>> into the indicated places.  In other words, it's not a patch series at
>>>> all - it's a sprinkle of independent fixes.  Are we OK thinking of it
>>>> in that fashion?
>>>
>>> The common themes is "pagecache cannot handle > order-11", #1-3 tackle "ordinary" file THP, #4 tackles shmem THP.
>>>
>>> So I'm not sure we should be splitting it apart. It's just that shmem THP arrived before file THP :)
>>>
>>
>> I rechecked the history, it's a bit hard to have precise fix tag for PATCH[4].
>> Please let me know if you have a better one for PATCH[4].
>>
>> #4
>>    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
>>    Cc: stable@kernel.org # v4.10+
>>    Fixes: 552446a41661 ("shmem: Convert shmem_add_to_page_cache to XArray")
>>    Cc: stable@kernel.org # v4.20+
>> #3
>>    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>    Cc: stable@kernel.org # v5.18+
>> #2
>>    Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
>>    Cc: stable@kernel.org # v5.18+
>> #1
>>    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>    Cc: stable@kernel.org # v5.18+
> 
> I actually think it's this:
> 
> commit 6b24ca4a1a8d
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Sat Jun 27 22:19:08 2020 -0400
> 
>      mm: Use multi-index entries in the page cache
> 
>      We currently store large folios as 2^N consecutive entries.  While this
>      consumes rather more memory than necessary, it also turns out to be buggy.
>      A writeback operation which starts within a tail page of a dirty folio will
>      not write back the folio as the xarray's dirty bit is only set on the
>      head index.  With multi-index entries, the dirty bit will be found no
>      matter where in the folio the operation starts.
> 
>      This does end up simplifying the page cache slightly, although not as
>      much as I had hoped.
> 
>      Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>      Reviewed-by: William Kucharski <william.kucharski@oracle.com>
> 
> Before this, we could split an arbitrary size folio to order 0.  After
> it, we're limited to whatever the xarray allows us to split.
> 

Thanks, PATCH[4]'s fix tag will point to 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache"),
which was merged to v5.17. The fix tags for other patches are correct

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-06-26 23:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25  9:06 [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Gavin Shan
2024-06-25  9:06 ` [PATCH 1/4] mm/filemap: Make MAX_PAGECACHE_ORDER acceptable to xarray Gavin Shan
2024-06-25 18:43   ` David Hildenbrand
2024-06-25  9:06 ` [PATCH 2/4] mm/filemap: Skip to allocate PMD-sized folios if needed Gavin Shan
2024-06-25 18:44   ` David Hildenbrand
2024-06-25  9:06 ` [PATCH 3/4] mm/readahead: Limit page cache size in page_cache_ra_order() Gavin Shan
2024-06-25 18:45   ` David Hildenbrand
2024-06-26  0:48     ` Gavin Shan
2024-06-25  9:06 ` [PATCH 4/4] mm/shmem: Disable PMD-sized page cache if needed Gavin Shan
2024-06-25 18:50   ` David Hildenbrand
2024-06-26  8:24     ` Ryan Roberts
2024-06-25 18:37 ` [PATCH 0/4] mm/filemap: Limit page cache size to that supported by xarray Andrew Morton
2024-06-25 18:51   ` David Hildenbrand
2024-06-25 18:58     ` Andrew Morton
2024-06-25 19:05       ` David Hildenbrand
2024-06-26  0:37         ` Gavin Shan
2024-06-26 20:38           ` Andrew Morton
2024-06-26 23:05             ` Gavin Shan
2024-06-26 20:54           ` Matthew Wilcox
2024-06-26 23:48             ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).