Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
       [not found] ` <20240822135018.1931258-5-kernel@pankajraghav.com>
@ 2024-08-29 10:51   ` Sven Schnelle
  2024-08-29 18:46     ` Luis Chamberlain
  0 siblings, 1 reply; 12+ messages in thread
From: Sven Schnelle @ 2024-08-29 10:51 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells, linux-s390

Hi,

"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:

> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: David Howells <dhowells@redhat.com>

This causes the following warning on s390 with linux-next starting from
next-20240827:

[  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
[  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
[  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
[  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
[  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
[  112.690548] page dumped because: bad pte
[  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
[  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
[  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
[  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[  112.690568] Call Trace:
[  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
[  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
[  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
[  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
[  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
[  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
[  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
[  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
[  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
[  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
[  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
[  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
[  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
[  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
[  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
[  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
[  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
[  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
[  112.690638] Disabling lock debugging due to kernel taint

To reproduce, running the ksm01 testsuite from ltp seems to be
enough. The splat is always triggered immediately. The output from ksm01
is:

tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:344: TINFO: child 2 continues...
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 2 stops.
mem.c:400: TINFO: child 0 stops.
Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks
Sven

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 10:51   ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Sven Schnelle
@ 2024-08-29 18:46     ` Luis Chamberlain
  2024-08-29 19:55       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Chamberlain @ 2024-08-29 18:46 UTC (permalink / raw)
  To: Sven Schnelle
  Cc: Pankaj Raghav (Samsung), brauner, akpm, chandan.babu,
	linux-fsdevel, djwong, hare, gost.dev, linux-xfs, hch, david,
	Zi Yan, yang, linux-kernel, linux-mm, willy, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 12:51:25PM +0200, Sven Schnelle wrote:
> Hi,
> 
> "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:
> 
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Tested-by: David Howells <dhowells@redhat.com>
> 
> This causes the following warning on s390 with linux-next starting from
> next-20240827:
> 
> [  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
> [  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
> [  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
> [  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
> [  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
> [  112.690548] page dumped because: bad pte
> [  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
> [  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> [  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
> [  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
> [  112.690568] Call Trace:
> [  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
> [  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
> [  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
> [  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
> [  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
> [  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
> [  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
> [  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
> [  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
> [  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
> [  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
> [  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
> [  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
> [  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
> [  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
> [  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
> [  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
> [  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
> [  112.690638] Disabling lock debugging due to kernel taint
> 
> To reproduce, running the ksm01 testsuite from ltp seems to be
> enough. The splat is always triggered immediately. The output from ksm01
> is:
> 
> tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
> tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
> tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
> tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:388: TINFO: child 0 stops.
> mem.c:388: TINFO: child 1 stops.
> mem.c:388: TINFO: child 2 stops.
> mem.c:495: TINFO: KSM merging...
> mem.c:434: TINFO: resume all children.
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:344: TINFO: child 0 continues...
> mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
> mem.c:344: TINFO: child 1 continues...
> mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
> mem.c:344: TINFO: child 2 continues...
> mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
> mem.c:400: TINFO: child 1 stops.
> mem.c:400: TINFO: child 2 stops.
> mem.c:400: TINFO: child 0 stops.
> Test timeouted, sending SIGKILL!
> tst_test.c:1700: TINFO: Killed the leftover descendant processes
> tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks for the report and test reproducer, I was able to reproduce on
x86_64 easily with ltp/testcases/kernel/mem/ksm/ksm01 as well:

ltp/testcases/kernel/mem/ksm/ksm01
tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6fc20
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #56 SMP
PREEMPT_DYNAMIC Tue Aug 27 08:10:26 UTC 2024 x86_64
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:344: TINFO: child 2 continues...
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 0 stops.
mem.c:400: TINFO: child 2 stops.


Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try
exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Summary:
passed   0
failed   0
broken   1
skipped  0
warnings 0


With vm debugging however I get more information about the issue:

Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Modules linked in: xfs nvme_fabrics 9p kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd pcspkr 9pnet_virtio joydev virtio_balloon virtio_console evdev button serio_raw drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover nvme psmouse crc32_pclmul crc32c_intel nvme_core virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ---[ end trace 0000000000000000 ]---
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554

Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
cmp_and_merge_page() on the split case:

                } else if (split) {                                             
                        /*                                                      
                         * We are here if we tried to merge two pages and       
                         * failed because they both belonged to the same        
                         * compound page. We will split the page now, but no    
                         * merging will take place.                             
                         * We do not want to add the cost of a full lock; if    
                         * the page is locked, it is better to skip it and      
                         * perhaps try again later.                             
                         */                                                     
                        if (!trylock_page(page))                                
                                return;                                         
                        split_huge_page(page);                                  
                        unlock_page(page);                                      
                }

The trylock_page() is faulty. I'm digging in further.

  Luis

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 18:46     ` Luis Chamberlain
@ 2024-08-29 19:55       ` Matthew Wilcox
  2024-08-29 22:12         ` Zi Yan
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-08-29 19:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Sven Schnelle, Pankaj Raghav (Samsung), brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, Zi Yan, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
> With vm debugging however I get more information about the issue:
> 
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!

This is in folio_unlock().  We're trying to unlock a folio which isn't
locked!

> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
[...]
> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
> cmp_and_merge_page() on the split case:
> 
>                 } else if (split) {                                             
>                         /*                                                      
>                          * We are here if we tried to merge two pages and       
>                          * failed because they both belonged to the same        
>                          * compound page. We will split the page now, but no    
>                          * merging will take place.                             
>                          * We do not want to add the cost of a full lock; if    
>                          * the page is locked, it is better to skip it and      
>                          * perhaps try again later.                             
>                          */                                                     
>                         if (!trylock_page(page))                                
>                                 return;                                         
>                         split_huge_page(page);                                  
>                         unlock_page(page);                                      

Obviously the page is locked when we call split_huge_page().  There's
an assert inside it.  And the lock bit is _supposed_ to be transferred
to the head page of the page which is being split.  My guess is that
this is messed up somehow; we're perhaps transferring the lock bit to
the wrong page?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 19:55       ` Matthew Wilcox
@ 2024-08-29 22:12         ` Zi Yan
  2024-08-29 23:41           ` Luis Chamberlain
  0 siblings, 1 reply; 12+ messages in thread
From: Zi Yan @ 2024-08-29 22:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Sven Schnelle, Pankaj Raghav (Samsung), brauner,
	akpm, chandan.babu, linux-fsdevel, djwong, hare, gost.dev,
	linux-xfs, hch, david, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 7906 bytes --]

On 29 Aug 2024, at 15:55, Matthew Wilcox wrote:

> On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
>> With vm debugging however I get more information about the issue:
>>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
>
> This is in folio_unlock().  We're trying to unlock a folio which isn't
> locked!
>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
> [...]
>> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
>> cmp_and_merge_page() on the split case:
>>
>>                 } else if (split) {
>>                         /*
>>                          * We are here if we tried to merge two pages and
>>                          * failed because they both belonged to the same
>>                          * compound page. We will split the page now, but no
>>                          * merging will take place.
>>                          * We do not want to add the cost of a full lock; if
>>                          * the page is locked, it is better to skip it and
>>                          * perhaps try again later.
>>                          */
>>                         if (!trylock_page(page))
>>                                 return;
>>                         split_huge_page(page);
>>                         unlock_page(page);
>
> Obviously the page is locked when we call split_huge_page().  There's
> an assert inside it.  And the lock bit is _supposed_ to be transferred
> to the head page of the page which is being split.  My guess is that
> this is messed up somehow; we're perhaps transferring the lock bit to
> the wrong page?

The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
But this patch changes the “page” passed into split_huge_page_to_list_to_order()
always to the head page.

This fixes the crash on my x86 VM, but it can be improved:

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7c50aeed0522..eff5d2fb5d4e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-       return split_folio(page_folio(page));
-}
+int split_huge_page(struct page *page);
 void deferred_split_folio(struct folio *folio);

 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c29af9451d92..4d723dab4336 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

+int split_huge_page(struct page *page)
+{
+       unsigned int min_order = 0;
+       struct folio *folio = page_folio(page);
+
+       if (folio_test_anon(folio))
+               goto out;
+
+       if (!folio->mapping) {
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_SPLIT_PAGE_FAILED);
+               return -EBUSY;
+       }
+
+       min_order = mapping_min_folio_order(folio->mapping);
+out:
+       return split_huge_page_to_list_to_order(page, NULL, min_order);
+}
+
 int split_folio_to_list(struct folio *folio, struct list_head *list)
 {
        unsigned int min_order = 0;

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 22:12         ` Zi Yan
@ 2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
                               ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Luis Chamberlain @ 2024-08-29 23:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Sven Schnelle, Pankaj Raghav (Samsung), brauner,
	akpm, chandan.babu, linux-fsdevel, djwong, hare, gost.dev,
	linux-xfs, hch, david, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> always to the head page.
> 
> This fixes the crash on my x86 VM, but it can be improved:
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7c50aeed0522..eff5d2fb5d4e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
> -static inline int split_huge_page(struct page *page)
> -{
> -       return split_folio(page_folio(page));
> -}
> +int split_huge_page(struct page *page);
>  void deferred_split_folio(struct folio *folio);
> 
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29af9451d92..4d723dab4336 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
> 
> +int split_huge_page(struct page *page)
> +{
> +       unsigned int min_order = 0;
> +       struct folio *folio = page_folio(page);
> +
> +       if (folio_test_anon(folio))
> +               goto out;
> +
> +       if (!folio->mapping) {
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> +               return -EBUSY;
> +       }
> +
> +       min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> +}
> +
>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>  {
>         unsigned int min_order = 0;


Confirmed, and also although you suggest it can be improved, I thought
that we could do that by sharing more code and putting things in the
headers, the below also fixes this but tries to share more code, but
I think it is perhaps less easier to understand than your patch.

So I think your patch is cleaner and easier as a fix.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..99cd9c7bf55b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
+#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
@@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
-int split_folio_to_list(struct folio *folio, struct list_head *list);
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_folio(page_folio(page));
+	return split_page_folio_to_list(page, page_folio(page), NULL);
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
 	return 0;
 }
 
-static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+static inline int split_page_folio_to_list(struct page *page,
+					   struct folio *folio,
+					   struct list_head *list)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b115bfe63b52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list)
 {
 	unsigned int min_order = 0;
 
@@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 out:
-	return split_huge_page_to_list_to_order(&folio->page, list,
-							min_order);
+	return split_huge_page_to_list_to_order(page, list, min_order);
 }
 
 void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
@ 2024-08-30  5:57             ` Sven Schnelle
  2024-08-30 11:58             ` Daniel Gomez
  2024-08-30 14:59             ` Pankaj Raghav
  2 siblings, 0 replies; 12+ messages in thread
From: Sven Schnelle @ 2024-08-30  5:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Matthew Wilcox, Pankaj Raghav (Samsung), brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>> 
>> This fixes the crash on my x86 VM, but it can be improved:

It also fixes the issue on s390x. Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
@ 2024-08-30 11:58             ` Daniel Gomez
  2024-08-30 14:59             ` Pankaj Raghav
  2 siblings, 0 replies; 12+ messages in thread
From: Daniel Gomez @ 2024-08-30 11:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Matthew Wilcox, Sven Schnelle, Pankaj Raghav (Samsung),
	brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, yang, linux-kernel, linux-mm,
	john.g.garry, cl, p.raghav, ryan.roberts, David Howells,
	linux-s390

On Thu, Aug 29, 2024 at 04:41:48PM -0700, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> > The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> > unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> > to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> > But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> > always to the head page.
> > 
> > This fixes the crash on my x86 VM, but it can be improved:
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 7c50aeed0522..eff5d2fb5d4e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
> >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >                 unsigned int new_order);
> >  int split_folio_to_list(struct folio *folio, struct list_head *list);
> > -static inline int split_huge_page(struct page *page)
> > -{
> > -       return split_folio(page_folio(page));
> > -}
> > +int split_huge_page(struct page *page);
> >  void deferred_split_folio(struct folio *folio);
> > 
> >  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c29af9451d92..4d723dab4336 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >         return ret;
> >  }
> > 
> > +int split_huge_page(struct page *page)
> > +{
> > +       unsigned int min_order = 0;
> > +       struct folio *folio = page_folio(page);
> > +
> > +       if (folio_test_anon(folio))
> > +               goto out;
> > +
> > +       if (!folio->mapping) {
> > +               if (folio_test_pmd_mappable(folio))
> > +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> > +               return -EBUSY;
> > +       }
> > +
> > +       min_order = mapping_min_folio_order(folio->mapping);
> > +out:
> > +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> > +}
> > +
> >  int split_folio_to_list(struct folio *folio, struct list_head *list)
> >  {
> >         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
> So I think your patch is cleaner and easier as a fix.

I reproduced it in arm64 as well. And both fixes provided solved the issue.

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
  2024-08-30 11:58             ` Daniel Gomez
@ 2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
                                 ` (2 more replies)
  2 siblings, 3 replies; 12+ messages in thread
From: Pankaj Raghav @ 2024-08-30 14:59 UTC (permalink / raw)
  To: Luis Chamberlain, Zi Yan
  Cc: Matthew Wilcox, Sven Schnelle, brauner, akpm, chandan.babu,
	linux-fsdevel, djwong, hare, gost.dev, linux-xfs, hch, david,
	yang, linux-kernel, linux-mm, john.g.garry, cl, p.raghav,
	ryan.roberts, David Howells, linux-s390

On 30/08/2024 01:41, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>>
>> This fixes the crash on my x86 VM, but it can be improved:
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7c50aeed0522..eff5d2fb5d4e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                 unsigned int new_order);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>> -static inline int split_huge_page(struct page *page)
>> -{
>> -       return split_folio(page_folio(page));
>> -}
>> +int split_huge_page(struct page *page);
>>  void deferred_split_folio(struct folio *folio);
>>
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29af9451d92..4d723dab4336 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>         return ret;
>>  }
>>
>> +int split_huge_page(struct page *page)
>> +{
>> +       unsigned int min_order = 0;
>> +       struct folio *folio = page_folio(page);
>> +
>> +       if (folio_test_anon(folio))
>> +               goto out;
>> +
>> +       if (!folio->mapping) {
>> +               if (folio_test_pmd_mappable(folio))
>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>> +               return -EBUSY;
>> +       }
>> +
>> +       min_order = mapping_min_folio_order(folio->mapping);
>> +out:
>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>> +}
>> +
>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>  {
>>         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

How about we extract the code that returns the min order so that we don't repeat.

Something like this:


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..d27febd5c639 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
+int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-       return split_folio(page_folio(page));
+       struct folio *folio = page_folio(page);
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       /*
+        * split_huge_page() locks the page before splitting and
+        * expects the same page that has been split to be locked when
+        * returned. split_folio_to_list() cannot be used here because
+        * it converts the page to folio and passes the head page to be
+        * split.
+        */
+       return split_huge_page_to_list_to_order(page, NULL, ret);
 }
 void deferred_split_folio(struct folio *folio);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b167e036d01b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int min_order_for_split(struct folio *folio)
 {
-       unsigned int min_order = 0;
-
        if (folio_test_anon(folio))
-               goto out;
+               return 0;

        if (!folio->mapping) {
                if (folio_test_pmd_mappable(folio))
@@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
                return -EBUSY;
        }

-       min_order = mapping_min_folio_order(folio->mapping);
-out:
-       return split_huge_page_to_list_to_order(&folio->page, list,
-                                                       min_order);
+       return mapping_min_folio_order(folio->mapping);
+}
+
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }

 void __folio_undo_large_rmappable(struct folio *folio)

> So I think your patch is cleaner and easier as a fix.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
@ 2024-08-30 17:12               ` Luis Chamberlain
  2024-08-31 22:38                 ` Zi Yan
  2024-08-30 22:42               ` Matthew Wilcox
  2024-08-31 22:35               ` Zi Yan
  2 siblings, 1 reply; 12+ messages in thread
From: Luis Chamberlain @ 2024-08-30 17:12 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Zi Yan, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

Agreed..

> How about we extract the code that returns the min order so that we don't repeat.
> Something like this:

Of all solutions I like this the most, Zi, do you have any preference?

  Luis

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
@ 2024-08-30 22:42               ` Matthew Wilcox
  2024-08-31 22:35               ` Zi Yan
  2 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2024-08-30 22:42 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Zi Yan, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

We do that in the rmap code.

But this is not a performance path.  We should keep this as simple as
possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
  2024-08-30 22:42               ` Matthew Wilcox
@ 2024-08-31 22:35               ` Zi Yan
  2 siblings, 0 replies; 12+ messages in thread
From: Zi Yan @ 2024-08-31 22:35 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 5966 bytes --]

On 30 Aug 2024, at 10:59, Pankaj Raghav wrote:

> On 30/08/2024 01:41, Luis Chamberlain wrote:
>> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>>> always to the head page.
>>>
>>> This fixes the crash on my x86 VM, but it can be improved:
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 7c50aeed0522..eff5d2fb5d4e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>                 unsigned int new_order);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>> -static inline int split_huge_page(struct page *page)
>>> -{
>>> -       return split_folio(page_folio(page));
>>> -}
>>> +int split_huge_page(struct page *page);
>>>  void deferred_split_folio(struct folio *folio);
>>>
>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c29af9451d92..4d723dab4336 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>         return ret;
>>>  }
>>>
>>> +int split_huge_page(struct page *page)
>>> +{
>>> +       unsigned int min_order = 0;
>>> +       struct folio *folio = page_folio(page);
>>> +
>>> +       if (folio_test_anon(folio))
>>> +               goto out;
>>> +
>>> +       if (!folio->mapping) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>>> +               return -EBUSY;
>>> +       }
>>> +
>>> +       min_order = mapping_min_folio_order(folio->mapping);
>>> +out:
>>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>>> +}
>>> +
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>>  {
>>>         unsigned int min_order = 0;
>>
>>
>> Confirmed, and also although you suggest it can be improved, I thought
>> that we could do that by sharing more code and putting things in the
>> headers, the below also fixes this but tries to share more code, but
>> I think it is perhaps less easier to understand than your patch.
>>
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> How about we extract the code that returns the min order so that we don't repeat.
>
> Something like this:
>
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..d27febd5c639 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
> +int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);

Since split_folio() is no longer used below, this line can be removed.

>  static inline int split_huge_page(struct page *page)
>  {
> -       return split_folio(page_folio(page));
> +       struct folio *folio = page_folio(page);
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

min_order_for_split() returns -EBUSY, 0, and a positive min_order. This if
statement should be "if (ret < 0)"?

> +
> +       /*
> +        * split_huge_page() locks the page before splitting and
> +        * expects the same page that has been split to be locked when
> +        * returned. split_folio_to_list() cannot be used here because
> +        * it converts the page to folio and passes the head page to be
> +        * split.
> +        */
> +       return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b167e036d01b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
>
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int min_order_for_split(struct folio *folio)
>  {
> -       unsigned int min_order = 0;
> -
>         if (folio_test_anon(folio))
> -               goto out;
> +               return 0;
>
>         if (!folio->mapping) {
>                 if (folio_test_pmd_mappable(folio))
> @@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>                 return -EBUSY;
>         }
>
> -       min_order = mapping_min_folio_order(folio->mapping);
> -out:
> -       return split_huge_page_to_list_to_order(&folio->page, list,
> -                                                       min_order);
> +       return mapping_min_folio_order(folio->mapping);
> +}
> +
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

Ditto.

> +
> +       return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
>  void __folio_undo_large_rmappable(struct folio *folio)
>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 17:12               ` Luis Chamberlain
@ 2024-08-31 22:38                 ` Zi Yan
  0 siblings, 0 replies; 12+ messages in thread
From: Zi Yan @ 2024-08-31 22:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

On 30 Aug 2024, at 13:12, Luis Chamberlain wrote:

> On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
>> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> Agreed..
>
>> How about we extract the code that returns the min order so that we don't repeat.
>> Something like this:
>
> Of all solutions I like this the most, Zi, do you have any preference?

No preference, as long as Pankaj fixed the "if (ret)" issue I mentioned in the
other email.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-31 22:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240822135018.1931258-1-kernel@pankajraghav.com>
     [not found] ` <20240822135018.1931258-5-kernel@pankajraghav.com>
2024-08-29 10:51   ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Sven Schnelle
2024-08-29 18:46     ` Luis Chamberlain
2024-08-29 19:55       ` Matthew Wilcox
2024-08-29 22:12         ` Zi Yan
2024-08-29 23:41           ` Luis Chamberlain
2024-08-30  5:57             ` Sven Schnelle
2024-08-30 11:58             ` Daniel Gomez
2024-08-30 14:59             ` Pankaj Raghav
2024-08-30 17:12               ` Luis Chamberlain
2024-08-31 22:38                 ` Zi Yan
2024-08-30 22:42               ` Matthew Wilcox
2024-08-31 22:35               ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox