linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
@ 2025-02-10 19:37 David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 01/17] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
                   ` (18 more replies)
  0 siblings, 19 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Against mm-hotfixes-stable for now.

Discussing the PageTail() call in make_device_exclusive_range() with
Willy, I recently discovered [1] that device-exclusive handling does
not properly work with THP, making the hmm-tests selftests fail if THPs
are enabled on the system.

Looking into more details, I found that hugetlb is not properly fenced,
and I realized that something that was bugging me for longer -- how
device-exclusive entries interact with mapcounts -- completely breaks
migration/swapout/split/hwpoison handling of these folios while they have
device-exclusive PTEs.

The program below can be used to allocate 1 GiB worth of pages and
making them device-exclusive on a kernel with CONFIG_TEST_HMM.

Once they are device-exclusive, these folios cannot get swapped out
(proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
much one forces memory reclaim), and when having a memory block onlined
to ZONE_MOVABLE, trying to offline it will loop forever and complain about
failed migration of a page that should be movable.

# echo offline > /sys/devices/system/memory/memory136/state
# echo online_movable > /sys/devices/system/memory/memory136/state
# ./hmm-swap &
... wait until everything is device-exclusive
# echo offline > /sys/devices/system/memory/memory136/state
[  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
  index:0x7f20671f7 pfn:0x442b6a
[  285.196618][T14882] memcg:ffff888179298000
[  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
  dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
[  285.201734][T14882] raw: ...
[  285.204464][T14882] raw: ...
[  285.207196][T14882] page dumped because: migration failure
[  285.209072][T14882] page_owner tracks the page as allocated
[  285.210915][T14882] page last allocated via order 0, migratetype
  Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
  id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
[  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
[  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
[  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
[  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
[  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
[  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
[  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
[  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
...

This series fixes all issues I found so far. There is no easy way to fix
without a bigger rework/cleanup. I have a bunch of cleanups on top (some
previous sent, some the result of the discussion in v1) that I will send
out separately once this landed and I get to it.

I wish we could just use some special present PROT_NONE PTEs instead of
these (non-present, non-none) fake-swap entries; but that just results in
the same problem we keep having (lack of spare PTE bits), and staring at
other similar fake-swap entries, that ship has sailed.

With this series, make_device_exclusive() doesn't actually belong into
mm/rmap.c anymore, but I'll leave moving that for another day.

I only tested this series with the hmm-tests selftests due to lack of HW,
so I'd appreciate some testing, especially if the interaction between
two GPUs wanting a device-exclusive entry works as expected.

<program>
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/types.h>
#include <linux/ioctl.h>

#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)

struct hmm_dmirror_cmd {
	__u64 addr;
	__u64 ptr;
	__u64 npages;
	__u64 cpages;
	__u64 faults;
};

const size_t size = 1 * 1024 * 1024 * 1024ul;
const size_t chunk_size = 2 * 1024 * 1024ul;

int main(void)
{
	struct hmm_dmirror_cmd cmd;
	size_t cur_size;
	int fd, ret;
	char *addr, *mirror;

	fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
	if (fd < 0) {
		perror("open failed\n");
		exit(1);
	}

	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (addr == MAP_FAILED) {
		perror("mmap failed\n");
		exit(1);
	}
	madvise(addr, size, MADV_NOHUGEPAGE);
	memset(addr, 1, size);

	mirror = malloc(chunk_size);

	for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
		cmd.addr = (uintptr_t)addr + cur_size;
		cmd.ptr = (uintptr_t)mirror;
		cmd.npages = chunk_size / getpagesize();
		ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
		if (ret) {
			perror("ioctl failed\n");
			exit(1);
		}
	}
	pause();
	return 0;
}
</program>

[1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Yanteng Si <si.yanteng@linux.dev>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>

v1 -> v2:
 * "mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()"
  -> Fix and simplify return value handling when calling dmirror_atomic_map()
  -> Fix parameter order when calling make_device_exclusive()
  [both things were fixed by the separate cleanups I previously sent, realized
   it when re-testing the fixes here only]
  -> Heavily extend documentation of make_device_exclusive()
 * "mm/rmap: implement make_device_exclusive() using folio_walk instead of
    rmap walk"
  -> Keep MMU_NOTIFY_EXCLUSIVE, and update comments/description
 * "mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()"
  -> Handle PageHWPoison with device-private pages differently
 * Added a bunch of "handle device-exclusive entries correctly" fixes,
   now handling all page_vma_mapped_walk() callers correctly
 * Added "mm/rmap: avoid -EBUSY from make_device_exclusive()" to fix some
   hmm selftest failures I saw while testing under memory pressure
 * Plenty of comment/description updates and improvements

David Hildenbrand (17):
  mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  mm/rmap: convert make_device_exclusive_range() to
    make_device_exclusive()
  mm/rmap: implement make_device_exclusive() using folio_walk instead of
    rmap walk
  mm/memory: detect writability in restore_exclusive_pte() through
    can_change_pte_writable()
  mm: use single SWP_DEVICE_EXCLUSIVE entry type
  mm/page_vma_mapped: device-exclusive entries are not migration entries
  kernel/events/uprobes: handle device-exclusive entries correctly in
    __replace_page()
  mm/ksm: handle device-exclusive entries correctly in
    write_protect_page()
  mm/rmap: handle device-exclusive entries correctly in
    try_to_unmap_one()
  mm/rmap: handle device-exclusive entries correctly in
    try_to_migrate_one()
  mm/rmap: handle device-exclusive entries correctly in
    page_vma_mkclean_one()
  mm/page_idle: handle device-exclusive entries correctly in
    page_idle_clear_pte_refs_one()
  mm/damon: handle device-exclusive entries correctly in
    damon_folio_young_one()
  mm/damon: handle device-exclusive entries correctly in
    damon_folio_mkold_one()
  mm/rmap: keep mapcount untouched for device-exclusive entries
  mm/rmap: avoid -EBUSY from make_device_exclusive()

 Documentation/mm/hmm.rst                    |   2 +-
 Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
 include/linux/mmu_notifier.h                |   2 +-
 include/linux/rmap.h                        |   5 +-
 include/linux/swap.h                        |   7 +-
 include/linux/swapops.h                     |  27 +-
 kernel/events/uprobes.c                     |  13 +-
 lib/test_hmm.c                              |  41 +-
 mm/damon/ops-common.c                       |  23 +-
 mm/damon/paddr.c                            |  10 +-
 mm/gup.c                                    |   3 +
 mm/ksm.c                                    |   9 +-
 mm/memory.c                                 |  28 +-
 mm/mprotect.c                               |   8 -
 mm/page_idle.c                              |   9 +-
 mm/page_table_check.c                       |   5 +-
 mm/page_vma_mapped.c                        |   3 +-
 mm/rmap.c                                   | 469 +++++++++-----------
 19 files changed, 315 insertions(+), 356 deletions(-)


base-commit: e5b2a356dc8a88708d97bd47cca3b8f7ed7af6cb
-- 
2.48.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 01/17] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 02/17] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, John Hubbard, stable

We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
early, make_device_exclusive_range() can end up getting called on
hugetlb VMAs.

Right now, this means that with a PMD-sized hugetlb page, we can end
up calling split_huge_pmd(), because pmd_trans_huge() also succeeds
with hugetlb PMDs.

For example, using a modified hmm-test selftest one can trigger:

[  207.017134][T14945] ------------[ cut here ]------------
[  207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
[  207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[  207.021072][T14945] CPU: 3 UID: 0 PID: ...
[  207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[  207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
[  207.026128][T14945] Code: ...
[  207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
[  207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
[  207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
[  207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
[  207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
[  207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
[  207.038711][T14945] FS:  00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
[  207.040407][T14945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
[  207.043196][T14945] PKRU: 55555554
[  207.043880][T14945] Call Trace:
[  207.044506][T14945]  <TASK>
[  207.045086][T14945]  ? __die+0x51/0x92
[  207.045864][T14945]  ? die+0x29/0x50
[  207.046596][T14945]  ? do_trap+0x250/0x320
[  207.047430][T14945]  ? do_error_trap+0xe7/0x220
[  207.048346][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.049535][T14945]  ? handle_invalid_op+0x34/0x40
[  207.050494][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.051681][T14945]  ? exc_invalid_op+0x2e/0x50
[  207.052589][T14945]  ? asm_exc_invalid_op+0x1a/0x20
[  207.053596][T14945]  ? page_table_check_clear.part.0+0x1fd/0x510
[  207.054790][T14945]  ? page_table_check_clear.part.0+0x487/0x510
[  207.055993][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.057195][T14945]  ? page_table_check_clear.part.0+0x487/0x510
[  207.058384][T14945]  __page_table_check_pmd_clear+0x34b/0x5a0
[  207.059524][T14945]  ? __pfx___page_table_check_pmd_clear+0x10/0x10
[  207.060775][T14945]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
[  207.061940][T14945]  ? __pfx___lock_acquire+0x10/0x10
[  207.062967][T14945]  pmdp_huge_clear_flush+0x279/0x360
[  207.064024][T14945]  split_huge_pmd_locked+0x82b/0x3750
...

Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
follow_page_mask code"), we would have ignored the flag; instead, let's
simply refuse the combination completely in check_vma_flags(): the
caller is likely not prepared to handle any hugetlb folios.

We'll teach make_device_exclusive_range() separately to ignore any hugetlb
folios as a future-proof safety net.

Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 3883b307780ea..61e751baf862c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1283,6 +1283,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
+		return -EOPNOTSUPP;
+
 	if (vma_is_secretmem(vma))
 		return -EFAULT;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 02/17] mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 01/17] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, stable

Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP,
let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked.

In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the
generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have
returned a page. In particular, hugetlb folios that are not PMD-sized
would never have been prone to FOLL_SPLIT_PMD.

hugetlb folios can be anonymous, and page_make_device_exclusive_one() is
not really prepared for handling them at all. So let's spell that out.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7e..17fbfa61f7efb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2499,7 +2499,7 @@ static bool folio_make_device_exclusive(struct folio *folio,
 	 * Restrict to anonymous folios for now to avoid potential writeback
 	 * issues.
 	 */
-	if (!folio_test_anon(folio))
+	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
 		return false;
 
 	rmap_walk(folio, &rwc);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 01/17] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 02/17] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-11  5:00   ` Andrew Morton
  2025-02-10 19:37 ` [PATCH v2 04/17] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Simona Vetter

The single "real" user in the tree of make_device_exclusive_range() always
requests making only a single address exclusive. The current implementation
is hard to fix for properly supporting anonymous THP / large folios and
for avoiding messing with rmap walks in weird ways.

So let's always process a single address/page and return folio + page to
minimize page -> folio lookups. This is a preparation for further
changes.

Reject any non-anonymous or hugetlb folios early, directly after GUP.

While at it, extend the documentation of make_device_exclusive() to
clarify some things.

Acked-by: Simona Vetter <simona.vetter@ffwll.ch>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/hmm.rst                    |   2 +-
 Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
 include/linux/mmu_notifier.h                |   2 +-
 include/linux/rmap.h                        |   5 +-
 lib/test_hmm.c                              |  41 +++-----
 mm/rmap.c                                   | 103 ++++++++++++--------
 7 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
index f6d53c37a2ca8..7d61b7a8b65b7 100644
--- a/Documentation/mm/hmm.rst
+++ b/Documentation/mm/hmm.rst
@@ -400,7 +400,7 @@ Exclusive access memory
 Some devices have features such as atomic PTE bits that can be used to implement
 atomic access to system memory. To support atomic operations to a shared virtual
 memory page such a device needs access to that page which is exclusive of any
-userspace access from the CPU. The ``make_device_exclusive_range()`` function
+userspace access from the CPU. The ``make_device_exclusive()`` function
 can be used to make a memory range inaccessible from userspace.
 
 This replaces all mappings for pages in the given range with special swap
diff --git a/Documentation/translations/zh_CN/mm/hmm.rst b/Documentation/translations/zh_CN/mm/hmm.rst
index 0669f947d0bc9..22c210f4e94f3 100644
--- a/Documentation/translations/zh_CN/mm/hmm.rst
+++ b/Documentation/translations/zh_CN/mm/hmm.rst
@@ -326,7 +326,7 @@ devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到 ``s
 
 一些设备具有诸如原子PTE位的功能,可以用来实现对系统内存的原子访问。为了支持对一
 个共享的虚拟内存页的原子操作,这样的设备需要对该页的访问是排他的,而不是来自CPU
-的任何用户空间访问。  ``make_device_exclusive_range()`` 函数可以用来使一
+的任何用户空间访问。  ``make_device_exclusive()`` 函数可以用来使一
 个内存范围不能从用户空间访问。
 
 这将用特殊的交换条目替换给定范围内的所有页的映射。任何试图访问交换条目的行为都会
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index b4da82ddbb6b2..39e3740980bb7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -609,10 +609,9 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
 
 		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
 		mmap_read_lock(mm);
-		ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE,
-					    &page, drm->dev);
+		page = make_device_exclusive(mm, start, drm->dev, &folio);
 		mmap_read_unlock(mm);
-		if (ret <= 0 || !page) {
+		if (IS_ERR(page)) {
 			ret = -EINVAL;
 			goto out;
 		}
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index e2dd57ca368b0..d4e7146618262 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -46,7 +46,7 @@ struct mmu_interval_notifier;
  * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
  * longer have exclusive access to the page. When sent during creation of an
  * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
+ * caller of make_device_exclusive(), otherwise the owner will be NULL.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a04088f3f2..86425d42c1a90 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -663,9 +663,8 @@ int folio_referenced(struct folio *, int is_locked,
 void try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
-int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, struct page **pages,
-				void *arg);
+struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
+		void *owner, struct folio **foliop);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 056f2e411d7b4..e4afca8d18802 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -780,10 +780,8 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 	unsigned long start, end, addr;
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
-	struct page *pages[64];
 	struct dmirror_bounce bounce;
-	unsigned long next;
-	int ret;
+	int ret = 0;
 
 	start = cmd->addr;
 	end = start + size;
@@ -795,36 +793,27 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 		return -EINVAL;
 
 	mmap_read_lock(mm);
-	for (addr = start; addr < end; addr = next) {
-		unsigned long mapped = 0;
-		int i;
-
-		next = min(end, addr + (ARRAY_SIZE(pages) << PAGE_SHIFT));
+	for (addr = start; !ret && addr < end; addr += PAGE_SIZE) {
+		struct folio *folio;
+		struct page *page;
 
-		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
-		/*
-		 * Do dmirror_atomic_map() iff all pages are marked for
-		 * exclusive access to avoid accessing uninitialized
-		 * fields of pages.
-		 */
-		if (ret == (next - addr) >> PAGE_SHIFT)
-			mapped = dmirror_atomic_map(addr, next, pages, dmirror);
-		for (i = 0; i < ret; i++) {
-			if (pages[i]) {
-				unlock_page(pages[i]);
-				put_page(pages[i]);
-			}
+		page = make_device_exclusive(mm, addr, NULL, &folio);
+		if (IS_ERR(page)) {
+			ret = PTR_ERR(page);
+			break;
 		}
 
-		if (addr + (mapped << PAGE_SHIFT) < next) {
-			mmap_read_unlock(mm);
-			mmput(mm);
-			return -EBUSY;
-		}
+		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
+		ret = ret == 1 ? 0 : -EBUSY;
+		folio_unlock(folio);
+		folio_put(folio);
 	}
 	mmap_read_unlock(mm);
 	mmput(mm);
 
+	if (ret)
+		return ret;
+
 	/* Return the migrated data for verification. */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
diff --git a/mm/rmap.c b/mm/rmap.c
index 17fbfa61f7efb..7ccf850565d33 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2495,70 +2495,89 @@ static bool folio_make_device_exclusive(struct folio *folio,
 		.arg = &args,
 	};
 
-	/*
-	 * Restrict to anonymous folios for now to avoid potential writeback
-	 * issues.
-	 */
-	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
-		return false;
-
 	rmap_walk(folio, &rwc);
 
 	return args.valid && !folio_mapcount(folio);
 }
 
 /**
- * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * make_device_exclusive() - Mark a page for exclusive use by a device
  * @mm: mm_struct of associated target process
- * @start: start of the region to mark for exclusive device access
- * @end: end address of region
- * @pages: returns the pages which were successfully marked for exclusive access
+ * @addr: the virtual address to mark for exclusive device access
  * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
+ * @foliop: folio pointer will be stored here on success.
+ *
+ * This function looks up the page mapped at the given address, grabs a
+ * folio reference, locks the folio and replaces the PTE with special
+ * device-exclusive PFN swap entry, preventing access through the process
+ * page tables. The function will return with the folio locked and referenced.
  *
- * Returns: number of pages found in the range by GUP. A page is marked for
- * exclusive access only if the page pointer is non-NULL.
+ * On fault, the device-exclusive entries are replaced with the original PTE
+ * under folio lock, after calling MMU notifiers.
  *
- * This function finds ptes mapping page(s) to the given address range, locks
- * them and replaces mappings with special swap entries preventing userspace CPU
- * access. On fault these entries are replaced with the original mapping after
- * calling MMU notifiers.
+ * Only anonymous non-hugetlb folios are supported and the VMA must have
+ * write permissions such that we can fault in the anonymous page writable
+ * in order to mark it exclusive. The caller must hold the mmap_lock in read
+ * mode.
  *
  * A driver using this to program access from a device must use a mmu notifier
  * critical section to hold a device specific lock during programming. Once
- * programming is complete it should drop the page lock and reference after
+ * programming is complete it should drop the folio lock and reference after
  * which point CPU access to the page will revoke the exclusive access.
+ *
+ * Notes:
+ *   #. This function always operates on individual PTEs mapping individual
+ *      pages. PMD-sized THPs are first remapped to be mapped by PTEs before
+ *      the conversion happens on a single PTE corresponding to @addr.
+ *   #. While concurrent access through the process page tables is prevented,
+ *      concurrent access through other page references (e.g., earlier GUP
+ *      invocation) is not handled and not supported.
+ *   #. device-exclusive entries are considered "clean" and "old" by core-mm.
+ *      Device drivers must update the folio state when informed by MMU
+ *      notifiers.
+ *
+ * Returns: pointer to mapped page on success, otherwise a negative error.
  */
-int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, struct page **pages,
-				void *owner)
+struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
+		void *owner, struct folio **foliop)
 {
-	long npages = (end - start) >> PAGE_SHIFT;
-	long i;
+	struct folio *folio;
+	struct page *page;
+	long npages;
+
+	mmap_assert_locked(mm);
 
-	npages = get_user_pages_remote(mm, start, npages,
+	/*
+	 * Fault in the page writable and try to lock it; note that if the
+	 * address would already be marked for exclusive use by a device,
+	 * the GUP call would undo that first by triggering a fault.
+	 */
+	npages = get_user_pages_remote(mm, addr, 1,
 				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-				       pages, NULL);
-	if (npages < 0)
-		return npages;
-
-	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
-		struct folio *folio = page_folio(pages[i]);
-		if (PageTail(pages[i]) || !folio_trylock(folio)) {
-			folio_put(folio);
-			pages[i] = NULL;
-			continue;
-		}
+				       &page, NULL);
+	if (npages != 1)
+		return ERR_PTR(npages);
+	folio = page_folio(page);
 
-		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			pages[i] = NULL;
-		}
+	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
+		folio_put(folio);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (!folio_trylock(folio)) {
+		folio_put(folio);
+		return ERR_PTR(-EBUSY);
 	}
 
-	return npages;
+	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
+		folio_unlock(folio);
+		folio_put(folio);
+		return ERR_PTR(-EBUSY);
+	}
+	*foliop = folio;
+	return page;
 }
-EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+EXPORT_SYMBOL_GPL(make_device_exclusive);
 #endif
 
 void __put_anon_vma(struct anon_vma *anon_vma)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 04/17] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 05/17] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

We require a writable PTE and only support anonymous folio: we can only
have exactly one PTE pointing at that page, which we can just lookup
using a folio walk, avoiding the rmap walk and the anon VMA lock.

So let's stop doing an rmap walk and perform a folio walk instead, so we
can easily just modify a single PTE and avoid relying on rmap/mapcounts.

We now effectively work on a single PTE instead of multiple PTEs of
a large folio, allowing for conversion of individual PTEs from
non-exclusive to device-exclusive -- note that the opposite direction
always works on single PTEs: restore_exclusive_pte().

With this change, device-exclusive handling is fully compatible with THPs /
large folios. We still require PMD-sized THPs to get PTE-mapped, and
supporting PMD-mapped THP (without the PTE-remapping) is a different
endeavour that might not be worth it at this point: it might even have
negative side-effects [1].

This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
rmap walks (migration/swapout) next. Spell out that messing with the
mapcount is wrong and must be fixed.

[1] https://lkml.kernel.org/r/Z5tI-cOSyzdLjoe_@phenom.ffwll.local

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 200 ++++++++++++++++++------------------------------------
 1 file changed, 67 insertions(+), 133 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7ccf850565d33..0cd2a2d3de00d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
-struct make_exclusive_args {
-	struct mm_struct *mm;
-	unsigned long address;
-	void *owner;
-	bool valid;
-};
-
-static bool page_make_device_exclusive_one(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long address, void *priv)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	struct make_exclusive_args *args = priv;
-	pte_t pteval;
-	struct page *subpage;
-	bool ret = true;
-	struct mmu_notifier_range range;
-	swp_entry_t entry;
-	pte_t swp_pte;
-	pte_t ptent;
-
-	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
-				      vma->vm_mm, address, min(vma->vm_end,
-				      address + folio_size(folio)),
-				      args->owner);
-	mmu_notifier_invalidate_range_start(&range);
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-
-		ptent = ptep_get(pvmw.pte);
-		if (!pte_present(ptent)) {
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			break;
-		}
-
-		subpage = folio_page(folio,
-				pte_pfn(ptent) - folio_pfn(folio));
-		address = pvmw.address;
-
-		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(ptent));
-		pteval = ptep_clear_flush(vma, address, pvmw.pte);
-
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
-		/*
-		 * Check that our target page is still mapped at the expected
-		 * address.
-		 */
-		if (args->mm == mm && args->address == address &&
-		    pte_write(pteval))
-			args->valid = true;
-
-		/*
-		 * Store the pfn of the page in a special migration
-		 * pte. do_swap_page() will wait until the migration
-		 * pte is removed and then restart fault handling.
-		 */
-		if (pte_write(pteval))
-			entry = make_writable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		else
-			entry = make_readable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		swp_pte = swp_entry_to_pte(entry);
-		if (pte_soft_dirty(pteval))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		if (pte_uffd_wp(pteval))
-			swp_pte = pte_swp_mkuffd_wp(swp_pte);
-
-		set_pte_at(mm, address, pvmw.pte, swp_pte);
-
-		/*
-		 * There is a reference on the page for the swap entry which has
-		 * been removed, so shouldn't take another.
-		 */
-		folio_remove_rmap_pte(folio, subpage, vma);
-	}
-
-	mmu_notifier_invalidate_range_end(&range);
-
-	return ret;
-}
-
-/**
- * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
- * @folio: The folio to replace page table entries for.
- * @mm: The mm_struct where the folio is expected to be mapped.
- * @address: Address where the folio is expected to be mapped.
- * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
- *
- * Tries to remove all the page table entries which are mapping this
- * folio and replace them with special device exclusive swap entries to
- * grant a device exclusive access to the folio.
- *
- * Context: Caller must hold the folio lock.
- * Return: false if the page is still mapped, or if it could not be unmapped
- * from the expected address. Otherwise returns true (success).
- */
-static bool folio_make_device_exclusive(struct folio *folio,
-		struct mm_struct *mm, unsigned long address, void *owner)
-{
-	struct make_exclusive_args args = {
-		.mm = mm,
-		.address = address,
-		.owner = owner,
-		.valid = false,
-	};
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_make_device_exclusive_one,
-		.done = folio_not_mapped,
-		.anon_lock = folio_lock_anon_vma_read,
-		.arg = &args,
-	};
-
-	rmap_walk(folio, &rwc);
-
-	return args.valid && !folio_mapcount(folio);
-}
-
 /**
  * make_device_exclusive() - Mark a page for exclusive use by a device
  * @mm: mm_struct of associated target process
@@ -2541,22 +2416,31 @@ static bool folio_make_device_exclusive(struct folio *folio,
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		void *owner, struct folio **foliop)
 {
-	struct folio *folio;
+	struct mmu_notifier_range range;
+	struct folio *folio, *fw_folio;
+	struct vm_area_struct *vma;
+	struct folio_walk fw;
 	struct page *page;
-	long npages;
+	swp_entry_t entry;
+	pte_t swp_pte;
 
 	mmap_assert_locked(mm);
+	addr = PAGE_ALIGN_DOWN(addr);
 
 	/*
 	 * Fault in the page writable and try to lock it; note that if the
 	 * address would already be marked for exclusive use by a device,
 	 * the GUP call would undo that first by triggering a fault.
+	 *
+	 * If any other device would already map this page exclusively, the
+	 * fault will trigger a conversion to an ordinary
+	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
 	 */
-	npages = get_user_pages_remote(mm, addr, 1,
-				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-				       &page, NULL);
-	if (npages != 1)
-		return ERR_PTR(npages);
+	page = get_user_page_vma_remote(mm, addr,
+					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+					&vma);
+	if (IS_ERR(page))
+		return page;
 	folio = page_folio(page);
 
 	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
@@ -2569,11 +2453,61 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
+	/*
+	 * Inform secondary MMUs that we are going to convert this PTE to
+	 * device-exclusive, such that they unmap it now. Note that the
+	 * caller must filter this event out to prevent livelocks.
+	 */
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
+				      mm, addr, addr + PAGE_SIZE, owner);
+	mmu_notifier_invalidate_range_start(&range);
+
+	/*
+	 * Let's do a second walk and make sure we still find the same page
+	 * mapped writable. Note that any page of an anonymous folio can
+	 * only be mapped writable using exactly one PTE ("exclusive"), so
+	 * there cannot be other mappings.
+	 */
+	fw_folio = folio_walk_start(&fw, vma, addr, 0);
+	if (fw_folio != folio || fw.page != page ||
+	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
+		if (fw_folio)
+			folio_walk_end(&fw, vma);
+		mmu_notifier_invalidate_range_end(&range);
 		folio_unlock(folio);
 		folio_put(folio);
 		return ERR_PTR(-EBUSY);
 	}
+
+	/* Nuke the page table entry so we get the uptodate dirty bit. */
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
+
+	/* Set the dirty flag on the folio now the PTE is gone. */
+	if (pte_dirty(fw.pte))
+		folio_mark_dirty(folio);
+
+	/*
+	 * Store the pfn of the page in a special device-exclusive PFN swap PTE.
+	 * do_swap_page() will trigger the conversion back while holding the
+	 * folio lock.
+	 */
+	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
+	swp_pte = swp_entry_to_pte(entry);
+	if (pte_soft_dirty(fw.pte))
+		swp_pte = pte_swp_mksoft_dirty(swp_pte);
+	/* The pte is writable, uffd-wp does not apply. */
+	set_pte_at(mm, addr, fw.ptep, swp_pte);
+
+	/*
+	 * TODO: The device-exclusive PFN swap PTE holds a folio reference but
+	 * does not count as a mapping (mapcount), which is wrong and must be
+	 * fixed, otherwise RMAP walks don't behave as expected.
+	 */
+	folio_remove_rmap_pte(folio, page, vma);
+
+	folio_walk_end(&fw, vma);
+	mmu_notifier_invalidate_range_end(&range);
 	*foliop = folio;
 	return page;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 05/17] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 04/17] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 06/17] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Let's do it just like mprotect write-upgrade or during NUMA-hinting
faults on PROT_NONE PTEs: detect if the PTE can be writable by using
can_change_pte_writable().

Set the PTE only dirty if the folio is dirty: we might not
necessarily have a write access, and setting the PTE writable doesn't
require setting the PTE dirty.

From a CPU perspective, these entries are clean. So only set the PTE dirty
if the folios is dirty.

With this change in place, there is no need to have separate
readable and writable device-exclusive entry types, and we'll merge
them next separately.

Note that, during fork(), we first convert the device-exclusive entries
back to ordinary PTEs, and we only ever allow conversion of writable
PTEs to device-exclusive -- only mprotect can currently change them to
readable-device-exclusive. Consequently, we always expect
PageAnonExclusive(page)==true and can_change_pte_writable()==true,
unless we are dealing with soft-dirty tracking or uffd-wp. But reusing
can_change_pte_writable() for now is cleaner.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d545..ba33ba3b7ea17 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -723,18 +723,21 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 	struct folio *folio = page_folio(page);
 	pte_t orig_pte;
 	pte_t pte;
-	swp_entry_t entry;
 
 	orig_pte = ptep_get(ptep);
 	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 
-	entry = pte_to_swp_entry(orig_pte);
 	if (pte_swp_uffd_wp(orig_pte))
 		pte = pte_mkuffd_wp(pte);
-	else if (is_writable_device_exclusive_entry(entry))
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	if ((vma->vm_flags & VM_WRITE) &&
+	    can_change_pte_writable(vma, address, pte)) {
+		if (folio_test_dirty(folio))
+			pte = pte_mkdirty(pte);
+		pte = pte_mkwrite(pte, vma);
+	}
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 06/17] mm: use single SWP_DEVICE_EXCLUSIVE entry type
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 05/17] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 07/17] mm/page_vma_mapped: device-exclusive entries are not migration entries David Hildenbrand
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Simona Vetter

There is no need for the distinction anymore; let's merge the readable
and writable device-exclusive entries into a single device-exclusive
entry type.

Acked-by: Simona Vetter <simona.vetter@ffwll.ch>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h    |  7 +++----
 include/linux/swapops.h | 27 ++++-----------------------
 mm/mprotect.c           |  8 --------
 mm/page_table_check.c   |  5 ++---
 mm/rmap.c               |  2 +-
 5 files changed, 10 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b13b72645db33..26b1d8cc5b0e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -74,14 +74,13 @@ static inline int current_is_kswapd(void)
  * to a special SWP_DEVICE_{READ|WRITE} entry.
  *
  * When a page is mapped by the device for exclusive access we set the CPU page
- * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
+ * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 4
+#define SWP_DEVICE_NUM 3
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
-#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
-#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
+#define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 96f26e29fefed..64ea151a7ae39 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -186,26 +186,16 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 
-static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
 {
-	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
-}
-
-static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
-{
-	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+	return swp_entry(SWP_DEVICE_EXCLUSIVE, offset);
 }
 
 static inline bool is_device_exclusive_entry(swp_entry_t entry)
 {
-	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
-		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
 }
 
-static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
-{
-	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
-}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -227,12 +217,7 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 	return false;
 }
 
-static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
-{
-	return swp_entry(0, 0);
-}
-
-static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
 }
@@ -242,10 +227,6 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
 	return false;
 }
 
-static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
-{
-	return false;
-}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 516b1d847e2cd..9cb6ab7c40480 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,14 +225,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_writable_device_exclusive_entry(entry)) {
-				entry = make_readable_device_exclusive_entry(
-							swp_offset(entry));
-				newpte = swp_entry_to_pte(entry);
-				if (pte_swp_soft_dirty(oldpte))
-					newpte = pte_swp_mksoft_dirty(newpte);
-				if (pte_swp_uffd_wp(oldpte))
-					newpte = pte_swp_mkuffd_wp(newpte);
 			} else if (is_pte_marker_entry(entry)) {
 				/*
 				 * Ignore error swap entries unconditionally,
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 509c6ef8de400..c2b3600429a0c 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -196,9 +196,8 @@ EXPORT_SYMBOL(__page_table_check_pud_clear);
 /* Whether the swap entry cached writable information */
 static inline bool swap_cached_writable(swp_entry_t entry)
 {
-	return is_writable_device_exclusive_entry(entry) ||
-	    is_writable_device_private_entry(entry) ||
-	    is_writable_migration_entry(entry);
+	return is_writable_device_private_entry(entry) ||
+	       is_writable_migration_entry(entry);
 }
 
 static inline void page_table_check_pte_flags(pte_t pte)
diff --git a/mm/rmap.c b/mm/rmap.c
index 0cd2a2d3de00d..1129ed132af94 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2492,7 +2492,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	 * do_swap_page() will trigger the conversion back while holding the
 	 * folio lock.
 	 */
-	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
+	entry = make_device_exclusive_entry(page_to_pfn(page));
 	swp_pte = swp_entry_to_pte(entry);
 	if (pte_soft_dirty(fw.pte))
 		swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 07/17] mm/page_vma_mapped: device-exclusive entries are not migration entries
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 06/17] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 08/17] kernel/events/uprobes: handle device-exclusive entries correctly in __replace_page() David Hildenbrand
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

It's unclear why they would be considered migration entries; they are
not.

Likely we'll never really trigger that case in practice, because
migration (including folio split) of a folio that has device-exclusive
entries is never started, as we would detect "additional references":
device-exclusive entries adjust the mapcount, but not the refcount.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_vma_mapped.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 81839a9e74f16..32679be22d30c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -111,8 +111,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(ptent);
 
-		if (!is_migration_entry(entry) &&
-		    !is_device_exclusive_entry(entry))
+		if (!is_migration_entry(entry))
 			return false;
 
 		pfn = swp_offset_pfn(entry);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 08/17] kernel/events/uprobes: handle device-exclusive entries correctly in __replace_page()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 07/17] mm/page_vma_mapped: device-exclusive entries are not migration entries David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 09/17] mm/ksm: handle device-exclusive entries correctly in write_protect_page() David Hildenbrand
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

__replace_page() is not prepared for that, so teach it about these
PFN swap PTEs. Note that device-private entries are so far not
applicable on that path, because GUP would never have returned such
folios (conversion to device-private happens by page migration, not
in-place conversion of the PTE).

There is a race between GUP and us locking the folio to look it up
using page_vma_mapped_walk(), so this is likely a fix (unless something
else could prevent that race, but it doesn't look like). pte_pfn() on
something that is not a present pte could give use garbage, and we'd
wrongly mess up the mapcount because it was already adjusted by calling
folio_remove_rmap_pte() when making the entry device-exclusive.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/events/uprobes.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2ca797cbe465f..cd6105b100325 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,6 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0);
 	int err;
 	struct mmu_notifier_range range;
+	pte_t pte;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
 				addr + PAGE_SIZE);
@@ -192,6 +193,16 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	if (!page_vma_mapped_walk(&pvmw))
 		goto unlock;
 	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
+	pte = ptep_get(pvmw.pte);
+
+	/*
+	 * Handle PFN swap PTES, such as device-exclusive ones, that actually
+	 * map pages: simply trigger GUP again to fix it up.
+	 */
+	if (unlikely(!pte_present(pte))) {
+		page_vma_mapped_walk_done(&pvmw);
+		goto unlock;
+	}
 
 	if (new_page) {
 		folio_get(new_folio);
@@ -206,7 +217,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		inc_mm_counter(mm, MM_ANONPAGES);
 	}
 
-	flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
+	flush_cache_page(vma, addr, pte_pfn(pte));
 	ptep_clear_flush(vma, addr, pvmw.pte);
 	if (new_page)
 		set_pte_at(mm, addr, pvmw.pte,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 09/17] mm/ksm: handle device-exclusive entries correctly in write_protect_page()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 08/17] kernel/events/uprobes: handle device-exclusive entries correctly in __replace_page() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 10/17] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

write_protect_page() is not prepared for that, so teach it about these
PFN swap PTEs. Note that device-private entries are so far not
applicable on that path, because GUP would never have returned such
folios (conversion to device-private happens by page migration, not
in-place conversion of the PTE).

There is a race between performing the folio_walk (which fails on
non-present PTEs) and locking the folio to look it up using
page_vma_mapped_walk() again, so this is likely a fix (unless something
else could prevent that race, but it doesn't look like). In the
future it could be handled if ever required, for now just give up and
ignore them like folio_walk would.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/ksm.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 8be2b144fefd6..8583fb91ef136 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1270,8 +1270,15 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
 	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
 		goto out_unlock;
 
-	anon_exclusive = PageAnonExclusive(&folio->page);
 	entry = ptep_get(pvmw.pte);
+	/*
+	 * Handle PFN swap PTEs, such as device-exclusive ones, that actually
+	 * map pages: give up just like the next folio_walk would.
+	 */
+	if (unlikely(!pte_present(entry)))
+		goto out_unlock;
+
+	anon_exclusive = PageAnonExclusive(&folio->page);
 	if (pte_write(entry) || pte_dirty(entry) ||
 	    anon_exclusive || mm_tlb_flush_pending(mm)) {
 		swapped = folio_test_swapcache(folio);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 10/17] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (8 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 09/17] mm/ksm: handle device-exclusive entries correctly in write_protect_page() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 11/17] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

try_to_unmap_one() is not prepared for that, so teach it about these
PFN swap PTEs. Note that device-private entries are so far not
applicable on that path, as we expect ZONE_DEVICE pages so far only in
migration code when it comes to the RMAP.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Further note that try_to_unmap() calls MMU notifiers and holds the
folio lock, so any device-exclusive users should be properly prepared
for a device-exclusive PTE to "vanish".

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1129ed132af94..47142a656ae51 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1648,9 +1648,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	bool anon_exclusive, ret = true;
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
@@ -1722,7 +1722,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-		pfn = pte_pfn(ptep_get(pvmw.pte));
+		/*
+		 * Handle PFN swap PTEs, such as device-exclusive ones, that
+		 * actually map pages.
+		 */
+		pteval = ptep_get(pvmw.pte);
+		if (likely(pte_present(pteval))) {
+			pfn = pte_pfn(pteval);
+		} else {
+			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+		}
+
 		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
@@ -1778,7 +1789,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_vma_unlock_write(vma);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-		} else {
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+		} else if (likely(pte_present(pteval))) {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
 			if (should_defer_flush(mm, flags)) {
@@ -1796,6 +1809,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+		} else {
+			pte_clear(mm, address, pvmw.pte);
 		}
 
 		/*
@@ -1805,10 +1822,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
 
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
@@ -1822,8 +1835,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
+			   !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
 			 * interest anymore. Simply discard the pte, vmscan
@@ -1902,6 +1915,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
+
+			/*
+			 * arch_unmap_one() is expected to be a NOP on
+			 * architectures where we could have PFN swap PTEs,
+			 * so we'll not check/care.
+			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 				swap_free(entry);
 				set_pte_at(mm, address, pvmw.pte, pteval);
@@ -1926,10 +1945,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (anon_exclusive)
 				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (likely(pte_present(pteval))) {
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			} else {
+				if (pte_swp_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_swp_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			}
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 		} else {
 			/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 11/17] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (9 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 10/17] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 12/17] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

try_to_migrate_one() is not prepared for that, so teach it about these
PFN swap PTEs. We already handle device-private entries by specializing
on the folio, so we can reshuffle that code to make it work on the
PFN swap PTEs instead.

Get rid of the folio_is_device_private() handling. Note that we never
currently expect device-private folios with HWPoison flag set at that
point, so add a warning in case that ever changes and we can figure out
what the right thing to do is.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Further note that try_to_migrate() calls MMU notifiers and holds the
folio lock, so any device-exclusive users should be properly prepared
for a device-exclusive PTE to "vanish".

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 124 ++++++++++++++++++++++--------------------------------
 1 file changed, 51 insertions(+), 73 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 47142a656ae51..7c471c3ea64c4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2039,9 +2039,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	bool anon_exclusive, writable, ret = true;
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
@@ -2108,24 +2108,19 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-		pfn = pte_pfn(ptep_get(pvmw.pte));
-
-		if (folio_is_zone_device(folio)) {
-			/*
-			 * Our PTE is a non-present device exclusive entry and
-			 * calculating the subpage as for the common case would
-			 * result in an invalid pointer.
-			 *
-			 * Since only PAGE_SIZE pages can currently be
-			 * migrated, just set it to page. This will need to be
-			 * changed when hugepage migrations to device private
-			 * memory are supported.
-			 */
-			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
-			subpage = &folio->page;
+		/*
+		 * Handle PFN swap PTEs, such as device-exclusive ones, that
+		 * actually map pages.
+		 */
+		pteval = ptep_get(pvmw.pte);
+		if (likely(pte_present(pteval))) {
+			pfn = pte_pfn(pteval);
 		} else {
-			subpage = folio_page(folio, pfn - folio_pfn(folio));
+			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 		}
+
+		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
@@ -2181,7 +2176,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 			/* Nuke the hugetlb page table entry */
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-		} else {
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+			writable = pte_write(pteval);
+		} else if (likely(pte_present(pteval))) {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
 			if (should_defer_flush(mm, flags)) {
@@ -2199,54 +2197,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+			writable = pte_write(pteval);
+		} else {
+			pte_clear(mm, address, pvmw.pte);
+			writable = is_writable_device_private_entry(pte_to_swp_entry(pteval));
 		}
 
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
+		VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) &&
+				!anon_exclusive, folio);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (folio_is_device_private(folio)) {
-			unsigned long pfn = folio_pfn(folio);
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			if (anon_exclusive)
-				WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio,
-									   subpage));
+		if (PageHWPoison(subpage)) {
+			VM_WARN_ON_FOLIO(folio_is_device_private(folio), folio);
 
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = pte_to_swp_entry(pteval);
-			if (is_writable_device_private_entry(entry))
-				entry = make_writable_migration_entry(pfn);
-			else if (anon_exclusive)
-				entry = make_readable_exclusive_migration_entry(pfn);
-			else
-				entry = make_readable_migration_entry(pfn);
-			swp_pte = swp_entry_to_pte(entry);
-
-			/*
-			 * pteval maps a zone device page and is therefore
-			 * a swap pte.
-			 */
-			if (pte_swp_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_swp_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
-			trace_set_migration_pte(pvmw.address, pte_val(swp_pte),
-						folio_order(folio));
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
-		} else if (PageHWPoison(subpage)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(folio_nr_pages(folio), mm);
@@ -2256,8 +2223,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
+			   !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
 			 * interest anymore. Simply discard the pte, vmscan
@@ -2273,6 +2240,11 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t swp_pte;
 
+			/*
+			 * arch_unmap_one() is expected to be a NOP on
+			 * architectures where we could have PFN swap PTEs,
+			 * so we'll not check/care.
+			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 				if (folio_test_hugetlb(folio))
 					set_huge_pte_at(mm, address, pvmw.pte,
@@ -2283,8 +2255,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
-			VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
-				       !anon_exclusive, subpage);
 
 			/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
 			if (folio_test_hugetlb(folio)) {
@@ -2309,7 +2279,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			if (pte_write(pteval))
+			if (writable)
 				entry = make_writable_migration_entry(
 							page_to_pfn(subpage));
 			else if (anon_exclusive)
@@ -2318,15 +2288,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-			if (pte_young(pteval))
-				entry = make_migration_entry_young(entry);
-			if (pte_dirty(pteval))
-				entry = make_migration_entry_dirty(entry);
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (likely(pte_present(pteval))) {
+				if (pte_young(pteval))
+					entry = make_migration_entry_young(entry);
+				if (pte_dirty(pteval))
+					entry = make_migration_entry_dirty(entry);
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			} else {
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_swp_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			}
 			if (folio_test_hugetlb(folio))
 				set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
 						hsz);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 12/17] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (10 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 11/17] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() David Hildenbrand
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

page_vma_mkclean_one() is not prepared for that, so teach it about these
PFN swap PTEs. Note that device-private entries are so far not applicable
on that path, as we expect ZONE_DEVICE pages so far only in migration code
when it comes to the RMAP.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7c471c3ea64c4..7b737f0f68fb5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1044,6 +1044,14 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 			pte_t *pte = pvmw->pte;
 			pte_t entry = ptep_get(pte);
 
+			/*
+			 * PFN swap PTEs, such as device-exclusive ones, that
+			 * actually map pages are clean and not writable from a
+			 * CPU perspective. The MMU notifier takes care of any
+			 * device aspects.
+			 */
+			if (!pte_present(entry))
+				continue;
 			if (!pte_dirty(entry) && !pte_write(entry))
 				continue;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (11 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 12/17] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-11 20:48   ` SeongJae Park
  2025-02-10 19:37 ` [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one() David Hildenbrand
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

page_idle_clear_pte_refs_one() is not prepared for that, so let's
teach it what to do with these PFN swap PTEs. Note that device-private
entries are so far not applicable on that path, as page_idle_get_folio()
filters out non-lru folios.

Should we just skip PFN swap PTEs completely? Possible, but it seems
straight forward to just handle them correctly.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_idle.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 947c7c7a37289..408aaf29a3ea6 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -62,9 +62,14 @@ static bool page_idle_clear_pte_refs_one(struct folio *folio,
 			/*
 			 * For PTE-mapped THP, one sub page is referenced,
 			 * the whole THP is referenced.
+			 *
+			 * PFN swap PTEs, such as device-exclusive ones, that
+			 * actually map pages are "old" from a CPU perspective.
+			 * The MMU notifier takes care of any device aspects.
 			 */
-			if (ptep_clear_young_notify(vma, addr, pvmw.pte))
-				referenced = true;
+			if (likely(pte_present(ptep_get(pvmw.pte))))
+				referenced |= ptep_test_and_clear_young(vma, addr, pvmw.pte);
+			referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
 				referenced = true;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (12 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-11  6:59   ` SeongJae Park
  2025-02-10 19:37 ` [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one() David Hildenbrand
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

damon_folio_young_one() is not prepared for that, so teach it about these
PFN swap PTEs. Note that device-private entries are so far not applicable
on that path, as we expect ZONE_DEVICE pages so far only in migration code
when it comes to the RMAP.

The impact is rather small: we'd be calling pte_young() on a
non-present PTE, which is not really defined to have semantic.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/damon/paddr.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 0f9ae14f884dd..10d75f9ceeafb 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -92,12 +92,20 @@ static bool damon_folio_young_one(struct folio *folio,
 {
 	bool *accessed = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0);
+	pte_t pte;
 
 	*accessed = false;
 	while (page_vma_mapped_walk(&pvmw)) {
 		addr = pvmw.address;
 		if (pvmw.pte) {
-			*accessed = pte_young(ptep_get(pvmw.pte)) ||
+			pte = ptep_get(pvmw.pte);
+
+			/*
+			 * PFN swap PTEs, such as device-exclusive ones, that
+			 * actually map pages are "old" from a CPU perspective.
+			 * The MMU notifier takes care of any device aspects.
+			 */
+			*accessed = (pte_present(pte) && pte_young(pte)) ||
 				!folio_test_idle(folio) ||
 				mmu_notifier_test_young(vma->vm_mm, addr);
 		} else {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (13 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-11  7:00   ` SeongJae Park
  2025-02-10 19:37 ` [PATCH v2 16/17] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

damon_folio_mkold_one() is not prepared for that and calls
damon_ptep_mkold() with PFN swap PTEs. Teach damon_ptep_mkold() to deal
with these PFN swap PTEs. Note that device-private entries are so far not
applicable on that path, as damon_get_folio() filters out non-lru
folios.

Should we just skip PFN swap PTEs completely? Possible, but it seems
straight forward to just handle it correctly.

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive; this makes the rmap walk
abort early for small folios, because we'll always have
!folio_mapped() with a single device-exclusive entry. We'll adjust the
mapcount logic once all page_vma_mapped_walk() users can properly
handle device-exclusive entries.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/damon/ops-common.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index d25d99cb5f2bb..86a50e8fbc806 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -9,6 +9,8 @@
 #include <linux/page_idle.h>
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #include "ops-common.h"
 
@@ -39,12 +41,29 @@ struct folio *damon_get_folio(unsigned long pfn)
 
 void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr)
 {
-	struct folio *folio = damon_get_folio(pte_pfn(ptep_get(pte)));
+	pte_t pteval = ptep_get(pte);
+	struct folio *folio;
+	bool young = false;
+	unsigned long pfn;
+
+	if (likely(pte_present(pteval)))
+		pfn = pte_pfn(pteval);
+	else
+		pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
 
+	folio = damon_get_folio(pfn);
 	if (!folio)
 		return;
 
-	if (ptep_clear_young_notify(vma, addr, pte))
+	/*
+	 * PFN swap PTEs, such as device-exclusive ones, that actually map pages
+	 * are "old" from a CPU perspective. The MMU notifier takes care of any
+	 * device aspects.
+	 */
+	if (likely(pte_present(pteval)))
+		young |= ptep_test_and_clear_young(vma, addr, pte);
+	young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
+	if (young)
 		folio_set_young(folio);
 
 	folio_set_idle(folio);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 16/17] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (14 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one() David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 19:37 ` [PATCH v2 17/17] mm/rmap: avoid -EBUSY from make_device_exclusive() David Hildenbrand
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Now that conversion to device-exclusive does no longer perform an
rmap walk and all page_vma_mapped_walk() users were taught to
properly handle device-exclusive entries, let's treat device-exclusive
entries just as if they would be present, similar to how we handle
device-private entries already.

This fixes swapout/migration/split/hwpoison of folios with
device-exclusive entries.

We only had to take care of page_vma_mapped_walk() users, because these
traditionally assume pte_present(). Other page table walkers already
have to handle !pte_present(), and some of them might simply skip them
(e.g., MADV_PAGEOUT) if they are not specialized on them. This change
doesn't modify the latter.

Note that while folios with device-exclusive PTEs can now get migrated,
khugepaged will not collapse a THP if there is device-exclusive PTE.
Doing so might also not be desired if the device frequently performs
atomics to the same page. Similarly, KSM will never merge order-0 folios
that are device-exclusive.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 17 +----------------
 mm/rmap.c   |  7 -------
 2 files changed, 1 insertion(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ba33ba3b7ea17..e9f54065b117f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -741,20 +741,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);
-
-	/*
-	 * No need to take a page reference as one was already
-	 * created when the swap entry was made.
-	 */
-	if (folio_test_anon(folio))
-		folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
-	else
-		/*
-		 * Currently device exclusive access only supports anonymous
-		 * memory so the entry shouldn't point to a filebacked page.
-		 */
-		WARN_ON_ONCE(1);
-
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
 	/*
@@ -1626,8 +1612,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		 */
 		WARN_ON_ONCE(!vma_is_anonymous(vma));
 		rss[mm_counter(folio)]--;
-		if (is_device_private_entry(entry))
-			folio_remove_rmap_pte(folio, page, vma);
+		folio_remove_rmap_pte(folio, page, vma);
 		folio_put(folio);
 	} else if (!non_swap_entry(entry)) {
 		/* Genuine swap entries, hence a private anon pages */
diff --git a/mm/rmap.c b/mm/rmap.c
index 7b737f0f68fb5..e2a543f639ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2511,13 +2511,6 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	/* The pte is writable, uffd-wp does not apply. */
 	set_pte_at(mm, addr, fw.ptep, swp_pte);
 
-	/*
-	 * TODO: The device-exclusive PFN swap PTE holds a folio reference but
-	 * does not count as a mapping (mapcount), which is wrong and must be
-	 * fixed, otherwise RMAP walks don't behave as expected.
-	 */
-	folio_remove_rmap_pte(folio, page, vma);
-
 	folio_walk_end(&fw, vma);
 	mmu_notifier_invalidate_range_end(&range);
 	*foliop = folio;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 17/17] mm/rmap: avoid -EBUSY from make_device_exclusive()
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (15 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 16/17] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
@ 2025-02-10 19:37 ` David Hildenbrand
  2025-02-10 23:05 ` [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) Andrew Morton
  2025-02-13 11:03 ` Alistair Popple
  18 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-10 19:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, linux-trace-kernel,
	linux-perf-users, damon, David Hildenbrand, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe

Failing to obtain the folio lock, for example because the folio is
concurrently getting migrated or swapped out, can easily make the callers
fail: for example, the hmm selftest can sometimes be observed to fail
because of this. Instead of forcing the caller to retry, let's simply
retry in this to-be-expected case.

Similarly, avoid spurious failures simply because we raced with someone
(e.g., swapout) modifying the page table such that our folio_walk fails.

Simply unconditionally lock the folio, and retry GUP if our folio_walk
fails. Note that the folio_walk repeatedly failing is not something we
expect.

Note that we might want to avoid grabbing the folio lock at some point;
for now, keep that as is and only unconditionally lock the folio.

With this change, the hmm selftests don't fail simply because the folio
is already locked. While this fixes the selftests in some cases, it's
likely not something that deserves a "Fixes:".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index e2a543f639ce3..0f760b93fc0a2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2435,6 +2435,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	struct page *page;
 	swp_entry_t entry;
 	pte_t swp_pte;
+	int ret;
 
 	mmap_assert_locked(mm);
 	addr = PAGE_ALIGN_DOWN(addr);
@@ -2448,6 +2449,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	 * fault will trigger a conversion to an ordinary
 	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
 	 */
+retry:
 	page = get_user_page_vma_remote(mm, addr,
 					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
 					&vma);
@@ -2460,9 +2462,10 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		return ERR_PTR(-EOPNOTSUPP);
 	}
 
-	if (!folio_trylock(folio)) {
+	ret = folio_lock_killable(folio);
+	if (ret) {
 		folio_put(folio);
-		return ERR_PTR(-EBUSY);
+		return ERR_PTR(ret);
 	}
 
 	/*
@@ -2488,7 +2491,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		mmu_notifier_invalidate_range_end(&range);
 		folio_unlock(folio);
 		folio_put(folio);
-		return ERR_PTR(-EBUSY);
+		goto retry;
 	}
 
 	/* Nuke the page table entry so we get the uptodate dirty bit. */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (16 preceding siblings ...)
  2025-02-10 19:37 ` [PATCH v2 17/17] mm/rmap: avoid -EBUSY from make_device_exclusive() David Hildenbrand
@ 2025-02-10 23:05 ` Andrew Morton
  2025-02-10 23:39   ` Barry Song
  2025-02-13 11:03 ` Alistair Popple
  18 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-02-10 23:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Barry Song

On Mon, 10 Feb 2025 20:37:42 +0100 David Hildenbrand <david@redhat.com> wrote:

> Against mm-hotfixes-stable for now.
> 
> Discussing the PageTail() call in make_device_exclusive_range() with
> Willy, I recently discovered [1] that device-exclusive handling does
> not properly work with THP, making the hmm-tests selftests fail if THPs
> are enabled on the system.
> 
> Looking into more details, I found that hugetlb is not properly fenced,
> and I realized that something that was bugging me for longer -- how
> device-exclusive entries interact with mapcounts -- completely breaks
> migration/swapout/split/hwpoison handling of these folios while they have
> device-exclusive PTEs.
> 
> The program below can be used to allocate 1 GiB worth of pages and
> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> 
> Once they are device-exclusive, these folios cannot get swapped out
> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> much one forces memory reclaim), and when having a memory block onlined
> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> failed migration of a page that should be movable.
> 
> # echo offline > /sys/devices/system/memory/memory136/state
> # echo online_movable > /sys/devices/system/memory/memory136/state
> # ./hmm-swap &
> ... wait until everything is device-exclusive
> # echo offline > /sys/devices/system/memory/memory136/state
> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>   index:0x7f20671f7 pfn:0x442b6a
> [  285.196618][T14882] memcg:ffff888179298000
> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> [  285.201734][T14882] raw: ...
> [  285.204464][T14882] raw: ...
> [  285.207196][T14882] page dumped because: migration failure
> [  285.209072][T14882] page_owner tracks the page as allocated
> [  285.210915][T14882] page last allocated via order 0, migratetype
>   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> ...
> 
> This series fixes all issues I found so far.

Cool.

Barry, could you please redo your series "mm: batched unmap lazyfree
large folios during reclamation" on top of this (on top of mm-unstable,
ideally).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-10 23:05 ` [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) Andrew Morton
@ 2025-02-10 23:39   ` Barry Song
  0 siblings, 0 replies; 31+ messages in thread
From: Barry Song @ 2025-02-10 23:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Barry Song

On Tue, Feb 11, 2025 at 12:05 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Mon, 10 Feb 2025 20:37:42 +0100 David Hildenbrand <david@redhat.com> wrote:
>
> > Against mm-hotfixes-stable for now.
> >
> > Discussing the PageTail() call in make_device_exclusive_range() with
> > Willy, I recently discovered [1] that device-exclusive handling does
> > not properly work with THP, making the hmm-tests selftests fail if THPs
> > are enabled on the system.
> >
> > Looking into more details, I found that hugetlb is not properly fenced,
> > and I realized that something that was bugging me for longer -- how
> > device-exclusive entries interact with mapcounts -- completely breaks
> > migration/swapout/split/hwpoison handling of these folios while they have
> > device-exclusive PTEs.
> >
> > The program below can be used to allocate 1 GiB worth of pages and
> > making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> >
> > Once they are device-exclusive, these folios cannot get swapped out
> > (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> > much one forces memory reclaim), and when having a memory block onlined
> > to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> > failed migration of a page that should be movable.
> >
> > # echo offline > /sys/devices/system/memory/memory136/state
> > # echo online_movable > /sys/devices/system/memory/memory136/state
> > # ./hmm-swap &
> > ... wait until everything is device-exclusive
> > # echo offline > /sys/devices/system/memory/memory136/state
> > [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
> >   index:0x7f20671f7 pfn:0x442b6a
> > [  285.196618][T14882] memcg:ffff888179298000
> > [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
> >   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> > [  285.201734][T14882] raw: ...
> > [  285.204464][T14882] raw: ...
> > [  285.207196][T14882] page dumped because: migration failure
> > [  285.209072][T14882] page_owner tracks the page as allocated
> > [  285.210915][T14882] page last allocated via order 0, migratetype
> >   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
> >   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> > [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> > [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> > [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> > [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> > [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> > [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> > [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> > [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> > ...
> >
> > This series fixes all issues I found so far.
>
> Cool.
>
> Barry, could you please redo your series "mm: batched unmap lazyfree
> large folios during reclamation" on top of this (on top of mm-unstable,
> ideally).

Sure. Thanks for letting me know.

>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-02-10 19:37 ` [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
@ 2025-02-11  5:00   ` Andrew Morton
  2025-02-11  8:33     ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-02-11  5:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Simona Vetter

On Mon, 10 Feb 2025 20:37:45 +0100 David Hildenbrand <david@redhat.com> wrote:

> The single "real" user in the tree of make_device_exclusive_range() always
> requests making only a single address exclusive. The current implementation
> is hard to fix for properly supporting anonymous THP / large folios and
> for avoiding messing with rmap walks in weird ways.
> 
> So let's always process a single address/page and return folio + page to
> minimize page -> folio lookups. This is a preparation for further
> changes.
> 
> Reject any non-anonymous or hugetlb folios early, directly after GUP.
> 
> While at it, extend the documentation of make_device_exclusive() to
> clarify some things.

x86_64 allmodconfig:

drivers/gpu/drm/nouveau/nouveau_svm.c: In function 'nouveau_atomic_range_fault':
drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: error: 'folio' undeclared (first use in this function)
  612 |                 page = make_device_exclusive(mm, start, drm->dev, &folio);
      |                                                                    ^~~~~
drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: note: each undeclared identifier is reported only once for each function it appears in



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one()
  2025-02-10 19:37 ` [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one() David Hildenbrand
@ 2025-02-11  6:59   ` SeongJae Park
  0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-02-11  6:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pasha Tatashin, Peter Xu, Alistair Popple, Jason Gunthorpe

On Mon, 10 Feb 2025 20:37:56 +0100 David Hildenbrand <david@redhat.com> wrote:

> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> we can return with a device-exclusive entry from page_vma_mapped_walk().
> 
> damon_folio_young_one() is not prepared for that, so teach it about these
> PFN swap PTEs. Note that device-private entries are so far not applicable
> on that path, as we expect ZONE_DEVICE pages so far only in migration code
> when it comes to the RMAP.
> 
> The impact is rather small: we'd be calling pte_young() on a
> non-present PTE, which is not really defined to have semantic.
> 
> Note that we could currently only run into this case with
> device-exclusive entries on THPs. We still adjust the mapcount on
> conversion to device-exclusive; this makes the rmap walk
> abort early for small folios, because we'll always have
> !folio_mapped() with a single device-exclusive entry. We'll adjust the
> mapcount logic once all page_vma_mapped_walk() users can properly
> handle device-exclusive entries.
> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one()
  2025-02-10 19:37 ` [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one() David Hildenbrand
@ 2025-02-11  7:00   ` SeongJae Park
  0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-02-11  7:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pasha Tatashin, Peter Xu, Alistair Popple, Jason Gunthorpe

On Mon, 10 Feb 2025 20:37:57 +0100 David Hildenbrand <david@redhat.com> wrote:

> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> we can return with a device-exclusive entry from page_vma_mapped_walk().
> 
> damon_folio_mkold_one() is not prepared for that and calls
> damon_ptep_mkold() with PFN swap PTEs. Teach damon_ptep_mkold() to deal
> with these PFN swap PTEs. Note that device-private entries are so far not
> applicable on that path, as damon_get_folio() filters out non-lru
> folios.
> 
> Should we just skip PFN swap PTEs completely? Possible, but it seems
> straight forward to just handle it correctly.
> 
> Note that we could currently only run into this case with
> device-exclusive entries on THPs. We still adjust the mapcount on
> conversion to device-exclusive; this makes the rmap walk
> abort early for small folios, because we'll always have
> !folio_mapped() with a single device-exclusive entry. We'll adjust the
> mapcount logic once all page_vma_mapped_walk() users can properly
> handle device-exclusive entries.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-02-11  5:00   ` Andrew Morton
@ 2025-02-11  8:33     ` David Hildenbrand
  2025-02-17  0:01       ` Alistair Popple
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-11  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, Simona Vetter

On 11.02.25 06:00, Andrew Morton wrote:
> On Mon, 10 Feb 2025 20:37:45 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
>> The single "real" user in the tree of make_device_exclusive_range() always
>> requests making only a single address exclusive. The current implementation
>> is hard to fix for properly supporting anonymous THP / large folios and
>> for avoiding messing with rmap walks in weird ways.
>>
>> So let's always process a single address/page and return folio + page to
>> minimize page -> folio lookups. This is a preparation for further
>> changes.
>>
>> Reject any non-anonymous or hugetlb folios early, directly after GUP.
>>
>> While at it, extend the documentation of make_device_exclusive() to
>> clarify some things.
> 
> x86_64 allmodconfig:
> 
> drivers/gpu/drm/nouveau/nouveau_svm.c: In function 'nouveau_atomic_range_fault':
> drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: error: 'folio' undeclared (first use in this function)
>    612 |                 page = make_device_exclusive(mm, start, drm->dev, &folio);
>        |                                                                    ^~~~~
> drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: note: each undeclared identifier is reported only once for each function it appears in

Ah! Because I was carrying on the same branch SVM fixes [1] that are
getting surprisingly little attention so far.


The following sorts it out for now:

 From 337c68bf24af59f36477be11ea6ef7c7ce9aa8ae Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 11 Feb 2025 09:33:04 +0100
Subject: [PATCH] merge

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  drivers/gpu/drm/nouveau/nouveau_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 39e3740980bb7..1fed638b9eba8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -590,6 +590,7 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
  	unsigned long timeout =
  		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
  	struct mm_struct *mm = svmm->notifier.mm;
+	struct folio *folio;
  	struct page *page;
  	unsigned long start = args->p.addr;
  	unsigned long notifier_seq;
-- 
2.48.1


I'll resend [1] once this stuff here landed.

Let me know if you want a full resend of this series, thanks.


[1] https://lkml.kernel.org/r/20250124181524.3584236-1-david@redhat.com

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one()
  2025-02-10 19:37 ` [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() David Hildenbrand
@ 2025-02-11 20:48   ` SeongJae Park
  0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-02-11 20:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pasha Tatashin, Peter Xu, Alistair Popple, Jason Gunthorpe

On Mon, 10 Feb 2025 20:37:55 +0100 David Hildenbrand <david@redhat.com> wrote:

> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> we can return with a device-exclusive entry from page_vma_mapped_walk().
> 
> page_idle_clear_pte_refs_one() is not prepared for that, so let's
> teach it what to do with these PFN swap PTEs. Note that device-private
> entries are so far not applicable on that path, as page_idle_get_folio()
> filters out non-lru folios.
> 
> Should we just skip PFN swap PTEs completely? Possible, but it seems
> straight forward to just handle them correctly.
> 
> Note that we could currently only run into this case with
> device-exclusive entries on THPs. We still adjust the mapcount on
> conversion to device-exclusive; this makes the rmap walk
> abort early for small folios, because we'll always have
> !folio_mapped() with a single device-exclusive entry. We'll adjust the
> mapcount logic once all page_vma_mapped_walk() users can properly
> handle device-exclusive entries.
> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (17 preceding siblings ...)
  2025-02-10 23:05 ` [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) Andrew Morton
@ 2025-02-13 11:03 ` Alistair Popple
  2025-02-13 11:15   ` David Hildenbrand
  18 siblings, 1 reply; 31+ messages in thread
From: Alistair Popple @ 2025-02-13 11:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe

On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
> Against mm-hotfixes-stable for now.
> 
> Discussing the PageTail() call in make_device_exclusive_range() with
> Willy, I recently discovered [1] that device-exclusive handling does
> not properly work with THP, making the hmm-tests selftests fail if THPs
> are enabled on the system.
> 
> Looking into more details, I found that hugetlb is not properly fenced,
> and I realized that something that was bugging me for longer -- how
> device-exclusive entries interact with mapcounts -- completely breaks
> migration/swapout/split/hwpoison handling of these folios while they have
> device-exclusive PTEs.
> 
> The program below can be used to allocate 1 GiB worth of pages and
> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> 
> Once they are device-exclusive, these folios cannot get swapped out
> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> much one forces memory reclaim), and when having a memory block onlined
> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> failed migration of a page that should be movable.
> 
> # echo offline > /sys/devices/system/memory/memory136/state
> # echo online_movable > /sys/devices/system/memory/memory136/state
> # ./hmm-swap &
> ... wait until everything is device-exclusive
> # echo offline > /sys/devices/system/memory/memory136/state
> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>   index:0x7f20671f7 pfn:0x442b6a
> [  285.196618][T14882] memcg:ffff888179298000
> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> [  285.201734][T14882] raw: ...
> [  285.204464][T14882] raw: ...
> [  285.207196][T14882] page dumped because: migration failure
> [  285.209072][T14882] page_owner tracks the page as allocated
> [  285.210915][T14882] page last allocated via order 0, migratetype
>   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> ...
> 
> This series fixes all issues I found so far. There is no easy way to fix
> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
> previous sent, some the result of the discussion in v1) that I will send
> out separately once this landed and I get to it.
> I wish we could just use some special present PROT_NONE PTEs instead of

First off David thanks for finding and fixing these issues. If you have further
clean-ups in mind that you need help with please let me know as I'd be happy
to help.

> these (non-present, non-none) fake-swap entries; but that just results in
> the same problem we keep having (lack of spare PTE bits), and staring at
> other similar fake-swap entries, that ship has sailed.
> 
> With this series, make_device_exclusive() doesn't actually belong into
> mm/rmap.c anymore, but I'll leave moving that for another day.
> 
> I only tested this series with the hmm-tests selftests due to lack of HW,
> so I'd appreciate some testing, especially if the interaction between
> two GPUs wanting a device-exclusive entry works as expected.

I'm still reviewing the series but so far testing on my single GPU system
appears to be working as expected. I will try and fire up a dual GPU system
tomorrow and test it there as well.

 - Alistair

> <program>
> #include <stdio.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <linux/types.h>
> #include <linux/ioctl.h>
> 
> #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
> 
> struct hmm_dmirror_cmd {
> 	__u64 addr;
> 	__u64 ptr;
> 	__u64 npages;
> 	__u64 cpages;
> 	__u64 faults;
> };
> 
> const size_t size = 1 * 1024 * 1024 * 1024ul;
> const size_t chunk_size = 2 * 1024 * 1024ul;
> 
> int main(void)
> {
> 	struct hmm_dmirror_cmd cmd;
> 	size_t cur_size;
> 	int fd, ret;
> 	char *addr, *mirror;
> 
> 	fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
> 	if (fd < 0) {
> 		perror("open failed\n");
> 		exit(1);
> 	}
> 
> 	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> 		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (addr == MAP_FAILED) {
> 		perror("mmap failed\n");
> 		exit(1);
> 	}
> 	madvise(addr, size, MADV_NOHUGEPAGE);
> 	memset(addr, 1, size);
> 
> 	mirror = malloc(chunk_size);
> 
> 	for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
> 		cmd.addr = (uintptr_t)addr + cur_size;
> 		cmd.ptr = (uintptr_t)mirror;
> 		cmd.npages = chunk_size / getpagesize();
> 		ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
> 		if (ret) {
> 			perror("ioctl failed\n");
> 			exit(1);
> 		}
> 	}
> 	pause();
> 	return 0;
> }
> </program>
> 
> [1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Alex Shi <alexs@kernel.org>
> Cc: Yanteng Si <si.yanteng@linux.dev>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> 
> v1 -> v2:
>  * "mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()"
>   -> Fix and simplify return value handling when calling dmirror_atomic_map()
>   -> Fix parameter order when calling make_device_exclusive()
>   [both things were fixed by the separate cleanups I previously sent, realized
>    it when re-testing the fixes here only]
>   -> Heavily extend documentation of make_device_exclusive()
>  * "mm/rmap: implement make_device_exclusive() using folio_walk instead of
>     rmap walk"
>   -> Keep MMU_NOTIFY_EXCLUSIVE, and update comments/description
>  * "mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()"
>   -> Handle PageHWPoison with device-private pages differently
>  * Added a bunch of "handle device-exclusive entries correctly" fixes,
>    now handling all page_vma_mapped_walk() callers correctly
>  * Added "mm/rmap: avoid -EBUSY from make_device_exclusive()" to fix some
>    hmm selftest failures I saw while testing under memory pressure
>  * Plenty of comment/description updates and improvements
> 
> David Hildenbrand (17):
>   mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
>   mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
>   mm/rmap: convert make_device_exclusive_range() to
>     make_device_exclusive()
>   mm/rmap: implement make_device_exclusive() using folio_walk instead of
>     rmap walk
>   mm/memory: detect writability in restore_exclusive_pte() through
>     can_change_pte_writable()
>   mm: use single SWP_DEVICE_EXCLUSIVE entry type
>   mm/page_vma_mapped: device-exclusive entries are not migration entries
>   kernel/events/uprobes: handle device-exclusive entries correctly in
>     __replace_page()
>   mm/ksm: handle device-exclusive entries correctly in
>     write_protect_page()
>   mm/rmap: handle device-exclusive entries correctly in
>     try_to_unmap_one()
>   mm/rmap: handle device-exclusive entries correctly in
>     try_to_migrate_one()
>   mm/rmap: handle device-exclusive entries correctly in
>     page_vma_mkclean_one()
>   mm/page_idle: handle device-exclusive entries correctly in
>     page_idle_clear_pte_refs_one()
>   mm/damon: handle device-exclusive entries correctly in
>     damon_folio_young_one()
>   mm/damon: handle device-exclusive entries correctly in
>     damon_folio_mkold_one()
>   mm/rmap: keep mapcount untouched for device-exclusive entries
>   mm/rmap: avoid -EBUSY from make_device_exclusive()
> 
>  Documentation/mm/hmm.rst                    |   2 +-
>  Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
>  include/linux/mmu_notifier.h                |   2 +-
>  include/linux/rmap.h                        |   5 +-
>  include/linux/swap.h                        |   7 +-
>  include/linux/swapops.h                     |  27 +-
>  kernel/events/uprobes.c                     |  13 +-
>  lib/test_hmm.c                              |  41 +-
>  mm/damon/ops-common.c                       |  23 +-
>  mm/damon/paddr.c                            |  10 +-
>  mm/gup.c                                    |   3 +
>  mm/ksm.c                                    |   9 +-
>  mm/memory.c                                 |  28 +-
>  mm/mprotect.c                               |   8 -
>  mm/page_idle.c                              |   9 +-
>  mm/page_table_check.c                       |   5 +-
>  mm/page_vma_mapped.c                        |   3 +-
>  mm/rmap.c                                   | 469 +++++++++-----------
>  19 files changed, 315 insertions(+), 356 deletions(-)
> 
> 
> base-commit: e5b2a356dc8a88708d97bd47cca3b8f7ed7af6cb
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-13 11:03 ` Alistair Popple
@ 2025-02-13 11:15   ` David Hildenbrand
  2025-02-14  1:25     ` Alistair Popple
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-02-13 11:15 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe

On 13.02.25 12:03, Alistair Popple wrote:
> On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
>> Against mm-hotfixes-stable for now.
>>
>> Discussing the PageTail() call in make_device_exclusive_range() with
>> Willy, I recently discovered [1] that device-exclusive handling does
>> not properly work with THP, making the hmm-tests selftests fail if THPs
>> are enabled on the system.
>>
>> Looking into more details, I found that hugetlb is not properly fenced,
>> and I realized that something that was bugging me for longer -- how
>> device-exclusive entries interact with mapcounts -- completely breaks
>> migration/swapout/split/hwpoison handling of these folios while they have
>> device-exclusive PTEs.
>>
>> The program below can be used to allocate 1 GiB worth of pages and
>> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
>>
>> Once they are device-exclusive, these folios cannot get swapped out
>> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
>> much one forces memory reclaim), and when having a memory block onlined
>> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
>> failed migration of a page that should be movable.
>>
>> # echo offline > /sys/devices/system/memory/memory136/state
>> # echo online_movable > /sys/devices/system/memory/memory136/state
>> # ./hmm-swap &
>> ... wait until everything is device-exclusive
>> # echo offline > /sys/devices/system/memory/memory136/state
>> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>>    index:0x7f20671f7 pfn:0x442b6a
>> [  285.196618][T14882] memcg:ffff888179298000
>> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>>    dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
>> [  285.201734][T14882] raw: ...
>> [  285.204464][T14882] raw: ...
>> [  285.207196][T14882] page dumped because: migration failure
>> [  285.209072][T14882] page_owner tracks the page as allocated
>> [  285.210915][T14882] page last allocated via order 0, migratetype
>>    Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>>    id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
>> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
>> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
>> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
>> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
>> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
>> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
>> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
>> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
>> ...
>>
>> This series fixes all issues I found so far. There is no easy way to fix
>> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
>> previous sent, some the result of the discussion in v1) that I will send
>> out separately once this landed and I get to it.
>> I wish we could just use some special present PROT_NONE PTEs instead of
> 
> First off David thanks for finding and fixing these issues. If you have further
> clean-ups in mind that you need help with please let me know as I'd be happy
> to help.

Sure! I have some cleanups TBD as result of the previous discussion, but 
nothing bigger so far.

(removing the folio lock could be considered bigger, if we want to go 
down that path)

> 
>> these (non-present, non-none) fake-swap entries; but that just results in
>> the same problem we keep having (lack of spare PTE bits), and staring at
>> other similar fake-swap entries, that ship has sailed.
>>
>> With this series, make_device_exclusive() doesn't actually belong into
>> mm/rmap.c anymore, but I'll leave moving that for another day.
>>
>> I only tested this series with the hmm-tests selftests due to lack of HW,
>> so I'd appreciate some testing, especially if the interaction between
>> two GPUs wanting a device-exclusive entry works as expected.
> 
> I'm still reviewing the series but so far testing on my single GPU system
> appears to be working as expected. I will try and fire up a dual GPU system
> tomorrow and test it there as well.

Great, thanks a bunch for testing!

Out of interest: does the nvidia driver make use of this interface as 
well, and are you testing with that or with the nouveau driver? I saw 
some reports that nvidia at least checks for it [1] when building the 
module:

	CONFTEST: make_device_exclusive_range

[1] 
https://www.googlecloudcommunity.com/gc/AI-ML/Can-t-Install-Nvidia-Drivers-on-6-1-0-18-Kernel/m-p/722596

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-13 11:15   ` David Hildenbrand
@ 2025-02-14  1:25     ` Alistair Popple
  2025-02-14 10:37       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Alistair Popple @ 2025-02-14  1:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe

On Thu, Feb 13, 2025 at 12:15:58PM +0100, David Hildenbrand wrote:
> On 13.02.25 12:03, Alistair Popple wrote:
> > On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
> > > Against mm-hotfixes-stable for now.
> > > 
> > > Discussing the PageTail() call in make_device_exclusive_range() with
> > > Willy, I recently discovered [1] that device-exclusive handling does
> > > not properly work with THP, making the hmm-tests selftests fail if THPs
> > > are enabled on the system.
> > > 
> > > Looking into more details, I found that hugetlb is not properly fenced,
> > > and I realized that something that was bugging me for longer -- how
> > > device-exclusive entries interact with mapcounts -- completely breaks
> > > migration/swapout/split/hwpoison handling of these folios while they have
> > > device-exclusive PTEs.
> > > 
> > > The program below can be used to allocate 1 GiB worth of pages and
> > > making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> > > 
> > > Once they are device-exclusive, these folios cannot get swapped out
> > > (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> > > much one forces memory reclaim), and when having a memory block onlined
> > > to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> > > failed migration of a page that should be movable.
> > > 
> > > # echo offline > /sys/devices/system/memory/memory136/state
> > > # echo online_movable > /sys/devices/system/memory/memory136/state
> > > # ./hmm-swap &
> > > ... wait until everything is device-exclusive
> > > # echo offline > /sys/devices/system/memory/memory136/state
> > > [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
> > >    index:0x7f20671f7 pfn:0x442b6a
> > > [  285.196618][T14882] memcg:ffff888179298000
> > > [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
> > >    dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> > > [  285.201734][T14882] raw: ...
> > > [  285.204464][T14882] raw: ...
> > > [  285.207196][T14882] page dumped because: migration failure
> > > [  285.209072][T14882] page_owner tracks the page as allocated
> > > [  285.210915][T14882] page last allocated via order 0, migratetype
> > >    Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
> > >    id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> > > [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> > > [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> > > [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> > > [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> > > [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> > > [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> > > [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> > > [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> > > ...
> > > 
> > > This series fixes all issues I found so far. There is no easy way to fix
> > > without a bigger rework/cleanup. I have a bunch of cleanups on top (some
> > > previous sent, some the result of the discussion in v1) that I will send
> > > out separately once this landed and I get to it.
> > > I wish we could just use some special present PROT_NONE PTEs instead of

Yeah, that was my initial instinct when I first investigated this. As you point
out a lack of spare PTE bits made it hard/impossible. Of course I'm about to
give you all one back, maybe I should keep it :) I'm only kidding though - I'm
sure there's more interesting things to spend it on.

> > 
> > First off David thanks for finding and fixing these issues. If you have further
> > clean-ups in mind that you need help with please let me know as I'd be happy
> > to help.
> 
> Sure! I have some cleanups TBD as result of the previous discussion, but
> nothing bigger so far.
> 
> (removing the folio lock could be considered bigger, if we want to go down
> that path)
> 
> > 
> > > these (non-present, non-none) fake-swap entries; but that just results in
> > > the same problem we keep having (lack of spare PTE bits), and staring at
> > > other similar fake-swap entries, that ship has sailed.
> > > 
> > > With this series, make_device_exclusive() doesn't actually belong into
> > > mm/rmap.c anymore, but I'll leave moving that for another day.
> > > 
> > > I only tested this series with the hmm-tests selftests due to lack of HW,
> > > so I'd appreciate some testing, especially if the interaction between
> > > two GPUs wanting a device-exclusive entry works as expected.
> > 
> > I'm still reviewing the series but so far testing on my single GPU system
> > appears to be working as expected. I will try and fire up a dual GPU system
> > tomorrow and test it there as well.
> 
> Great, thanks a bunch for testing!
> 
> Out of interest: does the nvidia driver make use of this interface as well,
> and are you testing with that or with the nouveau driver? I saw some reports
> that nvidia at least checks for it [1] when building the module:

Both. I have tested Nouveau with the Mesa OpenCL stack and a simple stress test
that just thrashes atomic accesses between CPU and GPU and a similar test for
the nvidia driver.

In practice the nvidia driver probably doesn't use this that often as it
more aggressively migrates data but it does use this as a fallback. Also it's
possible for users to force residency on the CPU in which case this is used,
which is what the test does.

Anyway I have just finished testing on a multi-GPU setup so please feel free to
add for the series:

Tested-by: Alistair Popple <apopple@nvidia.com>

> 
> 	CONFTEST: make_device_exclusive_range
> 
> [1] https://www.googlecloudcommunity.com/gc/AI-ML/Can-t-Install-Nvidia-Drivers-on-6-1-0-18-Kernel/m-p/722596
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
  2025-02-14  1:25     ` Alistair Popple
@ 2025-02-14 10:37       ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-14 10:37 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	linux-trace-kernel, linux-perf-users, damon, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe

On 14.02.25 02:25, Alistair Popple wrote:
> On Thu, Feb 13, 2025 at 12:15:58PM +0100, David Hildenbrand wrote:
>> On 13.02.25 12:03, Alistair Popple wrote:
>>> On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
>>>> Against mm-hotfixes-stable for now.
>>>>
>>>> Discussing the PageTail() call in make_device_exclusive_range() with
>>>> Willy, I recently discovered [1] that device-exclusive handling does
>>>> not properly work with THP, making the hmm-tests selftests fail if THPs
>>>> are enabled on the system.
>>>>
>>>> Looking into more details, I found that hugetlb is not properly fenced,
>>>> and I realized that something that was bugging me for longer -- how
>>>> device-exclusive entries interact with mapcounts -- completely breaks
>>>> migration/swapout/split/hwpoison handling of these folios while they have
>>>> device-exclusive PTEs.
>>>>
>>>> The program below can be used to allocate 1 GiB worth of pages and
>>>> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
>>>>
>>>> Once they are device-exclusive, these folios cannot get swapped out
>>>> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
>>>> much one forces memory reclaim), and when having a memory block onlined
>>>> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
>>>> failed migration of a page that should be movable.
>>>>
>>>> # echo offline > /sys/devices/system/memory/memory136/state
>>>> # echo online_movable > /sys/devices/system/memory/memory136/state
>>>> # ./hmm-swap &
>>>> ... wait until everything is device-exclusive
>>>> # echo offline > /sys/devices/system/memory/memory136/state
>>>> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>>>>     index:0x7f20671f7 pfn:0x442b6a
>>>> [  285.196618][T14882] memcg:ffff888179298000
>>>> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>>>>     dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
>>>> [  285.201734][T14882] raw: ...
>>>> [  285.204464][T14882] raw: ...
>>>> [  285.207196][T14882] page dumped because: migration failure
>>>> [  285.209072][T14882] page_owner tracks the page as allocated
>>>> [  285.210915][T14882] page last allocated via order 0, migratetype
>>>>     Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>>>>     id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
>>>> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
>>>> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
>>>> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
>>>> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
>>>> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
>>>> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
>>>> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
>>>> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
>>>> ...
>>>>
>>>> This series fixes all issues I found so far. There is no easy way to fix
>>>> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
>>>> previous sent, some the result of the discussion in v1) that I will send
>>>> out separately once this landed and I get to it.
>>>> I wish we could just use some special present PROT_NONE PTEs instead of
> 
> Yeah, that was my initial instinct when I first investigated this. As you point
> out a lack of spare PTE bits made it hard/impossible. Of course I'm about to
> give you all one back, maybe I should keep it :) I'm only kidding though - I'm
> sure there's more interesting things to spend it on.

Yes. And I think it could actually be valuable to have the option for 
more fake-prot-none things.

For example, right now we cannot really distinguish NUMA-hinting 
prot-none from ordinary prot-none without guessing based on some VMA flags.

One could implement NUMA-hinting using a PFN swap entry in an 
arch-independent way I guess.

So there are pros and cons to it. The biggest con is, that while RMAP 
can now handle it, other page table walkers mostly skip these entries.

> 
>>>
>>> First off David thanks for finding and fixing these issues. If you have further
>>> clean-ups in mind that you need help with please let me know as I'd be happy
>>> to help.
>>
>> Sure! I have some cleanups TBD as result of the previous discussion, but
>> nothing bigger so far.
>>
>> (removing the folio lock could be considered bigger, if we want to go down
>> that path)
>>
>>>
>>>> these (non-present, non-none) fake-swap entries; but that just results in
>>>> the same problem we keep having (lack of spare PTE bits), and staring at
>>>> other similar fake-swap entries, that ship has sailed.
>>>>
>>>> With this series, make_device_exclusive() doesn't actually belong into
>>>> mm/rmap.c anymore, but I'll leave moving that for another day.
>>>>
>>>> I only tested this series with the hmm-tests selftests due to lack of HW,
>>>> so I'd appreciate some testing, especially if the interaction between
>>>> two GPUs wanting a device-exclusive entry works as expected.
>>>
>>> I'm still reviewing the series but so far testing on my single GPU system
>>> appears to be working as expected. I will try and fire up a dual GPU system
>>> tomorrow and test it there as well.
>>
>> Great, thanks a bunch for testing!
>>
>> Out of interest: does the nvidia driver make use of this interface as well,
>> and are you testing with that or with the nouveau driver? I saw some reports
>> that nvidia at least checks for it [1] when building the module:
> 
> Both. I have tested Nouveau with the Mesa OpenCL stack and a simple stress test
> that just thrashes atomic accesses between CPU and GPU and a similar test for
> the nvidia driver.
> 
> In practice the nvidia driver probably doesn't use this that often as it
> more aggressively migrates data but it does use this as a fallback. Also it's
> possible for users to force residency on the CPU in which case this is used,
> which is what the test does.

Cool, thanks! (so even though nouveau is not enabled in RHEL, we'd 
effectively be using that functionality in RHEL kernels using the nvidia 
driver)

> 
> Anyway I have just finished testing on a multi-GPU setup so please feel free to
> add for the series:
> 
> Tested-by: Alistair Popple <apopple@nvidia.com>

Thanks a bunch!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-02-11  8:33     ` David Hildenbrand
@ 2025-02-17  0:01       ` Alistair Popple
  2025-02-17  9:32         ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Alistair Popple @ 2025-02-17  0:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe,
	Simona Vetter

On Tue, Feb 11, 2025 at 09:33:54AM +0100, David Hildenbrand wrote:
> On 11.02.25 06:00, Andrew Morton wrote:
> > On Mon, 10 Feb 2025 20:37:45 +0100 David Hildenbrand <david@redhat.com> wrote:
> > 
> > > The single "real" user in the tree of make_device_exclusive_range() always
> > > requests making only a single address exclusive. The current implementation
> > > is hard to fix for properly supporting anonymous THP / large folios and
> > > for avoiding messing with rmap walks in weird ways.
> > > 
> > > So let's always process a single address/page and return folio + page to
> > > minimize page -> folio lookups. This is a preparation for further
> > > changes.
> > > 
> > > Reject any non-anonymous or hugetlb folios early, directly after GUP.
> > > 
> > > While at it, extend the documentation of make_device_exclusive() to
> > > clarify some things.
> > 
> > x86_64 allmodconfig:
> > 
> > drivers/gpu/drm/nouveau/nouveau_svm.c: In function 'nouveau_atomic_range_fault':
> > drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: error: 'folio' undeclared (first use in this function)
> >    612 |                 page = make_device_exclusive(mm, start, drm->dev, &folio);
> >        |                                                                    ^~~~~
> > drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: note: each undeclared identifier is reported only once for each function it appears in
> 
> Ah! Because I was carrying on the same branch SVM fixes [1] that are
> getting surprisingly little attention so far.

I believe this has been picked up in drm-misc-fixes now:

https://lore.kernel.org/dri-devel/Z69eloo_7LM6NneO@cassiopeiae/

> 
> 
> The following sorts it out for now:
> 
> From 337c68bf24af59f36477be11ea6ef7c7ce9aa8ae Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 11 Feb 2025 09:33:04 +0100
> Subject: [PATCH] merge
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 39e3740980bb7..1fed638b9eba8 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -590,6 +590,7 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
>  	unsigned long timeout =
>  		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>  	struct mm_struct *mm = svmm->notifier.mm;
> +	struct folio *folio;
>  	struct page *page;
>  	unsigned long start = args->p.addr;
>  	unsigned long notifier_seq;
> -- 
> 2.48.1
> 
> 
> I'll resend [1] once this stuff here landed.
> 
> Let me know if you want a full resend of this series, thanks.
> 
> 
> [1] https://lkml.kernel.org/r/20250124181524.3584236-1-david@redhat.com
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-02-17  0:01       ` Alistair Popple
@ 2025-02-17  9:32         ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-02-17  9:32 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, linux-trace-kernel, linux-perf-users, damon,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra,
	SeongJae Park, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Jason Gunthorpe,
	Simona Vetter

On 17.02.25 01:01, Alistair Popple wrote:
> On Tue, Feb 11, 2025 at 09:33:54AM +0100, David Hildenbrand wrote:
>> On 11.02.25 06:00, Andrew Morton wrote:
>>> On Mon, 10 Feb 2025 20:37:45 +0100 David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> The single "real" user in the tree of make_device_exclusive_range() always
>>>> requests making only a single address exclusive. The current implementation
>>>> is hard to fix for properly supporting anonymous THP / large folios and
>>>> for avoiding messing with rmap walks in weird ways.
>>>>
>>>> So let's always process a single address/page and return folio + page to
>>>> minimize page -> folio lookups. This is a preparation for further
>>>> changes.
>>>>
>>>> Reject any non-anonymous or hugetlb folios early, directly after GUP.
>>>>
>>>> While at it, extend the documentation of make_device_exclusive() to
>>>> clarify some things.
>>>
>>> x86_64 allmodconfig:
>>>
>>> drivers/gpu/drm/nouveau/nouveau_svm.c: In function 'nouveau_atomic_range_fault':
>>> drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: error: 'folio' undeclared (first use in this function)
>>>     612 |                 page = make_device_exclusive(mm, start, drm->dev, &folio);
>>>         |                                                                    ^~~~~
>>> drivers/gpu/drm/nouveau/nouveau_svm.c:612:68: note: each undeclared identifier is reported only once for each function it appears in
>>
>> Ah! Because I was carrying on the same branch SVM fixes [1] that are
>> getting surprisingly little attention so far.
> 
> I believe this has been picked up in drm-misc-fixes now:
> 
> https://lore.kernel.org/dri-devel/Z69eloo_7LM6NneO@cassiopeiae/

Yes. Both trees should merge without conflicts. However, we can later 
get rid of the now-superfluous page_folio() that was required in the drm 
fix.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-02-17  9:32 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-10 19:37 [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 01/17] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 02/17] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 03/17] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
2025-02-11  5:00   ` Andrew Morton
2025-02-11  8:33     ` David Hildenbrand
2025-02-17  0:01       ` Alistair Popple
2025-02-17  9:32         ` David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 04/17] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 05/17] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 06/17] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 07/17] mm/page_vma_mapped: device-exclusive entries are not migration entries David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 08/17] kernel/events/uprobes: handle device-exclusive entries correctly in __replace_page() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 09/17] mm/ksm: handle device-exclusive entries correctly in write_protect_page() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 10/17] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 11/17] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 12/17] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 13/17] mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() David Hildenbrand
2025-02-11 20:48   ` SeongJae Park
2025-02-10 19:37 ` [PATCH v2 14/17] mm/damon: handle device-exclusive entries correctly in damon_folio_young_one() David Hildenbrand
2025-02-11  6:59   ` SeongJae Park
2025-02-10 19:37 ` [PATCH v2 15/17] mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one() David Hildenbrand
2025-02-11  7:00   ` SeongJae Park
2025-02-10 19:37 ` [PATCH v2 16/17] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
2025-02-10 19:37 ` [PATCH v2 17/17] mm/rmap: avoid -EBUSY from make_device_exclusive() David Hildenbrand
2025-02-10 23:05 ` [PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm) Andrew Morton
2025-02-10 23:39   ` Barry Song
2025-02-13 11:03 ` Alistair Popple
2025-02-13 11:15   ` David Hildenbrand
2025-02-14  1:25     ` Alistair Popple
2025-02-14 10:37       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).