linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
@ 2025-03-04 15:48 David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 1/3] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint() David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-04 15:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	David Hildenbrand, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

Based on -next, because a related fix [1] is not in mm.git. This is
the follow-up to [2] (later than I wanted to send it out), now that
Willy also stumbled over this [3].

Since the RFC, I rewrote it once again, now using a folio_walk instead of
our old pagewalk infrastructure.


Currently, uprobe_write_opcode() implements COW-breaking manually, which is
really far from ideal. Further, there is interest in supporting uprobes on
hugetlb pages [1], and leaving at least the COW-breaking to the core will
make this much easier.

Also, I think the current code doesn't really handle some things
properly (see patch #3) when replacing/zapping pages.

Let's rewrite it, to leave COW-breaking to the fault handler, and handle
registration/unregistration by temporarily unmapping the anonymous page,
modifying it, and mapping it again. We still have to implement
zapping of anonymous pages ourselves, unfortunately.

We could look into not performing the temporary unmapping if we can
perform the write atomically, which would likely also make adding hugetlb
support a lot easier. But, limited (e.g., only PMD/PUD) hugetlb support
could be added on top of this with some tweaking.

Note that we now won't have to allocate another anonymous folio when
unregistering (which will be beneficial for hugetlb as well), we can simply
modify the already-mapped one from the registration (if any). When
registering a uprobe, we'll first trigger a ptrace-like write fault to
break COW, to then modify the already-mapped page.

Briefly sanity tested with perf:
  [root@localhost ~]# perf probe -x /usr/bin/bash -a main
  ...
  [root@localhost ~]# perf record -e probe_bash:main -aR sleep 10 &
  [1] 2196
  [root@localhost ~]# bash
  [root@localhost ~]# exit
  exit
  [root@localhost ~]# bash
  [root@localhost ~]# exit
  exit
  [root@localhost ~]# [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.287 MB perf.data (8 samples) ]
  ...
  [root@localhost ~]# perf report --stdio
  # To display the perf.data header info, please use --header/--header-only optio>
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 8  of event 'probe_bash:main'
  # Event count (approx.): 8
  #
  # Overhead  Command      Shared Object  Symbol
  # ........  ...........  .............  ........
  #
      75.00%  grepconf.sh  bash           [.] main
      25.00%  bash         bash           [.] main
  ...

Are there any uprobe tests / benchmarks that are worth running?

RFC -> v1:
* Use folio_walk and simplify the logic

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>

[1] https://lkml.kernel.org/r/20250224031149.1598949-1-tongtiangen@huawei.com
[2] https://lore.kernel.org/linux-mm/20240604122548.359952-2-david@redhat.com/T/
[3] https://lore.kernel.org/all/d7971673-19ed-448a-9e54-8ffbde5059dc@redhat.com/T/
[4] https://lkml.kernel.org/r/ZiK50qob9yl5e0Xz@bender.morinfr.org

David Hildenbrand (3):
  kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint()
  kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and
    uprobe_write_opcode()
  kernel/events/uprobes: uprobe_write_opcode() rewrite

 arch/arm/probes/uprobes/core.c |   4 +-
 include/linux/uprobes.h        |   6 +-
 kernel/events/uprobes.c        | 363 +++++++++++++++++----------------
 3 files changed, 190 insertions(+), 183 deletions(-)


base-commit: cd3215bbcb9d4321def93fea6cfad4d5b42b9d1d
-- 
2.48.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -next v1 1/3] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint()
  2025-03-04 15:48 [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
@ 2025-03-04 15:48 ` David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 2/3] kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and uprobe_write_opcode() David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-04 15:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	David Hildenbrand, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

... and remove the "MM" argument from install_breakpoint(), because it
can easily be derived from the VMA.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/events/uprobes.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8fc53813779a4..991aacc80d0e0 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1134,10 +1134,10 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	return ret;
 }
 
-static int
-install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
-			struct vm_area_struct *vma, unsigned long vaddr)
+static int install_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma,
+		unsigned long vaddr)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	bool first_uprobe;
 	int ret;
 
@@ -1162,9 +1162,11 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
 	return ret;
 }
 
-static int
-remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
+static int remove_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma,
+		unsigned long vaddr)
 {
+	struct mm_struct *mm = vma->vm_mm;
+
 	set_bit(MMF_RECALC_UPROBES, &mm->flags);
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
@@ -1296,10 +1298,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 		if (is_register) {
 			/* consult only the "caller", new consumer. */
 			if (consumer_filter(new, mm))
-				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
+				err = install_breakpoint(uprobe, vma, info->vaddr);
 		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
 			if (!filter_chain(uprobe, mm))
-				err |= remove_breakpoint(uprobe, mm, info->vaddr);
+				err |= remove_breakpoint(uprobe, vma, info->vaddr);
 		}
 
  unlock:
@@ -1472,7 +1474,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 			continue;
 
 		vaddr = offset_to_vaddr(vma, uprobe->offset);
-		err |= remove_breakpoint(uprobe, mm, vaddr);
+		err |= remove_breakpoint(uprobe, vma, vaddr);
 	}
 	mmap_read_unlock(mm);
 
@@ -1610,7 +1612,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 		if (!fatal_signal_pending(current) &&
 		    filter_chain(uprobe, vma->vm_mm)) {
 			unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
-			install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
+			install_breakpoint(uprobe, vma, vaddr);
 		}
 		put_uprobe(uprobe);
 	}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH -next v1 2/3] kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and uprobe_write_opcode()
  2025-03-04 15:48 [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 1/3] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint() David Hildenbrand
@ 2025-03-04 15:48 ` David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
  2025-03-05 15:20 ` [PATCH -next v1 0/3] " Oleg Nesterov
  3 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-04 15:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	David Hildenbrand, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

We already have the VMA, no need to look it up using
get_user_page_vma_remote(). We can now switch to
get_user_pages_remote().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm/probes/uprobes/core.c |  4 ++--
 include/linux/uprobes.h        |  6 +++---
 kernel/events/uprobes.c        | 33 +++++++++++++++++----------------
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index f5f790c6e5f89..885e0c5e8c20d 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -26,10 +26,10 @@ bool is_swbp_insn(uprobe_opcode_t *insn)
 		(UPROBE_SWBP_ARM_INSN & 0x0fffffff);
 }
 
-int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm,
+int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	     unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, mm, vaddr,
+	return uprobe_write_opcode(auprobe, vma, vaddr,
 		   __opcode_to_mem_arm(auprobe->bpinsn));
 }
 
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a40efdda9052b..4da3bce5e062d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -186,13 +186,13 @@ struct uprobes_state {
 };
 
 extern void __init uprobes_init(void);
-extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
-extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
+extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
+extern int set_orig_insn(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
 extern bool is_swbp_insn(uprobe_opcode_t *insn);
 extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 991aacc80d0e0..0276defd6fbfa 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,19 +474,19 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  *
  * uprobe_write_opcode - write the opcode at a given virtual address.
  * @auprobe: arch specific probepoint information.
- * @mm: the probed process address space.
+ * @vma: the probed virtual memory area.
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
  * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
-int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-			unsigned long vaddr, uprobe_opcode_t opcode)
+int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+		unsigned long vaddr, uprobe_opcode_t opcode)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct uprobe *uprobe;
 	struct page *old_page, *new_page;
-	struct vm_area_struct *vma;
 	int ret, is_register, ref_ctr_updated = 0;
 	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
@@ -498,9 +498,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (is_register)
 		gup_flags |= FOLL_SPLIT_PMD;
 	/* Read the page with vaddr into memory */
-	old_page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
-	if (IS_ERR(old_page))
-		return PTR_ERR(old_page);
+	ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, NULL);
+	if (ret != 1)
+		return ret;
 
 	ret = verify_opcode(old_page, vaddr, &opcode);
 	if (ret <= 0)
@@ -590,30 +590,31 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 /**
  * set_swbp - store breakpoint at a given address.
  * @auprobe: arch specific probepoint information.
- * @mm: the probed process address space.
+ * @vma: the probed virtual memory area.
  * @vaddr: the virtual address to insert the opcode.
  *
  * For mm @mm, store the breakpoint instruction at @vaddr.
  * Return 0 (success) or a negative errno.
  */
-int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+int __weak set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+		unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
+	return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN);
 }
 
 /**
  * set_orig_insn - Restore the original instruction.
- * @mm: the probed process address space.
+ * @vma: the probed virtual memory area.
  * @auprobe: arch specific probepoint information.
  * @vaddr: the virtual address to insert the opcode.
  *
  * For mm @mm, restore the original opcode (opcode) at @vaddr.
  * Return 0 (success) or a negative errno.
  */
-int __weak
-set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+int __weak set_orig_insn(struct arch_uprobe *auprobe,
+		struct vm_area_struct *vma, unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, mm, vaddr,
+	return uprobe_write_opcode(auprobe, vma, vaddr,
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
@@ -1153,7 +1154,7 @@ static int install_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma,
 	if (first_uprobe)
 		set_bit(MMF_HAS_UPROBES, &mm->flags);
 
-	ret = set_swbp(&uprobe->arch, mm, vaddr);
+	ret = set_swbp(&uprobe->arch, vma, vaddr);
 	if (!ret)
 		clear_bit(MMF_RECALC_UPROBES, &mm->flags);
 	else if (first_uprobe)
@@ -1168,7 +1169,7 @@ static int remove_breakpoint(struct uprobe *uprobe, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 
 	set_bit(MMF_RECALC_UPROBES, &mm->flags);
-	return set_orig_insn(&uprobe->arch, mm, vaddr);
+	return set_orig_insn(&uprobe->arch, vma, vaddr);
 }
 
 struct map_info {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-04 15:48 [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 1/3] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint() David Hildenbrand
  2025-03-04 15:48 ` [PATCH -next v1 2/3] kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and uprobe_write_opcode() David Hildenbrand
@ 2025-03-04 15:48 ` David Hildenbrand
  2025-03-05 19:30   ` Matthew Wilcox
  2025-03-10 17:03   ` Oleg Nesterov
  2025-03-05 15:20 ` [PATCH -next v1 0/3] " Oleg Nesterov
  3 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-04 15:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	David Hildenbrand, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

uprobe_write_opcode() does some pretty low-level things that really, it
shouldn't be doing: for example, manually breaking COW by allocating
anonymous folios and replacing mapped pages.

Further, it does seem to do some shaky things: for example, writing to
possible COW-shared anonymous pages or zapping anonymous pages that might
be pinned. We're also not taking care of uffd, uffd-wp, softdirty ...
although rather corner cases here. Let's just get it right like ordinary
ptrace writes would.

Let's rewrite the code, leaving COW-breaking to core-MM, triggered by
FOLL_FORCE|FOLL_WRITE (note that the code was already using FOLL_FORCE).

We'll use GUP to lookup/faultin the page and break COW if required. Then,
we'll walk the page tables using a folio_walk to perform our page
modification atomically by temporarily unmap the PTE + flushing the TLB.

Likely, we could avoid the temporary unmap in case we can just
atomically write the instruction, but that will be a separate project.

Unfortunately, we still have to implement the zapping logic manually,
because we only want to zap in specific circumstances (e.g., page
content identical).

Note that we can now handle large folios (compound pages) and the shared
zeropage just fine, so drop these checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/events/uprobes.c | 316 ++++++++++++++++++++--------------------
 1 file changed, 160 insertions(+), 156 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0276defd6fbfa..4e39280f8f424 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -29,6 +29,7 @@
 #include <linux/workqueue.h>
 #include <linux/srcu.h>
 #include <linux/oom.h>          /* check_stable_address_space */
+#include <linux/pagewalk.h>
 
 #include <linux/uprobes.h>
 
@@ -151,91 +152,6 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
 	return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
 }
 
-/**
- * __replace_page - replace page in vma by new page.
- * based on replace_page in mm/ksm.c
- *
- * @vma:      vma that holds the pte pointing to page
- * @addr:     address the old @page is mapped at
- * @old_page: the page we are replacing by new_page
- * @new_page: the modified page we replace page by
- *
- * If @new_page is NULL, only unmap @old_page.
- *
- * Returns 0 on success, negative error code otherwise.
- */
-static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
-				struct page *old_page, struct page *new_page)
-{
-	struct folio *old_folio = page_folio(old_page);
-	struct folio *new_folio;
-	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0);
-	int err;
-	struct mmu_notifier_range range;
-	pte_t pte;
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + PAGE_SIZE);
-
-	if (new_page) {
-		new_folio = page_folio(new_page);
-		err = mem_cgroup_charge(new_folio, vma->vm_mm, GFP_KERNEL);
-		if (err)
-			return err;
-	}
-
-	/* For folio_free_swap() below */
-	folio_lock(old_folio);
-
-	mmu_notifier_invalidate_range_start(&range);
-	err = -EAGAIN;
-	if (!page_vma_mapped_walk(&pvmw))
-		goto unlock;
-	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
-	pte = ptep_get(pvmw.pte);
-
-	/*
-	 * Handle PFN swap PTES, such as device-exclusive ones, that actually
-	 * map pages: simply trigger GUP again to fix it up.
-	 */
-	if (unlikely(!pte_present(pte))) {
-		page_vma_mapped_walk_done(&pvmw);
-		goto unlock;
-	}
-
-	if (new_page) {
-		folio_get(new_folio);
-		folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE);
-		folio_add_lru_vma(new_folio, vma);
-	} else
-		/* no new page, just dec_mm_counter for old_page */
-		dec_mm_counter(mm, MM_ANONPAGES);
-
-	if (!folio_test_anon(old_folio)) {
-		dec_mm_counter(mm, mm_counter_file(old_folio));
-		inc_mm_counter(mm, MM_ANONPAGES);
-	}
-
-	flush_cache_page(vma, addr, pte_pfn(pte));
-	ptep_clear_flush(vma, addr, pvmw.pte);
-	if (new_page)
-		set_pte_at(mm, addr, pvmw.pte,
-			   mk_pte(new_page, vma->vm_page_prot));
-
-	folio_remove_rmap_pte(old_folio, old_page, vma);
-	if (!folio_mapped(old_folio))
-		folio_free_swap(old_folio);
-	page_vma_mapped_walk_done(&pvmw);
-	folio_put(old_folio);
-
-	err = 0;
- unlock:
-	mmu_notifier_invalidate_range_end(&range);
-	folio_unlock(old_folio);
-	return err;
-}
-
 /**
  * is_swbp_insn - check if instruction is breakpoint instruction.
  * @insn: instruction to be checked.
@@ -463,6 +379,105 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
 	return ret;
 }
 
+static bool orig_page_is_identical(struct vm_area_struct *vma,
+		unsigned long vaddr, struct page *page, bool *pmd_mappable)
+{
+	const pgoff_t index = vaddr_to_offset(vma, vaddr) >> PAGE_SHIFT;
+	struct page *orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
+					       index);
+	struct folio *orig_folio;
+	bool identical;
+
+	if (!orig_page)
+		return false;
+	orig_folio = page_folio(orig_page);
+
+	*pmd_mappable = folio_test_pmd_mappable(orig_folio);
+	identical = folio_test_uptodate(orig_folio) &&
+		    pages_identical(page, orig_page);
+	folio_put(orig_folio);
+	return identical;
+}
+
+static int __uprobe_write_opcode(struct vm_area_struct *vma,
+		struct folio_walk *fw, struct folio *folio,
+		unsigned long opcode_vaddr, uprobe_opcode_t opcode)
+{
+	const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
+	const bool is_register = !!is_swbp_insn(&opcode);
+	bool pmd_mappable;
+
+	/* We're done if we don't find an anonymous folio when unregistering. */
+	if (!folio_test_anon(folio))
+		return is_register ? -EFAULT : 0;
+
+	/* For now, we'll only handle PTE-mapped folios. */
+	if (fw->level != FW_LEVEL_PTE)
+		return -EFAULT;
+
+	/*
+	 * See can_follow_write_pte(): we'd actually prefer a writable PTE here,
+	 * but the VMA might not be writable.
+	 */
+	if (!pte_write(fw->pte)) {
+		if (!PageAnonExclusive(fw->page))
+			return -EFAULT;
+		if (unlikely(userfaultfd_pte_wp(vma, fw->pte)))
+			return -EFAULT;
+		/* SOFTDIRTY is handled via pte_mkdirty() below. */
+	}
+
+	/*
+	 * We'll temporarily unmap the page and flush the TLB, such that we can
+	 * modify the page atomically.
+	 */
+	flush_cache_page(vma, vaddr, pte_pfn(fw->pte));
+	fw->pte = ptep_clear_flush(vma, vaddr, fw->ptep);
+
+	/* Verify that the page content is still as expected. */
+	if (verify_opcode(fw->page, opcode_vaddr, &opcode) <= 0) {
+		set_pte_at(vma->vm_mm, vaddr, fw->ptep, fw->pte);
+		return -EAGAIN;
+	}
+	copy_to_page(fw->page, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+
+	/*
+	 * When unregistering, we may only zap a PTE if uffd is disabled and
+	 * there are no unexpected folio references ...
+	 */
+	if (is_register || userfaultfd_missing(vma) ||
+	    (folio_ref_count(folio) != folio_mapcount(folio) +
+	     folio_test_swapcache(folio) * folio_nr_pages(folio)))
+		goto remap;
+
+	/*
+	 * ... and the mapped page is identical to the original page that
+	 * would get faulted in on next access.
+	 */
+	if (!orig_page_is_identical(vma, vaddr, fw->page, &pmd_mappable))
+		goto remap;
+
+	dec_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	folio_remove_rmap_pte(folio, fw->page, vma);
+	if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
+	     folio_trylock(folio)) {
+		folio_free_swap(folio);
+		folio_unlock(folio);
+	}
+	folio_put(folio);
+
+	return pmd_mappable;
+remap:
+	/*
+	 * Make sure that our copy_to_page() changes become visible before the
+	 * set_pte_at() write.
+	 */
+	smp_wmb();
+	/* We modified the page. Make sure to mark the PTE dirty. */
+	set_pte_at(vma->vm_mm, vaddr, fw->ptep, pte_mkdirty(fw->pte));
+	return 0;
+}
+
 /*
  * NOTE:
  * Expect the breakpoint instruction to be the smallest size instruction for
@@ -475,116 +490,105 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * uprobe_write_opcode - write the opcode at a given virtual address.
  * @auprobe: arch specific probepoint information.
  * @vma: the probed virtual memory area.
- * @vaddr: the virtual address to store the opcode.
- * @opcode: opcode to be written at @vaddr.
+ * @opcode_vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @opcode_vaddr.
  *
  * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		unsigned long vaddr, uprobe_opcode_t opcode)
+		const unsigned long opcode_vaddr, uprobe_opcode_t opcode)
 {
+	const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
 	struct uprobe *uprobe;
-	struct page *old_page, *new_page;
 	int ret, is_register, ref_ctr_updated = 0;
-	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
+	struct mmu_notifier_range range;
+	struct folio_walk fw;
+	struct folio *folio;
+	struct page *page;
 
 	is_register = is_swbp_insn(&opcode);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
-retry:
+	if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags)))
+		return -EINVAL;
+
+	/*
+	 * When registering, we have to break COW to get an exclusive anonymous
+	 * page that we can safely modify. Use FOLL_WRITE to trigger a write
+	 * fault if required. When unregistering, we might be lucky and the
+	 * anon page is already gone. So defer write faults until really
+	 * required. Use FOLL_SPLIT_PMD, because __uprobe_write_opcode()
+	 * cannot deal with PMDs yet.
+	 */
 	if (is_register)
-		gup_flags |= FOLL_SPLIT_PMD;
-	/* Read the page with vaddr into memory */
-	ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, NULL);
+		gup_flags |= FOLL_WRITE | FOLL_SPLIT_PMD;
+
+retry:
+	ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &page, NULL);
 	if (ret != 1)
-		return ret;
+		goto out;
 
-	ret = verify_opcode(old_page, vaddr, &opcode);
+	ret = verify_opcode(page, opcode_vaddr, &opcode);
+	put_page(page);
 	if (ret <= 0)
-		goto put_old;
-
-	if (is_zero_page(old_page)) {
-		ret = -EINVAL;
-		goto put_old;
-	}
-
-	if (WARN(!is_register && PageCompound(old_page),
-		 "uprobe unregister should never work on compound page\n")) {
-		ret = -EINVAL;
-		goto put_old;
-	}
+		goto out;
 
 	/* We are going to replace instruction, update ref_ctr. */
 	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
 		ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
 		if (ret)
-			goto put_old;
+			goto out;
 
 		ref_ctr_updated = 1;
 	}
 
-	ret = 0;
-	if (!is_register && !PageAnon(old_page))
-		goto put_old;
-
-	ret = anon_vma_prepare(vma);
-	if (ret)
-		goto put_old;
-
-	ret = -ENOMEM;
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
-	if (!new_page)
-		goto put_old;
-
-	__SetPageUptodate(new_page);
-	copy_highpage(new_page, old_page);
-	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
-
 	if (!is_register) {
-		struct page *orig_page;
-		pgoff_t index;
-
-		VM_BUG_ON_PAGE(!PageAnon(old_page), old_page);
-
-		index = vaddr_to_offset(vma, vaddr & PAGE_MASK) >> PAGE_SHIFT;
-		orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
-					  index);
-
-		if (orig_page) {
-			if (PageUptodate(orig_page) &&
-			    pages_identical(new_page, orig_page)) {
-				/* let go new_page */
-				put_page(new_page);
-				new_page = NULL;
-
-				if (PageCompound(orig_page))
-					orig_page_huge = true;
-			}
-			put_page(orig_page);
-		}
+		/*
+		 * In the common case, we'll be able to zap the page when
+		 * unregistering. So trigger MMU notifiers now, as we won't
+		 * be able to do it under PTL.
+		 */
+		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+					vaddr, vaddr + PAGE_SIZE);
+		mmu_notifier_invalidate_range_start(&range);
+	}
+
+	/* Walk the page tables again, to perform the actual update. */
+	folio = folio_walk_start(&fw, vma, vaddr, 0);
+	if (folio) {
+		ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr,
+					    opcode);
+		folio_walk_end(&fw, vma);
+	} else {
+		ret = -EAGAIN;
 	}
 
-	ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page);
-	if (new_page)
-		put_page(new_page);
-put_old:
-	put_page(old_page);
+	if (!is_register)
+		mmu_notifier_invalidate_range_end(&range);
 
-	if (unlikely(ret == -EAGAIN))
+	switch (ret) {
+	case -EFAULT:
+		gup_flags |= FOLL_WRITE | FOLL_SPLIT_PMD;
+		fallthrough;
+	case -EAGAIN:
 		goto retry;
+	default:
+		break;
+	}
 
+out:
 	/* Revert back reference counter if instruction update failed. */
-	if (ret && is_register && ref_ctr_updated)
+	if (ret < 0 && is_register && ref_ctr_updated)
 		update_ref_ctr(uprobe, mm, -1);
 
 	/* try collapse pmd for compound page */
-	if (!ret && orig_page_huge)
+	if (ret > 0)
 		collapse_pte_mapped_thp(mm, vaddr, false);
 
-	return ret;
+	return ret < 0 ? ret : 0;
 }
 
 /**
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-04 15:48 [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-03-04 15:48 ` [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
@ 2025-03-05 15:20 ` Oleg Nesterov
  2025-03-05 19:43   ` Andrii Nakryiko
  3 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-05 15:20 UTC (permalink / raw)
  To: David Hildenbrand, Andrii Nakryiko, Jiri Olsa
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, Liang, Kan,
	Tong Tiangen

On 03/04, David Hildenbrand wrote:
>
> Currently, uprobe_write_opcode() implements COW-breaking manually, which is
> really far from ideal.

To say at least ;)

David, thanks for doing this. I'll try to read 3/3 tomorrow, but I don't
think I can really help. Let me repeat, this code was written many years
ago, I forgot everything, and today my understanding of mm/ is very poor.
But I'll try anyway.

> Are there any uprobe tests / benchmarks that are worth running?

All I know about uprobe tests is that bpf people run a lot of tests which
use uprobes.

Andrii, Jiri, what you advise?

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-04 15:48 ` [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
@ 2025-03-05 19:30   ` Matthew Wilcox
  2025-03-05 19:37     ` David Hildenbrand
  2025-03-10 17:03   ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2025-03-05 19:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Russell King, Masami Hiramatsu,
	Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On Tue, Mar 04, 2025 at 04:48:46PM +0100, David Hildenbrand wrote:
> +static bool orig_page_is_identical(struct vm_area_struct *vma,
> +		unsigned long vaddr, struct page *page, bool *pmd_mappable)
> +{
> +	const pgoff_t index = vaddr_to_offset(vma, vaddr) >> PAGE_SHIFT;
> +	struct page *orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
> +					       index);
> +	struct folio *orig_folio;
> +	bool identical;
> +
> +	if (!orig_page)
> +		return false;
> +	orig_folio = page_folio(orig_page);

I'd rather write this as:

	struct folio *orig_folio = filemap_get_folio(vma->vm_file->f_mapping,
			index);
	struct page *orig_page;

	if (IS_ERR(orig_folio))
		return false;
	orig_page = folio_file_page(folio, index);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-05 19:30   ` Matthew Wilcox
@ 2025-03-05 19:37     ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Russell King, Masami Hiramatsu,
	Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On 05.03.25 20:30, Matthew Wilcox wrote:
> On Tue, Mar 04, 2025 at 04:48:46PM +0100, David Hildenbrand wrote:
>> +static bool orig_page_is_identical(struct vm_area_struct *vma,
>> +		unsigned long vaddr, struct page *page, bool *pmd_mappable)
>> +{
>> +	const pgoff_t index = vaddr_to_offset(vma, vaddr) >> PAGE_SHIFT;
>> +	struct page *orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
>> +					       index);
>> +	struct folio *orig_folio;
>> +	bool identical;
>> +
>> +	if (!orig_page)
>> +		return false;
>> +	orig_folio = page_folio(orig_page);
> 
> I'd rather write this as:
> 
> 	struct folio *orig_folio = filemap_get_folio(vma->vm_file->f_mapping,
> 			index);
> 	struct page *orig_page;
> 
> 	if (IS_ERR(orig_folio))
> 		return false;
> 	orig_page = folio_file_page(folio, index);
> 

Willy, you read my mind. I was staring at that code and thought "there 
must be an easier way using pages", and folio_file_page() was the 
missing piece.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-05 15:20 ` [PATCH -next v1 0/3] " Oleg Nesterov
@ 2025-03-05 19:43   ` Andrii Nakryiko
  2025-03-05 19:47     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2025-03-05 19:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Hildenbrand, Andrii Nakryiko, Jiri Olsa, linux-kernel,
	linux-mm, linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	Andrew Morton, Matthew Wilcox, Russell King, Masami Hiramatsu,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, Liang, Kan, Tong Tiangen

On Wed, Mar 5, 2025 at 7:22 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 03/04, David Hildenbrand wrote:
> >
> > Currently, uprobe_write_opcode() implements COW-breaking manually, which is
> > really far from ideal.
>
> To say at least ;)
>
> David, thanks for doing this. I'll try to read 3/3 tomorrow, but I don't
> think I can really help. Let me repeat, this code was written many years
> ago, I forgot everything, and today my understanding of mm/ is very poor.
> But I'll try anyway.
>
> > Are there any uprobe tests / benchmarks that are worth running?
>
> All I know about uprobe tests is that bpf people run a lot of tests which
> use uprobes.
>
> Andrii, Jiri, what you advise?
>

We do have a bunch of tests within BPF selftests:

cd tools/testing/selftest/bpf && make -j$(nproc) && sudo ./test_progs -t uprobe

I also built an uprobe-stress tool to validate uprobe optimizations I
was doing, this one is the most stand-alone thing to use for testing,
please consider checking that. You can find it at [0], and see also
[1] and [2] where  I was helping Peter to build it from sources, so
that might be useful for you as well, if you run into problems with
building. Running something like `sudo ./uprobe-stress -a10 -t5 -m5
-f3` would hammer on this quite a bit.

I'm just about to leave on a short vacation, so won't have time to go
over patches, but I plan to look at them when I'm back next week.

  [0] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress
  [1] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZ+ygwfk8FKn5AS_Ny=igvGcFzdDLE2FjcvwjCKazEWMA@mail.gmail.com/
  [2] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZqKCR-EQz6LTi-YvFY4RnYb_NnQXtwgZCv6aUo7gjkHg@mail.gmail.com

> Oleg.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-05 19:43   ` Andrii Nakryiko
@ 2025-03-05 19:47     ` David Hildenbrand
  2025-03-05 19:58       ` Andrii Nakryiko
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:47 UTC (permalink / raw)
  To: Andrii Nakryiko, Oleg Nesterov
  Cc: Andrii Nakryiko, Jiri Olsa, linux-kernel, linux-mm,
	linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	Andrew Morton, Matthew Wilcox, Russell King, Masami Hiramatsu,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, Liang, Kan, Tong Tiangen

On 05.03.25 20:43, Andrii Nakryiko wrote:
> On Wed, Mar 5, 2025 at 7:22 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> On 03/04, David Hildenbrand wrote:
>>>
>>> Currently, uprobe_write_opcode() implements COW-breaking manually, which is
>>> really far from ideal.
>>
>> To say at least ;)
>>
>> David, thanks for doing this. I'll try to read 3/3 tomorrow, but I don't
>> think I can really help. Let me repeat, this code was written many years
>> ago, I forgot everything, and today my understanding of mm/ is very poor.
>> But I'll try anyway.
>>
>>> Are there any uprobe tests / benchmarks that are worth running?
>>
>> All I know about uprobe tests is that bpf people run a lot of tests which
>> use uprobes.
>>
>> Andrii, Jiri, what you advise?
>>
> 
> We do have a bunch of tests within BPF selftests:
> 
> cd tools/testing/selftest/bpf && make -j$(nproc) && sudo ./test_progs -t uprobe

I stumbled over them, but was so far not successful in building them in 
my test VM (did not try too hard, though). Will try harder now that I 
know that it actually tests uprobe properly :)

> 
> I also built an uprobe-stress tool to validate uprobe optimizations I
> was doing, this one is the most stand-alone thing to use for testing,
> please consider checking that. You can find it at [0], and see also
> [1] and [2] where  I was helping Peter to build it from sources, so
> that might be useful for you as well, if you run into problems with
> building. Running something like `sudo ./uprobe-stress -a10 -t5 -m5
> -f3` would hammer on this quite a bit.

Thanks, I'll play with that as well.

> 
> I'm just about to leave on a short vacation, so won't have time to go
> over patches, but I plan to look at them when I'm back next week.
> 
>    [0] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress
>    [1] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZ+ygwfk8FKn5AS_Ny=igvGcFzdDLE2FjcvwjCKazEWMA@mail.gmail.com/
>    [2] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZqKCR-EQz6LTi-YvFY4RnYb_NnQXtwgZCv6aUo7gjkHg@mail.gmail.com
> 
>> Oleg.
>>
> 


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-05 19:47     ` David Hildenbrand
@ 2025-03-05 19:58       ` Andrii Nakryiko
  2025-03-05 20:53         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2025-03-05 19:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oleg Nesterov, Andrii Nakryiko, Jiri Olsa, linux-kernel, linux-mm,
	linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	Andrew Morton, Matthew Wilcox, Russell King, Masami Hiramatsu,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, Liang, Kan, Tong Tiangen

On Wed, Mar 5, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.03.25 20:43, Andrii Nakryiko wrote:
> > On Wed, Mar 5, 2025 at 7:22 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> On 03/04, David Hildenbrand wrote:
> >>>
> >>> Currently, uprobe_write_opcode() implements COW-breaking manually, which is
> >>> really far from ideal.
> >>
> >> To say at least ;)
> >>
> >> David, thanks for doing this. I'll try to read 3/3 tomorrow, but I don't
> >> think I can really help. Let me repeat, this code was written many years
> >> ago, I forgot everything, and today my understanding of mm/ is very poor.
> >> But I'll try anyway.
> >>
> >>> Are there any uprobe tests / benchmarks that are worth running?
> >>
> >> All I know about uprobe tests is that bpf people run a lot of tests which
> >> use uprobes.
> >>
> >> Andrii, Jiri, what you advise?
> >>
> >
> > We do have a bunch of tests within BPF selftests:
> >
> > cd tools/testing/selftest/bpf && make -j$(nproc) && sudo ./test_progs -t uprobe
>
> I stumbled over them, but was so far not successful in building them in
> my test VM (did not try too hard, though). Will try harder now that I
> know that it actually tests uprobe properly :)

If you have decently recent Clang and pahole, then just make sure you
have kernel built before you build selftests. So above instructions
are more like:

1. cd <linux-repo>
2. cat tools/testing/selftests/bpf/{config, config.<your_arch>} >> .config
3. make -j$(nproc) # build kernel with that adjusted config
4. cd tools/testing/selftests/bpf
5. make -j$(nproc) # build BPF selftests
6. sudo ./test_progs -t uprobe # run selftests with "uprobe" in their name

>
> >
> > I also built an uprobe-stress tool to validate uprobe optimizations I
> > was doing, this one is the most stand-alone thing to use for testing,
> > please consider checking that. You can find it at [0], and see also
> > [1] and [2] where  I was helping Peter to build it from sources, so
> > that might be useful for you as well, if you run into problems with
> > building. Running something like `sudo ./uprobe-stress -a10 -t5 -m5
> > -f3` would hammer on this quite a bit.
>
> Thanks, I'll play with that as well.
>
> >
> > I'm just about to leave on a short vacation, so won't have time to go
> > over patches, but I plan to look at them when I'm back next week.
> >
> >    [0] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress
> >    [1] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZ+ygwfk8FKn5AS_Ny=igvGcFzdDLE2FjcvwjCKazEWMA@mail.gmail.com/
> >    [2] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZqKCR-EQz6LTi-YvFY4RnYb_NnQXtwgZCv6aUo7gjkHg@mail.gmail.com
> >
> >> Oleg.
> >>
> >
>
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-05 19:58       ` Andrii Nakryiko
@ 2025-03-05 20:53         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-05 20:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Andrii Nakryiko, Jiri Olsa, linux-kernel, linux-mm,
	linux-arm-kernel, linux-trace-kernel, linux-perf-users,
	Andrew Morton, Matthew Wilcox, Russell King, Masami Hiramatsu,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, Liang, Kan, Tong Tiangen

On 05.03.25 20:58, Andrii Nakryiko wrote:
> On Wed, Mar 5, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.03.25 20:43, Andrii Nakryiko wrote:
>>> On Wed, Mar 5, 2025 at 7:22 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>>>
>>>> On 03/04, David Hildenbrand wrote:
>>>>>
>>>>> Currently, uprobe_write_opcode() implements COW-breaking manually, which is
>>>>> really far from ideal.
>>>>
>>>> To say at least ;)
>>>>
>>>> David, thanks for doing this. I'll try to read 3/3 tomorrow, but I don't
>>>> think I can really help. Let me repeat, this code was written many years
>>>> ago, I forgot everything, and today my understanding of mm/ is very poor.
>>>> But I'll try anyway.
>>>>
>>>>> Are there any uprobe tests / benchmarks that are worth running?
>>>>
>>>> All I know about uprobe tests is that bpf people run a lot of tests which
>>>> use uprobes.
>>>>
>>>> Andrii, Jiri, what you advise?
>>>>
>>>
>>> We do have a bunch of tests within BPF selftests:
>>>
>>> cd tools/testing/selftest/bpf && make -j$(nproc) && sudo ./test_progs -t uprobe
>>
>> I stumbled over them, but was so far not successful in building them in
>> my test VM (did not try too hard, though). Will try harder now that I
>> know that it actually tests uprobe properly :)
> 
> If you have decently recent Clang and pahole, then just make sure you
> have kernel built before you build selftests. So above instructions
> are more like:
> 
> 1. cd <linux-repo>
> 2. cat tools/testing/selftests/bpf/{config, config.<your_arch>} >> .config

^ that did the trick

> 3. make -j$(nproc) # build kernel with that adjusted config
> 4. cd tools/testing/selftests/bpf
> 5. make -j$(nproc) # build BPF selftests
> 6. sudo ./test_progs -t uprobe # run selftests with "uprobe" in their name

#444     uprobe:OK
#445     uprobe_autoattach:OK
#446/1   uprobe_multi_test/skel_api:OK
#446/2   uprobe_multi_test/attach_api_pattern:OK
#446/3   uprobe_multi_test/attach_api_syms:OK
#446/4   uprobe_multi_test/link_api:OK
#446/5   uprobe_multi_test/bench_uprobe:OK
#446/6   uprobe_multi_test/bench_usdt:OK
#446/7   uprobe_multi_test/attach_api_fails:OK
#446/8   uprobe_multi_test/attach_uprobe_fails:OK
#446/9   uprobe_multi_test/consumers:OK
#446/10  uprobe_multi_test/filter_fork:OK
#446/11  uprobe_multi_test/filter_clone_vm:OK
#446/12  uprobe_multi_test/session:OK
#446/13  uprobe_multi_test/session_single:OK
#446/14  uprobe_multi_test/session_cookie:OK
#446/15  uprobe_multi_test/session_cookie_recursive:OK
#446/16  uprobe_multi_test/uprobe_sesison_return_0:OK
#446/17  uprobe_multi_test/uprobe_sesison_return_1:OK
#446/18  uprobe_multi_test/uprobe_sesison_return_2:OK
#446     uprobe_multi_test:OK
#447/1   uprobe_syscall/uretprobe_regs_equal:OK
#447/2   uprobe_syscall/uretprobe_regs_change:OK
#447/3   uprobe_syscall/uretprobe_syscall_call:OK
#447/4   uprobe_syscall/uretprobe_shadow_stack:SKIP
#447     uprobe_syscall:OK (SKIP: 1/4)
Summary: 4/21 PASSED, 1 SKIPPED, 0 FAILED


Looks promising, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-04 15:48 ` [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
  2025-03-05 19:30   ` Matthew Wilcox
@ 2025-03-10 17:03   ` Oleg Nesterov
  2025-03-11  9:54     ` David Hildenbrand
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-10 17:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On 03/04, David Hildenbrand wrote:
>
> uprobe_write_opcode() does some pretty low-level things that really, it
> shouldn't be doing:

Agreed. Thanks again for doing this.

David, as I said, I can't review. I don't understand this mm/folio magic
with or without your changes.

However. With your changes the code looks "better" and more understandable
to me. So I'd vote for your patches even if I can't ack them.

But I'd like to ask some stupid (no, really) questions.
__uprobe_write_opcode() does:

	/* We're done if we don't find an anonymous folio when unregistering. */
	if (!folio_test_anon(folio))
		return is_register ? -EFAULT : 0;

Yes, but we do not expect !folio_test_anon() if register == true, right?
See also below.

	/* Verify that the page content is still as expected. */
	if (verify_opcode(fw->page, opcode_vaddr, &opcode) <= 0) {
		set_pte_at(vma->vm_mm, vaddr, fw->ptep, fw->pte);
		return -EAGAIN;
	}

The caller, uprobe_write_opcode(), has already called verify_opcode(),
why do we need to re-check?

But whatever reason we have. Can we change uprobe_write_opcode() to
"delay" put_page() and instead of

	/* Walk the page tables again, to perform the actual update. */
	folio = folio_walk_start(&fw, vma, vaddr, 0);
	if (folio) {
		ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr,
					    opcode);
		folio_walk_end(&fw, vma);
	} else {
		ret = -EAGAIN;
	}

do something like

	/* Walk the page tables again, to perform the actual update. */
	ret = -EAGAIN;
	folio = folio_walk_start(&fw, vma, vaddr, 0);
	if (folio) {
		if (fw.page == page) {
			WARN_ON(is_register && !folio_test_anon(folio));
			ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr,
					            opcode);
		}
		folio_walk_end(&fw, vma);
	}

?

Once again, I am not trying to review. I am trying to understand the
basics of your code.

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-10 17:03   ` Oleg Nesterov
@ 2025-03-11  9:54     ` David Hildenbrand
  2025-03-11 12:32       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-03-11  9:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On 10.03.25 18:03, Oleg Nesterov wrote:
> On 03/04, David Hildenbrand wrote:
>>
>> uprobe_write_opcode() does some pretty low-level things that really, it
>> shouldn't be doing:
> 
> Agreed. Thanks again for doing this.
> 
> David, as I said, I can't review. I don't understand this mm/folio magic
> with or without your changes.
> 

No worries! Thanks for taking a look!

> However. With your changes the code looks "better" and more understandable
> to me. So I'd vote for your patches even if I can't ack them.
> 
> But I'd like to ask some stupid (no, really) questions.
> __uprobe_write_opcode() does:
> 
> 	/* We're done if we don't find an anonymous folio when unregistering. */
> 	if (!folio_test_anon(folio))
> 		return is_register ? -EFAULT : 0;
> 
> Yes, but we do not expect !folio_test_anon() if register == true, right?
 > See also below.>
> 	/* Verify that the page content is still as expected. */
> 	if (verify_opcode(fw->page, opcode_vaddr, &opcode) <= 0) {
> 		set_pte_at(vma->vm_mm, vaddr, fw->ptep, fw->pte);
> 		return -EAGAIN;
> 	}
> 
> The caller, uprobe_write_opcode(), has already called verify_opcode(),
> why do we need to re-check?

Regarding both questions, the code is fairly racy. Nothing would stop 
user space from (a) modifying that memory (b) zapping the anon page 
using MADV_DONTNEED (if we don't hold the mmap lock in write mode).

Regarding the latter, uprobe_write_opcode() is documented to: "Called 
with mm->mmap_lock held for read or write.".

Note that both checks are fairly cheap.

> 
> But whatever reason we have. Can we change uprobe_write_opcode() to
> "delay" put_page() and instead of

I was debating with myself whether we should do that and went back and 
forth a couple of times.

> 
> 	/* Walk the page tables again, to perform the actual update. */
> 	folio = folio_walk_start(&fw, vma, vaddr, 0);
> 	if (folio) {
> 		ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr,
> 					    opcode);
> 		folio_walk_end(&fw, vma);
> 	} else {
> 		ret = -EAGAIN;
> 	}
> 
> do something like
> 
> 	/* Walk the page tables again, to perform the actual update. */
> 	ret = -EAGAIN;
> 	folio = folio_walk_start(&fw, vma, vaddr, 0);
> 	if (folio) {
> 		if (fw.page == page) {
> 			WARN_ON(is_register && !folio_test_anon(folio));

Yes, that would work (we could leave the WARN_ON in 
__uprobe_write_opcode), but I am not sure if the end result is better 
better. No strong opinion on the details though.

> 			ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr,
> 					            opcode);
> 		}
> 		folio_walk_end(&fw, vma);
> 	}
> 
> ?
> 
> Once again, I am not trying to review. I am trying to understand the
> basics of your code.

Any feedback is welcome, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-11  9:54     ` David Hildenbrand
@ 2025-03-11 12:32       ` Oleg Nesterov
  2025-03-11 20:02         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-11 12:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On 03/11, David Hildenbrand wrote:
>
> Regarding both questions, the code is fairly racy. Nothing would stop user
> space from (a) modifying that memory

Yes, but we don't really care. uprobes.c assumes that user-space won't play
with the probed memory.

Note that if is_register is false, then vma can be even writable. Hmm, why?
Perhaps valid_vma() should ignore is_register and nack VM_MAYWRITE ? But
this doesn't really matter, say, gdb can change this memory anyway. Again,
we don't really care.

> >do something like
> >
> >	/* Walk the page tables again, to perform the actual update. */
> >	ret = -EAGAIN;
> >	folio = folio_walk_start(&fw, vma, vaddr, 0);
> >	if (folio) {
> >		if (fw.page == page) {
> >			WARN_ON(is_register && !folio_test_anon(folio));
>
> Yes, that would work (we could leave the WARN_ON in __uprobe_write_opcode),
> but I am not sure if the end result is better better. No strong opinion on
> the details though.

Will, this way __uprobe_write_opcode() will look a little bit simpler...

But I won't insist, please do what you think is better.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite
  2025-03-11 12:32       ` Oleg Nesterov
@ 2025-03-11 20:02         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-03-11 20:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linux-trace-kernel,
	linux-perf-users, Andrew Morton, Matthew Wilcox, Russell King,
	Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Tong Tiangen

On 11.03.25 13:32, Oleg Nesterov wrote:
> On 03/11, David Hildenbrand wrote:
>>
>> Regarding both questions, the code is fairly racy. Nothing would stop user
>> space from (a) modifying that memory
> 
> Yes, but we don't really care. uprobes.c assumes that user-space won't play
> with the probed memory.

Right, I primarily care about that if user space would do it, that we 
don't trigger unintended behavior (e.g., overwriting pagecache pages 
etc, WARN etc).

Likely the re-validating the page content is indeed something we can drop.

> 
> Note that if is_register is false, then vma can be even writable. Hmm, why?
> Perhaps valid_vma() should ignore is_register and nack VM_MAYWRITE ? But
> this doesn't really matter, say, gdb can change this memory anyway. Again,
> we don't really care.
> 
>>> do something like
>>>
>>> 	/* Walk the page tables again, to perform the actual update. */
>>> 	ret = -EAGAIN;
>>> 	folio = folio_walk_start(&fw, vma, vaddr, 0);
>>> 	if (folio) {
>>> 		if (fw.page == page) {
>>> 			WARN_ON(is_register && !folio_test_anon(folio));
>>
>> Yes, that would work (we could leave the WARN_ON in __uprobe_write_opcode),
>> but I am not sure if the end result is better better. No strong opinion on
>> the details though.
> 
> Will, this way __uprobe_write_opcode() will look a little bit simpler...
> 
> But I won't insist, please do what you think is better.

I'll take another look at this series probably next week (I'm on PTO 
this week) to then resend once adjusted + retested.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-11 20:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 15:48 [PATCH -next v1 0/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
2025-03-04 15:48 ` [PATCH -next v1 1/3] kernel/events/uprobes: pass VMA instead of MM to remove_breakpoint() David Hildenbrand
2025-03-04 15:48 ` [PATCH -next v1 2/3] kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and uprobe_write_opcode() David Hildenbrand
2025-03-04 15:48 ` [PATCH -next v1 3/3] kernel/events/uprobes: uprobe_write_opcode() rewrite David Hildenbrand
2025-03-05 19:30   ` Matthew Wilcox
2025-03-05 19:37     ` David Hildenbrand
2025-03-10 17:03   ` Oleg Nesterov
2025-03-11  9:54     ` David Hildenbrand
2025-03-11 12:32       ` Oleg Nesterov
2025-03-11 20:02         ` David Hildenbrand
2025-03-05 15:20 ` [PATCH -next v1 0/3] " Oleg Nesterov
2025-03-05 19:43   ` Andrii Nakryiko
2025-03-05 19:47     ` David Hildenbrand
2025-03-05 19:58       ` Andrii Nakryiko
2025-03-05 20:53         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).