public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Darko Tominac <dtominac@cisco.com>
To: Masami Hiramatsu <mhiramat@kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	James Clark <james.clark@linaro.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Lorenzo Stoakes <ljs@kernel.org>,
	David Hildenbrand <david@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>, Jann Horn <jannh@google.com>
Cc: xe-linux-external@cisco.com, danielwa@cisco.com,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
Date: Wed, 29 Apr 2026 15:15:18 +0200	[thread overview]
Message-ID: <20260429131522.4049054-1-dtominac@cisco.com> (raw)

When uprobes are active, MADV_DONTNEED can discard file-backed pages
that contain uprobe software breakpoint instructions.  Because the
uprobe infrastructure does not re-instrument pages on individual page
faults (uprobe_mmap() is only called during VMA creation, not on
page-in), the breakpoints are silently lost once the discarded pages are
re-read from the backing file.  The probes stop firing with no error
indication, and the only recovery is to unregister and re-register the
affected uprobes.

Note that MADV_FREE is not affected: it only operates on anonymous VMAs
(madvise_free_single_vma() rejects non-anonymous VMAs with -EINVAL),
while uprobes only instrument file-backed mappings, so the two can never
overlap.

A concrete example is a userspace memory reclamation subsystem that
periodically calls madvise(MADV_DONTNEED) on file-backed text pages to
release memory.  This silently clears uprobe breakpoints placed by
eBPF-based security and tracing tools that use uprobes to attach eBPF
programs to user-space functions, causing those tools to stop
functioning within seconds of the first reclamation pass.

Add a check in madvise_dontneed_free(), which handles MADV_DONTNEED,
MADV_DONTNEED_LOCKED and MADV_FREE, that when CONFIG_UPROBES is enabled
detects whether the target range contains active uprobes:

  - Fast path: if no uprobes are registered system-wide, or the VMA is
    not file-backed (uprobes only instrument file-backed mappings, so
    anonymous VMAs -- including MADV_FREE targets -- can never contain
    breakpoints), or no uprobes are present in the VMA range, proceed
    with the discard as before.
  - Slow path: when uprobes are detected in the range, use
    vma_first_uprobe_addr() to jump directly to each uprobe page via
    the rbtree, zapping the clean ranges between them.  This is
    O(M * log N) where M is the number of uprobes in the range and
    N is the total uprobe count, rather than O(pages).  madvise()
    still returns success, consistent with the advisory nature of
    MADV_DONTNEED.

When CONFIG_UPROBES is not configured, the original behaviour is
preserved with no overhead.

To support the above, export vma_has_uprobes() and add new helpers
any_uprobes_registered() and vma_first_uprobe_addr() in the uprobes
subsystem.  vma_first_uprobe_addr() returns the page-aligned virtual
address of the lowest-offset uprobe in a given VMA range by leveraging
the (inode, offset)-sorted global rbtree.

Cc: xe-linux-external@cisco.com
Cc: danielwa@cisco.com
Signed-off-by: Darko Tominac <dtominac@cisco.com>
---
 include/linux/uprobes.h | 21 +++++++++++
 kernel/events/uprobes.c | 79 +++++++++++++++++++++++++++++++++++++++--
 mm/madvise.c            | 73 +++++++++++++++++++++++++++++++++----
 3 files changed, 164 insertions(+), 9 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f548fea2adec..9ce5c46fd2e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -212,6 +212,11 @@ extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consum
 extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+					   unsigned long start,
+					   unsigned long end);
+extern bool any_uprobes_registered(void);
 extern void uprobe_start_dup_mmap(void);
 extern void uprobe_end_dup_mmap(void);
 extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
@@ -278,6 +283,22 @@ static inline void
 uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 }
+static inline bool
+vma_has_uprobes(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end)
+{
+	return false;
+}
+static inline unsigned long
+vma_first_uprobe_addr(struct vm_area_struct *vma, unsigned long start,
+		      unsigned long end)
+{
+	return 0;
+}
+static inline bool any_uprobes_registered(void)
+{
+	return false;
+}
 static inline void uprobe_start_dup_mmap(void)
 {
 }
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..0f8aea99b96f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -152,6 +152,19 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
 	return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
 }
 
+/**
+ * any_uprobes_registered - check if any uprobes are currently registered
+ *
+ * Check whether the global uprobe rbtree has any entries, indicating
+ * that at least one uprobe is currently active in the system.
+ *
+ * Return: true if one or more uprobes are registered, false otherwise.
+ */
+bool any_uprobes_registered(void)
+{
+	return !no_uprobe_events();
+}
+
 /**
  * is_swbp_insn - check if instruction is breakpoint instruction.
  * @insn: instruction to be checked.
@@ -1635,8 +1648,16 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	return 0;
 }
 
-static bool
-vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+/**
+ * vma_has_uprobes - check whether a vma range contains any uprobes.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Return: true if at least one uprobe is registered in [@start, @end),
+ * false otherwise.
+ */
+bool vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 	loff_t min, max;
 	struct inode *inode;
@@ -1654,6 +1675,60 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
 	return !!n;
 }
 
+/**
+ * vma_first_uprobe_addr - find first uprobe in a vma range.
+ * @vma: the vma to search.
+ * @start: start address of the range (inclusive).
+ * @end: end address of the range (exclusive).
+ *
+ * Used by madvise to skip directly to uprobe pages.
+ *
+ * Return: the page-aligned virtual address of the first uprobe in
+ * [@start, @end), or 0 if none exists.
+ */
+unsigned long vma_first_uprobe_addr(struct vm_area_struct *vma,
+				    unsigned long start, unsigned long end)
+{
+	loff_t min, max, first_offset;
+	struct inode *inode;
+	struct rb_node *n, *t;
+	struct uprobe *u;
+
+	/* No uprobes possible on anonymous mappings */
+	if (!vma->vm_file)
+		return 0;
+
+	/* Empty range -- nothing to search */
+	if (start >= end)
+		return 0;
+
+	inode = file_inode(vma->vm_file);
+
+	min = vaddr_to_offset(vma, start);
+	max = min + (end - start) - 1;
+
+	read_lock(&uprobes_treelock);
+	n = find_node_in_range(inode, min, max);
+	if (!n) {
+		read_unlock(&uprobes_treelock);
+		return 0;
+	}
+
+	/* Walk left to find the lowest offset in range */
+	u = rb_entry(n, struct uprobe, rb_node);
+	first_offset = u->offset;
+	for (t = rb_prev(n); t; t = rb_prev(t)) {
+		u = rb_entry(t, struct uprobe, rb_node);
+		if (u->inode != inode || u->offset < min)
+			break;
+		first_offset = u->offset;
+	}
+	read_unlock(&uprobes_treelock);
+
+	/* Return page-aligned vaddr containing this uprobe */
+	return PAGE_ALIGN_DOWN(offset_to_vaddr(vma, first_offset));
+}
+
 /*
  * Called in context of a munmap of a vma.
  */
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..c73f1131224b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
 #include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/tlb.h>
 
@@ -862,6 +863,30 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior)
 	return 0;
 }
 
+static long madvise_dontneed_free_range(struct madvise_behavior *madv_behavior,
+					unsigned long start, unsigned long end)
+{
+	struct madvise_behavior_range *range = &madv_behavior->range;
+	unsigned long saved_start = range->start;
+	unsigned long saved_end = range->end;
+	int behavior = madv_behavior->behavior;
+	long ret;
+
+	range->start = start;
+	range->end = end;
+
+	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
+		ret = madvise_dontneed_single_vma(madv_behavior);
+	else if (behavior == MADV_FREE)
+		ret = madvise_free_single_vma(madv_behavior);
+	else
+		ret = -EINVAL;
+
+	range->start = saved_start;
+	range->end = saved_end;
+	return ret;
+}
+
 static
 bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior)
 {
@@ -898,7 +923,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
 {
 	struct mm_struct *mm = madv_behavior->mm;
 	struct madvise_behavior_range *range = &madv_behavior->range;
-	int behavior = madv_behavior->behavior;
+	unsigned long cur, end, uprobe_addr;
 
 	if (!madvise_dontneed_free_valid_vma(madv_behavior))
 		return -EINVAL;
@@ -947,12 +972,46 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
 		VM_WARN_ON(range->start > range->end);
 	}
 
-	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(madv_behavior);
-	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(madv_behavior);
-	else
-		return -EINVAL;
+	/*
+	 * Preserve uprobes: if any uprobes are active in this VMA range,
+	 * avoid discarding pages that contain active breakpoints.
+	 *
+	 * Fast path: if no uprobes are registered system-wide, or the VMA
+	 * is not file-backed (uprobes only instrument file-backed mappings,
+	 * so anonymous VMAs can never contain breakpoints), or no uprobes
+	 * are present in this VMA range, proceed with the full operation.
+	 */
+	if (likely(!any_uprobes_registered()) ||
+	    !madv_behavior->vma->vm_file ||
+	    !vma_has_uprobes(madv_behavior->vma, range->start, range->end))
+		return madvise_dontneed_free_range(madv_behavior,
+						   range->start, range->end);
+
+	/*
+	 * Slow path: jump from uprobe to uprobe via rbtree lookup, zapping
+	 * the clean range before each uprobe page. This is O(M * log N)
+	 * where M is the number of uprobes in the range and N is the total
+	 * uprobe count, versus O(pages) for a page-by-page scan. 'cur'
+	 * tracks the beginning of the current clean range.
+	 */
+	cur = range->start;
+	end = range->end;
+	while (cur < end) {
+		uprobe_addr = vma_first_uprobe_addr(madv_behavior->vma,
+						    cur, end);
+		if (!uprobe_addr) {
+			/* No more uprobes - zap the rest */
+			madvise_dontneed_free_range(madv_behavior, cur, end);
+			break;
+		}
+		/* Zap the clean range before the uprobe page */
+		if (cur < uprobe_addr)
+			madvise_dontneed_free_range(madv_behavior, cur,
+						    uprobe_addr);
+		/* Skip past the uprobe page */
+		cur = uprobe_addr + PAGE_SIZE;
+	}
+	return 0;
 }
 
 static long madvise_populate(struct madvise_behavior *madv_behavior)
-- 
2.35.6


             reply	other threads:[~2026-04-29 13:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 13:15 Darko Tominac [this message]
2026-04-29 13:31 ` [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED David Hildenbrand (Arm)
2026-04-29 15:24   ` Oleg Nesterov
2026-04-29 21:11     ` Daniel Walker (danielwa)
2026-04-30  9:16       ` Oleg Nesterov
2026-04-30  9:54         ` David Hildenbrand (Arm)
2026-04-30 18:46           ` Oleg Nesterov
2026-04-30 19:11             ` Jann Horn
2026-04-30 15:22       ` Jann Horn
2026-04-30 19:25         ` Daniel Walker (danielwa)
2026-05-01 19:25           ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260429131522.4049054-1-dtominac@cisco.com \
    --to=dtominac@cisco.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=danielwa@cisco.com \
    --cc=david@kernel.org \
    --cc=irogers@google.com \
    --cc=james.clark@linaro.org \
    --cc=jannh@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vbabka@kernel.org \
    --cc=xe-linux-external@cisco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox