[RFC PATCH 3/7] mm/damon/core: replace mutex-protected report buffer with per-CPU lockless ring

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
To: sj@kernel.org, damon@lists.linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Cc: akpm@linux-foundation.org, corbet@lwn.net, bijan311@gmail.com,
	ajayjoshi@micron.com, honggyu.kim@sk.com, yunjeong.mun@sk.com,
	ravis.opensrc@gmail.com, bharata@amd.com
Subject: [RFC PATCH 3/7] mm/damon/core: replace mutex-protected report buffer with per-CPU lockless ring
Date: Sat, 16 May 2026 15:34:28 -0700	[thread overview]
Message-ID: <20260516223439.4033-4-ravis.opensrc@gmail.com> (raw)
In-Reply-To: <20260516223439.4033-1-ravis.opensrc@gmail.com>

Replace the mutex-protected fixed-size array (DAMON_ACCESS_REPORTS_CAP=1000)
with a per-CPU lockless ring buffer.  This enables damon_report_access()
to be called from NMI context.

Ring design:
- Producer is serialized per CPU: only one in-flight producer per CPU
  is allowed.  A per-CPU damon_report_ring_busy counter detects
  NMI-on-process nesting and drops the nested attempt, preserving the
  single-writer invariant on the slot.
- head is advanced by the producer with smp_wmb() before publish.
- tail is advanced by the consumer (kdamond) after the entries[] reads.
- Overflow: sample silently dropped.  NMI context is allocation-free
  and access reports are best-effort.

To keep the producer/consumer pattern scalable on systems with many
CPUs and a high NMI rate, the ring layout follows three rules:

- head, tail and entries[] live on separate cache lines via
  ____cacheline_aligned_in_smp, so producer and consumer do not
  invalidate each other's working set on every advance.
- DAMON_REPORT_RING_SIZE is bounded so the per-CPU footprint stays
  small (256 entries x sizeof(struct damon_access_report) plus head
  and tail cache lines), keeping draining all rings during one
  kdamond tick from evicting unrelated data on contemporary server
  parts.
- A cpumask, damon_rings_pending, is set by the producer after
  publishing and cleared by the consumer per ring drained, so the
  consumer iterates only CPUs with pending entries instead of
  walking every online CPU.  An smp_mb__before_atomic() between the
  head publish and the cpumask_set_cpu() ensures observers of the
  pending bit also observe the published head; without it, weakly-
  ordered architectures could let the consumer drain stale head and
  delay the report.  The consumer pairs this with an
  smp_mb__after_atomic() between cpumask_clear_cpu() and reading
  head, so a producer that publishes between the consumer's clear
  and head-read is observed via the bit it re-sets rather than
  silently stranded.

Consumer (kdamond_check_reported_accesses) drains the rings of CPUs
in damon_rings_pending, applying reports to targets.

Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
---
 mm/damon/core.c | 143 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 101 insertions(+), 42 deletions(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index b605d36b29b1a..9ed789e932ebd 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -25,7 +25,26 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/damon.h>
 
-#define DAMON_ACCESS_REPORTS_CAP 1000
+/* Sized so the per-CPU ring set fits in L3 on typical multi-socket boxes. */
+#define DAMON_REPORT_RING_SIZE	256
+#define DAMON_REPORT_RING_MASK	(DAMON_REPORT_RING_SIZE - 1)
+
+struct damon_report_ring {
+	unsigned int head;	/* written by producer (NMI) */
+	unsigned int tail	/* written by consumer (kdamond) */
+		____cacheline_aligned_in_smp;
+	struct damon_access_report entries[DAMON_REPORT_RING_SIZE]
+		____cacheline_aligned_in_smp;
+};
+
+static DEFINE_PER_CPU(struct damon_report_ring, damon_report_rings);
+static DEFINE_PER_CPU(int, damon_report_ring_busy);
+/*
+ * Per-CPU bitmap: producer (NMI) sets after publishing a report;
+ * consumer (kdamond) clears before draining the corresponding ring.
+ * Hot-write under sampling load - do NOT mark __read_mostly.
+ */
+static cpumask_t damon_rings_pending;
 
 static DEFINE_MUTEX(damon_lock);
 static int nr_running_ctxs;
@@ -36,10 +55,6 @@ static struct damon_operations damon_registered_ops[NR_DAMON_OPS];
 
 static struct kmem_cache *damon_region_cache __ro_after_init;
 
-static DEFINE_MUTEX(damon_access_reports_lock);
-static struct damon_access_report damon_access_reports[
-	DAMON_ACCESS_REPORTS_CAP];
-static int damon_access_reports_len;
 
 /* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */
 static bool __damon_is_registered_ops(enum damon_ops_id id)
@@ -2127,33 +2142,56 @@ int damos_walk(struct damon_ctx *ctx, struct damos_walk_control *control)
 }
 
 /**
- * damon_report_access() - Report identified access events to DAMON.
- * @report:	The reporting access information.
+ * damon_report_access() - Report a hardware-observed memory access.
+ * @report:	pointer to a filled damon_access_report struct.
  *
- * Report access events to DAMON.
- *
- * Context: May sleep.
- *
- * NOTE: we may be able to implement this as a lockless queue, and allow any
- * context.  As the overhead is unknown, and region-based DAMON logics would
- * guarantee the reports would be not made that frequently, let's start with
- * this simple implementation.
+ * Context: NMI-safe.  No sleeping, no allocation, no locks.
  */
 void damon_report_access(struct damon_access_report *report)
 {
-	struct damon_access_report *dst;
+	struct damon_report_ring *ring;
+	unsigned int head, next;
 
-	/* silently fail for races */
-	if (!mutex_trylock(&damon_access_reports_lock))
-		return;
-	dst = &damon_access_reports[damon_access_reports_len++];
-	/* just drop all existing reports in favor of simplicity. */
-	if (damon_access_reports_len == DAMON_ACCESS_REPORTS_CAP)
-		damon_access_reports_len = 0;
-	*dst = *report;
-	dst->report_jiffies = jiffies;
-	mutex_unlock(&damon_access_reports_lock);
+	/* Pin to a CPU so the SPSC invariant holds for preemptible callers. */
+	preempt_disable();
+	/*
+	 * NMI nesting on the same CPU as a process-context producer would
+	 * stomp the same entries[head] slot.  Detect and drop instead.
+	 */
+	if (this_cpu_inc_return(damon_report_ring_busy) != 1) {
+		/* NMI nested on a process-context producer; drop. */
+		goto out;
+	}
+
+	ring = this_cpu_ptr(&damon_report_rings);
+	head = ring->head;
+	next = (head + 1) & DAMON_REPORT_RING_MASK;
+
+	if (next == READ_ONCE(ring->tail)) {
+		/* Ring full; consumer is behind, drop the report. */
+		goto out;
+	}
+
+	ring->entries[head] = *report;
+	ring->entries[head].report_jiffies = jiffies;
+	smp_wmb(); /* ensure entry visible before head advance */
+	WRITE_ONCE(ring->head, next);
+	/*
+	 * Order the head advance before publishing the pending bit
+	 * so that the consumer, on observing the bit, is also
+	 * guaranteed to observe the new head.  set_bit/cpumask_set_cpu
+	 * are documented as unordered RMW (atomic_bitops.txt), hence
+	 * the explicit barrier; without it, a weakly-ordered arch
+	 * could let the consumer drain stale head, clear the bit, and
+	 * delay the report until the next producer sets the bit again.
+	 */
+	smp_mb__before_atomic();
+	cpumask_set_cpu(smp_processor_id(), &damon_rings_pending);
+out:
+	this_cpu_dec(damon_report_ring_busy);
+	preempt_enable();
 }
+EXPORT_SYMBOL_GPL(damon_report_access);
 
 #ifdef CONFIG_MMU
 void damon_report_page_fault(struct vm_fault *vmf, bool huge_pmd)
@@ -3814,26 +3852,47 @@ static unsigned int kdamond_apply_zero_access_report(struct damon_ctx *ctx)
 
 static unsigned int kdamond_check_reported_accesses(struct damon_ctx *ctx)
 {
-	int i;
-	struct damon_access_report *report;
+	int cpu;
 	struct damon_target *t;
 
-	/* currently damon_access_report supports only physical address */
-	if (damon_target_has_pid(ctx))
-		return 0;
+	for_each_cpu(cpu, &damon_rings_pending) {
+		struct damon_report_ring *ring =
+			per_cpu_ptr(&damon_report_rings, cpu);
+		unsigned int head, tail;
 
-	mutex_lock(&damon_access_reports_lock);
-	for (i = 0; i < damon_access_reports_len; i++) {
-		report = &damon_access_reports[i];
-		if (time_before(report->report_jiffies,
-					jiffies -
-					usecs_to_jiffies(
-						ctx->attrs.sample_interval)))
-			continue;
-		damon_for_each_target(t, ctx)
-			kdamond_apply_access_report(report, t, ctx);
+		cpumask_clear_cpu(cpu, &damon_rings_pending);
+		/*
+		 * Pair with the producer's smp_mb__before_atomic() between
+		 * the head publish and cpumask_set_cpu(): order the bit
+		 * clear before the head read so that a producer publishing
+		 * between our clear and our READ_ONCE(head) is observed via
+		 * the bit it re-sets, not lost as a stale-head drain.
+		 */
+		smp_mb__after_atomic();
+		head = READ_ONCE(ring->head);
+		/*
+		 * Pair with smp_wmb in damon_report_access(): the entry
+		 * data published before the producer advanced head must be
+		 * visible to the entries[] reads inside the loop below.
+		 */
+		smp_rmb();
+		tail = ring->tail;
+
+		while (tail != head) {
+			struct damon_access_report *report =
+				&ring->entries[tail];
+
+			if (!time_before(report->report_jiffies,
+					jiffies - usecs_to_jiffies(
+						ctx->attrs.sample_interval))) {
+				damon_for_each_target(t, ctx)
+					kdamond_apply_access_report(
+							report, t, ctx);
+			}
+			tail = (tail + 1) & DAMON_REPORT_RING_MASK;
+		}
+		WRITE_ONCE(ring->tail, tail);
 	}
-	mutex_unlock(&damon_access_reports_lock);
 	/* For nr_accesses_bp, absence of access should also be reported. */
 	return kdamond_apply_zero_access_report(ctx);
 }
-- 
2.43.0

next prev parent reply	other threads:[~2026-05-16 22:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-16 22:34 [RFC PATCH 0/7] mm/damon: hardware-sampled access reports + AMD IBS Op example Ravi Jonnalagadda
2026-05-16 22:34 ` [RFC PATCH 1/7] mm/damon/core: refcount ops owner module to prevent rmmod UAF Ravi Jonnalagadda
2026-05-16 22:34 ` [RFC PATCH 2/7] mm/damon/paddr: export damon_pa_* ops for IBS module Ravi Jonnalagadda
2026-05-16 22:34 ` Ravi Jonnalagadda [this message]
2026-05-16 22:34 ` [RFC PATCH 4/7] mm/damon/core: flat-array snapshot + bsearch in ring-drain loop Ravi Jonnalagadda
2026-05-16 22:34 ` [RFC PATCH 5/7] mm/damon: add sysfs binding and dispatch hookup for paddr_ibs operations Ravi Jonnalagadda
2026-05-16 22:34 ` [RFC PATCH 6/7] mm/damon/core: accept paddr_ibs in node_eligible_mem_bp ops check Ravi Jonnalagadda
2026-05-16 22:34 ` [RFC PATCH 7/7] mm/damon/damon_ibs: add AMD IBS-based access sampling backend Ravi Jonnalagadda

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b605d36b29b1 dfblob:9ed789e932eb )
 OR (
bs:"[RFC PATCH 3/7] mm/damon/core: replace mutex-protected report buffer with per-CPU lockless ring" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260516223439.4033-4-ravis.opensrc@gmail.com \
    --to=ravis.opensrc@gmail.com \
    --cc=ajayjoshi@micron.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=bijan311@gmail.com \
    --cc=corbet@lwn.net \
    --cc=damon@lists.linux.dev \
    --cc=honggyu.kim@sk.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=sj@kernel.org \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.