Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: Kees Cook @ 2026-05-14 16:31 UTC (permalink / raw)
  To: Manuel Ebner
  Cc: Andy Shevchenko, Jonathan Corbet, Shuah Khan, Andy Whitcroft,
	Joe Perches, Dwaipayan Ray, Lukas Bulwahn, Geert Uytterhoeven,
	David Laight, Randy Dunlap, Jani Nikula, Heiko Carstens,
	open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
	open list
In-Reply-To: <20260514162652.107714-2-manuelebner@mailbox.org>

On Thu, May 14, 2026 at 06:26:53PM +0200, Manuel Ebner wrote:
> add strlcat and alternatives
> 
> Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
> ---
>  Documentation/process/deprecated.rst | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
> index fed56864d036..06e802f4bbfd 100644
> --- a/Documentation/process/deprecated.rst
> +++ b/Documentation/process/deprecated.rst
> @@ -153,6 +153,13 @@ used, and the destinations should be marked with the `__nonstring
>  attribute to avoid future compiler warnings. For cases still needing
>  NUL-padding, strtomem_pad() can be used.
>  
> +strlcat()
> +---------
> +strlcat() must re-scan the destination string from the beginning on each
> +call (O(n^2) behavior). Alternatives are seq_buf_puts() and seq_buf_printf().
> +snprintf(), scnprintf() and sysfs_emit() are possible aswell, but the adoption
> +of the arguments needs to be taken care off.
> +

How about just:

strlcat() must re-scan the destination string from the beginning on each
call (O(n^2) behavior). Use the seq_buf API or similar instead.


>  strlcpy()
>  ---------
>  strlcpy() reads the entire source buffer first (since the return value
> -- 
> 2.54.0
> 

-- 
Kees Cook

^ permalink raw reply

* [PATCH v2 2/3] scripts: checkpatch.pl: add warning for strlcat()
From: Manuel Ebner @ 2026-05-14 16:28 UTC (permalink / raw)
  To: Andy Shevchenko, Kees Cook, Jonathan Corbet, Shuah Khan,
	Andy Whitcroft, Joe Perches, Dwaipayan Ray, Lukas Bulwahn,
	Geert Uytterhoeven, David Laight, Randy Dunlap, Jani Nikula,
	Heiko Carstens, open list:DOCUMENTATION PROCESS,
	open list:DOCUMENTATION, open list
  Cc: Manuel Ebner
In-Reply-To: <20260514160719.105084-3-manuelebner@mailbox.org>

add a warning for strlcat()

Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
---
 scripts/checkpatch.pl | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 0492d6afc9a1..4c1b43ebe00d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -7085,6 +7085,12 @@ sub process {
 			     "Prefer strscpy over strlcpy - see: https://github.com/KSPP/linux/issues/89\n" . $herecurr);
 		}
 
+# strlcat uses that should be a more supported function
+		if ($line =~ /\bstrlcat\s*\(/ && !is_userspace($realfile)) {
+			WARN("STRLCAT",
+			     "Prefer a more supported function over strlcat - see: https://github.com/KSPP/linux/issues/370\n" . $herecurr);
+		}
+
 # strncpy uses that should likely be strscpy or strscpy_pad
 		if ($line =~ /\bstrncpy\s*\(/ && !is_userspace($realfile)) {
 			WARN("STRNCPY",
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: Manuel Ebner @ 2026-05-14 16:26 UTC (permalink / raw)
  To: Andy Shevchenko, Kees Cook, Jonathan Corbet, Shuah Khan,
	Andy Whitcroft, Joe Perches, Dwaipayan Ray, Lukas Bulwahn,
	Geert Uytterhoeven, David Laight, Randy Dunlap, Jani Nikula,
	Heiko Carstens, open list:DOCUMENTATION PROCESS,
	open list:DOCUMENTATION, open list
  Cc: Manuel Ebner
In-Reply-To: <20260514160719.105084-3-manuelebner@mailbox.org>

add strlcat and alternatives

Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
---
 Documentation/process/deprecated.rst | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
index fed56864d036..06e802f4bbfd 100644
--- a/Documentation/process/deprecated.rst
+++ b/Documentation/process/deprecated.rst
@@ -153,6 +153,13 @@ used, and the destinations should be marked with the `__nonstring
 attribute to avoid future compiler warnings. For cases still needing
 NUL-padding, strtomem_pad() can be used.
 
+strlcat()
+---------
+strlcat() must re-scan the destination string from the beginning on each
+call (O(n^2) behavior). Alternatives are seq_buf_puts() and seq_buf_printf().
+snprintf(), scnprintf() and sysfs_emit() are possible aswell, but the adoption
+of the arguments needs to be taken care off.
+
 strlcpy()
 ---------
 strlcpy() reads the entire source buffer first (since the return value
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 0/3] Doc, scripts: facilitate phaseout of strlcat
From: Manuel Ebner @ 2026-05-14 16:07 UTC (permalink / raw)
  To: Andy Shevchenko, Kees Cook, Jonathan Corbet, Shuah Khan,
	Andy Whitcroft, Joe Perches, Dwaipayan Ray, Lukas Bulwahn,
	Geert Uytterhoeven, David Laight, Randy Dunlap, Jani Nikula,
	Heiko Carstens, open list:DOCUMENTATION PROCESS,
	open list:DOCUMENTATION, open list
  Cc: Manuel Ebner

Thanks for all the feedback. I tried to incorporate it in this version.

The goal of this series is to facilitate the transition away from strlcat()

[v2]
 add recipants
 add remarks to strlcat definition in
  lib/string.c
  tools/include/nolibc/string.h
  -> [3/3]
 

^ permalink raw reply

* Re: [PATCH v10 8/9] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Jason Gunthorpe @ 2026-05-14 16:02 UTC (permalink / raw)
  To: Tzung-Bi Shih
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
	Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
	driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
	Danilo Krummrich, Jonathan Corbet, Shuah Khan, Laurent Pinchart,
	Wolfram Sang, Johan Hovold, Paul E . McKenney
In-Reply-To: <agVCoxuTu7l60TH-@google.com>

On Thu, May 14, 2026 at 03:33:55AM +0000, Tzung-Bi Shih wrote:

> > Given you say this is such a bug I think you really should be sending
> > a series that is patches 5 through 7 from the other series and a
> > simple rwsem instead of misc_deregister_sync() to deal with this bug
> > ASAP. No need to complicate a simple bug fix in a driver with all
> > these core changes.
> 
> Apologies for missing this suggestion.
> 
> For "patches 5 through 7 from the other series" I guess you're referring:
> - https://lore.kernel.org/all/20260427134659.95181-6-tzungbi@kernel.org
> - https://lore.kernel.org/all/20260427134659.95181-7-tzungbi@kernel.org
> - https://lore.kernel.org/all/20260427134659.95181-8-tzungbi@kernel.org

Yes

> Could you provide a bit more detail on the rwsem approach?  I'm not
> entirely clear on what data or operations the rwsem would be protecting.

Just put a rwsem, or even scru, inside the driver's fops.

You can refactor that out to a misc or revocable later.

Jason

^ permalink raw reply

* Re: [PATCH v11 4/5] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Jason Gunthorpe @ 2026-05-14 16:00 UTC (permalink / raw)
  To: Tzung-Bi Shih
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
	Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
	driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
	Danilo Krummrich, Jonathan Corbet, Shuah Khan, Laurent Pinchart,
	Wolfram Sang, Johan Hovold, Paul E . McKenney
In-Reply-To: <agVCtBbqT6aZL0mx@google.com>

On Thu, May 14, 2026 at 03:34:12AM +0000, Tzung-Bi Shih wrote:

> To help me understand, could you elaborate on why the revocable mechanism
> isn't suitable here?

Stay within one driver. Create the revokable is probe, consume it
within that drivers fops/etc, destroy it on remove. Do not randomly
pass it to other drivers.

> I'm wondering because if this piece of code were to transition to
> Rust in the future, would the concerns you have also apply to using
> Revocable[1] in the Rust context for this driver?

Yes, even in rust driver local revocable objects should not be
spaghetti coded through different layers.

Jason

^ permalink raw reply

* [PATCH] docs: hwmon: sy7636a: fix temperature sysfs attribute name
From: Chen-Shi-Hong @ 2026-05-14 15:39 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Jonathan Corbet, Shuah Khan, linux-hwmon, linux-doc, linux-kernel,
	Chen-Shi-Hong

The hwmon sysfs naming convention uses
temp[1-*]_input for temperature channels.

Documentation/hwmon/sy7636a-hwmon.rst currently documents
temp0_input, while the driver uses the standard hwmon
temperature channel interface.

Update the documentation to use temp1_input.

Signed-off-by: Chen-Shi-Hong <eric039eric@gmail.com>
---
 Documentation/hwmon/sy7636a-hwmon.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/hwmon/sy7636a-hwmon.rst b/Documentation/hwmon/sy7636a-hwmon.rst
index 0143ce0e5db7..03d866aba6e8 100644
--- a/Documentation/hwmon/sy7636a-hwmon.rst
+++ b/Documentation/hwmon/sy7636a-hwmon.rst
@@ -22,5 +22,5 @@ The following sensors are supported
 sysfs-Interface
 ---------------
 
-temp0_input
+temp1_input
 	- Temperature of external NTC (milli-degree C)
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v4 02/16] vfio/pci: Preserve vfio-pci device files across Live Update
From: Pratyush Yadav @ 2026-05-14 15:24 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Vipin Sharma, David Matlack, kvm, linux-doc, linux-kernel,
	linux-kselftest, linux-pci, ajayachandra, alex, amastro, ankita,
	apopple, chrisl, corbet, graf, jacob.pan, jgg, jgg, jrhilke,
	julianr, kevin.tian, leon, leonro, lukas, michal.winiarski, parav,
	pasha.tatashin, praan, pratyush, rananta, rientjes, rodrigo.vivi,
	rppt, saeedm, skhan, vivek.kasireddy, witu, yanjun.zhu, yi.l.liu
In-Reply-To: <agT9bYpXskVwW0E_@google.com>

On Wed, May 13 2026, Samiullah Khawaja wrote:

> On Tue, May 12, 2026 at 02:29:19PM -0700, Vipin Sharma wrote:
>>On Tue, May 12, 2026 at 01:59:51PM -0700, David Matlack wrote:
>>> On Mon, May 11, 2026 at 4:48 PM Vipin Sharma <vipinsh@google.com> wrote:
>>>
>>> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>> > index c12d614fc6c4..019de053f116 100644
>>> > --- a/drivers/vfio/pci/Kconfig
>>> > +++ b/drivers/vfio/pci/Kconfig
>>> > @@ -45,13 +45,15 @@ config VFIO_PCI_IGD
>>> >
>>> >  config VFIO_PCI_LIVEUPDATE
>>> >         bool "VFIO PCI support for Live Update (EXPERIMENTAL)"
>>> > -       depends on PCI_LIVEUPDATE
>>> > +       depends on PCI_LIVEUPDATE && VFIO_DEVICE_CDEV
>>> >         help
>>> >           Support for preserving devices bound to vfio-pci across a Live
>>> >           Update. This option should only be enabled by developers working on
>>> >           implementing this support. Once enough support has landed in the
>>> >           kernel, this option will no longer be marked EXPERIMENTAL.
>>> >
>>> > +         Enabling this will disable support for VFIO PCI DMA buffer.
>>> > +
>>> >           If you don't know what to do here, say N.
>>> >
>>> >  endif
>>> > @@ -68,7 +70,7 @@ config VFIO_PCI_ZDEV_KVM
>>> >           To enable s390x KVM vfio-pci extensions, say Y.
>>> >
>>> >  config VFIO_PCI_DMABUF
>>> > -       def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
>>> > +       def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER && !VFIO_PCI_LIVEUPDATE
>>>
>>> Why does enabling VFIO_PCI_LIVEUPDATE require disabling
>>> VFIO_PCI_DMABUF? I saw the cover letter says "to keep things simple",
>>> but what specific problem does this solve or simplify?
>>
>>I should have provided more details there.
>>
>>When device is getting reset in vfio_pci_liveupdate_freeze(), we are
>>zapping userspace mapped bars, we also need to use
>>vfio_pci_dma_buf_move() to revoke dma buffer access or
>>vfio_pci_dma_buf_cleanup() combination. Cleanup takes the memory lock
>>which freeze already takes, and there are some refcounts which are
>>managed in both of these APIs. This was causing complexities with code
>>flow based on result of pci_load_saved_state(). All this was adding more
>>refactoring than I wanted in the series.
>
> Maybe we can return -EOPNOTSUPP if any dmabufs for this vfio cdev are
> exported during preserve?

Whichever way you go with, a TODO/comment would be nice to have so
someone (including future you) looking at this code knows why this
restriction exists.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Horst Birthelmer @ 2026-05-14 15:13 UTC (permalink / raw)
  To: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer

From: Horst Birthelmer <hbirthelmer@ddn.com>

The dcache only shrinks under memory pressure, which is rarely reached
on machines with ample RAM, so cached negative dentries can accumulate
without bound.  Give administrators a soft cap they can set,
and a background worker that prefers negative dentries when reclaiming.

Two new sysctls under /proc/sys/fs/:

  dentry-limit             -- soft cap on nr_dentry.  0 (default)
                              disables the feature; behaviour is then
                              identical to before.
  dentry-limit-interval-ms -- pacing for the worker while still over
                              the cap.  Default 1000, minimum 1.

When the cap is exceeded, a delayed_work runs in two phases:

  1. iterate_supers() draining only negative dentries from every LRU.
     Positive entries are rotated past so the walk makes progress.
     DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
     cap should evict even hot negatives before any positive entry.
  2. If still over the cap, iterate_supers() again with the same
     isolate callback the memory-pressure shrinker uses.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
---
There was a discussion at LSFMM about servers with too many cached
negative dentries.
That gave me the idea to keep the dentries in general limited
if the system administrator needs it to.

This is somewhat related to [1] where it would address the same
symptoms but in a more unobtrusive way, by just garbage collecting
the negative and then the unused cache entries.

The other effect I have seen regarding this is that FUSE
will not forget inodes (no FORGET call to the FUSE server)
even after the latest reference has been closed until much later.

In a FUSE server that mirrors the kernel cached inodes in user space
because it has to keep a lot of private data for every node
this puts an unnecessarry memory strain on that userspace entity
especially if the memory is limited for its cgroup.

[1]: https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/
---
 Documentation/admin-guide/sysctl/fs.rst |  28 +++++
 fs/dcache.c                             | 197 ++++++++++++++++++++++++++++++++
 2 files changed, 225 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 9b7f65c3efd8..0229aea45d85 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -38,6 +38,34 @@ requests.  ``aio-max-nr`` allows you to change the maximum value
 ``aio-max-nr`` does not result in the
 pre-allocation or re-sizing of any kernel data structures.
 
+dentry-limit
+------------
+
+Soft cap on the total number of dentries allocated system-wide (i.e. on
+``nr_dentry`` from ``dentry-state``).  A value of ``0`` (the default)
+disables the feature and the dcache grows or shrinks only under memory
+pressure as before.
+
+When set to a non-zero value, a background worker is woken whenever
+the live dentry count exceeds the limit. The worker walks every
+superblock's LRU and prefers to evict negative dentries first; if it
+cannot get back under the limit using negative entries alone it falls
+back to the same LRU policy used by the memory-pressure shrinker.
+
+The limit is *soft*: allocations never fail because of it, and brief
+overshoots while the worker catches up are expected. Set the cap a
+comfortable margin above your steady-state working set.
+
+dentry-limit-interval-ms
+------------------------
+
+How often, in milliseconds, the ``dentry-limit`` worker re-runs while
+``nr_dentry`` is still above the cap. Defaults to ``1000`` (one
+second); the minimum accepted value is ``1``. Smaller values trim the
+cache more aggressively at the cost of more CPU spent walking LRUs;
+larger values let temporary spikes ride out before any work is done.
+Has no effect when ``dentry-limit`` is ``0``.
+
 dentry-negative
 ----------------------------
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 2c61aeea41f4..4959d2c011c0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -144,6 +144,19 @@ static DEFINE_PER_CPU(long, nr_dentry_unused);
 static DEFINE_PER_CPU(long, nr_dentry_negative);
 static int dentry_negative_policy;
 
+/*
+ * Soft cap on the total number of dentries. When non-zero and exceeded,
+ * a background worker prunes unused dentries (preferring negative ones)
+ * until we are back under the limit. Zero (the default) disables the
+ * feature entirely; the fast path in __d_alloc() only pays the cost of
+ * a READ_ONCE and a branch in that case.
+ */
+static unsigned long sysctl_dentry_limit __read_mostly;
+static unsigned int sysctl_dentry_limit_interval_ms __read_mostly = 1000;
+static unsigned long dentry_limit_last_kick;
+
+static void dentry_limit_kick(void);
+
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 /* Statistics gathering. */
 static struct dentry_stat_t dentry_stat = {
@@ -199,6 +212,20 @@ static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+/*
+ * Writing fs.dentry-limit should give prompt feedback to admins
+ * lowering the cap, so kick the worker on every successful write.
+ */
+static int proc_dentry_limit(const struct ctl_table *table, int write,
+			     void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (write && !ret)
+		dentry_limit_kick();
+	return ret;
+}
+
 static const struct ctl_table fs_dcache_sysctls[] = {
 	{
 		.procname	= "dentry-state",
@@ -207,6 +234,21 @@ static const struct ctl_table fs_dcache_sysctls[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_nr_dentry,
 	},
+	{
+		.procname	= "dentry-limit",
+		.data		= &sysctl_dentry_limit,
+		.maxlen		= sizeof(sysctl_dentry_limit),
+		.mode		= 0644,
+		.proc_handler	= proc_dentry_limit,
+	},
+	{
+		.procname	= "dentry-limit-interval-ms",
+		.data		= &sysctl_dentry_limit_interval_ms,
+		.maxlen		= sizeof(sysctl_dentry_limit_interval_ms),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "dentry-negative",
 		.data		= &dentry_negative_policy,
@@ -1325,6 +1367,160 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item,
 	return LRU_REMOVED;
 }
 
+#define DENTRY_LIMIT_BATCH	1024UL
+
+static void dentry_limit_worker_fn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(dentry_limit_work, dentry_limit_worker_fn);
+
+/*
+ * Variant of dentry_lru_isolate() that only frees negative dentries.
+ * DCACHE_REFERENCED is intentionally not honoured here: the whole point
+ * of an admin-imposed cap on negatives is that even frequently-looked-up
+ * negative entries should be evicted before any positive dentry.
+ * Positive entries are rotated to the tail so the walk continues to
+ * make progress without disturbing their LRU position.
+ */
+static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
+		struct list_lru_one *lru, void *arg)
+{
+	struct list_head *freeable = arg;
+	struct dentry *dentry = container_of(item, struct dentry, d_lru);
+
+	if (!spin_trylock(&dentry->d_lock))
+		return LRU_SKIP;
+
+	/* Same handling as dentry_lru_isolate() for in-use entries. */
+	if (dentry->d_lockref.count) {
+		d_lru_isolate(lru, dentry);
+		spin_unlock(&dentry->d_lock);
+		return LRU_REMOVED;
+	}
+
+	if (!d_is_negative(dentry)) {
+		spin_unlock(&dentry->d_lock);
+		return LRU_ROTATE;
+	}
+
+	d_lru_shrink_move(lru, dentry, freeable);
+	spin_unlock(&dentry->d_lock);
+	return LRU_REMOVED;
+}
+
+struct dentry_limit_ctx {
+	long over;		/* remaining dentries to evict */
+	list_lru_walk_cb isolate;
+};
+
+static void dentry_limit_prune_sb(struct super_block *sb, void *arg)
+{
+	struct dentry_limit_ctx *ctx = arg;
+	unsigned long walked = 0;
+	unsigned long budget;
+
+	if (ctx->over <= 0)
+		return;
+
+	/*
+	 * Walk up to one full pass of this superblock's LRU, in
+	 * DENTRY_LIMIT_BATCH-sized chunks. The loop matters mainly for
+	 * phase 1: dentry_lru_isolate_negative() returns LRU_ROTATE for
+	 * positive dentries, which still counts against list_lru_walk()'s
+	 * nr_to_walk. A single batch can therefore finish having freed
+	 * nothing when positives crowd the head of the LRU, and without
+	 * the inner loop the worker would have to wait a full
+	 * dentry-limit-interval-ms before retrying never reaching the
+	 * negatives buried behind a long run of positives.
+	 *
+	 * The budget is snapshot at entry so a filesystem allocating
+	 * dentries faster than we drain them can't keep us spinning here
+	 * forever; freshly added dentries are picked up on the next
+	 * worker invocation.
+	 *
+	 * Phase 2 normally exits much sooner: its isolate callback frees
+	 * any non-referenced dentry, so ctx->over typically hits zero
+	 * inside the first batch. The worst-case over-eviction is one
+	 * batch past the cap, which is within the soft semantics of
+	 * fs.dentry-limit.
+	 */
+	budget = list_lru_count(&sb->s_dentry_lru);
+
+	while (ctx->over > 0 && walked < budget) {
+		LIST_HEAD(dispose);
+		unsigned long nr;
+		long freed;
+
+		nr = min(DENTRY_LIMIT_BATCH, budget - walked);
+		freed = list_lru_walk(&sb->s_dentry_lru, ctx->isolate,
+				      &dispose, nr);
+		shrink_dentry_list(&dispose);
+
+		ctx->over -= freed;
+		walked += nr;
+
+		cond_resched();
+	}
+}
+
+static void dentry_limit_worker_fn(struct work_struct *work)
+{
+	struct dentry_limit_ctx ctx;
+	unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+	unsigned int ms;
+	long nr;
+
+	if (!limit)
+		return;
+
+	nr = get_nr_dentry();
+	if (nr <= (long)limit)
+		return;
+
+	ctx.over = nr - (long)limit;
+
+	/* Phase 1: drain negative dentries across every superblock. */
+	ctx.isolate = dentry_lru_isolate_negative;
+	iterate_supers(dentry_limit_prune_sb, &ctx);
+
+	/* Phase 2: still over? Apply the ordinary LRU policy. */
+	if (ctx.over > 0) {
+		ctx.isolate = dentry_lru_isolate;
+		iterate_supers(dentry_limit_prune_sb, &ctx);
+	}
+
+	/*
+	 * Re-arm while still above the limit. Re-read the sysctls in
+	 * case the admin raised the cap or disabled the feature during
+	 * the walk.
+	 */
+	limit = READ_ONCE(sysctl_dentry_limit);
+	if (!limit || get_nr_dentry() <= (long)limit)
+		return;
+
+	ms = READ_ONCE(sysctl_dentry_limit_interval_ms);
+	queue_delayed_work(system_unbound_wq, &dentry_limit_work,
+			   msecs_to_jiffies(ms));
+}
+
+static void dentry_limit_kick(void)
+{
+	unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+	unsigned long now;
+
+	if (!limit)
+		return;
+	if (delayed_work_pending(&dentry_limit_work))
+		return;
+
+	now = jiffies;
+	if (time_before(now, READ_ONCE(dentry_limit_last_kick) + HZ / 10))
+		return;
+	WRITE_ONCE(dentry_limit_last_kick, now);
+
+	if (get_nr_dentry() <= (long)limit)
+		return;
+
+	queue_delayed_work(system_unbound_wq, &dentry_limit_work, 0);
+}
 
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
@@ -1868,6 +2064,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	}
 
 	this_cpu_inc(nr_dentry);
+	dentry_limit_kick();
 
 	return dentry;
 }

---
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
change-id: 20260513-limit-dentries-cache-63685729672b

Best regards,
-- 
Horst Birthelmer <hbirthelmer@ddn.com>


^ permalink raw reply related

* Re: [PATCH] driver core: Add cmdline option to force probe type
From: Jianlin Lv @ 2026-05-14 15:09 UTC (permalink / raw)
  To: Greg KH
  Cc: corbet, skhan, rafael, dakr, jianlv, linux-kernel, linux-doc,
	driver-core
In-Reply-To: <2026051406-corridor-equation-c50e@gregkh>

On Thu, May 14, 2026 at 9:49 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, May 14, 2026 at 09:35:08PM +0800, Jianlin Lv wrote:
> > On Thu, May 14, 2026 at 6:16 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, May 14, 2026 at 05:49:55PM +0800, Jianlin Lv wrote:
> > > > From: Jianlin Lv <iecedge@gmail.com>
> > > >
> > > > Device drivers that use asynchronous probing can cause non-deterministic
> > > > device ordering and naming across reboots. A typical example is storage
> > > > drivers (like sd/nvme): asynchronous probing can lead to inconsistent disk
> > > > logical names after reboot. In scenarios where disk naming consistency is
> > > > critical, the probe type should be set to synchronous.
> > > >
> > > > This patch introduces a driver_probe kernel parameter that overrides any
> > > > driver's hard-coded probe type settings and allows runtime control without
> > > > requiring kernel recompilation:
> > > >
> > > >   driver_probe=PROBE_TYPE_SYNC,nvme,sd      # Force specific drivers sync
> > > >   driver_probe=PROBE_TYPE_ASYNC,*,usb       # Force all async except usb
> > > >   driver_probe=PROBE_TYPE_SYNC,*            # Force all drivers synchronous
> > > >
> > > > The implementation replaces the limited driver_async_probe parameter with
> > > > a more flexible interface that can force either synchronous or asynchronous
> > > > probing as needed.
> > > >
> > > > Signed-off-by: Jianlin Lv <iecedge@gmail.com>
> > > > ---
> > > >  .../admin-guide/kernel-parameters.txt         | 27 +++++--
> > > >  drivers/base/dd.c                             | 71 ++++++++++++++-----
> > > >  2 files changed, 74 insertions(+), 24 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 4d0f545fb3ec..b43a8bd20356 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -1377,12 +1377,27 @@ Kernel parameters
> > > >                       it becomes active and is searched during signature
> > > >                       verification.
> > > >
> > > > -     driver_async_probe=  [KNL]
> > > > -                     List of driver names to be probed asynchronously. *
> > > > -                     matches with all driver names. If * is specified, the
> > > > -                     rest of the listed driver names are those that will NOT
> > > > -                     match the *.
> > > > -                     Format: <driver_name1>,<driver_name2>...
> > >
> > > You can not remove an existing user/kernel api, sorry, that is not
> > > allowed as you just broke all systems that were relying on this :(
> > >
> > Could you provide more suggestions on how to improve this patch?
>
> Not really, sorry, I don't think this is a change that should be done at
> all.  disk naming is a long-solved issue, to think that you can fix that
> by doing sync/async device probing is not understanding both the issues
> involved, and how we solved it already :)

Do you mean referencing disks via by-path/by-id? In our production env
they can also be unstable; this is an example I encountered before:
https://lore.kernel.org/all/CAFA-uR_jk6jCmf9DTebSVBRwtoLuXuyvf1Biq+OObqRVAOZbBw@mail.gmail.com/

I understand that device naming in the kernel can change at any time. However,
Is it necessary to provide an interface that allows users to choose
the probe mode themselves?
Currently, driver_async_probe has lower priority than the drivers’
hard-coded probe_type settings.
Could we adjust the code as follows so that driver_async_probe has the
highest priority?

 static bool driver_allows_async_probing(const struct device_driver *drv)
 {
+       if (cmdline_requested_async_probing(drv->name))
+                        return true;
+
        switch (drv->probe_type) {
        case PROBE_PREFER_ASYNCHRONOUS:
                return true;
@@ -876,9 +879,6 @@ static bool driver_allows_async_probing(const
struct device_driver *drv)
                return false;

        default:
-               if (cmdline_requested_async_probing(drv->name))
-                       return true;
-
                if (module_requested_async_probing(drv->owner))

Jianlin

>
> Hint, never count on block device, or any device, names to be the same
> across reboots.  That has NEVER been guaranteed on systems built in the
> past 20+ years.
>
> Please, just use the existing solutions, no new command line option
> should ever be needed here.
>
> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-14 14:37 UTC (permalink / raw)
  To: Lance Yang
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260514132830.25622-1-lance.yang@linux.dev>

On Thu, May 14, 2026 at 09:28:30PM +0800, Lance Yang wrote:
> 
> On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
> >get_any_page() collapses three different failure modes into a single
> >-EIO return:
> >
> >  * the put_page race in the !count_increased path;
> >  * the HWPoisonHandlable() rejection that bounces out of
> >    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
> >  * the HWPoisonHandlable() rejection that goes through the
> >    count_increased / put_page / shake_page retry loop.
> >
> >The first is transient (the page is racing with the allocator).  The
> >second can be either transient (a userspace folio briefly off LRU
> >during migration/compaction) or stable (slab/vmalloc/page-table/
> >kernel-stack pages).  The third describes a stable kernel-owned page
> >that the count_increased=true caller already held a reference on.
> >
> >Distinguish them on the return path: keep -EIO for both the put_page
> >race and the -EBUSY-after-retries branch (shake_page() cannot drag a
> >folio back from active migration, so we cannot prove the page is
> >permanently kernel-owned from there), keep -EBUSY for the allocation
> >race (unchanged), and return -ENOTRECOVERABLE only from the
> >count_increased-true HWPoisonHandlable() rejection that exhausts its
> >retries -- the caller's reference is structural evidence that the
> >page is owned by the kernel.
> >
> >Extend the unhandlable-page pr_err() to fire for either errno and
> >update the get_hwpoison_page() kerneldoc.
> >
> >memory_failure() still folds every negative return into
> >MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> >this patch is a no-op for users of memory_failure() and only changes
> >the errno that soft_offline_page() can propagate to its callers.  A
> >follow-up wires the new return code through memory_failure() and
> >reports MF_MSG_KERNEL for the unrecoverable cases.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Signed-off-by: Breno Leitao <leitao@debian.org>
> >---
> > mm/memory-failure.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >index 49bcfbd04d213..bae883df3ccb2 100644
> >--- a/mm/memory-failure.c
> >+++ b/mm/memory-failure.c
> >@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> > 				shake_page(p);
> > 				goto try_again;
> > 			}
> >+			/*
> >+			 * Return -EIO rather than -ENOTRECOVERABLE: this
> >+			 * branch is also reached for pages that are merely
> >+			 * off-LRU transiently (e.g. a folio in the middle
> >+			 * of migration or compaction), which shake_page()
> >+			 * cannot drag back.  The caller cannot prove the
> >+			 * page is permanently kernel-owned from here, so
> >+			 * keep it on the recoverable errno.
> >+			 */
> > 			ret = -EIO;
> > 			goto out;
> > 		}
> >@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> > 			goto try_again;
> > 		}
> > 		put_page(p);
> >-		ret = -EIO;
> >+		ret = -ENOTRECOVERABLE;
> > 	}
> > out:
> >-	if (ret == -EIO)
> >+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> > 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> > 
> > 	return ret;
> >@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
> >  *         -EIO for pages on which we can not handle memory errors,
> >  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
> >  *         operations like allocation and free,
> >- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> >+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
> >+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
> >+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
> >+ *         kernel stacks, and similar non-LRU/non-buddy pages).
> 
> Did you test this patch series? I don't see how we ever get to
> -ENOTRECOVERABLE there ...

Yes, I did. I am using the following test case:

https://github.com/leitao/linux/commit/cfebe84ddeab5ac34ed456331db980d57e7025dc

	# RUN_DESTRUCTIVE=1 tools/testing/selftests/mm/hwpoison-panic.sh
	# enabling /proc/sys/vm/panic_on_unrecoverable_memory_failure
	# injecting hwpoison at phys 0x2a00000 (Kernel rodata)
	# expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'
	[  501.113256] Memory failure: 0x2a00: recovery action for reserved kernel page: Ignored
	[  501.113956] Kernel panic - not syncing: Memory failure: 0x2a00: unrecoverable page


> Even with MF_COUNT_INCREASED, the first pass does:
> 
> 	if (flags & MF_COUNT_INCREASED)
> 		count_increased = true;
> 
> 	[...]
> 
> 	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
> 		ret = 1;
> 	} else {
> 		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
> 			put_page(p);
> 			shake_page(p);
> 			count_increased = false;
> 			goto try_again; <-
> 		}
> 		put_page(p);
> 		ret = -ENOTRECOVERABLE;
> 	}
> 
> Then we come back with count_increased=false:
> 
> try_again:
> 	if (!count_increased) {
> 		ret = __get_hwpoison_page(p, flags); <-
> 		if (!ret) {
> 		[...]
> 		} else if (ret == -EBUSY) { <-
> 		[...]
> 			ret = -EIO;
> 			goto out; <-
> 		}
> 	}
> 
> For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
> 
> 	if (!HWPoisonHandlable(&folio->page, flags))
> 		return -EBUSY;
> 
> so they still seem to end up as -EIO ... Am I missing something?

You are not, and thanks for catching this. I traced it again and the
-ENOTRECOVERABLE branch is unreachable for slab/vmalloc/page-table pages
exactly as you described. The __get_hwpoison_page() → -EBUSY → shake → retry
loop catches them first and they exit as -EIO.

The selftest I am using (link above) only validated the PageReserved
short-circuit added in patch 3, which lives in memory_failure() and never
reaches get_any_page().

I even thought about this code path, and I was not convinced we should return
-ENOTRECOVERABLE, thus I documented the following (as in this current patch)

	@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
			shake_page(p);
			goto try_again;
		}
	+            /*
	+             * Return -EIO rather than -ENOTRECOVERABLE: this
	+             * branch is also reached for pages that are merely
	+             * off-LRU transiently (e.g. a folio in the middle
	+             * of migration or compaction), which shake_page()
	+             * cannot drag back.  The caller cannot prove the
	+             * page is permanently kernel-owned from here, so
	+             * keep it on the recoverable errno.
	+             */
		ret = -EIO;

^ permalink raw reply

* Re: [PATCH 09/12] swap: push down setting sis->bdev into ->swap_activate
From: Darrick J. Wong @ 2026-05-14 14:37 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Kairui Song,
	Christian Brauner, Jens Axboe, David Sterba, Theodore Ts'o,
	Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Namjae Jeon, Hyunchul Lee, Steve French, Paulo Alcantara,
	Carlos Maiolino, Naohiro Aota, linux-xfs, linux-fsdevel,
	linux-doc, linux-mm, linux-block, linux-btrfs, linux-ext4,
	linux-f2fs-devel, linux-nfs, linux-cifs
In-Reply-To: <b37ca8a7-289e-45a0-8cbd-eb14d7453b97@kernel.org>

On Wed, May 13, 2026 at 04:58:37PM +0900, Damien Le Moal wrote:
> On 5/13/26 16:46, Christoph Hellwig wrote:
> > On Wed, May 13, 2026 at 04:44:53PM +0900, Damien Le Moal wrote:
> >> Hmmm... With zonefs, swap files can be created on top of conventional zone
> >> files. So enforcing "no swap on zoned device" here would break that.
> > 
> > We can check that none of the extents fall onto sequential zones instead
> > of just devices.
> > 
> > I still wonder why you bother with swap to zonefs at all, though.
> 
> Yeah. I do not think anyone actually use that... But since it is there from the
> start, kind of stuck with it now.

Ahh, right, I forgot that zoned devices can have conventional zones
where swap would actually work.  Question withdrawn.

--D

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Jiri Olsa @ 2026-05-14 14:35 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michal Hocko, Breno Leitao, Andrew Morton, corbet, skhan,
	linux-doc, linux-kernel, linux-kselftest, gregkh, akinobu.mita,
	live-patching
In-Reply-To: <agIbaeBQAr-RkqYc@laps>

On Mon, May 11, 2026 at 02:09:45PM -0400, Sasha Levin wrote:

SNIP

> > > Even if I'm okay with rebooting that often (and I really really would prefer
> > > not to), this doesn't solve the issues of a larger fleet of servers that can't
> > > just reboot that often.
> > > 
> > > What am I missing?
> > 
> > For one, you are missing more maintainers of code modification infrastructures.
> 
> Happy to add more, but I don't want to be too spammy. I'll add in the
> livepatching ML and the fault injection maintainer (I couldn't find a list).
> Please add any other folks/lists who you think might want to contribute to this
> discussion.

hi,
could you please add bpf (bpf@vger.kernel.org) to the loop?

thanks,
jirka

^ permalink raw reply

* [RFC PATCH v2.1 28/28] Docs/admin-guide/mm/damon/usage: update for memcg damon filter
From: SeongJae Park @ 2026-05-14 14:09 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

Update DAMON usage document for the newly added belonging memory cgroup
attribute monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/admin-guide/mm/damon/usage.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 465bcdf89b182..84741b4cd1877 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -74,7 +74,7 @@ comma (",").
     │ │ │ │ │ │ nr_regions/min,max
     │ │ │ │ │ │ :ref:`probes <damon_usage_sysfs_probes>`/nr_probes
     │ │ │ │ │ │ │ 0/filters/nr_filters
-    │ │ │ │ │ │ │ │ │ 0/type,matching,allow
+    │ │ │ │ │ │ │ │ │ 0/type,matching,allow,path
     │ │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
@@ -289,7 +289,9 @@ the data attribute for the probe.
 In the beginning, ``filters`` directory has only one file, ``nr_filters``.
 Writing a number (``N``) to the file creates the number of child directories
 named ``0`` to ``N-1``.  Each directory represents each filter and work in a
-way similar to that for :ref:`DAMOS filter <sysfs_filters>`.
+way similar to that for :ref:`DAMOS filter <sysfs_filters>`.  When the filter
+``type`` is ``memcg``, ``path`` file works the role of ``memcg_path`` for
+:ref:`DAMOS filter <sysfs_filters>`.
 
 .. _sysfs_targets:
 
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2.1 27/28] Docs/mm/damon/design: update for memcg damon filter
From: SeongJae Park @ 2026-05-14 14:09 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

Update DAMON design document for the newly added belonging memory cgroup
attribute monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 887b45cbeb716..a24f9f00d1837 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -293,8 +293,8 @@ registration is made by specifying a probe per attribute.  Each of the probe
 specifies a rule to determine if a given memory region has the related
 attribute.  The rule is constructed with multiple filters.  The filters work
 same to :ref:`DAMOS filters <damon_design_damos_filters>` except the supported
-filter types.  Currently only ``anon`` filter type is supported for data
-attributes monitoring.
+filter types.  Currently only ``anon`` and ``memcg`` filter types are supported
+for data attributes monitoring.
 
 If such probes are registered, DAMON executes the probes for each region's
 sampling memory when it does the access :ref:`sampling
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2.1 21/28] Docs/admin-guide/mm/damon/usage: document data attributes monitoring
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

Update DAMON usage document for the newly added data attributes
monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/admin-guide/mm/damon/usage.rst | 46 +++++++++++++++++---
 Documentation/mm/damon/design.rst            |  2 +
 2 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 11c75a598393c..465bcdf89b182 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -72,6 +72,11 @@ comma (",").
     │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
     │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
     │ │ │ │ │ │ nr_regions/min,max
+    │ │ │ │ │ │ :ref:`probes <damon_usage_sysfs_probes>`/nr_probes
+    │ │ │ │ │ │ │ 0/filters/nr_filters
+    │ │ │ │ │ │ │ │ │ 0/type,matching,allow
+    │ │ │ │ │ │ │ │ │ ...
+    │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
     │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target,obsolete_target
     │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
@@ -97,7 +102,10 @@ comma (",").
     │ │ │ │ │ │ │ │ 0/id,weight
     │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots
     │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
-    │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
+    │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed,
+    │ │ │ │ │ │ │ │ │ probes
+    │ │ │ │ │ │ │ │ │ │ 0/hits
+    │ │ │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ ...
     │ │ │ │ ...
@@ -227,8 +235,8 @@ contexts/<N>/monitoring_attrs/
 
 Files for specifying attributes of the monitoring including required quality
 and efficiency of the monitoring are in ``monitoring_attrs`` directory.
-Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this
-directory.
+Specifically, two directories, ``intervals``, ``nr_regions`` and ``probes``
+exist in this directory.
 
 Under ``intervals`` directory, three files for DAMON's sampling interval
 (``sample_us``), aggregation interval (``aggr_us``), and update interval
@@ -262,6 +270,27 @@ tuning-applied current values of the two intervals can be read from the
 ``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to
 the ``state`` file.
 
+.. _damon_usage_sysfs_probes:
+
+contexts/<N>/monitoring_attrs/probes/
+-------------------------------------
+
+A directory for registering :ref:`data attributes monitoring
+<damon_design_data_attrs_monitoring>` probes.
+
+In the beginning, this directory has only one file, ``nr_probes``.  Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``.  Each directory represents each monitoring probe.
+
+In each probe directory, one directory, ``filters`` exist.  The directory
+contains files for installingt filters for the probe, that is used to determine
+the data attribute for the probe.
+
+In the beginning, ``filters`` directory has only one file, ``nr_filters``.
+Writing a number (``N``) to the file creates the number of child directories
+named ``0`` to ``N-1``.  Each directory represents each filter and work in a
+way similar to that for :ref:`DAMOS filter <sysfs_filters>`.
+
 .. _sysfs_targets:
 
 contexts/<N>/targets/
@@ -614,10 +643,13 @@ set the ``access pattern`` as their interested pattern that they want to query.
 tried_regions/<N>/
 ------------------
 
-In each region directory, you will find five files (``start``, ``end``,
-``nr_accesses``, ``age``, and ``sz_filter_passed``).  Reading the files will
-show the properties of the region that corresponding DAMON-based operation
-scheme ``action`` has tried to be applied.
+In each region directory, you will find six files (``start``, ``end``,
+``nr_accesses``, ``age``, ``sz_filter_passed`` and ``probe_hits``).  Reading
+the files will show the properties of the region that corresponding DAMON-based
+operation scheme ``action`` has tried to be applied.
+
+Reading ``probe_hists`` shows the number of data attributes monitoring
+probe-hit positive samples of the region.
 
 Example
 ~~~~~~~
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 6731c3102d0ff..887b45cbeb716 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -276,6 +276,8 @@ interval``, DAMON checks if the region's size and access frequency
 (``nr_accesses``) has significantly changed.  If so, the counter is reset to
 zero.  Otherwise, the counter is increased.
 
+.. _damon_design_data_attrs_monitoring:
+
 Data Attributes Monitoring
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2.1 20/28] Docs/mm/damon/design: document data attributes monitoring
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260514140904.119781-1-sj@kernel.org>

Update DAMON design document for newly added data attributes monitoring
feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 37 +++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index fa7392b5a331d..6731c3102d0ff 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -276,6 +276,43 @@ interval``, DAMON checks if the region's size and access frequency
 (``nr_accesses``) has significantly changed.  If so, the counter is reset to
 zero.  Otherwise, the counter is increased.
 
+Data Attributes Monitoring
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Data access pattern is only one type of data attributes.  In some use cases,
+users need to know more data attributes information.  For example, users may
+need to know how much of a given hot or cold memory region is backed by
+anonymous pages, or belong to a specific cgroup.  For such use case, data
+attributes monitoring feature is provided.
+
+Using the feature, users can register data attributes of their interest to the
+DAMON :ref:`context <damon_design_execution_model_and_data_structures>`.  The
+registration is made by specifying a probe per attribute.  Each of the probe
+specifies a rule to determine if a given memory region has the related
+attribute.  The rule is constructed with multiple filters.  The filters work
+same to :ref:`DAMOS filters <damon_design_damos_filters>` except the supported
+filter types.  Currently only ``anon`` filter type is supported for data
+attributes monitoring.
+
+If such probes are registered, DAMON executes the probes for each region's
+sampling memory when it does the access :ref:`sampling
+<damon_design_region_based_sampling>`.  The number of samples that identified
+as having the data attribute (hitting the probe) per :ref:`aggregation interval
+<damon_design_monitoring>` is accounted in a per-region per-probe counter.
+Users can therefore know how much of a given DAMON region has a specific data
+attribute by reading the per-region per-probe probe hits counter after each
+aggregation interval.
+
+This is a sampling based mechanism.  Hence, it is lightweight but the output
+may include some measurement errors.  The output should be used with good
+understanding of statistics.
+
+Another way to do this for higher accuracy is using :ref:`DAMOS filter
+<damon_design_damos_filters>` with ``stat`` :ref:`action
+<damon_design_damos_action>` and ``sz_ops_filter_passed`` :ref:`stat
+<damon_design_damos_stat>`.  This approach provides the data attributes
+information in page level.  But, because it is operated in page level, the
+overhead is proportional to the size of the memory.
 
 Dynamic Target Space Updates Handling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2.1 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-14 14:08 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  48 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  67 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 197 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 222 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1284 insertions(+), 51 deletions(-)

base-commit: 678b6bc7ce120b8c51d4e05fcb8eb0a92f9be3f6
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH] driver core: Add cmdline option to force probe type
From: Greg KH @ 2026-05-14 13:50 UTC (permalink / raw)
  To: Jianlin Lv
  Cc: corbet, skhan, rafael, dakr, jianlv, linux-kernel, linux-doc,
	driver-core
In-Reply-To: <CAFA-uR93Wf2ALpYnnU79kruv7XO=uFePqioaEXNNEfrUtRw2xQ@mail.gmail.com>

On Thu, May 14, 2026 at 09:35:08PM +0800, Jianlin Lv wrote:
> On Thu, May 14, 2026 at 6:16 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, May 14, 2026 at 05:49:55PM +0800, Jianlin Lv wrote:
> > > From: Jianlin Lv <iecedge@gmail.com>
> > >
> > > Device drivers that use asynchronous probing can cause non-deterministic
> > > device ordering and naming across reboots. A typical example is storage
> > > drivers (like sd/nvme): asynchronous probing can lead to inconsistent disk
> > > logical names after reboot. In scenarios where disk naming consistency is
> > > critical, the probe type should be set to synchronous.
> > >
> > > This patch introduces a driver_probe kernel parameter that overrides any
> > > driver's hard-coded probe type settings and allows runtime control without
> > > requiring kernel recompilation:
> > >
> > >   driver_probe=PROBE_TYPE_SYNC,nvme,sd      # Force specific drivers sync
> > >   driver_probe=PROBE_TYPE_ASYNC,*,usb       # Force all async except usb
> > >   driver_probe=PROBE_TYPE_SYNC,*            # Force all drivers synchronous
> > >
> > > The implementation replaces the limited driver_async_probe parameter with
> > > a more flexible interface that can force either synchronous or asynchronous
> > > probing as needed.
> > >
> > > Signed-off-by: Jianlin Lv <iecedge@gmail.com>
> > > ---
> > >  .../admin-guide/kernel-parameters.txt         | 27 +++++--
> > >  drivers/base/dd.c                             | 71 ++++++++++++++-----
> > >  2 files changed, 74 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 4d0f545fb3ec..b43a8bd20356 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -1377,12 +1377,27 @@ Kernel parameters
> > >                       it becomes active and is searched during signature
> > >                       verification.
> > >
> > > -     driver_async_probe=  [KNL]
> > > -                     List of driver names to be probed asynchronously. *
> > > -                     matches with all driver names. If * is specified, the
> > > -                     rest of the listed driver names are those that will NOT
> > > -                     match the *.
> > > -                     Format: <driver_name1>,<driver_name2>...
> >
> > You can not remove an existing user/kernel api, sorry, that is not
> > allowed as you just broke all systems that were relying on this :(
> >
> Could you provide more suggestions on how to improve this patch?

Not really, sorry, I don't think this is a change that should be done at
all.  disk naming is a long-solved issue, to think that you can fix that
by doing sync/async device probing is not understanding both the issues
involved, and how we solved it already :)

Hint, never count on block device, or any device, names to be the same
across reboots.  That has NEVER been guaranteed on systems built in the
past 20+ years.

Please, just use the existing solutions, no new command line option
should ever be needed here.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] driver core: Add cmdline option to force probe type
From: Jianlin Lv @ 2026-05-14 13:35 UTC (permalink / raw)
  To: Greg KH
  Cc: corbet, skhan, rafael, dakr, jianlv, linux-kernel, linux-doc,
	driver-core
In-Reply-To: <2026051443-exuberant-important-534f@gregkh>

On Thu, May 14, 2026 at 6:16 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, May 14, 2026 at 05:49:55PM +0800, Jianlin Lv wrote:
> > From: Jianlin Lv <iecedge@gmail.com>
> >
> > Device drivers that use asynchronous probing can cause non-deterministic
> > device ordering and naming across reboots. A typical example is storage
> > drivers (like sd/nvme): asynchronous probing can lead to inconsistent disk
> > logical names after reboot. In scenarios where disk naming consistency is
> > critical, the probe type should be set to synchronous.
> >
> > This patch introduces a driver_probe kernel parameter that overrides any
> > driver's hard-coded probe type settings and allows runtime control without
> > requiring kernel recompilation:
> >
> >   driver_probe=PROBE_TYPE_SYNC,nvme,sd      # Force specific drivers sync
> >   driver_probe=PROBE_TYPE_ASYNC,*,usb       # Force all async except usb
> >   driver_probe=PROBE_TYPE_SYNC,*            # Force all drivers synchronous
> >
> > The implementation replaces the limited driver_async_probe parameter with
> > a more flexible interface that can force either synchronous or asynchronous
> > probing as needed.
> >
> > Signed-off-by: Jianlin Lv <iecedge@gmail.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 27 +++++--
> >  drivers/base/dd.c                             | 71 ++++++++++++++-----
> >  2 files changed, 74 insertions(+), 24 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 4d0f545fb3ec..b43a8bd20356 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1377,12 +1377,27 @@ Kernel parameters
> >                       it becomes active and is searched during signature
> >                       verification.
> >
> > -     driver_async_probe=  [KNL]
> > -                     List of driver names to be probed asynchronously. *
> > -                     matches with all driver names. If * is specified, the
> > -                     rest of the listed driver names are those that will NOT
> > -                     match the *.
> > -                     Format: <driver_name1>,<driver_name2>...
>
> You can not remove an existing user/kernel api, sorry, that is not
> allowed as you just broke all systems that were relying on this :(
>
Could you provide more suggestions on how to improve this patch?
If extend driver_async_probe to implement a 'force synchronous probe'
feature, that doesn’t really match the name driver_async_probe and
may feel ambiguous.
If  add a new 'driver_probe' parameter while keeping driver_async_probe,
then their functionality would partially overlap.

Regards,
Jianlin

> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-14 13:28 UTC (permalink / raw)
  To: leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260513-ecc_panic-v7-2-be2e578e61da@debian.org>


On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
>get_any_page() collapses three different failure modes into a single
>-EIO return:
>
>  * the put_page race in the !count_increased path;
>  * the HWPoisonHandlable() rejection that bounces out of
>    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
>  * the HWPoisonHandlable() rejection that goes through the
>    count_increased / put_page / shake_page retry loop.
>
>The first is transient (the page is racing with the allocator).  The
>second can be either transient (a userspace folio briefly off LRU
>during migration/compaction) or stable (slab/vmalloc/page-table/
>kernel-stack pages).  The third describes a stable kernel-owned page
>that the count_increased=true caller already held a reference on.
>
>Distinguish them on the return path: keep -EIO for both the put_page
>race and the -EBUSY-after-retries branch (shake_page() cannot drag a
>folio back from active migration, so we cannot prove the page is
>permanently kernel-owned from there), keep -EBUSY for the allocation
>race (unchanged), and return -ENOTRECOVERABLE only from the
>count_increased-true HWPoisonHandlable() rejection that exhausts its
>retries -- the caller's reference is structural evidence that the
>page is owned by the kernel.
>
>Extend the unhandlable-page pr_err() to fire for either errno and
>update the get_hwpoison_page() kerneldoc.
>
>memory_failure() still folds every negative return into
>MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>this patch is a no-op for users of memory_failure() and only changes
>the errno that soft_offline_page() can propagate to its callers.  A
>follow-up wires the new return code through memory_failure() and
>reports MF_MSG_KERNEL for the unrecoverable cases.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---
> mm/memory-failure.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
>diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>index 49bcfbd04d213..bae883df3ccb2 100644
>--- a/mm/memory-failure.c
>+++ b/mm/memory-failure.c
>@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> 				shake_page(p);
> 				goto try_again;
> 			}
>+			/*
>+			 * Return -EIO rather than -ENOTRECOVERABLE: this
>+			 * branch is also reached for pages that are merely
>+			 * off-LRU transiently (e.g. a folio in the middle
>+			 * of migration or compaction), which shake_page()
>+			 * cannot drag back.  The caller cannot prove the
>+			 * page is permanently kernel-owned from here, so
>+			 * keep it on the recoverable errno.
>+			 */
> 			ret = -EIO;
> 			goto out;
> 		}
>@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> 			goto try_again;
> 		}
> 		put_page(p);
>-		ret = -EIO;
>+		ret = -ENOTRECOVERABLE;
> 	}
> out:
>-	if (ret == -EIO)
>+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> 
> 	return ret;
>@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
>  *         -EIO for pages on which we can not handle memory errors,
>  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
>  *         operations like allocation and free,
>- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
>+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
>+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
>+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
>+ *         kernel stacks, and similar non-LRU/non-buddy pages).

Did you test this patch series? I don't see how we ever get to
-ENOTRECOVERABLE there ...

Even with MF_COUNT_INCREASED, the first pass does:

	if (flags & MF_COUNT_INCREASED)
		count_increased = true;

	[...]

	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
		ret = 1;
	} else {
		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
			put_page(p);
			shake_page(p);
			count_increased = false;
			goto try_again; <-
		}
		put_page(p);
		ret = -ENOTRECOVERABLE;
	}

Then we come back with count_increased=false:

try_again:
	if (!count_increased) {
		ret = __get_hwpoison_page(p, flags); <-
		if (!ret) {
		[...]
		} else if (ret == -EBUSY) { <-
		[...]
			ret = -EIO;
			goto out; <-
		}
	}

For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:

	if (!HWPoisonHandlable(&folio->page, flags))
		return -EBUSY;

so they still seem to end up as -EIO ... Am I missing something?

>  */
> static int get_hwpoison_page(struct page *p, unsigned long flags)
> {
>
>-- 
>2.53.0-Meta
>
>

^ permalink raw reply

* Re: [PATCH v3 2/3] Documentation: security-bugs: explain what is and is not a security bug
From: Willy Tarreau @ 2026-05-14 13:13 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Greg KH, Leon Romanovsky, skhan, security, workflows, linux-doc,
	linux-kernel
In-Reply-To: <87wlx6uqob.fsf@trenco.lwn.net>

On Thu, May 14, 2026 at 06:22:12AM -0600, Jonathan Corbet wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> >> (While I was there, I noticed that threat-model.rst has no SPDX line;
> >> what's your preference there?)
> >
> > I didn't notice any was needed, I tried to get inspiration from other
> > files for the format (I'm still not familiar with the rst format
> > though this time I could successfully install the tools).
> 
> In theory every file in the kernel tree is supposed to have one; many
> documentation files lag a bit behind on that front, but we try...

OK thanks for the background.

> > Same for
> > the label at the top BTW, I just did what I found somewhere else,
> > probably security-bugs.rst which is similar (no SPDX line and has a
> > label). So regarding SPDX, I do not have any preference. If one is
> > needed, let's pick what's used by default, I do not care, as long
> > as it allows the doc to be published.
> 
> The top-of-file label got started somewhere and has been cargo-culted
> extensively since then; it has proved hard to eradicate.

I'm not surprised, everyone likely does like me: look at another file
to see what it should look like, and does it again. Apparently in
security-bugs it was added 10 years ago by this:

  609d99a3b72e3 ("Documentation/HOWTO: add cross-references to other documents")

> As for SPDX, the most common is the basic:
> 
> .. SPDX-License-Identifier: GPL-2.0

This works for me for the new file. For existing security-bugs, why
not do the same at the same time ? Before SPDX tags it has been covered
by GPL-2.0 as well via the COPYING file, and further contributions did
not change its license. And in the worst case a total of 10 people
touched the 3 names of that file over its Git history, I doubt there
would be too much resistance against an update.

Willy

^ permalink raw reply

* RE: [PATCH v10 3/6] iio: adc: ad4691: add triggered buffer support
From: Sabau, Radu bogdan @ 2026-05-14 12:43 UTC (permalink / raw)
  To: Jonathan Cameron, Radu Sabau via B4 Relay
  Cc: Lars-Peter Clausen, Hennerich, Michael, David Lechner, Sa, Nuno,
	Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
	Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
	linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
	linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <20260512164514.38bbfd4c@jic23-huawei>

> -----Original Message-----
> From: Jonathan Cameron <jic23@kernel.org>
> Sent: Tuesday, May 12, 2026 6:45 PM
> To: Radu Sabau via B4 Relay <devnull+radu.sabau.

...

> > The CNV Burst Mode sampling frequency (PWM period) is exposed as a
> > buffer-level attribute via IIO_DEVICE_ATTR.
> >
> > Signed-off-by: Radu Sabau <radu.sabau@analog.com>
> 
> Sashiko pointed out you have a buffer that is big endian but
> chan_spec doesn't reflect that.  That should have generated obvious
> garbage output (unless you are actually testing on a be machine!)
> 

I thought I had IIO_BE in chan_spec,, my bad. I either forgot to copy the last
kernel image upon testing (though I don't know why I would remove that line)
or I did something wrong upon rebasing on the patches.

> Various other things came up, some of which I thought were in previous
> reviews - but maybe I'm confusing drivers.
> 
> Thanks
> 
> Jonathan
> 
> > @@ -204,7 +230,14 @@ static const struct ad4691_chip_info
> ad4694_chip_info = {
> >  struct ad4691_state {
> >  	const struct ad4691_chip_info *info;
> >  	struct regmap *regmap;
> > +	struct spi_device *spi;

...

> > +	/*
> > +	 * Append a 4-byte state-reset transfer [addr_hi, addr_lo,
> > +	 * STATE_RESET_ALL, OSC_EN=1]. CS is asserted throughout, so
> > +	 * ADDR_DESCENDING writes byte[3]=1 to OSC_EN_REG (0x180) as a
> > +	 * deliberate side-write, keeping the oscillator enabled.
> > +	 */
> > +	put_unaligned_be16(AD4691_STATE_RESET_REG, st->scan_tx_reset);
> > +	st->scan_tx_reset[2] = AD4691_STATE_RESET_ALL;
> > +	st->scan_tx_reset[3] = 1;
> > +	st->scan_xfers[2 * k].tx_buf = st->scan_tx_reset;
> > +	st->scan_xfers[2 * k].len = sizeof(st->scan_tx_reset);
> > +	st->scan_xfers[2 * k].cs_change = 1;
> 
> Our old friend - cs_change = 1 is very rarely the right thing to do on a
> final message.  I thought this came up in an earlier version.
> 

I thought I had this removed, it must have been lost.

> > +	spi_message_add_tail(&st->scan_xfers[2 * k], &st->scan_msg);
> > +
> > +	ret = spi_optimize_message(st->spi, &st->scan_msg);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = regmap_write(st->regmap, AD4691_STD_SEQ_CONFIG,
> > +			   bitmap_read(indio_dev->active_scan_mask, 0,
> > +				       iio_get_masklength(indio_dev)));

...

> > +static irqreturn_t ad4691_trigger_handler(int irq, void *p)
> > +{
> > +	struct iio_poll_func *pf = p;
> > +	struct iio_dev *indio_dev = pf->indio_dev;
> > +	struct ad4691_state *st = iio_priv(indio_dev);
> > +
> > +	ad4691_read_scan(indio_dev, pf->timestamp);
> > +	if (!st->manual_mode)
> > +		enable_irq(st->irq);
> 
> Maybe it was a different driver but I thought I commented on this before.
> There are a bunch of races if you reenable this here - needs to be
> in the trigger reenable callback.
> (Sashiko is pointing this out as well with more detail on what those
> races are)  The short story is that you can race and have a trigger between
> the enable and the notify_done which will be dropped on the floor meaning
> we never get in here again - IIRC there is (rather convoluted) code to handle
> that
> corner case in via the reenable callback and a work item.
> 

I also thought I had this covered, it appears it is not...

> > +	iio_trigger_notify_done(indio_dev->trig);
> > +	return IRQ_HANDLED;
> > +}
> > +
> >  static const struct iio_info ad4691_info = {
> >  	.read_raw = &ad4691_read_raw,
> >  	.write_raw = &ad4691_write_raw,
> >  	.read_avail = &ad4691_read_avail,
> >  	.debugfs_reg_access = &ad4691_reg_access,
> > +	.validate_trigger = iio_validate_own_trigger,
> 

...

> >  static int ad4691_probe(struct spi_device *spi)
> >  {
> >  	struct device *dev = &spi->dev;
> > @@ -663,6 +1200,7 @@ static int ad4691_probe(struct spi_device *spi)
> >  		return -ENOMEM;
> >
> >  	st = iio_priv(indio_dev);
> > +	st->spi = spi;
> >  	st->info = spi_get_device_match_data(spi);
> >  	if (!st->info)
> >  		return -ENODEV;
> > @@ -692,8 +1230,9 @@ static int ad4691_probe(struct spi_device *spi)
> >  	indio_dev->info = &ad4691_info;
> >  	indio_dev->modes = INDIO_DIRECT_MODE;
> >
> > -	indio_dev->channels = st->info->sw_info->channels;
> > -	indio_dev->num_channels = st->info->sw_info->num_channels;
> 
> You've lost me here. Where are these now set?
> 

At this point I am starting to think that whatever I have sent is different from
whatever I tested. It seems like changes I have running on the kernel image
do not appear here. I am pretty sure I have done something wrong while
rebasing, I am very sorry for that. I'll try and send a better/cleaner version
next time.

> > +	ret = ad4691_setup_triggered_buffer(indio_dev, st);
> > +	if (ret)
> > +		return ret;
> >
> >  	return devm_iio_device_register(dev, indio_dev);
> >  }
> >


^ permalink raw reply

* Re: [PATCH v4] docs: reporting-issues: replace "these advices" with "all of this advice"
From: WangYuli @ 2026-05-14 12:30 UTC (permalink / raw)
  To: Jonathan Corbet, Chen-Shi-Hong, linux; +Cc: skhan, linux-doc, linux-kernel
In-Reply-To: <87zf22uquc.fsf@trenco.lwn.net>

Hi jon,

On 2026/5/14 20:18, Jonathan Corbet wrote:
> Please, no.  If you start churning the code in that way you will
> certainly get pushback.  Typo fixes are a fine way to learn the process,
> but I really hope that contributors will move on quickly to more
> substantial work.

You're right. I take back my suggestion.

Thanks,

---

WangYuli


^ permalink raw reply

* Re: [RFC net-next 0/4] devlink: Add boot-time defaults
From: Mark Bloch @ 2026-05-14 12:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Parav Pandit, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
	Andrew Lunn, David S. Miller, Jonathan Corbet, Shuah Khan,
	Simon Horman, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Andrew Morton, Borislav Petkov (AMD), Randy Dunlap, Dave Hansen,
	Christian Brauner, Petr Mladek, Peter Zijlstra (Intel),
	Thomas Gleixner, Pawan Gupta, Dapeng Mi, Kees Cook, Marco Elver,
	Eric Biggers, NBU-Contact-Li Rongqing (EXTERNAL),
	Paul E. McKenney, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org
In-Reply-To: <agRcYDkjsQuS7ArD@FV6GYCPJ69>



On 13/05/2026 14:11, Jiri Pirko wrote:
> Wed, May 13, 2026 at 07:53:05AM CEST, mbloch@nvidia.com wrote:
>>
>>
>> On 12/05/2026 21:35, Jiri Pirko wrote:
>>> Tue, May 12, 2026 at 05:25:21PM CEST, parav@nvidia.com wrote:
>>>>
>>>>
>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>> Sent: 12 May 2026 07:37 PM
>>>>>
>>>>> Tue, May 12, 2026 at 03:48:32PM CEST, parav@nvidia.com wrote:
>>>>>>
>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>> Sent: 12 May 2026 02:16 PM
>>>>>>>
>>>>>>> Mon, May 11, 2026 at 08:21:37PM +0200, parav@nvidia.com wrote:
>>>>>>>>
>>>>>>>>> From: Mark Bloch <mbloch@nvidia.com>
>>>>>>>>> Sent: 10 May 2026 06:02 PM
>>>>>>>>>
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>
>>>>>>>>>> I look at it from the perspective that from some CX generation,
>>>>>>>>>> switchdev mode should be default. So that is a device-based decision.
>>>>>>>>>> I believe as such it can optionally be permanenty configured (nv config)
>>>>>>>>>> on older device. Why not?
>>>>>>>>>
>>>>>>>> Because sometimes switchdev_inactive is needed and sometimes not.
>>>>>>>> Such knob is not device decision.
>>>>>>>
>>>>>>> That is what I would call corner case. In that, user can use userspace
>>>>>>> configuration to change the mode in runtime.
>>>>>>>
>>>>>> Corner vs common depends on users one talks to. :)
>>>>>> If fw has switchdev(active) as default, and then
>>>>>> And user needs to run switchdev_inactive, it will actually break their switching applications.
>>>>>
>>>>> Can you describe the actutal breakage please?
>>>>>
>>>> Driver default was switchdev so all the traffic is forwarded to the switch,
>>>> and user didn't have chance to setup the fdb rules.
>>>> So packets are dropped but user didn't expect the traffic to be forwarded.
>>>
>>> User may switch mode to switchdev_inactive early on, before any of the
>>> representors are created. What's the issue then?
>>
>> That is the ordering problem I am trying to solve.
>>
>> On a DPU, the host PF cannot finish loading until the ECPF moves the eswitch to
>> switchdev/switchdev_inactive. So we need to do that transition during ECPF
>> driver init, as early as possible. Waiting for userspace means the host PF stays
>> blocked until userspace is up and has the right logic.
>>
>> That is not always true in practice, the driver may be built in, loaded from an
>> initramfs, or the initramfs may simply not contain the devlink policy we need.
>>
>> Also, after talking with Parav, my understanding is that we need to support both
>> switchdev and switchdev_inactive, since different customers want different boot
>> behavior. Once we do the transition, the host PF can load and may start sending
>> packets. At that point the initial mode already matters: in switchdev_inactive
>> packets are dropped until userspace programs the pipeline; in switchdev they may
>> reach the FDB before the pipeline is ready.
>>
>> So I do not think an early userspace transition is equivalent here. The initial
>> mode needs to be known by the kernel before userspace runs, which is why I am
>> proposing the devlink= command line default.
> 
> Okay fair enough. Could you please at least make sure this is mode only
> config and noone would ever think about abusing this for any other
> configuration? Perhaps call it "devlink_eswitch_mode=" to remove
> the "devlink=" namespace flexibility?

Sure, something along these lines:
devlink_eswitch_mode=[*]:switchdev
devlink_eswitch_mode=[pci/0000:08:00.0,pci/0000:09:00.1]:switchdev_inactive

The proper (not RFC) series will have 3 patches:

- devlink: add the command-line default eswitch mode handling
- mlx5: cleanup/prep patch
- mlx5: use the devlink API to apply the early eswitch mode

Since the mlx5 changes are part of the series, I suspect this will need to
go through Tariq. The patches are ready, but are currently in
our submission queue.

Mark

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox